[20230809]ora-04030问题分析整理.txt

寻技术 Oracle 2023年12月02日 121
[20230809]ora-04030问题分析整理.txt

--//生产系统同事使用toad连接经常出现ora-04030错误。
ORA-04030: out of process memory when trying to allocate 123416 bytes (QERHJ hash-joi,kllcqas:kllsltba)
--//仔细看joi确实不是join,开始猜测可能某个程序的sql语句选择hash-join,导致pga消耗太大。

--//同事给了我alert日志的截图,提示都是icare_s001_20087.trc的进程,难道全部使用共享连接模式报错!!
--//icare_s001_20087.trc的进程报如下错误,对应进程里面应该能看到sql语句。
ORA-04030: out of process memory when trying to allocate 82456 bytes (pga heap,control file i/o buffer)

--//原始的分析版本写的太乱,事后重新做一下整理,主要还是想记录一下当时的分析过程,估计还是很乱!!

1.环境:
# cat /etc/redhat-release
redhat 4
#Red Hat Enterprise Linux Server release 6.0 (Santiago)

# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      15775        178          0          4       5003
-/+ buffers/cache:      10767       5186
Swap:        15999      15999          0
--//昏,swap内存都全部使用了.

SYS@192.168.100.41:1521/icare> @ ver1
PORT_STRING                    VERSION        BANNER
------------------------------ -------------- ----------------------------------------------------------------
x86_64/Linux 2.4.xx            10.2.0.4.0     Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
--//哇!很古老的版本.

SYS@192.168.100.41:1521/icare:DEDICATED> select * from v$instance;
INSTANCE_NUMBER INSTANCE_NAME    HOST_NAME       VERSION           STARTUP_TIME        STATUS       PAR    THREAD# ARCHIVE LOG_SWITCH_WAIT LOGINS     SHU DATABASE_STATUS   INSTANCE_ROLE      ACTIVE_ST BLO
--------------- ---------------- --------------- ----------------- ------------------- ------------ --- ---------- ------- --------------- ---------- --- ----------------- ------------------ --------- ---
              1 icare            DZBL_DB_101.132 10.2.0.4.0        2021-08-24 08:24:17 OPEN         NO           1 STARTED                 ALLOWED    NO  ACTIVE            PRIMARY_INSTANCE   NORMAL    NO
--//出现问题后没有重启过机器!!顺便说明一下,这台机器很少访问,基本作为历史数据库,偶尔有应用连上就是查询数据.

2.分析:
SYS@192.168.100.41:1521/icare> show sga
Total System Global Area 6442450944 bytes
Fixed Size                  2093584 bytes
Variable Size            2181041648 bytes
Database Buffers         4244635648 bytes
Redo Buffers               14680064 bytes

SYS@192.168.100.41:1521/icare> show parameter pga
PARAMETER_NAME       TYPE        VALUE
-------------------- ----------- -------
pga_aggregate_target big integer 3G

SYS@192.168.100.41:1521/icare> @ spid
       SID    SERIAL# PROCESS      SERVER    SPID       PID  P_SERIAL# C50
---------- ---------- ------------ --------- ------ ------- ---------- --------------------------------------------------
       855      22839 31462        SHARED    1811        18        237 alter system kill session '855,22839' immediate;
--//噢!以前管理的数据库不合理,选择的连接模式是SHARED模式,spid=1811.
--//这样可能更好理解为什么alert日志都是记录都是S001的进程.

$ grep -B1 "ORA-04030" alert_icare.log  | grep Error | sort | uniq -c
      1 Errors in file /opt/oracle/admin/icare/bdump/icare_s000_18301.trc:
      1 Errors in file /opt/oracle/admin/icare/bdump/icare_s000_26348.trc:
   1968 Errors in file /opt/oracle/admin/icare/bdump/icare_s001_20087.trc:
--//很明显问题问题出在共享连接的进程,也就是可以估计应用段的进程都是以共享模式连接的.

SYS@192.168.100.41:1521/icare> show parameter disp
PARAMETER_NAME   TYPE     VALUE
---------------- -------- ---------------
dispatchers      string   (PROTOCOL=TCP)
max_dispatchers  integer
--//奇怪没有任何配置.难道同事建立的配置连接串没有(SERVER = DEDICATED)设置.缺省共享模式优先.

$ lsnrctl service
LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 10-AUG-2023 08:25:53
Copyright (c) 1991, 2007, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
Services Summary...
Service "icare" has 2 instance(s).
  Instance "icare", status UNKNOWN, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1419 refused:0
         LOCAL SERVER
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1458 refused:0 state:ready
         LOCAL SERVER
      "D000" established:7472195 refused:0 current:109 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Service "icaredg4" has 1 instance(s).
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1458 refused:0 state:ready
         LOCAL SERVER
      "D000" established:7472195 refused:0 current:109 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
Service "icaredg4_XPT" has 1 instance(s).
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1458 refused:0 state:ready
         LOCAL SERVER
      "D000" established:7472195 refused:0 current:109 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 26346>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=45587))
The command completed successfully
--//icare服务存在两者模式,缺省连接串没有配置,优先选择共享模式连接.

SYS@192.168.100.41:1521/icare> column Program format a30
SYS@192.168.100.41:1521/icare> @pga 10
NAME                             VALUE_MB
------------------------------ ----------
aggregate PGA target parameter       3072
total PGA inuse                17655.4385
total PGA allocated            19551.1279
over allocation count            18030955

PROFILE                                CNT PERCENTAGE
------------------------------- ---------- ----------
workarea executions - optimal     74202449        100
workarea executions - onepass        89372          0
workarea executions - multipass          3          0

SPID          SID    SERIAL# MACHINE          CLIENT_INFO          Program                        PGA_USED_MEM PGA_ALLOC_MEM PGA_MAX_MEM
------ ---------- ---------- ---------------- -------------------- ------------------------------ ------------ ------------- -----------
2290         1014      34111 H3C-APM          192.168.101.212      JDBC Thin Client                 6862056397    7554874813  7555005885
1811          855      22839 gxqyydg4         192.168.100.78       sqlplus@gxqyydg4 (TNS V1-V3)     2603811029    2875538877  2881830333
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15102         977       9825 DZBL_DB_101.132                       oracle@DZBL_DB_101.132 (LNS1)      38893197      79486477    79486477
26437        1089          3 DZBL_DB_101.132                       oracle@DZBL_DB_101.132 (ARC0)      22414453      47438477    48683661
26332        1101          1 DZBL_DB_101.132                       oracle@DZBL_DB_101.132 (LGWR)      10946301      28613773    33987725
26439        1086          3 DZBL_DB_101.132                       oracle@DZBL_DB_101.132 (ARC1)      10814885      24238733    25483917
26342        1096          1 DZBL_DB_101.132                       oracle@DZBL_DB_101.132 (MMON)       1770773       3662333     3793405
26346         833      21702                  192.168.100.51       w3wp.exe                             933309        933309     1761861
26346         835      18802                  192.168.100.51       w3wp.exe                             933309        933309     1761861
26346         839      27183                  192.168.100.51       w3wp.exe                             933309        933309     1761861
10 rows selected.
--//看看那个进程消耗很大的PGA。
--//注意看下划线是我诊断的连接进程.我不可能怎么没做就使用这么多PGA,一定是连接上共享模式的进程导致的问题.
--//Program=w3wp.exe的spid=26346,说明也是共享模式连接的进程.

--//后记:我这里分析有错!!实际上spid=2290进程占用的pga内存可能是多个回话共享的,因为当前仅仅1个连上,其他用户估计很难登陆上.
--//如果多个用户使用共享模式连上该进程,后面的PGA_USED_MEM都都等于6862056397.看看spid=26346,就很容易明白了.
--//也就是我写pga.sql脚本查询涉及共享模式使用pga的情况有点问题.

# ps -fp 2290,1811,26346
UID        PID  PPID  C STIME TTY          TIME CMD
oracle    1811     1  0  2022 ?        08:41:28 ora_s002_icare
oracle    2290     1  0  2022 ?        16:15:39 ora_s000_icare
oracle   26346     1 54  2021 ?        391-21:33:26 ora_d000_icare
--//都是共享模式相关的进程.
--//后记:这里出现一个ora_d000_icare进程,我当时并没有多想,实际上应用程序的执行卡在这里,还停留在ora_d000_icare进程,正常成
--//功的化,应该出现的是ora_s00N_icare的进程.当时没有考虑周全,经验不足.

--//如果安装smem软件包,可以执行如下:
# smem -tk -U oracle -P "ora_|oracleicare"

--//检查内核参数/proc/sys/vm/max_map_count:
# cat /proc/sys/vm/max_map_count
65530
--//来自链接 https://www.josip-pojatina.com/en/bulk-collect-and-memory-limits/
To increase a number of the map entries from 4 GB (65530 x 64 / 1024 to get the amount in GB) into the 16 GB, you need
to execute the following command as a root user:
--//65530*64/1024 = 4095.62500000000000000000
--//2^16 = 65536
--//如何知道每个64K?

# sysctl -w vm.max_map_count=262144
vm.max_map_count = 262144
--//这个我没有执行,目前不是这方面的问题.

--//顺便查看ash的情况:
SYS@192.168.100.41:1521/icare:DEDICATED> @ tpt/ash/ashtop10 sql_id 1=1 &day
    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN
--------- ------- ------- ------------- ------------------- -------------------
      791      .0   37% |               2023-08-09 08:58:00 2023-08-10 08:52:13
      200      .0    9% | 8szmwam7fysa3 2023-08-09 22:00:06 2023-08-09 22:03:26
      154      .0    7% | 93jgxvdzsx4y1 2023-08-09 08:57:58 2023-08-10 08:27:37
       93      .0    4% | 5r14h528vkacs 2023-08-09 08:58:03 2023-08-10 08:27:41
       47      .0    2% | b0nsh8xz40nhr 2023-08-09 08:58:01 2023-08-10 08:27:38
       42      .0    2% | 19nrxkxw2b8j1 2023-08-09 08:58:07 2023-08-10 08:27:44
       30      .0    1% | dqxfcr2c8265u 2023-08-09 09:06:36 2023-08-10 08:14:24
       30      .0    1% | g97w6fq8yw5br 2023-08-09 08:58:11 2023-08-10 08:27:48
       29      .0    1% | gy2ncu3j1x25j 2023-08-09 09:56:59 2023-08-10 08:44:33
       25      .0    1% | 6q6tgwwt6bku9 2023-08-09 09:58:02 2023-08-10 08:27:40
       23      .0    1% | 7dsgq5k08yymk 2023-08-09 12:28:06 2023-08-10 08:27:47
       22      .0    1% | 6wcywxctbbvs8 2023-08-09 11:17:26 2023-08-10 06:43:46
       21      .0    1% | 27m1sf1nknfz2 2023-08-09 08:58:10 2023-08-10 06:27:48
       21      .0    1% | 39nd965zn01yq 2023-08-09 09:26:37 2023-08-10 07:44:18
       19      .0    1% | 3vsh63tqr0sr3 2023-08-09 08:54:35 2023-08-10 07:41:41
15 rows selected.

SYS@192.168.100.41:1521/icare:DEDICATED> @ tpt/ash/ashtop10 event 1=1 &day
    Total
  Seconds     AAS %This   EVENT                         FIRST_SEEN          LAST_SEEN
--------- ------- ------- ----------------------------- ------------------- -------------------
     1053      .0   50% |                               2023-08-09 08:56:29 2023-08-10 08:52:13
      412      .0   19% | db file sequential read       2023-08-09 08:57:59 2023-08-10 08:48:57
      245      .0   12% | log file parallel write       2023-08-09 08:58:00 2023-08-10 08:52:09
      139      .0    7% | control file parallel write   2023-08-09 09:14:32 2023-08-10 08:51:24
       93      .0    4% | log file sync                 2023-08-09 08:58:25 2023-08-10 08:52:09
       86      .0    4% | control file sequential read  2023-08-09 08:58:04 2023-08-10 08:14:24
       26      .0    1% | null event                    2023-08-09 09:30:51 2023-08-10 08:41:22
       23      .0    1% | LNS wait on SENDREQ           2023-08-09 09:13:47 2023-08-10 08:41:36
       14      .0    1% | log file sequential read      2023-08-09 10:05:24 2023-08-10 08:42:45
       11      .0    1% | os thread startup             2023-08-09 09:00:13 2023-08-10 00:00:49
       10      .0    0% | db file scattered read        2023-08-09 22:00:19 2023-08-09 22:00:59
        3      .0    0% | db file parallel read         2023-08-09 16:31:31 2023-08-10 08:52:09
        3      .0    0% | log file switch completion    2023-08-09 22:00:05 2023-08-10 08:42:42
        2      .0    0% | Log archive I/O               2023-08-09 09:15:04 2023-08-09 12:30:26
        1      .0    0% | latch free                    2023-08-10 07:03:11 2023-08-10 07:03:11
15 rows selected.
--//数据库性能基本没有怎么问题.

3.临时解决方法:
--//先停止192.168.101.212 上的应用.修改配置连接参数加上
(SERVER = DEDICATED)
--//192.168.100.51 类似.其他地址类似.

--//剩下的是如何回收内存的工作.
SYS@192.168.100.41:1521/icare:DEDICATED> alter system kill session '1014,34113' immediate;
System altered.

# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      15781        172          0          3       5002
-/+ buffers/cache:      10775       5178
Swap:        15999      15999          0
--//前面的执行情况,为了对比方便我复制下来:
# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      15775        178          0          4       5003
-/+ buffers/cache:      10767       5186
Swap:        15999      15999          0
--//昏基本没有任何改变.

SYS@192.168.100.41:1521/icare:DEDICATED> @ pga 10
NAME                             VALUE_MB
------------------------------ ----------
aggregate PGA target parameter       3072
total PGA inuse                17656.9668
total PGA allocated            19554.3701
over allocation count            18031763

PROFILE                                 CNT PERCENTAGE
-------------------------------- ---------- ----------
workarea executions - optimal      74204980        100
workarea executions - onepass         89374          0
workarea executions - multipass           3          0

SPID          SID    SERIAL# MACHINE         CLIENT_INFO          Program                        PGA_USED_MEM PGA_ALLOC_MEM PGA_MAX_MEM
------ ---------- ---------- --------------- -------------------- ------------------------------ ------------ ------------- -----------
15102         977       9825 DZBL_DB_101.132                      oracle@DZBL_DB_101.132 (LNS1)      38893197      79486477    79486477
26437        1089          3 DZBL_DB_101.132                      oracle@DZBL_DB_101.132 (ARC0)      22414453      47438477    48683661
26332        1101          1 DZBL_DB_101.132                      oracle@DZBL_DB_101.132 (LGWR)      10946301      28613773    33987725
26439        1086          3 DZBL_DB_101.132                      oracle@DZBL_DB_101.132 (ARC1)      10814885      24238733    25483917
26342        1096          1 DZBL_DB_101.132                      oracle@DZBL_DB_101.132 (MMON)       1770773       3662333     3793405
11755         855      22871 gxqyydg4        192.168.100.78       sqlplus@gxqyydg4 (TNS V1-V3)        1617021       2220613     5497413
26346         833      21702                 192.168.100.51       w3wp.exe                             933309        933309     1761861
26346         835      18802                 192.168.100.51       w3wp.exe                             933309        933309     1761861
26346         839      27183                 192.168.100.51       w3wp.exe                             933309        933309     1761861
26346         840      28987                 192.168.100.51       w3wp.exe                             933309        933309     1761861
10 rows selected.
--//后记:这是我才反应过来前面alter system kill session '1014,34113' immediate;不可能完全回收s00N占用的内存的.我必须
--//kill对应的进程.这里也说明一点,如果这个时候使用pag.sql脚本查询,很难判断问题在那里.

--//先看看那些进程占用内存.
[root@DZBL_DB_101 app]# top
top - 09:30:22 up 762 days, 13:58,  5 users,  load average: 1.08, 1.14, 1.10
Tasks: 271 total,   2 running, 269 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us,  4.4%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16336568k total, 16169608k used,   166960k free,     4940k buffers
Swap: 16383992k total, 16383992k used,        0k free,  5137576k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26346 oracle    20   0 6292m 313m 311m R 100.0  2.0 564370:57 oracle
 2792 root      20   0  154m  27m 1816 S  5.9  0.2  25322:45 Xorg
10132 root      20   0 4470m 1.1g 2568 S  2.0  6.8  30280:47 gnome-system-mo
    1 root      20   0 19244  896  676 S  0.0  0.0   1:23.96 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.04 kthreadd
--//不熟悉top的使用,如果安装我熟悉的smem,也许问题定位会更快一些.实际上缺省按照CPU 使用排序的.spid=26346,CPU使用率100%,实
--//际上可能通过共享模式登陆还是报错.从以上输出可以发现启动到图形模式5,可以回到多用户模式3,回收一些内存,切换到多用户模式
--//,不启动图形模式X11,注意同时修改/etc/inittab文件:
--//id:3:initdefault:

[root@DZBL_DB_101 app]# who -r
         run-level 5  2021-07-08 19:31
[root@DZBL_DB_101 app]# init 3
[root@DZBL_DB_101 app]# who -r
         run-level 3  2023-08-10 09:33                   last=5

# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      14582       1371          0          9       5068
-/+ buffers/cache:       9503       6449
Swap:        15999      12577       3422
--//回收了一些内存.swap free增加3422M.

[root@DZBL_DB_101 app]# top
top - 09:57:24 up 762 days, 14:06,  2 users,  load average: 1.03, 1.33, 1.23
Tasks: 225 total,   2 running, 223 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us,  4.4%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16336568k total, 14947360k used,  1389208k free,    11068k buffers
Swap: 16383992k total, 12879480k used,  3504512k free,  5199580k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26346 oracle    20   0 6292m 313m 311m R 100.0  2.0 564378:32 oracle
 3956 root      20   0 1394m  30m 3992 S  2.0  0.2   0:41.88 infoeyeagent_mq
    1 root      20   0 19244  992  752 S  0.0  0.0   1:24.00 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.04 kthreadd
--//spid=26346还是占用许多内存,分析有误,实际上它不是占用最多的.
--//后记:我这里再次犯了严重错误,没有按照内存使用排序,如果排序能很快定位问题.不过我当时想既然spid=26346进程对应
--//ora_d000_icare,kill -9应该下面ora_s00N_icare进程也会一起kill掉,重新建立新的进程.

--//再尝试前,我想看看先kill对应的session如何.
# ps -fp 26346
UID        PID  PPID  C STIME TTY          TIME CMD
oracle   26346     1 54  2021 ?        391-22:19:20 ora_d000_icare
--//后记:实际上这时我才想起共享连接不应该停在这里,实际上前面的w3wp.exe执行卡在这里,无法继续下去.
--//取消kill这些连接应该没有任何问题,注实际上很多,差不多有1XX个.

SYS@192.168.100.41:1521/icare:DEDICATED> alter system kill session '833,21702,@1' immediate -- ICARE@ (w3wp.exe);
alter system kill session '833,21702,@1' immediate -- ICARE@ (w3wp.exe)
*
ERROR at line 1:
ORA-00026: missing or invalid session ID

SYS@192.168.100.41:1521/icare:DEDICATED> alter system kill session '833,21702' immediate -- ICARE@ (w3wp.exe);
System altered.
--//昏!10g版本竟然不支持@1这样的写法.

SYS@192.168.100.41:1521/icare:DEDICATED> @ killi program='w3wp.exe'
COMMANDS_TO_VERIFY_AND_RUN
--------------------------------------------------------------------------------------
alter system kill session '1010,44238' immediate -- ICARE@WORKGROUP\DELL70 (w3wp.exe);
...

--//拷贝然后执行,再查看:
[root@DZBL_DB_101 app]# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      14722       1230          0         13       5136
-/+ buffers/cache:       9572       6381
Swap:        15999      12557       3442
--//变化不是很大.

top - 10:07:41 up 762 days, 14:16,  2 users,  load average: 1.10, 1.20, 1.19
Tasks: 227 total,   2 running, 225 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us,  4.4%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16336568k total, 15085796k used,  1250772k free,    14212k buffers
Swap: 16383992k total, 12858468k used,  3525524k free,  5262764k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26346 oracle    20   0 6292m 316m 314m R 98.9  2.0 564388:39 oracle
    1 root      20   0 19244  992  752 S  0.0  0.0   1:24.02 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.04 kthreadd

--//简单一点直接kill掉,说明一下:在执行前我看过select * from V$BGPROCESS where PADDR<>'00';的输出,
--//ora_d000_icare,ora_s00N_icare居然没有出现在输出列表里面.有点奇怪!!

[root@DZBL_DB_101 app]# kill -9 26346
[root@DZBL_DB_101 app]# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      14812       1141          0         15       5196
-/+ buffers/cache:       9599       6353
Swap:        15999      12552       3447
--//对比前面基本没有任何变化.

[oracle@DZBL_DB_101 ~]$ ps -ef -u oracle | grep d0[0]
oracle   25655     1  0 10:08 ?        00:00:00 ora_d001_icare
root     30812     2  0 Jan19 ?        00:00:00 [kslowd000]
root     30813     2  0 Jan19 ?        00:00:00 [kslowd001]
--//启动1个新的d001进程.

$ lsnrctl service
LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 10-AUG-2023 10:16:38
Copyright (c) 1991, 2007, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
Services Summary...
Service "icare" has 2 instance(s).
  Instance "icare", status UNKNOWN, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:1419 refused:0
         LOCAL SERVER
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:3043 refused:0 state:ready
         LOCAL SERVER
      "D001" established:7 refused:0 current:3 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
Service "icaredg4" has 1 instance(s).
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:3043 refused:0 state:ready
         LOCAL SERVER
      "D001" established:7 refused:0 current:3 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
Service "icaredg4_XPT" has 1 instance(s).
  Instance "icare", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:3043 refused:0 state:ready
         LOCAL SERVER
      "D001" established:7 refused:0 current:3 max:1022 state:ready
         DISPATCHER <machine: DZBL_DB_101.132, pid: 25655>
         (ADDRESS=(PROTOCOL=tcp)(HOST=DZBL_DB_101.132)(PORT=33030))
The command completed successfully
--//使用d001替换原来的d000.

[oracle@DZBL_DB_101 bdump]$ ps -ef | grep s0[0]
oracle    1811     1  0  2022 ?        08:41:30 ora_s002_icare
oracle    2290     1  0  2022 ?        16:15:53 ora_s000_icare
oracle   20087     1  0  2022 ?        21:44:42 ora_s001_icare
oracle   26354     1  0  2021 ?        01:23:30 ora_s003_icare
--//这才发现ora_s00N_icare进程还在并没有删除.

[oracle@DZBL_DB_101 bdump]$ top -u oracle
top - 10:26:51 up 762 days, 14:35,  2 users,  load average: 0.01, 0.06, 0.36
Tasks: 229 total,   1 running, 228 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us,  4.4%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16336568k total, 15241824k used,  1094744k free,    19756k buffers
Swap: 16383992k total, 12854104k used,  3529888k free,  5416796k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1811 oracle    20   0 9057m 3.9g 2.8g S  0.0 25.1 521:30.70 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 2290 oracle    20   0 13.2g 7.4g 2.5g S  0.0 47.5 975:53.23 oracle
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 5500 oracle    20   0  180m 3064 1736 S  0.0  0.0  79:26.09 tnslsnr
15102 oracle    20   0 6338m  32m 6068 S  0.0  0.2 867:40.38 oracle
20087 oracle    20   0 15.3g 4.6g 2.7g S  0.0 29.3   1304:42 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23203 oracle    20   0 6291m  33m  28m S  0.0  0.2   0:00.08 oracle
25655 oracle    20   0 6290m  19m  15m S  0.0  0.1   0:00.07 oracle
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ =>d0001
26324 oracle    20   0 6290m  14m  13m S  0.0  0.1  67:26.29 oracle
26326 oracle    20   0 6289m 3588 3440 S  0.0  0.0  30:28.83 oracle
26328 oracle    20   0 6289m 943m 942m S  0.0  5.9  60:37.91 oracle
26330 oracle    20   0 6302m 1.7g 1.7g S  0.0 11.0  91:47.32 oracle
26332 oracle    20   0 6312m  19m  18m S  0.0  0.1 114:29.21 oracle
26334 oracle    20   0 6295m 193m 188m S  0.0  1.2 197:47.28 oracle
26336 oracle    20   0 6292m 1.2g 1.2g S  0.0  7.7  17:08.92 oracle
26338 oracle    20   0 6289m  19m  19m S  0.0  0.1   0:10.50 oracle
26340 oracle    20   0 6296m 152m 147m S  0.0  1.0 115:34.77 oracle
26342 oracle    20   0 6294m 225m 222m S  0.0  1.4 137:20.15 oracle
26344 oracle    20   0 6289m  34m  34m S  0.0  0.2 638:24.48 oracle
26354 oracle    20   0 6290m 1.6g 1.6g S  0.0 10.3  83:30.18 oracle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
26437 oracle    20   0 6315m  17m 6324 S  0.0  0.1  10:20.44 oracle
26439 oracle    20   0 6304m  16m 5456 S  0.0  0.1   5:52.89 oracle
26455 oracle    20   0 6289m 8480 8320 S  0.0  0.1   9:42.74 oracle
26522 oracle    20   0 6291m  31m  30m S  0.0  0.2   0:03.41 oracle
26950 oracle    20   0 6291m 134m 129m S  0.0  0.8   0:00.39 oracle
30800 oracle    20   0 6289m 3832 3700 S  0.0  0.0   4:56.27 oracle
31446 oracle    20   0  105m 1900 1420 S  0.0  0.0   0:00.02 bash
31831 oracle    20   0 6290m  30m  26m S  0.0  0.2   0:00.03 oracle
32310 oracle    20   0 15072 1204  812 R  0.0  0.0   0:00.00 top
--//昏,没有按照内存使用排序,可以发现spid=1811,2290,20087,26354(注意坎%MEM列) 占用很大内存.
--//后记:自己对top使用的一些细节实在太不熟悉,还是最近干活太少的原因.另外写一篇使用top如何按照内存使用排序的文章.
--//顺便说一下按c可以可以看到COMMAND更加详细的内容.

--//找同事要安装盘没找到,管理混乱,想安装自己熟悉的smem包看看,6.0版本在网上已经很难找到,安装1个rh5的版本smem,居然可以安装
--//成功.
[root@DZBL_DB_101 rpm_tgz]# smem -tk -U oracle -P "ora_|oracleicare"
  PID User     Command                         Swap      USS      PSS      RSS
26326 oracle   ora_psp0_icare                  3.2M   160.0K   319.0K     3.5M
30800 oracle   ora_q002_icare                  3.3M   164.0K   328.0K     3.7M
26455 oracle   ora_qmnc_icare                  3.3M   200.0K   698.0K     8.3M
26324 oracle   ora_pmon_icare                  3.2M   628.0K     1.8M    14.0M
26338 oracle   ora_reco_icare                  3.7M   372.0K     1.9M    19.6M
26522 oracle   ora_q000_icare                  3.6M     1.0M     4.1M    31.7M
26332 oracle   ora_lgwr_icare                  5.5M     4.7M     6.4M    19.1M
25655 oracle   ora_d001_icare                     0     4.4M     7.2M    19.8M
 3510 oracle   oracleicare (LOCAL=NO)             0     4.7M     9.6M    36.5M
26344 oracle   ora_mmnl_icare                  3.1M     1.2M    10.7M    34.4M
31831 oracle   oracleicare (LOCAL=NO)             0     4.9M    10.9M    48.9M
26439 oracle   ora_arc1_icare                  3.5M    11.2M    11.5M    16.5M
26437 oracle   ora_arc0_icare                 14.3M    11.9M    12.4M    17.9M
26334 oracle   ora_ckpt_icare                  3.5M     4.6M    25.3M   193.0M
26340 oracle   ora_cjq0_icare                  4.2M     8.6M    26.8M   152.1M
15102 oracle   ora_lns1_icare                  3.2M    26.8M    27.2M    32.5M
26342 oracle   ora_mmon_icare                  3.8M     4.5M    33.5M   225.6M
26950 oracle   ora_j000_icare                     0    20.5M    40.6M   137.5M
26328 oracle   ora_mman_icare                  3.2M    44.7M   206.7M   943.1M
26336 oracle   ora_smon_icare                  3.3M    12.6M   237.0M     1.2G
26354 oracle   ora_s003_icare                  2.9M     5.8M   324.4M     1.6G
26330 oracle   ora_dbw0_icare                 10.0M   182.6M   558.5M     1.7G
 1811 oracle   ora_s002_icare                  1.5G     1.2G     1.8G     3.9G
20087 oracle   ora_s001_icare                  7.3G     1.9G     2.5G     4.6G
 2290 oracle   ora_s000_icare                  2.1G     4.9G     5.5G     7.4G
-------------------------------------------------------------------------------
   25 1                                       11.1G     8.4G    11.4G    22.3G
--//注意看s00X进程占用的内存空间.而且还使用了大量的swap内存.

SYS@192.168.100.41:1521/icare:DEDICATED> select * from V$BGPROCESS where PADDR<>'00';
PADDR              PSERIAL# NAME  DESCRIPTION                        ERROR
---------------- ---------- ----- ------------------------------ ---------
00000001FE637BB8          1 PMON  process cleanup                0
00000001FB641E88          1 PSP0  process spawner 0              0
00000001FE6383A0          1 MMAN  Memory Manager                 0
00000001FB642670          1 DBW0  db writer process 0            0
00000001FE63DA98          2 ARC0  Archival Process 0             0
00000001FB647D68          1 ARC1  Archival Process 1             0
00000001FB648550          1 ARC2  Archival Process 2             448
00000001FB645DC8         55 LNS1  Network Server 1               0
00000001FE63E280          1 LNS4  Network Server 4               448
00000001FE638B88          1 LGWR  Redo etc.                      0
00000001FB642E58          1 CKPT  checkpoint                     0
00000001FE639370          1 SMON  System Monitor Process         0
00000001FB643640          1 RECO  distributed recovery           0
00000001FE639B58          1 CJQ0  Job Queue Coordinator          0
00000001FE63E280          5 QMNC  AQ Coordinator                 0
00000001FB643E28          1 MMON  Manageability Monitor Process  0
00000001FE63A340          1 MMNL  Manageability Monitor Process  0
17 rows selected.
--//居然S00N不是后台进程应该可以kill掉.
--//后记:我在另外1台11g查看11g上有显示.

[root@DZBL_DB_101 rpm_tgz]# kill -9 1811
[root@DZBL_DB_101 rpm_tgz]# ps -ef | grep s0[0]
oracle    2290     1  0  2022 ?        16:15:55 ora_s000_icare
oracle   14542     1  0 11:03 ?        00:00:00 ora_s002_icare
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
oracle   20087     1  0  2022 ?        21:44:42 ora_s001_icare
oracle   26354     1  0  2021 ?        01:23:30 ora_s003_icare
--//发现重新建立新的ora_s002_icare进程.

--//全部删除ora_s00N_icare进程.
# ps -ef --no-header | grep s0[0]  | awk '{print $2}' | xargs -IQ  kill -9 Q
--//简单说明:grep 过滤s0[0],使用awk取第2个字段(就是进程号),通过xargs -IQ的模式传参调用kill -9执行.

[root@DZBL_DB_101 rpm_tgz]# ps -ef --no-header | grep s0[0]
oracle   15769     1  0 11:08 ?        00:00:00 ora_s000_icare
oracle   15771     1  0 11:08 ?        00:00:00 ora_s001_icare
oracle   15773     1  0 11:08 ?        00:00:00 ora_s002_icare
oracle   15775     1  0 11:08 ?        00:00:00 ora_s003_icare
--//OK,已经重新启动ora_s00N_icare进程,从时间列上也可以看出.

[root@DZBL_DB_101 rpm_tgz]# smem -tk -U oracle -P "ora_|oracleicare"
  PID User     Command                         Swap      USS      PSS      RSS
26326 oracle   ora_psp0_icare                  3.2M   164.0K   333.0K     3.5M
30800 oracle   ora_q002_icare                  3.3M   172.0K   358.0K     3.7M
26455 oracle   ora_qmnc_icare                  3.3M   200.0K   836.0K     8.3M
26338 oracle   ora_reco_icare                  3.7M   400.0K     2.7M    19.6M
26324 oracle   ora_pmon_icare                  3.2M     1.2M     3.2M    14.6M
15773 oracle   ora_s002_icare                     0     3.2M     4.6M    16.7M
15771 oracle   ora_s001_icare                     0     3.2M     4.6M    16.7M
15775 oracle   ora_s003_icare                     0     3.2M     4.6M    16.7M
15769 oracle   ora_s000_icare                     0     3.2M     4.6M    16.7M
26522 oracle   ora_q000_icare                  3.6M     1.2M     6.3M    31.7M
25655 oracle   ora_d001_icare                     0     4.7M     6.7M    19.9M
26332 oracle   ora_lgwr_icare                  5.5M     6.9M     8.6M    19.1M
26439 oracle   ora_arc1_icare                  3.5M    11.2M    11.7M    16.8M
26437 oracle   ora_arc0_icare                 14.3M    11.7M    12.4M    17.9M
31831 oracle   oracleicare (LOCAL=NO)             0     6.0M    13.8M    51.0M
 3510 oracle   oracleicare (LOCAL=NO)             0     8.7M    14.4M    40.3M
26344 oracle   ora_mmnl_icare                  3.1M    17.9M    20.4M    34.4M
15102 oracle   ora_lns1_icare                  3.2M    26.8M    27.3M    32.5M
26334 oracle   ora_ckpt_icare                  3.5M     4.8M    39.5M   193.0M
26340 oracle   ora_cjq0_icare                  4.2M    13.9M    47.2M   152.1M
26342 oracle   ora_mmon_icare                  3.8M    16.1M    67.1M   225.6M
26950 oracle   ora_j000_icare                     0   156.6M   233.9M   433.4M
26328 oracle   ora_mman_icare                  3.2M   365.8M   570.2M   943.1M
26336 oracle   ora_smon_icare                  3.3M   208.3M   614.1M     1.2G
26330 oracle   ora_dbw0_icare                 10.0M   783.5M     1.2G     1.7G
-------------------------------------------------------------------------------
   25 1                                       77.9M     1.6G     2.8G     5.2G
--//对比前面已经回收了大量的内存,swap仅仅占用77.9M.

[root@DZBL_DB_101 rpm_tgz]# free -m
             total       used       free     shared    buffers     cached
Mem:         15953       6521       9432          0         36       5534
-/+ buffers/cache:        951      15002
Swap:        15999       1298      14701
--//对比问题出现前的情况,估计在观察几天看看内存使用情况.下面的输出时最开始看到的内存使用情况.
# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      15775        178          0          4       5003
-/+ buffers/cache:      10767       5186
Swap:        15999      15999          0

4.总结:
--//配置10g的数据库连接串时一定要显式地加上使用专用模式连接或者共享模式.
--//我估计使用太久了,共享进程s00N占用内存存在bug,导致对应进程内存没有释放,出现这样的情况.有机会看看dump.
--//估计很久没做这类分析工作,top不熟悉,当时不知道如何实现按照内存使用排序,浪费一些时间.
--//对于共享模式连接也有点生疏,想当然认为kill ora_d000_icare进程,对应的ora_s00N_icare进程也会kill,os操作系统没学好.
--//如果服务器安装我熟悉的smem工具,也许就没有必要走这个大的弯路定位问题了.
--//惯性思维,以为alter system kill session '855,22839' immediate;对应进程就被kill掉.
--//解决问题时间过长,大约从8:30开始到11:30才定位解决问题,花了3个小时,事后我想我的问题思路是否有错,
--//ora-04030 => pga相关问题 -> pga.sql查询,如果当时这里看不出来,也许时间会更长...
--//另外顺手的工具很重要,实际上大部分时间浪费在上网找smem的安装包.
--//同事安装时应该不管怎么选择使用hugepages模式,减少页面表的使用.这部分内存占用1G.

5.附件脚本代码:
$ cat pga
column name format a30
column machine format a30
SELECT NAME, VALUE/1024/1024 VALUE_MB
FROM   V$PGASTAT
WHERE NAME IN ( 'aggregate PGA target parameter',
'total PGA allocated',
'total PGA inuse')
union all
SELECT NAME, VALUE
FROM   V$PGASTAT
WHERE NAME IN ('over allocation count');

SELECT name profile, cnt, decode(total, 0, 0, round(cnt*100/total)) percentage
FROM (SELECT name, value cnt, (sum(value) over ()) total
FROM V$SYSSTAT WHERE name like 'workarea exec%');

SELECT *
  FROM (  SELECT p.spid,
                 s.sid,
                 s.serial#,
                 s.machine,
                 s.client_info,
                 DECODE (s.program, NULL, p.program, s.program) AS "Program",
                 p.pga_used_mem,
                 p.pga_alloc_mem,
                 p.pga_max_mem
            FROM v$process p, v$session s
           WHERE s.paddr = p.addr
        ORDER BY p.pga_used_mem DESC)
 WHERE ROWNUM <= &1;

$ cat spid.sql
column sid     new_value newsid
column serial#  new_value newserial
column spid    new_value newspid

set verify off

SELECT s.sid
     , s.serial#
     , s.process
     , s.server
     , p.spid
     , p.pid
     , p.serial# p_serial#
     , 'alter system kill session '''||s.sid||' , '||s.serial#||''''||' immediate;' c50
  FROM v$session s
     , v$process p
 WHERE s.sid in ( SELECT sid FROM v$mystat WHERE rownum  = 1)
   AND s.paddr = p.addr;


关闭

用微信“扫一扫”