小y这个名字,是笔者临时想的一个笔名,其实没有什么特殊的含义,就暂且用他来代表我们这些为各个数据中心奉献自己青春的一群默默无闻的IT人吧!
小y今天要和大家分享的是一个疑难杂症的分析过程。如果大家有耐心读完这个案例,一定会或多或少有些收获,也就没浪费小y的一片苦心。
具体来说是一个应用间歇性局部挂起案例的分析过程,报告中将对
Oracle
数据库稳定运行的共性风险和隐患作出提醒。
据客户反映,应用会间歇性出现异常,包括
insert
单条记录在内的操作长时间无法完成,按照客户的说法,数据库内可能有“死锁”现象,希望能够找到问题发生的根因,提出解决方案,以避免问题再次发生。
2015
年
12
月
23
日,问题再次发生,客户再次联系到小
y,
小
y
通过远程方式进行了信息收集和故障诊断,最终定位了问题的根本原因。
环境介绍:
|
|
>
>
>
>
>
>
>
>
SQL> oradebug setmypid
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/admin/xxdb/udump/xxdb_ora_14136.trc
SQL>
SQL> oradebug dump systemstate 266
Statement processed.
SQL> oradebug tracefile_name
/oracle/admin/xxdb/udump/xxdb_ora_14136.trc
>
>
>
>
>
>
>
>
>
>
>
>
PROCESS 19:
—————————————-
SO: c00000003949b948, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=19, calls cur/top: c0000000397209b0/c0000000397209b0, flag: (0) –
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 121
last post received-location: kcbzww
last process to post me: c000000039496148 1 22
last post sent: 0 0 121
last post sent-location: kcbzww
last process posted by me: c000000039496148 1 22
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c000000039529928
O/S info: user: oracle, term: UNKNOWN, ospid: 11880
OSD pid info: Unix process pid: 11880, image: oracle@ap-machine-
*** 2015-12-22 10:34:53.431
Short stack dump:
ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_
pw_wait()+48<-pw_wait()
+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktugur()+4416<-ktuchg()+1280<-ktbchg2()+704<-kdiins0()+267536<-kdiinsp()+320<-kauxsin()+2960<-insidx()+1744<-insrow()+1440<-insdrv()+960<-inscovexe()+1408<-insExecStmtExecIniEngine()+176<-insexe()+1040<-opiexe()+13776<-kpoal8()+3808<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80
—————————————-
SO: c0000000396d80d8, type: 4, owner: c00000003949b948, flag: INIT/-/-/0x00
(session) sid: 315 trans: c0000000355b2a28, creator: c00000003949b948, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-0013-00000027, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 2, prv: 0, sql: c0000000384bad50, psql: c0000000384bad50, user: 33/XXDB
service name: xxdb
O/S info: user: , term: , ospid: 1234, machine: rvwapp2-1
program:
waiting for 'db file sequential read' wait_time=0, seconds since wait started=0
file#=c, block#=2f359, blocks=1
blocking sess=0x0000000000000000 seq=42271
Dumping Session Wait History
for 'undo segment extension' count=1 wait_time=2 min 45 sec
segment#=6, =0, =0
for 'buffer busy waits' count=1 wait_time=0.000009 sec
file#=2, block#=59, class#=1b
for 'SQL*Net message from client' count=1 wait_time=0.000087 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message to client' count=1 wait_time=0.000001 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message from client' count=1 wait_time=0.000086 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message to client' count=1 wait_time=0.000001 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message from client' count=1 wait_time=0.000086 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message to client' count=1 wait_time=0.000001 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message from client' count=1 wait_time=0.000087 sec
driver id=28444553, #bytes=1, =0
for 'SQL*Net message to client' count=1 wait_time=0.000001 sec
driver id=28444553, #bytes=1, =0
Sampled Session History of session 315 serial 4544
—————————————————
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
—————————————————
[120 samples, 10:32:52 – 10:34:53]
waited for 'undo segment extension', seq_num: 42270
p1: 'segment#'=0x6
p2: ''=0x0
p3: ''=0x0
time_waited: >= 120 sec (still in wait)
—————————————————
Sampled Session History Summary:
longest_non_idle_wait: 'undo segment extension'
[120 samples, 10:32:52 – 10:34:53]
time_waited: >= 120 sec (still in wait)
可以看到:
进程在被
ORADEBUG
间接唤醒后,不再等待
undo segment extension,
而是做
”
PROCESS 8:
—————————————-
SO: c000000039496148, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=8, calls cur/top: c00000003971e868/c00000003971e868, flag: (16) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 121
last post received-location: kcbzww
last process to post me: c0000000394a0948 211 0
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000039495148 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c000000039529928
O/S info: user: oracle, term: UNKNOWN, ospid: 10072
OSD pid info: Unix process pid: 10072, image: oracle@ap-machine- (SMON)
Short stack dump:
ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+352<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktmmon()+1168<-ktmSmonMain()+64<-ksbrdp()+2368<-opirip()+1184<-opidrv()+1184<-sou2o()+240<-opimai_real()+336<-main()+240<-main_opd_entry()+80
—————————————-
—————————————-
SO: c0000000396ead68, type: 4, owner: c000000039496148, flag: INIT/-/-/0x00
(session) sid: 329 trans: 0000000000000000, creator: c000000039496148, flag: (100051) USR/- BSY/-/-/-/-/-
DID: 0001-0008-00000002, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'smon timer' wait_time=0, seconds since wait started=3109
sleep time=12c, failed=0, =0
blocking sess=0x0000000000000000 seq=7382
Dumping Session Wait History
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 11 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
for 'smon timer' count=1 wait_time=4 min 53 sec
sleep time=12c, failed=0, =0
Sampled Session History of session 329 serial 1
可以看到,
SMON
进程在等“
SMON TIMER
”,即空闲等待。
这说明
SMON
没有因为阻塞在异常的等待上,导致无法腾出时间来处理前台进程发过来的
undo segment
扩展请求。
>
>
>
>
操作系统
The problem is a defect in the pw_wait() O/S system call.
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:
PHKL_37456 scheduler cumulative patch
即
HPUX11.31
上,当安装了
PHKL_37456
这个调度补丁后,操作系统调用
pw_wait
存在缺陷,导致进程无法从
post/wait
中被唤醒,解决方案是
Solution: on 11.31 install PHKL_38397 or later equivalent.
以下是官网原文
An example of the top of the call stack of an affected process, obtained using pstack:
|
>
>
>
>
INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)
之所以执行不下去,不是因为死锁,而是因为该会话即
SID=315
的会话
,
需要执行
DML
,因此需要
UNDO SEGMENT
即回滚段来存储前镜像,但发现回滚段空间不足,需要通知
SMON
后台进程来完成扩展的请求
,
但是长时间没有获得回滚段扩展成功与否的返回消息。前台进程和
SMON
之间通过
POST/WAIT
进制来通讯。
由于操作系统
HPUX
调度的缺陷,具体来说是
pw_wait
系统调用
(post/wait)
存在缺陷,当
SMON
进程完成回滚段扩展后,消息返回时,前台进程
SID=315
并没有能被及时调度到
CPU
上继续处理后续工作,即体现出来就是等待在“
undo segment extension
“事件上等待,当我们人为使用
ORADEBUG
对
SID 315
收集信息并间接唤醒了该前台进程后,前台进行可以继续往下工作,这更加印证了
SMON
已经将消息已经返回给前台进程,只是由于操作系统调度进制的问题,未能即使将前台进程调度到
CPU
上,拿到
UNDO
扩展成功的消息
造成该故障的根本原因是操作系统调度机制的问题,该问题命中操作系统
HPUX
上的已知缺陷。
PHKL_37456 scheduler cumulative patch
|
>
>
>
>
3.2 建议
Solution: on 11.31 install PHKL_38397 or later equivalent.
请系统管理员为操作系统安装
PHKL_38397
补丁。
本文是转载中亦安图的文章。