欢迎光临
我们一直在努力

技术人生系列 · 我和数据中心的故事——第一期


小y这个名字,是笔者临时想的一个笔名,其实没有什么特殊的含义,就暂且用他来代表我们这些为各个数据中心奉献自己青春的一群默默无闻的IT人吧!






小y今天要和大家分享的是一个疑难杂症的分析过程。如果大家有耐心读完这个案例,一定会或多或少有些收获,也就没浪费小y的一片苦心。




具体来说是一个应用间歇性局部挂起案例的分析过程,报告中将对


Oracle


数据库稳定运行的共性风险和隐患作出提醒。



1
问题描述


据客户反映,应用会间歇性出现异常,包括


insert


单条记录在内的操作长时间无法完成,按照客户的说法,数据库内可能有“死锁”现象,希望能够找到问题发生的根因,提出解决方案,以避免问题再次发生。



2015



12



23

日,问题再次发生,客户再次联系到小

y,



y

通过远程方式进行了信息收集和故障诊断,最终定位了问题的根本原因。



环境介绍:



操作系统


>


>


>

>


>


>


>


INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)


>


>


>


>

>


>


>


>


SQL> oradebug setmypid


Statement processed.


SQL> oradebug hanganalyze 3


Hang Analysis in /oracle/admin/xxdb/udump/xxdb_ora_14136.trc


SQL>


SQL> oradebug dump systemstate 266


Statement processed.


SQL> oradebug tracefile_name


/oracle/admin/xxdb/udump/xxdb_ora_14136.trc


>


>


>


>

>


>


>


>

>


>


>


>


PROCESS 19:


—————————————-


SO: c00000003949b948, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00


(process) Oracle pid=19, calls cur/top: c0000000397209b0/c0000000397209b0, flag: (0) –


int error: 0, call error: 0, sess error: 0, txn error 0


(post info) last post received: 0 0 121


last post received-location: kcbzww


last process to post me: c000000039496148 1 22


last post sent: 0 0 121


last post sent-location: kcbzww


last process posted by me: c000000039496148 1 22


(latch info) wait_event=0 bits=0


Process Group: DEFAULT, pseudo proc: c000000039529928


O/S info: user: oracle, term: UNKNOWN, ospid: 11880


OSD pid info: Unix process pid: 11880, image: oracle@ap-machine-


*** 2015-12-22 10:34:53.431



Short stack dump:


ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_


pw_wait()+48<-pw_wait()


+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktugur()+4416<-ktuchg()+1280<-ktbchg2()+704<-kdiins0()+267536<-kdiinsp()+320<-kauxsin()+2960<-insidx()+1744<-insrow()+1440<-insdrv()+960<-inscovexe()+1408<-insExecStmtExecIniEngine()+176<-insexe()+1040<-opiexe()+13776<-kpoal8()+3808<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80


—————————————-


SO: c0000000396d80d8, type: 4, owner: c00000003949b948, flag: INIT/-/-/0x00


(session) sid: 315 trans: c0000000355b2a28, creator: c00000003949b948, flag: (100041) USR/- BSY/-/-/-/-/-


DID: 0001-0013-00000027, short-term DID: 0000-0000-00000000


txn branch: 0000000000000000


oct: 2, prv: 0, sql: c0000000384bad50, psql: c0000000384bad50, user: 33/XXDB


service name: xxdb


O/S info: user: , term: , ospid: 1234, machine: rvwapp2-1


program:



waiting for 'db file sequential read' wait_time=0, seconds since wait started=0


file#=c, block#=2f359, blocks=1


blocking sess=0x0000000000000000 seq=42271


Dumping Session Wait History


for 'undo segment extension' count=1 wait_time=2 min 45 sec


segment#=6, =0, =0


for 'buffer busy waits' count=1 wait_time=0.000009 sec


file#=2, block#=59, class#=1b


for 'SQL*Net message from client' count=1 wait_time=0.000087 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message to client' count=1 wait_time=0.000001 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message from client' count=1 wait_time=0.000086 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message to client' count=1 wait_time=0.000001 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message from client' count=1 wait_time=0.000086 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message to client' count=1 wait_time=0.000001 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message from client' count=1 wait_time=0.000087 sec


driver id=28444553, #bytes=1, =0


for 'SQL*Net message to client' count=1 wait_time=0.000001 sec


driver id=28444553, #bytes=1, =0


Sampled Session History of session 315 serial 4544


—————————————————


The sampled session history is constructed by sampling


the target session every 1 second. The sampling process


captures at each sample if the session is in a non-idle wait,


an idle wait, or not in a wait. If the session is in a


non-idle wait then one interval is shown for all the samples


the session was in the same non-idle wait. If the


session is in an idle wait or not in a wait for


consecutive samples then one interval is shown for all


the consecutive samples. Though we display these consecutive


samples in a single interval the session may NOT be continuously


idle or not in a wait (the sampling process does not know).



The history is displayed in reverse chronological order.



sample interval: 1 sec, max history 120 sec


—————————————————


[120 samples, 10:32:52 – 10:34:53]


waited for 'undo segment extension', seq_num: 42270


p1: 'segment#'=0x6


p2: ''=0x0


p3: ''=0x0


time_waited: >= 120 sec (still in wait)


—————————————————


Sampled Session History Summary:


longest_non_idle_wait: 'undo segment extension'


[120 samples, 10:32:52 – 10:34:53]


time_waited: >= 120 sec (still in wait)




可以看到:






进程在被

ORADEBUG

间接唤醒后,不再等待

undo segment extension,

而是做




PROCESS 8:


—————————————-


SO: c000000039496148, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00


(process) Oracle pid=8, calls cur/top: c00000003971e868/c00000003971e868, flag: (16) SYSTEM


int error: 0, call error: 0, sess error: 0, txn error 0


(post info) last post received: 0 0 121


last post received-location: kcbzww


last process to post me: c0000000394a0948 211 0


last post sent: 0 0 24


last post sent-location: ksasnd


last process posted by me: c000000039495148 1 6


(latch info) wait_event=0 bits=0


Process Group: DEFAULT, pseudo proc: c000000039529928


O/S info: user: oracle, term: UNKNOWN, ospid: 10072


OSD pid info: Unix process pid: 10072, image: oracle@ap-machine- (SMON)


Short stack dump:


ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+352<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktmmon()+1168<-ktmSmonMain()+64<-ksbrdp()+2368<-opirip()+1184<-opidrv()+1184<-sou2o()+240<-opimai_real()+336<-main()+240<-main_opd_entry()+80


—————————————-


—————————————-


SO: c0000000396ead68, type: 4, owner: c000000039496148, flag: INIT/-/-/0x00


(session) sid: 329 trans: 0000000000000000, creator: c000000039496148, flag: (100051) USR/- BSY/-/-/-/-/-


DID: 0001-0008-00000002, short-term DID: 0000-0000-00000000


txn branch: 0000000000000000


oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS


service name: SYS$BACKGROUND



waiting for 'smon timer' wait_time=0, seconds since wait started=3109


sleep time=12c, failed=0, =0


blocking sess=0x0000000000000000 seq=7382


Dumping Session Wait History


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 11 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


for 'smon timer' count=1 wait_time=4 min 53 sec


sleep time=12c, failed=0, =0


Sampled Session History of session 329 serial 1



可以看到,

SMON

进程在等“

SMON TIMER

”,即空闲等待。



这说明

SMON

没有因为阻塞在异常的等待上,导致无法腾出时间来处理前台进程发过来的

undo segment

扩展请求。


>


>


>


>



操作系统


The problem is a defect in the pw_wait() O/S system call.

On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

PHKL_37456 scheduler cumulative patch






HPUX11.31


上,当安装了


PHKL_37456


这个调度补丁后,操作系统调用


pw_wait


存在缺陷,导致进程无法从


post/wait


中被唤醒,解决方案是


Solution: on 11.31 install PHKL_38397 or later equivalent.




以下是官网原文





Processes Hang Waiting on 'cursor: pin S wait on X' (and other Wait Events) on HP-UX 11.23 and 11.31 Itanium Systems (Doc ID 580273.1)





In this Document



Symptoms



Changes



Cause



Solution






APPLIES TO:




Oracle Database – Enterprise Edition – Version 9.2.0.1 and later




HP-UX Itanium



***Checked for relevance on 23-JUL-2013***


  • cursor: pin S wait on X (10g wait)


  • kksfbc child completion


  • SGA: allocation forcing component growth


  • There is no blocker process.

    The variety in wait events is explained by the fact that the underlying issue is with the pw_wait() HP-UX system call.

    Therefore, the above list is not exhaustive, use the pstack output below to match the problem in all cases.

    The problem is commonly seen on Parallel Execution Slave processes but can affect any Oracle process.

    A command to generate a systemstate, processstate, or errorstack dump e.g. via ALTER SESSION or oradebug command frees the hanging process.

    An example of the top of the call stack of an affected process, obtained using pstack:


    0: c0000000003e2ff0 : pw_wait() + 0x30 (/usr/lib/hpux64/libc.so.1)

    1: 4000000002e98a20 : pw_wait() + 0x80


  • Workaround: oradebug setospid+unlimit to free the hanging process:


  • 1.


  • Solution: on 11.31 install PHKL_38397 or later equivalent.


  • Solution: on 11.23 install PHKL_37809 or later equivalent.


  • Note: the patches whose names begin with PHKL_ are HP-UX patches. Please check with HP-UX support on the latest applicable patch numbers as such patches are regularly updated and superceded.


    3
    原因总结和建议


    >


    >


    >


    >


    INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)




    之所以执行不下去,不是因为死锁,而是因为该会话即

    SID=315

    的会话

    ,

    需要执行

    DML

    ,因此需要

    UNDO SEGMENT

    即回滚段来存储前镜像,但发现回滚段空间不足,需要通知

    SMON

    后台进程来完成扩展的请求

    ,

    但是长时间没有获得回滚段扩展成功与否的返回消息。前台进程和

    SMON

    之间通过

    POST/WAIT

    进制来通讯。




    由于操作系统

    HPUX

    调度的缺陷,具体来说是

    pw_wait

    系统调用

    (post/wait)

    存在缺陷,当

    SMON

    进程完成回滚段扩展后,消息返回时,前台进程

    SID=315

    并没有能被及时调度到

    CPU

    上继续处理后续工作,即体现出来就是等待在“

    undo segment extension

    “事件上等待,当我们人为使用

    ORADEBUG



    SID 315

    收集信息并间接唤醒了该前台进程后,前台进行可以继续往下工作,这更加印证了

    SMON

    已经将消息已经返回给前台进程,只是由于操作系统调度进制的问题,未能即使将前台进程调度到

    CPU

    上,拿到

    UNDO

    扩展成功的消息







    造成该故障的根本原因是操作系统调度机制的问题,该问题命中操作系统

    HPUX



    上的已知缺陷。






    The problem is a defect in the pw_wait() O/S system call.

    On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

    PHKL_37456 scheduler cumulative patch






    HPUX11.31


    上,当安装了


    PHKL_37456


    这个调度补丁后,操作系统调用


    pw_wait


    存在缺陷,导致进程无法从


    post/wait


    中被唤醒,解决方案是


    Solution: on 11.31 install PHKL_38397 or later equivalent.







    >


    >


    >


    >

     




    3.2 建议









    Solution: on 11.31 install PHKL_38397 or later equivalent.




    请系统管理员为操作系统安装


    PHKL_38397


    补丁。





    本文是转载中亦安图的文章。

    赞(0)
    【声明】:本博客不参与任何交易,也非中介,仅记录个人感兴趣的主机测评结果和优惠活动,内容均不作直接、间接、法定、约定的保证。访问本博客请务必遵守有关互联网的相关法律、规定与规则。一旦您访问本博客,即表示您已经知晓并接受了此声明通告。