XX联通SF4810宕机+VCS资源offline处理过程总结文档

    技术2022-05-19  20

    处理过程

    XX联通SF4810异常宕机,VCS资源无法自动切换,计费采集业务中断.

     

    系统版本:solaris 10

    VCS版本:5.0

    VXVM版本:5.0 

    二、处理过程

    2.1检查硬件、启系统

    到达现场检查发现IB6RP0RP2故障:

    sun4810_sc1:SC> showlog -v

    Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 846693 local0.notice] Device will not be polled

    Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 752932 local0.notice] PCI I/O Board at /N0/IB6 Device poll caused: sun.serengeti.FailedHwException: I2cComm.readCmd:  CBH Port is disabled: IB6.sbbc0.regs.c0 (118000c0)

    Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 846693 local0.notice] Device will not be polled

    Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 679592 local0.error] 

    Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 679592 local0.error] 

    Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 898456 local0.error] 

    /partition0/RP0/dx0: 

        General Error Status[0x1e] : 0x00010001

                    AccCPerr [16:16] : 0x1 

                       CPerr [00:00] : 0x1 Control Parity Error

        Safari Port Error Status 6[0x25] : 0x00050004

                     AccIFOv [16:16] : 0x1 

                      AccErr [18:18] : 0x1 

                      SafPar [02:02] : 0x1 Safari input parity error

    >>> Safari Port Error Status 7[0x26] : 0x00078004

                     AccIFOv [16:16] : 0x1 

                      AccErr [18:18] : 0x1 

                    AccIFPar [17:17] : 0x1 

                  FirstError [15:15] : 0x1 

                      SafPar [02:02] : 0x1 Safari input parity error

    Mar 14 09:53:14 sun4810_sc1 Platform.SC: [ID 208063 local0.error] [AD] Event: SF4810.ASIC.AR.ADR_PERR.10433006

         CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12

         Time: Sun Mar 14 09:53:13 GMT+08:00 2010

         FRU-List-Count: 1; FRU-PN: 5014953; FRU-SN: 002538; FRU-LOC: RP0

         Recommended-Action: Service action required

    [AD] Event: SF4810.ASIC.AR.ADR_PERR.10433007

         CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12

         Time: Sun Mar 14 09:53:13 GMT+08:00 2010

         FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 011488; FRU-LOC: /N0/IB6

                         FRU-PN: 5014953; FRU-SN: 014739; FRU-LOC: RP2

         Recommended-Action: Service action required

    Mar 14 09:53:14 sun4810_sc1 Platform.SC: [ID 789080 local0.crit] A fatal condition is detected on Domain A. Initiating automatic restoration for this domain.

    Mar 14 09:53:16 sun4810_sc1 Platform.SC: [ID 187965 local0.error] Data Parity error polling failed. Board will no longer be polled: JtagController.tapWait:  CBH Port is disabled: IB6.sdc.b0 (12c000b0)

    Mar 14 09:53:38 sun4810_sc1 Platform.SC: [ID 930884 local0.error] [AD] Event: SF4810.ASIC.DX.SAF_IN_PAR_ERR.30233026

         CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12

         Time: Sun Mar 14 09:53:38 GMT+08:00 2010

         FRU-List-Count: 3; FRU-PN: 5014953; FRU-SN: 002538; FRU-LOC: RP0

                            FRU-PN: 5014404; FRU-SN: 011488; FRU-LOC: /N0/IB6

                            FRU-PN: 5014953; FRU-SN: 014739; FRU-LOC: RP2

         Recommended-Action: Service action required

    Mar 14 09:54:07 sun4810_sc1 Platform.SC: [ID 620190 local0.notice] A: CycleKeyswitch: Initiating keyswitch: off, domain A.

    Mar 14 09:56:16 sun4810_sc1 Platform.SC: [ID 299301 local0.notice] A: CycleKeyswitch: Initiating keyswitch: on, domain A.

     

    启系统:

    sun4810_sc1:A> break

     

    This will suspend Solaris in domain A.

    Do you want to continue? [no] yes

    Type 'go' to resume

    debugger entered.

    {a} ok boot

     

    2.2 解决VCS资源无法online

    系统启动后,VCS资源faulted 无法online

    root@zhcj2 # hastatus -sum

    -- SYSTEM STATE

    -- System               State                Frozen             

    A  zhcj1                RUNNING              0                   

    A  zhcj2                RUNNING              0                   

    -- GROUP STATE

    -- Group           System               Probed     AutoDisabled    State         

    B  cvm             zhcj1                Y          N               ONLINE        

    B  cvm             zhcj2                Y          N               ONLINE        

    B  vrts_vea_cfs_int_cfsmount1  zhcj1      Y          N              STARTING|PARTIAL

    B  vrts_vea_cfs_int_cfsmount1 zhcj2       Y          N               STARTING|PARTIAL

    -- RESOURCES FAILED

    -- Group           Type                 Resource             System             

     

    C  vrts_vea_cfs_int_cfsmount1 CFSMount             cfsmount3            zhcj1               

    C  vrts_vea_cfs_int_cfsmount1 CFSMount             cfsmount3            zhcj2              

    C  vrts_vea_cfs_int_cfsmount1 CFSMount             cfsmount4            zhcj1              

    C  vrts_vea_cfs_int_cfsmount1 CFSMount             cfsmount4            zhcj2

     

    Clear资源:

        root@zhcj2 # hagrp -clear vrts_vea_cfs_int_cfsmount1 -sys zhcj1

    root@zhcj2 # hagrp -clear vrts_vea_cfs_int_cfsmount1 -sys zhcj2

    root@zhcj2 # hastatus -sum

    -- SYSTEM STATE

    -- System               State                Frozen             

    A  zhcj1                RUNNING              0                   

    A  zhcj2                RUNNING              0                   

     

    -- GROUP STATE

    -- Group           System               Probed     AutoDisabled    State         

     

    B  cvm             zhcj1                Y          N               ONLINE        

    B  cvm             zhcj2                Y          N               ONLINE        

    B  vrts_vea_cfs_int_cfsmount1 zhcj1       Y          N               PARTIAL       

    B  vrts_vea_cfs_int_cfsmount1 zhcj2        Y         N               PARTIAL

    ONLINE资源:

    root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj1

    root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj2

    root@zhcj2 # hastatus

    attempting to connect....

    attempting to connect....connected

    group           resource             system               message            

    --------------- -------------------- -------------------- --------------------

                                         zhcj1                RUNNING            

                                         zhcj2                RUNNING            

    cvm                                  zhcj1                ONLINE             

    cvm                                  zhcj2                ONLINE             

    -------------------------------------------------------------------------

    vrts_vea_cfs_int_cfsmount1                      zhcj1                STARTING PARTIAL   

    vrts_vea_cfs_int_cfsmount1                      zhcj2                STARTING PARTIAL   

    VCShmg                               zhcj1                OFFLINE            

    VCShmg                               zhcj2                OFFLINE            

                    vxfsckd              zhcj1                ONLINE             

    -------------------------------------------------------------------------

                    vxfsckd              zhcj2                ONLINE             

                    cvm_clus             zhcj1                ONLINE             

                    cvm_clus             zhcj2                ONLINE             

                    cvm_vxconfigd        zhcj1                ONLINE             

                    cvm_vxconfigd        zhcj2                ONLINE             

    -------------------------------------------------------------------------

                    cfsmount2            zhcj1                ONLINE             

                    cfsmount2            zhcj2                ONLINE             

                    cfsmount3            zhcj1                *FAULTED*          

                    cfsmount3            zhcj2                *FAULTED*          

                    cfsmount4            zhcj1                *FAULTED*          

    -------------------------------------------------------------------------

                    cfsmount4            zhcj2                *FAULTED*          

                    cvmvoldg1            zhcj1                ONLINE             

                    cvmvoldg1            zhcj2                ONLINE             

                    cvmvoldg2            zhcj1                ONLINE             

                    cvmvoldg2            zhcj2                ONLINE             

    -------------------------------------------------------------------------

                    VCShm                zhcj1                ONLINE             

                    VCShm                zhcj2                ONLINE          

     

    cfsmount2资源正常ONLINE,但cfsmount3cfsmount4资源为  *FAULTED*,无法online

     

    检查日志:

     

    #tail -f /var/VRTSvcs/log/engine_A.log

    2010/03/13 18:12:48 VCS NOTICE V-16-1-10232 Clearing Restart attribute for group vrts_vea_cfs_int_cfsmount1 on node zhcj1

    2010/03/13 18:12:48 VCS NOTICE V-16-1-10232 Clearing Restart attribute for group vrts_vea_cfs_int_cfsmount1 on node zhcj2

    2010/03/13 18:12:48 VCS NOTICE V-16-1-10301 Initiating Online of Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) on System zhcj1

    2010/03/13 18:12:49 VCS WARNING V-16-20011-5502 (zhcj1) CFSMount:cfsmount4:online:Mount Failed on this Node MountOptions -F vxfs -o cluster,nosuid,rw                 Block Device /dev/vx/dsk/dasdatadg/daslv : MountPoint /das

    2010/03/13 18:12:49 VCS WARNING V-16-20011-5508 (zhcj1) CFSMount:cfsmount4:online:Mount Error : UX:vxfs mount: ERROR: V-3-21264: /dev/vx/dsk/dasdatadg/daslv is already mounted, /das is busy,

                    allowable number of mount points exceeded

    2010/03/13 18:13:51 VCS ERROR V-16-2-13066 (zhcj1) Agent is calling clean for resource(cfsmount4) because the resource is not up even after online completed.

    2010/03/13 18:13:53 VCS INFO V-16-2-13068 (zhcj1) Resource(cfsmount4) - clean completed successfully.

    2010/03/13 18:13:53 VCS INFO V-16-2-13071 (zhcj1) Resource(cfsmount4): reached OnlineRetryLimit(0).

    2010/03/13 18:13:54 VCS ERROR V-16-1-10303 Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is FAULTED (timed out) on sys zhcj1

    2010/03/13 18:13:54 VCS INFO V-16-6-15004 (zhcj1) hatrigger:Failed to send trigger for resfault; script doesn't exist

    2010/03/13 18:13:57 VCS INFO V-16-1-10307 Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is offline on zhcj1 (Not initiated by VCS)

    2010/03/13 18:13:59 VCS NOTICE V-16-1-10301 Initiating Online of Resource cfsmount3 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) on System zhcj1

    2010/03/13 18:13:59 VCS WARNING V-16-20011-5502 (zhcj1) CFSMount:cfsmount3:online:Mount Failed on this Node MountOptions -F vxfs -o cluster,nosuid,rw                 Block Device /dev/vx/dsk/dasdata1dg/dasdata1lv : MountPoint /dasdata1

    2010/03/13 18:14:01 VCS WARNING V-16-20011-5508 (zhcj1) CFSMount:cfsmount3:online:Mount Error : UX:vxfs mount: ERROR: V-3-21268: /dev/vx/dsk/dasdata1dg/dasdata1lv is corrupted. needs checking

    2010/03/13 18:16:01 VCS ERROR V-16-2-13066 (zhcj1) Agent is calling clean for resource(cfsmount3) because the resource is not up even after online completed.

    2010/03/13 18:16:01 VCS INFO V-16-2-13068 (zhcj1) Resource(cfsmount3) - clean completed successfully.

    2010/03/13 18:16:01 VCS INFO V-16-2-13071 (zhcj1) Resource(cfsmount3): reached OnlineRetryLimit(0).

    2010/03/13 18:16:05 VCS ERROR V-16-1-10303 Resource cfsmount3 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is FAULTED (timed out) on sys zhcj1

    2010/03/13 18:16:05 VCS INFO V-16-6-15004 (zhcj1) hatrigger:Failed to send trigger for resfault; script doesn't exist

     

    检查日志我们发现cfsmount3资源的逻辑卷需要fsck, cfsmount4资源的逻辑卷busy,所以无法挂载.

     

    Full Fsck逻辑卷:

     

    fsck -F vxfs /dev/vx/dsk/dasdata1dg/dasdata1lv时系统提示需full fsck

    full fsck: A full filesystem check is generally not necessary unless there has been some sort of disk hardware failure.

    root@zhcj2 # fsck -y -F vxfs -o full /dev/vx/dsk/dasdata1dg/dasdata1lv

     重新online cfsmount3资源正常

     

     

    解决cfsmount4 busy问题

    1号机上:

    root@zhcj1 # mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das

    UX:vxfs mount: ERROR: V-3-21264: /dev/vx/dsk/dasdatadg/daslv is already mounted, /das is busy,

                    allowable number of mount points exceeded

    root@zhcj1 # fuser -u /das

    /das:     9277c(das)

    root@zhcj1 # ps -ef|grep 9277

         das  9277  9275   0 17:31:32 pts/3       0:00 -sh

        root 25369  2177   0 18:30:16 pts/1       0:00 grep 9277

    root@zhcj1 # who    

    root       pts/1        Mar 13 17:14    (130.71.8.61)

    root       dtremote     Mar 13 17:38    (130.71.8.61:1)

    das        pts/3        Mar 13 17:31    (130.71.8.229)

    root       pts/5        Mar 13 17:39    (130.71.8.61:1.0)

    root       pts/6        Mar 13 17:44    (130.71.8.61)

    root       pts/7        Mar 13 17:47    (192.168.88.24)

    root@zhcj1 # kill -9 9277

    root@zhcj1 # mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das 正常

    root@zhcj1 # umount /das

    2号机上:

    root@zhcj2 # fuser -u /das

    /das:     7959c(das)

    root@zhcj2 # who 

    root       console      Mar 13 17:30

    root       pts/1        Mar 13 17:31    (130.71.8.61)

    root       pts/2        Mar 13 17:32    (192.168.88.24)

    das        pts/3        Mar 13 17:38    (130.71.8.229)

    root@zhcj2 # kill 7959

    root@zhcj2 # kill -9 7959

    root@zhcj2 # fuser -u /das

    /das:

    root@zhcj2# mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das 正常

    root@zhcj2 # umount /das

    online资源组正常

    root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj1

    root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj2

     

    通知用户启应用恢复业务

     


    最新回复(0)