XX联通SF4810异常宕机,VCS资源无法自动切换,计费采集业务中断.
系统版本:solaris 10
VCS版本:5.0
VXVM版本:5.0
到达现场检查发现IB6、RP0、RP2故障:
sun4810_sc1:SC> showlog -v
Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 846693 local0.notice] Device will not be polled
Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 752932 local0.notice] PCI I/O Board at /N0/IB6 Device poll caused: sun.serengeti.FailedHwException: I2cComm.readCmd: CBH Port is disabled: IB6.sbbc0.regs.c0 (118000c0)
Mar 14 09:53:10 sun4810_sc1 Platform.SC: [ID 846693 local0.notice] Device will not be polled
Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 679592 local0.error]
Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 679592 local0.error]
Mar 14 09:53:11 sun4810_sc1 Platform.SC: [ID 898456 local0.error]
/partition0/RP0/dx0:
General Error Status[0x1e] : 0x00010001
AccCPerr [16:16] : 0x1
CPerr [00:00] : 0x1 Control Parity Error
Safari Port Error Status 6[0x25] : 0x00050004
AccIFOv [16:16] : 0x1
AccErr [18:18] : 0x1
SafPar [02:02] : 0x1 Safari input parity error
>>> Safari Port Error Status 7[0x26] : 0x00078004
AccIFOv [16:16] : 0x1
AccErr [18:18] : 0x1
AccIFPar [17:17] : 0x1
FirstError [15:15] : 0x1
SafPar [02:02] : 0x1 Safari input parity error
Mar 14 09:53:14 sun4810_sc1 Platform.SC: [ID 208063 local0.error] [AD] Event: SF4810.ASIC.AR.ADR_PERR.10433006
CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12
Time: Sun Mar 14 09:53:13 GMT+08:00 2010
FRU-List-Count: 1; FRU-PN: 5014953; FRU-SN: 002538; FRU-LOC: RP0
Recommended-Action: Service action required
[AD] Event: SF4810.ASIC.AR.ADR_PERR.10433007
CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12
Time: Sun Mar 14 09:53:13 GMT+08:00 2010
FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 011488; FRU-LOC: /N0/IB6
FRU-PN: 5014953; FRU-SN: 014739; FRU-LOC: RP2
Recommended-Action: Service action required
Mar 14 09:53:14 sun4810_sc1 Platform.SC: [ID 789080 local0.crit] A fatal condition is detected on Domain A. Initiating automatic restoration for this domain.
Mar 14 09:53:16 sun4810_sc1 Platform.SC: [ID 187965 local0.error] Data Parity error polling failed. Board will no longer be polled: JtagController.tapWait: CBH Port is disabled: IB6.sdc.b0 (12c000b0)
Mar 14 09:53:38 sun4810_sc1 Platform.SC: [ID 930884 local0.error] [AD] Event: SF4810.ASIC.DX.SAF_IN_PAR_ERR.30233026
CSN: 138H2FE9 DomainID: A ADInfo: 1.SCAPP.20.12
Time: Sun Mar 14 09:53:38 GMT+08:00 2010
FRU-List-Count: 3; FRU-PN: 5014953; FRU-SN: 002538; FRU-LOC: RP0
FRU-PN: 5014404; FRU-SN: 011488; FRU-LOC: /N0/IB6
FRU-PN: 5014953; FRU-SN: 014739; FRU-LOC: RP2
Recommended-Action: Service action required
Mar 14 09:54:07 sun4810_sc1 Platform.SC: [ID 620190 local0.notice] A: CycleKeyswitch: Initiating keyswitch: off, domain A.
Mar 14 09:56:16 sun4810_sc1 Platform.SC: [ID 299301 local0.notice] A: CycleKeyswitch: Initiating keyswitch: on, domain A.
启系统:
sun4810_sc1:A> break
This will suspend Solaris in domain A.
Do you want to continue? [no] yes
Type 'go' to resume
debugger entered.
{a} ok boot
系统启动后,VCS资源faulted 无法online
root@zhcj2 # hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A zhcj1 RUNNING 0
A zhcj2 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B cvm zhcj1 Y N ONLINE
B cvm zhcj2 Y N ONLINE
B vrts_vea_cfs_int_cfsmount1 zhcj1 Y N STARTING|PARTIAL
B vrts_vea_cfs_int_cfsmount1 zhcj2 Y N STARTING|PARTIAL
-- RESOURCES FAILED
-- Group Type Resource System
C vrts_vea_cfs_int_cfsmount1 CFSMount cfsmount3 zhcj1
C vrts_vea_cfs_int_cfsmount1 CFSMount cfsmount3 zhcj2
C vrts_vea_cfs_int_cfsmount1 CFSMount cfsmount4 zhcj1
C vrts_vea_cfs_int_cfsmount1 CFSMount cfsmount4 zhcj2
Clear资源:
root@zhcj2 # hagrp -clear vrts_vea_cfs_int_cfsmount1 -sys zhcj1
root@zhcj2 # hagrp -clear vrts_vea_cfs_int_cfsmount1 -sys zhcj2
root@zhcj2 # hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A zhcj1 RUNNING 0
A zhcj2 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B cvm zhcj1 Y N ONLINE
B cvm zhcj2 Y N ONLINE
B vrts_vea_cfs_int_cfsmount1 zhcj1 Y N PARTIAL
B vrts_vea_cfs_int_cfsmount1 zhcj2 Y N PARTIAL
ONLINE资源:
root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj1
root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj2
root@zhcj2 # hastatus
attempting to connect....
attempting to connect....connected
group resource system message
--------------- -------------------- -------------------- --------------------
zhcj1 RUNNING
zhcj2 RUNNING
cvm zhcj1 ONLINE
cvm zhcj2 ONLINE
-------------------------------------------------------------------------
vrts_vea_cfs_int_cfsmount1 zhcj1 STARTING PARTIAL
vrts_vea_cfs_int_cfsmount1 zhcj2 STARTING PARTIAL
VCShmg zhcj1 OFFLINE
VCShmg zhcj2 OFFLINE
vxfsckd zhcj1 ONLINE
-------------------------------------------------------------------------
vxfsckd zhcj2 ONLINE
cvm_clus zhcj1 ONLINE
cvm_clus zhcj2 ONLINE
cvm_vxconfigd zhcj1 ONLINE
cvm_vxconfigd zhcj2 ONLINE
-------------------------------------------------------------------------
cfsmount2 zhcj1 ONLINE
cfsmount2 zhcj2 ONLINE
cfsmount3 zhcj1 *FAULTED*
cfsmount3 zhcj2 *FAULTED*
cfsmount4 zhcj1 *FAULTED*
-------------------------------------------------------------------------
cfsmount4 zhcj2 *FAULTED*
cvmvoldg1 zhcj1 ONLINE
cvmvoldg1 zhcj2 ONLINE
cvmvoldg2 zhcj1 ONLINE
cvmvoldg2 zhcj2 ONLINE
-------------------------------------------------------------------------
VCShm zhcj1 ONLINE
VCShm zhcj2 ONLINE
cfsmount2资源正常ONLINE,但cfsmount3,cfsmount4资源为 *FAULTED*,无法online
检查日志:
#tail -f /var/VRTSvcs/log/engine_A.log
2010/03/13 18:12:48 VCS NOTICE V-16-1-10232 Clearing Restart attribute for group vrts_vea_cfs_int_cfsmount1 on node zhcj1
2010/03/13 18:12:48 VCS NOTICE V-16-1-10232 Clearing Restart attribute for group vrts_vea_cfs_int_cfsmount1 on node zhcj2
2010/03/13 18:12:48 VCS NOTICE V-16-1-10301 Initiating Online of Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) on System zhcj1
2010/03/13 18:12:49 VCS WARNING V-16-20011-5502 (zhcj1) CFSMount:cfsmount4:online:Mount Failed on this Node MountOptions -F vxfs -o cluster,nosuid,rw Block Device /dev/vx/dsk/dasdatadg/daslv : MountPoint /das
2010/03/13 18:12:49 VCS WARNING V-16-20011-5508 (zhcj1) CFSMount:cfsmount4:online:Mount Error : UX:vxfs mount: ERROR: V-3-21264: /dev/vx/dsk/dasdatadg/daslv is already mounted, /das is busy,
allowable number of mount points exceeded
2010/03/13 18:13:51 VCS ERROR V-16-2-13066 (zhcj1) Agent is calling clean for resource(cfsmount4) because the resource is not up even after online completed.
2010/03/13 18:13:53 VCS INFO V-16-2-13068 (zhcj1) Resource(cfsmount4) - clean completed successfully.
2010/03/13 18:13:53 VCS INFO V-16-2-13071 (zhcj1) Resource(cfsmount4): reached OnlineRetryLimit(0).
2010/03/13 18:13:54 VCS ERROR V-16-1-10303 Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is FAULTED (timed out) on sys zhcj1
2010/03/13 18:13:54 VCS INFO V-16-6-15004 (zhcj1) hatrigger:Failed to send trigger for resfault; script doesn't exist
2010/03/13 18:13:57 VCS INFO V-16-1-10307 Resource cfsmount4 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is offline on zhcj1 (Not initiated by VCS)
2010/03/13 18:13:59 VCS NOTICE V-16-1-10301 Initiating Online of Resource cfsmount3 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) on System zhcj1
2010/03/13 18:13:59 VCS WARNING V-16-20011-5502 (zhcj1) CFSMount:cfsmount3:online:Mount Failed on this Node MountOptions -F vxfs -o cluster,nosuid,rw Block Device /dev/vx/dsk/dasdata1dg/dasdata1lv : MountPoint /dasdata1
2010/03/13 18:14:01 VCS WARNING V-16-20011-5508 (zhcj1) CFSMount:cfsmount3:online:Mount Error : UX:vxfs mount: ERROR: V-3-21268: /dev/vx/dsk/dasdata1dg/dasdata1lv is corrupted. needs checking
2010/03/13 18:16:01 VCS ERROR V-16-2-13066 (zhcj1) Agent is calling clean for resource(cfsmount3) because the resource is not up even after online completed.
2010/03/13 18:16:01 VCS INFO V-16-2-13068 (zhcj1) Resource(cfsmount3) - clean completed successfully.
2010/03/13 18:16:01 VCS INFO V-16-2-13071 (zhcj1) Resource(cfsmount3): reached OnlineRetryLimit(0).
2010/03/13 18:16:05 VCS ERROR V-16-1-10303 Resource cfsmount3 (Owner: unknown, Group: vrts_vea_cfs_int_cfsmount1) is FAULTED (timed out) on sys zhcj1
2010/03/13 18:16:05 VCS INFO V-16-6-15004 (zhcj1) hatrigger:Failed to send trigger for resfault; script doesn't exist
检查日志我们发现cfsmount3资源的逻辑卷需要fsck, cfsmount4资源的逻辑卷busy,所以无法挂载.
Full Fsck逻辑卷:
fsck -F vxfs /dev/vx/dsk/dasdata1dg/dasdata1lv时系统提示需full fsck
full fsck: A full filesystem check is generally not necessary unless there has been some sort of disk hardware failure.
root@zhcj2 # fsck -y -F vxfs -o full /dev/vx/dsk/dasdata1dg/dasdata1lv
重新online cfsmount3资源正常
解决cfsmount4 busy问题
1号机上:
root@zhcj1 # mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das
UX:vxfs mount: ERROR: V-3-21264: /dev/vx/dsk/dasdatadg/daslv is already mounted, /das is busy,
allowable number of mount points exceeded
root@zhcj1 # fuser -u /das
/das: 9277c(das)
root@zhcj1 # ps -ef|grep 9277
das 9277 9275 0 17:31:32 pts/3 0:00 -sh
root 25369 2177 0 18:30:16 pts/1 0:00 grep 9277
root@zhcj1 # who
root pts/1 Mar 13 17:14 (130.71.8.61)
root dtremote Mar 13 17:38 (130.71.8.61:1)
das pts/3 Mar 13 17:31 (130.71.8.229)
root pts/5 Mar 13 17:39 (130.71.8.61:1.0)
root pts/6 Mar 13 17:44 (130.71.8.61)
root pts/7 Mar 13 17:47 (192.168.88.24)
root@zhcj1 # kill -9 9277
root@zhcj1 # mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das 正常
root@zhcj1 # umount /das
2号机上:
root@zhcj2 # fuser -u /das
/das: 7959c(das)
root@zhcj2 # who
root console Mar 13 17:30
root pts/1 Mar 13 17:31 (130.71.8.61)
root pts/2 Mar 13 17:32 (192.168.88.24)
das pts/3 Mar 13 17:38 (130.71.8.229)
root@zhcj2 # kill 7959
root@zhcj2 # kill -9 7959
root@zhcj2 # fuser -u /das
/das:
root@zhcj2# mount -F vxfs -o cluster /dev/vx/dsk/dasdatadg/daslv /das 正常
root@zhcj2 # umount /das:
online资源组正常
root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj1
root@zhcj2 # hagrp -online vrts_vea_cfs_int_cfsmount1 -sys zhcj2
通知用户启应用恢复业务