Re: Oracle CSSD failure 134 Oracle reboots system

masoodasif · ‎05-14-2010

Hi Gurus

We are facing issue with ORACLE RAC on HP-UX. Our current environment is two rx3600 with EVA 4400 we are using SGeRAC for Oracle RAC. our OS environment is hp-ux 11.23. WE created Shared logical volumes for Oracle RAC. attached is the information for the VG's.
issue is this when we remove the one san switch for the testing oracle crs failed and did TOC to the servers. when we removed second switch the oracle didnt crash but we found pv is still accessible messages in the syslog but when ever the active pv path for the VG is unavailable and LV is doing switching from one PV to another PV then CRS fails and rebooted the systems.following are the error logs

ay 13 22:41:55 KSEHPDR2 vmunix: fclp driver at 0/3/0/0/0/0 (/dev/fclp0) : detected that device id 0x10000, PWWN 0x5001438004c70f4d is offline.
May 13 22:41:55 KSEHPDR2 vmunix: fclp driver at 0/3/0/0/0/0 (/dev/fclp0) : detected that device id 0x10100, PWWN 0x5001438004c70f49 is offline.
May 13 22:42:27 KSEHPDR2 cmdisklockd[2487]: WARNING: Cluster lock disk /dev/dsk/c12t1d7 has failed: I/O error. Until it is fixed, a single failure could cause all nodes in the cluster to crash.
May 13 22:42:27 KSEHPDR2 cmcld[2484]: Cluster lock disk /dev/vglock:/dev/dsk/c12t1d7 is bad
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3ae8000), from raw device 0x1f0c1200 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1200 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3aec000), from raw device 0x1f0c1300 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1300 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af0000), from raw device 0x1f0c1400 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1400 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af4000), from raw device 0x1f0c1500 (with priority: 0, and current flags: 0x40) to raw device 0x1f081500 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af8000), from raw device 0x1f0c1600 (with priority: 0, and current flags: 0x40) to raw device 0x1f081600 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b380c000), from raw device 0x1f0c0200 (with priority: 0, and current flags: 0x40) to raw device 0x1f080200 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a78000), from raw device 0x1f0c0300 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e0300 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a7c000), from raw device 0x1f0c0400 (with priority: 0, and current flags: 0x40) to raw device 0x1f080400 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a80000), from raw device 0x1f0c0500 (with priority: 0, and current flags: 0x40) to raw device 0x1f080500 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a88000), from raw device 0x1f0c2100 (with priority: 0, and current flags: 0x40) to raw device 0x1f082100 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3afc000), from raw device 0x1f0c0600 (with priority: 0, and current flags: 0x40) to raw device 0x1f080600 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b66000), from raw device 0x1f0c0700 (with priority: 0, and current flags: 0x40) to raw device 0x1f080700 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b6a000), from raw device 0x1f0c1000 (with priority: 0, and current flags: 0x40) to raw device 0x1f081000 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b6e000), from raw device 0x1f0c1100 (with priority: 0, and current flags: 0x40) to raw device 0x1f081100 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c0600 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e0600 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c0700 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e0700 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c1000 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e1000 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c1100 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e1100 Failed! The PV is still accessible.
May 13 22:42:41 KSEHPDR2 vmunix: LVM: WARNING: VG 64 0x030000: LV 1: Some I/O requests to this LV are waiting
May 13 22:42:41 KSEHPDR2 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 13 22:42:41 KSEHPDR2 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 13 22:42:47 KSEHPDR2 syslog: Oracle CSSD failure 134.
May 13 22:42:47 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:48 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Sending SIGKILL to process /u01/crs/oracle/product/10/app/bin/ocssd.bin (pid: 3062) after communication problem detected.
May 13 22:42:47 KSEHPDR2 syslog: Oracle CSSD failure 134.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Sending SIGKILL to process /u01/crs/oracle/product/10/app/bin/ocssd.bin (pid: 3062) after communication problem detected.
May 13 22:42:48 KSEHPDR2 syslogd: restart
May 13 22:42:47 KSEHPDR2 cmclconfd[2510]: Updated file /etc/cmcluster/cmclconfig.tmp for node KSEHPDR2 (length = 67616).
May 13 22:42:47 KSEHPDR2 cmclconfd[2510]: Updated file /etc/cmcluster/cmclconfig.tmp for node KSEHPDR2 (length = 0).
May 13 22:42:47 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Request for primary member 2 on node 2 to leave group crsdrt_NG
May 13 22:42:47 KSEHPDR2 cmfileassistd[2493]: Updated file /etc/cmcluster/cmclconfig (length = 33036).
May 13 22:42:47 KSEHPDR2 cmcld[2484]: Sending file $SGCONF/cmclconfig (33036 bytes) to file assistant daemon.
May 13 22:42:48 KSEHPDR2 su: + tty?? root-oracle
May 13 22:42:48 KSEHPDR2 syslog: Oracle clsomon failed with fatal status 12.
May 13 22:42:48 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:48 KSEHPDR2 syslogd: restart
May 13 22:42:48 KSEHPDR2 su: + tty?? root-oracle

Prasanth V Aravind · ‎05-14-2010

Pls provide ouput of

vgdisplay -v /dev/vglock

ls -l /dev/*/group | grep "0x070000"

GUdluck
Prasanth

masoodasif · ‎05-14-2010

# ll /dev/*/group
crw-r----- 1 root sys 64 0x000000 Nov 24 16:38 /dev/vg00/group
crw-rw-rw- 1 root sys 64 0x050000 Mar 9 11:06 /dev/vgasm1/group
crw-rw-rw- 1 root sys 64 0x070000 Mar 8 14:50 /dev/vgbackup/group
crw-rw-rw- 1 root sys 64 0x030000 Mar 3 12:45 /dev/vghp1/group
crw-rw-rw- 1 root sys 64 0x020000 Mar 8 12:24 /dev/vglock/group
crw-rw-rw- 1 root sys 64 0x060000 Mar 8 14:50 /dev/vgvote/group
#
#

masoodasif · ‎05-14-2010

the VGBackup is not being used by any thing not any filesystem on it or neither used by oracle it is created for some extension purpose but not in use. but SG is activating this VG in shared mode

Regards
MASOOD

Asif Sharif · ‎05-14-2010

Salam Masood,

How are you doing?

Can you please post the vgdisplay output of vg's?

Regards,
Asif Sharif

Regards,
Asif Sharif

masoodasif · ‎05-14-2010

w salam Asif Bhai,

yes i already checked that output all the alternate paths are correct in the vgdisplay and without oracle running i have no issue at all while shifting i created on 50 Gb file on SAN mount point and rcp that to another servers san mount point and removed the same ans switch nothing happened and file got transfered as well with out any issue my issue is this when OCR is running on shared logical volumes then if one active path loss then CRSS sense failure and reboot the booth machines for cluster integrity
i don't have vgdisplay output right now

Regards
Masood Asif

Prasanth V Aravind · ‎05-14-2010

>> /dev/vghp1 - LV 1

Not accesseble because of some unavailable pvs..

Chcek all pvs in thi vg.. does you have multipathing enabled for this ?

Gudluck
Prasanth

AQadir_1 · ‎05-15-2010

You should open a SR with Oracle. Seems to be CRS related issue

Aneesh Mohan · ‎05-15-2010

WARNING: Cluster lock disk /dev/dsk/c12t1d7 has failed: I/O error. Until it is fixed, a single failure could cause all nodes in the cluster to crash.
May 13 22:42:27 KSEHPDR2 cmcld[2484]: Cluster lock disk /dev/vglock:/dev/dsk/c12t1d7 is bad

You may need to install autopath and configure cluster lock disk using autopath virtual disk path ,so the like this failover will be a transparent to CRS.

Aneesh

masoodasif · ‎05-16-2010

hi

Could you please tell me which autopath you are talking about we are using PV LINK or you means scurepath for hp-ux

Regards
Masood

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Oracle CSSD failure 134 Oracle reboots system

Oracle CSSD failure 134 Oracle reboots system