Operating System - HP-UX
1839242 Members
2278 Online
110137 Solutions
New Discussion

Oracle CSSD failure 134 Oracle reboots system

 
SOLVED
Go to solution
masoodasif
Advisor

Oracle CSSD failure 134 Oracle reboots system

Hi Gurus

We are facing issue with ORACLE RAC on HP-UX. Our current environment is two rx3600 with EVA 4400 we are using SGeRAC for Oracle RAC. our OS environment is hp-ux 11.23. WE created Shared logical volumes for Oracle RAC. attached is the information for the VG's.
issue is this when we remove the one san switch for the testing oracle crs failed and did TOC to the servers. when we removed second switch the oracle didnt crash but we found pv is still accessible messages in the syslog but when ever the active pv path for the VG is unavailable and LV is doing switching from one PV to another PV then CRS fails and rebooted the systems.following are the error logs






ay 13 22:41:55 KSEHPDR2 vmunix: fclp driver at 0/3/0/0/0/0 (/dev/fclp0) : detected that device id 0x10000, PWWN 0x5001438004c70f4d is offline.
May 13 22:41:55 KSEHPDR2 vmunix: fclp driver at 0/3/0/0/0/0 (/dev/fclp0) : detected that device id 0x10100, PWWN 0x5001438004c70f49 is offline.
May 13 22:42:27 KSEHPDR2 cmdisklockd[2487]: WARNING: Cluster lock disk /dev/dsk/c12t1d7 has failed: I/O error. Until it is fixed, a single failure could cause all nodes in the cluster to crash.
May 13 22:42:27 KSEHPDR2 cmcld[2484]: Cluster lock disk /dev/vglock:/dev/dsk/c12t1d7 is bad
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3ae8000), from raw device 0x1f0c1200 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1200 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3aec000), from raw device 0x1f0c1300 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1300 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af0000), from raw device 0x1f0c1400 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e1400 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af4000), from raw device 0x1f0c1500 (with priority: 0, and current flags: 0x40) to raw device 0x1f081500 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3af8000), from raw device 0x1f0c1600 (with priority: 0, and current flags: 0x40) to raw device 0x1f081600 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b380c000), from raw device 0x1f0c0200 (with priority: 0, and current flags: 0x40) to raw device 0x1f080200 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a78000), from raw device 0x1f0c0300 (with priority: 0, and current flags: 0x40) to raw device 0x1f0e0300 (with priority: 1, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a7c000), from raw device 0x1f0c0400 (with priority: 0, and current flags: 0x40) to raw device 0x1f080400 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a80000), from raw device 0x1f0c0500 (with priority: 0, and current flags: 0x40) to raw device 0x1f080500 (with priority: 2, and current flags: 0x0).
May 13 22:42:29 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3a88000), from raw device 0x1f0c2100 (with priority: 0, and current flags: 0x40) to raw device 0x1f082100 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3afc000), from raw device 0x1f0c0600 (with priority: 0, and current flags: 0x40) to raw device 0x1f080600 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b66000), from raw device 0x1f0c0700 (with priority: 0, and current flags: 0x40) to raw device 0x1f080700 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b6a000), from raw device 0x1f0c1000 (with priority: 0, and current flags: 0x40) to raw device 0x1f081000 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0xe0000001b3b6e000), from raw device 0x1f0c1100 (with priority: 0, and current flags: 0x40) to raw device 0x1f081100 (with priority: 2, and current flags: 0x0).
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c0600 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e0600 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c0700 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e0700 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c1000 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e1000 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0c1100 Failed! The PV is still accessible.
May 13 22:42:37 KSEHPDR2 vmunix: LVM: VG 64 0x070000: PVLink 31 0x0e1100 Failed! The PV is still accessible.
May 13 22:42:41 KSEHPDR2 vmunix: LVM: WARNING: VG 64 0x030000: LV 1: Some I/O requests to this LV are waiting
May 13 22:42:41 KSEHPDR2 vmunix: indefinitely for an unavailable PV. These requests will be queued until
May 13 22:42:41 KSEHPDR2 vmunix: the PV becomes available (or a timeout is specified for the LV).
May 13 22:42:47 KSEHPDR2 syslog: Oracle CSSD failure 134.
May 13 22:42:47 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:48 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Sending SIGKILL to process /u01/crs/oracle/product/10/app/bin/ocssd.bin (pid: 3062) after communication problem detected.
May 13 22:42:47 KSEHPDR2 syslog: Oracle CSSD failure 134.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Sending SIGKILL to process /u01/crs/oracle/product/10/app/bin/ocssd.bin (pid: 3062) after communication problem detected.
May 13 22:42:48 KSEHPDR2 syslogd: restart
May 13 22:42:47 KSEHPDR2 cmclconfd[2510]: Updated file /etc/cmcluster/cmclconfig.tmp for node KSEHPDR2 (length = 67616).
May 13 22:42:47 KSEHPDR2 cmclconfd[2510]: Updated file /etc/cmcluster/cmclconfig.tmp for node KSEHPDR2 (length = 0).
May 13 22:42:47 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:47 KSEHPDR2 cmgmsd[2507]: Request for primary member 2 on node 2 to leave group crsdrt_NG
May 13 22:42:47 KSEHPDR2 cmfileassistd[2493]: Updated file /etc/cmcluster/cmclconfig (length = 33036).
May 13 22:42:47 KSEHPDR2 cmcld[2484]: Sending file $SGCONF/cmclconfig (33036 bytes) to file assistant daemon.
May 13 22:42:48 KSEHPDR2 su: + tty?? root-oracle
May 13 22:42:48 KSEHPDR2 syslog: Oracle clsomon failed with fatal status 12.
May 13 22:42:48 KSEHPDR2 syslog: Oracle CRS failure. Rebooting for cluster integrity.
May 13 22:42:48 KSEHPDR2 syslogd: restart
May 13 22:42:48 KSEHPDR2 su: + tty?? root-oracle
14 REPLIES 14
Prasanth V Aravind
Trusted Contributor

Re: Oracle CSSD failure 134 Oracle reboots system


Pls provide ouput of

vgdisplay -v /dev/vglock

ls -l /dev/*/group | grep "0x070000"


GUdluck
Prasanth
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

# ll /dev/*/group
crw-r----- 1 root sys 64 0x000000 Nov 24 16:38 /dev/vg00/group
crw-rw-rw- 1 root sys 64 0x050000 Mar 9 11:06 /dev/vgasm1/group
crw-rw-rw- 1 root sys 64 0x070000 Mar 8 14:50 /dev/vgbackup/group
crw-rw-rw- 1 root sys 64 0x030000 Mar 3 12:45 /dev/vghp1/group
crw-rw-rw- 1 root sys 64 0x020000 Mar 8 12:24 /dev/vglock/group
crw-rw-rw- 1 root sys 64 0x060000 Mar 8 14:50 /dev/vgvote/group
#
#
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

the VGBackup is not being used by any thing not any filesystem on it or neither used by oracle it is created for some extension purpose but not in use. but SG is activating this VG in shared mode

Regards
MASOOD
Asif Sharif
Honored Contributor

Re: Oracle CSSD failure 134 Oracle reboots system

Salam Masood,

How are you doing?

Can you please post the vgdisplay output of vg's?


Regards,
Asif Sharif
Regards,
Asif Sharif
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

w salam Asif Bhai,

yes i already checked that output all the alternate paths are correct in the vgdisplay and without oracle running i have no issue at all while shifting i created on 50 Gb file on SAN mount point and rcp that to another servers san mount point and removed the same ans switch nothing happened and file got transfered as well with out any issue my issue is this when OCR is running on shared logical volumes then if one active path loss then CRSS sense failure and reboot the booth machines for cluster integrity
i don't have vgdisplay output right now

Regards
Masood Asif
Prasanth V Aravind
Trusted Contributor

Re: Oracle CSSD failure 134 Oracle reboots system


>> /dev/vghp1 - LV 1

Not accesseble because of some unavailable pvs..

Chcek all pvs in thi vg.. does you have multipathing enabled for this ?

Gudluck
Prasanth
AQadir_1
Occasional Contributor
Solution

Re: Oracle CSSD failure 134 Oracle reboots system

You should open a SR with Oracle. Seems to be CRS related issue
Aneesh Mohan
Honored Contributor

Re: Oracle CSSD failure 134 Oracle reboots system


WARNING: Cluster lock disk /dev/dsk/c12t1d7 has failed: I/O error. Until it is fixed, a single failure could cause all nodes in the cluster to crash.
May 13 22:42:27 KSEHPDR2 cmcld[2484]: Cluster lock disk /dev/vglock:/dev/dsk/c12t1d7 is bad


You may need to install autopath and configure cluster lock disk using autopath virtual disk path ,so the like this failover will be a transparent to CRS.


Aneesh
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

hi

Could you please tell me which autopath you are talking about we are using PV LINK or you means scurepath for hp-ux

Regards
Masood
Aneesh Mohan
Honored Contributor

Re: Oracle CSSD failure 134 Oracle reboots system


Yes...I mean use multipathing software ,so the link failover should not affect the CRS.


Securepath ---> active/passive disk arrays

Autopath ----> active/active disk arrays

Aneesh
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

hi

can you please provide me the link for the autopath software and i believe it is licensed product but y not use PVLINK instead of auto path, i tried to search on the hp site it is licensed one but also old one for XP or VA i couldn't find nay thing for EVA 4400 could you please explain little more.

Regards
Masood
Aneesh Mohan
Honored Contributor

Re: Oracle CSSD failure 134 Oracle reboots system

Hi,

EVA 4400 has active/active controller.You may need autopath (SecurePath for active/active Storage controllers)..It is a licensed software (HP StorageWorks Auto Path for HP-UX),it is not avilable on the net.You have to ask HP Software team for that.

PVLINK ---- The failover between storage links may affect the CRS cluster disk since it is configured with only a single device path (legacy).

Regards,
Aneesh
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

Hi there,

Sorry for the late reply we solved the problem by changing the LV time out value to 120 because we have 4 paths to the VG and and oracle CTC recommend that (Total PV Path * PV Timeout) which was by default 30 second.



thanks
Regards
Masood Asif
masoodasif
Advisor

Re: Oracle CSSD failure 134 Oracle reboots system

Hi there,

Sorry for the late reply we solved the problem by changing the LV time out value to 120 because we have 4 paths to the VG and and oracle CTC recommend that (Total PV Path * PV Timeout) which was by default 30 second.



thanks
Regards
Masood Asif