Operating System - HP-UX
1831096 Members
2516 Online
110019 Solutions
New Discussion

SCSI errors in syslog with 2 machines connected to Model 12H

 
Stephen Andreassend
Regular Advisor

SCSI errors in syslog with 2 machines connected to Model 12H

 
15 REPLIES 15
Clemens van Everdingen
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

HI,

These are scsi error's regarding disks or controller.
I would check with msts/xstm to see wether there are large amount of read/write error's.

If so I would log a Hardware call at HP, to have the defective disk/s controller.

C.
The computer is a great invention, there are as many mistakes as ever, but they are nobody's fault !
Sandip Ghosh
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

I think the SCSI I/O timeout has been set as default. Default is 30 Sec. Try to increase it to 60 sec.For normal disks you can do it by

pvchange -t 60

But for the LUN you can try out through the control panel of the 12H array.

Sandip
Good Luck!!!
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

Even though the same operations are taking place concurrently on the machines copying large files, only Node 2 has these errors.

Some more errors:

Apr 16 18:48:20 k570b vmunix: SCSI: Resetting SCSI -- lbolt: 10045700, bus: 1
Apr 16 18:48:20 k570b vmunix:
Apr 16 18:48:20 k570b vmunix: SCSI: Reset detected -- lbolt: 10045700, bus: 1
Apr 16 18:48:20 k570b vmunix: LVM: Path (device 0x1f011000) to PV 0 in VG 1 Failed!
Apr 16 18:48:20 k570b vmunix: LVM: VG 1 : PV 0 (device 0x1f011000) is POWERFAILED
Apr 16 18:48:20 k570b vmunix: LVM: Path (device 0x1f011200) to PV 0 in VG 3 Failed!
Apr 16 18:48:20 k570b vmunix: LVM: Path (device 0x1f011400) to PV 0 in VG 4 Failed!
Apr 16 18:48:20 k570b vmunix: LVM: Recovered Path (device 0x1f011200) to PV 0 in VG 3.
Apr 16 18:48:20 k570b vmunix: LVM: Recovered Path (device 0x1f011400) to PV 0 in VG 4.
Apr 16 18:48:20 k570b vmunix: LVM: Recovered Path (device 0x1f011000) to PV 0 in VG 1.
Apr 16 18:48:20 k570b vmunix: LVM: Restored PV 0 to VG 1.

Apr 16 18:48:24 k570b vmunix: SCSI: Resetting SCSI -- lbolt: 10054100, bus: 1
Apr 16 18:48:24 k570b vmunix:
Apr 16 18:48:24 k570b vmunix: SCSI: Reset detected -- lbolt: 10054100, bus: 1
Apr 16 18:48:24 k570b vmunix: LVM: Path (device 0x1f011200) to PV 0 in VG 3 Failed!
Apr 16 18:48:24 k570b vmunix: lv_readvgdats: Could not read VGSA 2 header & trailer from disk H/W path 10/8.1.0 (error = 5)
Apr 16 18:48:24 k570b vmunix: LVM: Failed to restore PV 0 to VG 1!
Apr 16 18:48:24 k570b vmunix: LVM: Path (device 0x1f011400) to PV 0 in VG 4 Failed!
Apr 16 18:48:24 k570b vmunix: LVM: Path (device 0x1f011000) to PV 0 in VG 1 Failed!
Apr 16 18:48:24 k570b vmunix: LVM: Recovered Path (device 0x1f011400) to PV 0 in VG 4.
Apr 16 18:48:24 k570b vmunix: LVM: Recovered Path (device 0x1f011000) to PV 0 in VG 1.
Apr 16 18:48:24 k570b vmunix: LVM: Recovered Path (device 0x1f011200) to PV 0 in VG 3.
Apr 16 18:48:24 k570b vmunix: LVM: Restored PV 0 to VG 1.


Clemens van Everdingen
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

Hi,

Might ne the controller on node 2 is resetting/failing.

Check with mstm the controller.

C.
The computer is a great invention, there are as many mistakes as ever, but they are nobody's fault !
S.K. Chan
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

Problem seems to be in Bus 1 (all the c1's), try running diskinfo on all the disks in that bus ..
# /etc/diskinfo /dev/rdsk/c1tXdX
and do they respond ?

One more thing that stands out is the "Zalon Fatal error", are you running 10.20 ? If you're then patch PHKL_16751 fixes a potential zalon chip bug that resets SCSI bus for no reason.
Sandip Ghosh
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

Whenever one SCSI controller is getting timeout signal it assumes that there is a problem. That time it initiates a SCSI reset. In your case the node 1 has taken over the control of the SCSI Bus during the transfer of the large oracle file. And the second controller get the timeout signal because both are in the same SCSI chain. That is why the SCSI controller on node 2 initiated a SCSI reset and you got all those error messages in the syslog.

Sandip
Good Luck!!!
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

One more error on Node 2:

Apr 16 18:56:22 k570b vmunix: SCSI: Resetting SCSI -- lbolt: 10103600, bus: 1
Apr 16 18:56:22 k570b vmunix:
Apr 16 18:56:22 k570b vmunix: SCSI: Reset detected -- lbolt: 10103600, bus: 1
Apr 16 18:56:22 k570b vmunix: LVM: VG 1 : PV 0 (device 0x1f011000) is POWERFAILED

I have just checked the cabling etc:
Node 1 is on Controller X.
Node 2 (errors) is on Controller Y.
Controller X is primary.

I will switch Node 2 to Controller X and see what happens.
A. Clay Stephenson
Acclaimed Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

This is fairly typical on arrays. Almost certainly, you need to increase to I/O timeout on ALL LUN's using the pvchange command. The timeout should probably set between 120 and 180 seconds. Man pvchange for details.

By the way, if cabled/terminated properly you could actually use both external busses on both hosts without conflict and this would give you an alternate SCSI path in case of failure on either host. Here's how:

HostA Controller 1 (SCSI ID 7) ----- 12H X Controller (SCSI ID 1) ---- HostB Controller 1 (SCSI ID 6)


HostA Controller 2 (SCSI ID 7) ----- 12H Y Controller (SCSI ID 2) ---- HostB Controller 6 (SCSI ID 6)

This way, each LUN could have both a primary and alternate path.

This works quite well.




If it ain't broke, I can fix that.
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

OK reconfigured system:
Node 1 on Controller X.
Node 2 on Controller X.
Controller X set as primary.

Errors occuring on both Node 1 and Node 2 now while idle.

Previously, errors were only occuring on Node 2, which was connected to Controller Y with Controller X as primary.

PHKL_16751 for SCSI resets on HPUX 10.20 is already installed via the latest Support bundle.
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

Another reconfiguration,
Node 1 on Controller X,
Node 2 on Controller Y but a different port,
Controller X primary.

Node 1 is fine again, no errors.
Node 2 is spewing out the same errors even when idle.

Suspect either Node 2 SCSI adapter, cable, or terminator has problem. Hopefully its something obvious like bent pins.
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

No bent pins anywhere, problem still persists.
U.SivaKumar_2
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

hi,
why dont't try changing SCSI terminator and cable ?.

regards,
U.SivaKumar
Innovations are made when conventions are broken
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

pvchange -t 120 to change the time out has no effect either.
Stephen Andreassend
Regular Advisor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

OK I have made progress on this one.

I just finished some more stress tests after
completely swapping the SCSI cables between machines.

The errors stop when the short SCSI cable is connected to the external K570 (a bit of a stretch!), and the long SCSI cable is connected to the internal K570.

The cables are the same model number, as are all the components in the 2 machines.

I guess the problem is solved but we would prefer the long SCSI cable on the external K570 as it would be safer in case someone moves the machines around.

Any explanations of why this now works?

Steve
U.SivaKumar_2
Honored Contributor

Re: SCSI errors in syslog with 2 machines connected to Model 12H

hi,
now i suspect the SCSI terminators only. The length of cable is limited to avoid signal
reflection in cable and attenuation. The terminators set a proper impedance in this
electric circuit ( data bus ) to reduce signal
reflections effect. So if terminators malfunction , it is natural for a device connected with long SCSI cable to give errors
associated with signal corruption.

regards,
U.SivaKumar
Innovations are made when conventions are broken