Operating System - HP-UX
1833053 Members
2816 Online
110049 Solutions
New Discussion

Re: Problem with shared DS2300 and rp7420 (2x)

 
David G. Douthitt
Regular Advisor

Problem with shared DS2300 and rp7420 (2x)

The configuration is thus: two rp7420s sharing one HP DS2300 with two drives. No problems in the past, both disks are shared with both systems; neither system is currently using these disks (they were at one time used; now SAN is used instead). They have also been used without errors recently (last few months) as well.

The systems are now seeing SCSI reset errors at 4:01 am - now for the last two days.

The rp7420 cells reporting errors are both HP-UX 11i v3 (different upgrades). One complex also contains a second cell split into a separate nPar; that nPar is running HP-UX 11i v2 and is not using/connected to the shared SCSI DS2300. (Verified no connection to DS2300).

The HP-UX 11i v3 hosts are in a serviceguard cluster; one node was installed with A.11.18 and the other was upgraded from A.11.17 to A.11.18. The DS2300 has never been set up with clustered disks; they were in "use disks here; just make sure don't use *there*" mode :)

The errors being seen (filtered on C8xx) on hostN are:

Nov 25 04:00:52 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:00:52 vmunix: C8xx: -- lbolt: 52804655, bus: 2
Nov 25 04:00:55 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:00:55 vmunix: C8xx: -- lbolt: 52804955, bus: 2
Nov 25 04:00:55 vmunix: C8xx: Ultra160 Controller at 0/0/8/1/0/1/0: Error: The domain validation test for target 15 determined that communication may not be possible to this target. Verify the hardware configuration.
Nov 25 04:01:18 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:01:18 vmunix: C8xx: -- lbolt: 52807252, bus: 2
Nov 25 04:01:21 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:01:21 vmunix: C8xx: -- lbolt: 52807552, bus: 2
Nov 25 04:01:21 vmunix: C8xx: Ultra160 Controller at 0/0/8/1/0/1/0: Error: The domain validation test for target 15 determined that communication may not be possible to this target. Verify the hardware configuration.
Nov 25 04:01:45 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:01:45 vmunix: C8xx: -- lbolt: 52809878, bus: 2
Nov 25 04:01:48 vmunix: C8xx: Reset detected -- path: 0/0/8/1/0/1/0
Nov 25 04:01:48 vmunix: C8xx: -- lbolt: 52810178, bus: 2
Nov 25 04:01:48 vmunix: C8xx: Ultra160 Controller at 0/0/8/1/0/1/0: Error: The domain validation test for target 15 determined that communication may not be possible to this target. Verify the hardware configuration.

On hostS we see these errors (filtered on C8xx):

Nov 25 04:00:51 vmunix: C8xx: isrEscape Controller at 0/0/8/1/0/1/0.
Nov 25 04:00:51 vmunix: C8xx: First party detected bus hang (HTH) -- lbolt: 355685841, dev: ffffffff
Nov 25 04:00:52 vmunix: C8xx: Resetting SCSI -- lbolt: 355685941, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:00:52 vmunix: C8xx: Reset detected -- lbolt: 355685941, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:00:55 vmunix: C8xx: Reset detected -- lbolt: 355686241, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:17 vmunix: C8xx: isrEscape Controller at 0/0/8/1/0/1/0.
Nov 25 04:01:17 vmunix: C8xx: First party detected bus hang (HTH) -- lbolt: 355688438, dev: ffffffff
Nov 25 04:01:18 vmunix: C8xx: Resetting SCSI -- lbolt: 355688538, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:18 vmunix: C8xx: Reset detected -- lbolt: 355688538, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:18 vmunix: C8xx: Ultra160 Controller at 0/0/8/1/0/1/0: Error: The domain validation test for target 1 determined that communication may not be possible to this target. Verify the hardware configurat
ion.
Nov 25 04:01:21 vmunix: C8xx: Reset detected -- lbolt: 355688838, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:44 vmunix: C8xx: isrEscape Controller at 0/0/8/1/0/1/0.
Nov 25 04:01:44 vmunix: C8xx: First party detected bus hang (HTH) -- lbolt: 355691064, dev: ffffffff
Nov 25 04:01:45 vmunix: C8xx: Resetting SCSI -- lbolt: 355691164, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:45 vmunix: C8xx: Reset detected -- lbolt: 355691164, bus: 2 path: 0/0/8/1/0/1/0
Nov 25 04:01:48 vmunix: C8xx: Reset detected -- lbolt: 355691464, bus: 2 path: 0/0/8/1/0/1/0

Using cstm shows no errors in the disk logs (info log). The disks are listed as expected (CLAIMED, DEVICE, et al) in the ioscan output.

Here are my specific questions?

* Is the DS2300 itself going bad?
* Is a controller going bad?
* Disk can't be going bad....... can it?
* I'm implementing another DS2300 on a rx5670; should I worry?
* Can I poweroff the DS2300 without problems?
* What is the *real* error? Bus Hang? isrEscape? Domain validation error?
* Why didn't EMS notice this (not in logs anywhere, and no email received)?
* How can I make sure EMS is properly configured to be sending emails to me?
* Can I test things nondestructively with cstm?

Keep in mind that while both rp7420s (all cells) are production systems, the DS2300 is unused by all.

ioscan -funC disk output from hostN (filtered for disks at 0/0/8):

Class I H/W Path Driver S/W State H/W Type Description
=======================================================================
disk 31 0/0/8/1/0/1/0.0.0 sdisk CLAIMED DEVICE HP 36.4GST336753LC
/dev/dsk/c2t0d0 /dev/rdsk/c2t0d0
disk 30 0/0/8/1/0/1/0.1.0 sdisk CLAIMED DEVICE HP 36.4GST336753LC
/dev/dsk/c2t1d0 /dev/rdsk/c2t1d0

Same command, same filter, hostS:

# ioscan -funC disk
Class I H/W Path Driver S/W State H/W Type Description
=======================================================================
disk 71 0/0/8/1/0/1/0.0.0 sdisk CLAIMED DEVICE HP 36.4GST336753LC
/dev/dsk/c2t0d0 /dev/rdsk/c2t0d0
disk 70 0/0/8/1/0/1/0.1.0 sdisk CLAIMED DEVICE HP 36.4GST336753LC
/dev/dsk/c2t1d0 /dev/rdsk/c2t1d0
8 REPLIES 8
Torsten.
Acclaimed Contributor

Re: Problem with shared DS2300 and rp7420 (2x)

The message

Error: The domain validation test for target 15 determined that communication may not be possible to this target. Verify the hardware configuration.


points to the jbod controller (BCC).

The problem is, if 1 device block the bus, you may receive messages regarding every other device on this bus(ses).

What is STM telling you about the status of the BCCs and the disks?

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
David G. Douthitt
Regular Advisor

Re: Problem with shared DS2300 and rp7420 (2x)

cstm reports no errors.... except (uhoh!) 8 non-medium errors on 0/0/8/1/0/1/0.1.0.

Another interesting thing - not good as I see it... if I read this right, target 15 would be the same as device 0/0/8/1/0/1/0.15.0 (Disk Enclosure - HPA6491A). Isn't that the DS2300 itself?

The other system reported one "Data Validation" error against target 1 - wouldn't that be the disk with non-medium errors?

I tried getting information from 0/0/8/1/0/1/0 (PCI SCSI Interface) and from 0/0/8/1/0/1/0.15.0 (Disk Enclosure) and from 0/0/8/1/0 (PCI Bus Adapter); none had any information available.

Is it time to retire this box of disks?
Torsten.
Acclaimed Contributor

Re: Problem with shared DS2300 and rp7420 (2x)

Please check the JBOD controllers too, example output from stm:

Hardware path: 0/10/0/0.15.0


Product ID: A6491A

Controller A
------------
Hardware Path: 0/10/0/0.15.0
Serial No: S____________
Firmware Rev.: HP16

Annotation:


Enclosure Status
----------------
Bus Mode: Full

Disk Modules
------------

---------------------------------------------------------
SLOT | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 |
BUS ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 |10 |11 |12 |13 |14 |
STATUS |OK |OK |OK |OK |OK |OK |NIN|OK |OK |OK |NIN|NIN|NIN|NIN|
---------------------------------------------------------

NIN - There is no disk device installed in this slot.


Controllers Status
----------- -------------
A: OK (Reporting Controller)
B: OK

Fan Modules Status
----------- -------------
A: OK
B: OK

Power Supplies Status
-------------- -------------
A: OK
B: OK

SCSI Port
Transceivers Status Mode
---------------- ------------- ----
Controller A: OK LVD
Controller B: OK No cable connected
Controller C: OK No cable connected
Controller D: OK LVD

Voltage Sensors Voltage Status
--------------- ------- -------------
Controller A
------------
3.3v: 3.36 OK
5.0v: 5.08 OK
12v: 12.24 OK

Controller B
------------
3.3v: 3.36 OK
5.0v: 5.12 OK
12v: 12.24 OK

Temp Sensors Temperature Status
------------ ----------- -------------
Sensor 1: 18 (Celsius) OK
Sensor 2: 21 (Celsius) OK
-- Information Tool Log for Disk Enclosure on path 0/10/0/0.15.0 --

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
David G. Douthitt
Regular Advisor

Re: Problem with shared DS2300 and rp7420 (2x)

I had checked the DS2300 Disk Enclosure (as you requested) from hostN. No info was available. After you asked for it (with output no less!) I tried from the other host.... and voila! There it was.

Then I tried hostN again - and it gave a little more (title only) - then it gave a full output (!).

Snippet from full output (attached):

Controllers Status
----------- -------------
A: OK (Reporting Controller)
B: NOT INSTALLED

SCSI Port
Transceivers Status Mode
---------------- ------------- ----
Controller A: OK LVD
Controller B: OK No cable connected
Controller C: NOT INSTALLED
Controller D: NOT INSTALLED

From the looks of the output, controller B is bad?

The cable connected to port B on the back goes to hostN; the cable connected to port A goes to hostS.
Torsten.
Acclaimed Contributor

Re: Problem with shared DS2300 and rp7420 (2x)

If you have a controller B - it looks like dead.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
David G. Douthitt
Regular Advisor

Re: Problem with shared DS2300 and rp7420 (2x)

It sure looks like controller B is bad - but things are still confusing. Both hosts can read from the disks in the array - and both can report the full status (as given) of the disk array itself.

Both report the controller B as "not installed" or "no cable connected".

How can hostN talk to the disk array if the only controller it knows about is bad?

I'm confused...
David G. Douthitt
Regular Advisor

Re: Problem with shared DS2300 and rp7420 (2x)

Assuming controller B is bad (sure seems like it) then can I just power off the box, pull the cable, and power back up?

The disks aren't being used. Will the SCSI bus handle it popping off like that?

Also, why did EMS not report on this error?
David G. Douthitt
Regular Advisor

Re: Problem with shared DS2300 and rp7420 (2x)

Can anyone shed any light on my recent questions? Thanks...