TruCluster
Showing results for 
Search instead for 
Do you mean 

trucluster member booting failure--failed to send read to dga

SOLVED
Go to Solution
Frequent Advisor

trucluster member booting failure--failed to send read to dga

After power cut off and recover, our cluster can not boot.
when booting from boot disk dga74, when going to "reading 19 blocks from dga74.1001.0.10.0", it displays "failed to send read to dga74.1001.0.10.0" and "failed to read dga74.1001.0.10.0". Then booting failed.

Sometimes I can move a little further, but I still get "bootstrape failure. device dga74.1001.0.10.0 is no longer valid"

Even I clear all in wwidmgr and redo quickset udid to get the boot dev, the problem remains the same.


My trucluster is:
2 GS60e + RA8000 SAN,
All RA8000 units are OK, but the ECB expired.


The same boot problem occurs to the member.

Do appreciate your help.
21 REPLIES
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Probably some of your RAID devices are in a wrong state, maybe you need to flush the cache data. I don't know what kind of controllers uses RA8000 but probably will be HSG.

Then you shoud run show units, and check the status of each one.

See:

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1118582
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

So was the cache battery found bad?
You'll need to ask the HSG directly.
YOu may need to run 'frutil', and 'clear cli'
And the dreaded:

CLEAR_ERRORS THIS_CONTROLLER
INVALID_CACHE DESTROY_UNFLUSHED_DATA

similar topic

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=686194

hsg touble-shoot document (on IBM site! :-)

ftp://ftp.software.ibm.com/storage/mss/specialist/EK-G6OTR-SA-A01.pdf

hth,
Hein
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

I replace ECB.

Controller is in a good state as the following,

HSG80> sho this ful
Controller:
HSG80 ZG93307978 Software V85F-0, Hardware E09
NODE_ID = 5000-1FE1-0001-CF10
ALLOCATION_CLASS = 0
SCSI_VERSION = SCSI-2
Configured for dual-redundancy with ZG93308051
In dual-redundant configuration
Device Port SCSI address 7
Time: 17-JUN-2008 15:53:35
Command Console LUN is lun 0 (NOIDENTIFIER)
Host PORT_1:
Reported PORT_ID = 5000-1FE1-0001-CF11
PORT_1_TOPOLOGY = FABRIC (fabric up)
Address = 011000
Host PORT_2:
Reported PORT_ID = 5000-1FE1-0001-CF12
PORT_2_TOPOLOGY = FABRIC (standby)
NOREMOTE_COPY
Cache:
32 megabyte write cache, version 0012
Cache is GOOD
No unflushed data in cache
CACHE_FLUSH_TIMER = 45 (seconds)
Mirrored Cache:
32 megabyte write cache, version 0012
Cache is GOOD
No unflushed data in cache
Battery:
NOUPS
MORE THAN 50% CHARGED
Expires: 17-JUN-2010
Extended information:
Terminal speed 9600 baud, eight bit, no parity, 1 stop bit
Operation control: 00000000 Security state code: 37783
Configuration backup enabled on 4 devices
HSG80> sho oth ful
Controller:
HSG80 ZG93308051 Software V85F-0, Hardware E09
NODE_ID = 5000-1FE1-0001-CF10
ALLOCATION_CLASS = 0
SCSI_VERSION = SCSI-2
Configured for dual-redundancy with ZG93307978
In dual-redundant configuration
Device Port SCSI address 6
Time: 17-JUN-2008 15:53:48
Command Console LUN is lun 0 (NOIDENTIFIER)
Host PORT_1:
Reported PORT_ID = 5000-1FE1-0001-CF11
PORT_1_TOPOLOGY = FABRIC (standby)
Host PORT_2:
Reported PORT_ID = 5000-1FE1-0001-CF12
PORT_2_TOPOLOGY = FABRIC (fabric up)
Address = 011100
NOREMOTE_COPY
Cache:
32 megabyte write cache, version 0012
Cache is GOOD
No unflushed data in cache
CACHE_FLUSH_TIMER = 45 (seconds)
Mirrored Cache:
32 megabyte write cache, version 0012
Cache is GOOD
No unflushed data in cache
Battery:
NOUPS
MORE THAN 50% CHARGED
Expires: 17-JUN-2010
Extended information:
Terminal speed 9600 baud, eight bit, no parity, 1 stop bit
Operation control: 00000000 Security state code: 56185
Configuration backup enabled on 4 devices


There's a data unit D1 (not system disk), in a failed state:

D1 DISK50300
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0018
IDENTIFIER = 101
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 256
Access:
ALL
State:
UNKNOWN
Misconfigured:
disk DISK50300 at PTL 5 3 0 --
No device installed, please see product documentation
Size: NOT YET KNOWN
Geometry (C/H/S): NOT YET KNOWN

I delete D1.

But the booting problem remains yet.
How to do?
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Allright, the controller looks good now.

Now can you tie the dga74 back to a unit on the HSG? Use wwidmgr to see the wwid in its full glory. DO a show unit on hte HSG to see it there, with the UID and WWID. And of course showing a size and such.

And if you are close to the box, then I woudl suggest to jiggle teh wires some. Unplug and replug those fiber strands and such. Admittedly, a powerfail you not influence the physical connections, but still... maybe a gbix (sp?) got jolted?
Call support?

Hein.
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi,

Do you have dual SAN switches connecting the box to the storage ?

Maybe one of those is offline...

Hope this helps,

Regards,

Rob
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Please send output of:
HSG> show unit full
HSG> show id
HSG> show failed
from HSG console
and
>>> wwidmgr -show wwid -full
from SRM
In vino veritas, in VMS cluster
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

There're many units. I just copy those related to boot.

D74, D75 are member boot disks. D70 is installed tru64.

Unit:

D70 DISK10000 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0026
IDENTIFIER = 70
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 28089542 blocks
Geometry (C/H/S): ( 5530 / 20 / 254 )
D71 DISK10000 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0027
IDENTIFIER = 71
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 7111272 blocks
Geometry (C/H/S): ( 1400 / 20 / 254 )
D72 DISK10000 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0025
IDENTIFIER = 72
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 355560 blocks
Geometry (C/H/S): ( 70 / 20 / 254 )
D73 S3 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0033
IDENTIFIER = 73
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 35556347 blocks
Geometry (C/H/S): ( 7000 / 20 / 254 )
D74 S3 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0034
IDENTIFIER = 74
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 8533499 blocks
Geometry (C/H/S): ( 1680 / 20 / 254 )
D75 S3 (partition)
LUN ID: 6000-1FE1-0001-CF10-0009-9330-7978-0035
IDENTIFIER = 75
Switches:
RUN NOWRITE_PROTECT READ_CACHE
READAHEAD_CACHE WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 32
Access:
ALL
State:
ONLINE to the other controller
Size: 8533499 blocks
Geometry (C/H/S): ( 1680 / 20 / 254 )

Failedset : none

>>>wwidmgr â show wwid
(only some are copied here)
[0] UDID:74 WWID:01000010:6000-1fe1-0001-cf10-0009-9330-7978-0034 (ev:none)
[0] UDID:75 WWID:01000010:6000-1fe1-0001-cf10-0009-9330-7978-0035 (ev:none)
[0] UDID:70 WWID:01000010:6000-1fe1-0001-cf10-0009-9330-7978-0036 (ev:none)

After doing quickset udid,
via via fc nport connected
dga74.1001.0.10.0 kgpsa0.0.0.10.0 5000-1FE1-0001-CF11. yes
dga75.1001.0.10.0 kgpsa0.0.0.10.0 5000-1FE1-0001-CF11. yes
dga70.1001.0.10.0 kgpsa0.0.0.10.0 5000-1FE1-0001-CF11. yes
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

When one member in diag mode, the switch port by which its HBA connects is green.

And the connection is
HSG80> sho conn
Connection Unit
Name Operating system Controller Port Address Status Offset

!NEWCON25 TRU64_UNIX THIS 1 011500 OL this 0
HOST_ID=2000-0000-C922-2C96 ADAPTER_ID=1000-0000-C922-2C96

!NEWCON26 TRU64_UNIX OTHER 2 011500 OL other 100
HOST_ID=2000-0000-C922-2C96 ADAPTER_ID=1000-0000-C922-2C96

!NEWCON27 TRU64_UNIX THIS 1 011600 offline 0
HOST_ID=2000-0000-C922-22EF ADAPTER_ID=1000-0000-C922-22EF

!NEWCON28 TRU64_UNIX OTHER 2 offline 100
HOST_ID=2000-0000-C922-22EF ADAPTER_ID=1000-0000-C922-22EF

When >>>init, the switch port turns off, but show connection remains the same.

When >>>b , and failed,

the switch port is still off and the show connection gets all offline.

what's wrong?
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

You may have lost some FC connections.
Please post output of:
>>> wwidmgr -show wwid -full
In vino veritas, in VMS cluster
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Also please post output of:
>>> show dev
>>> show boot*
In vino veritas, in VMS cluster
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi,

As I mentioned previously, I suspect that one of your SAN switches is offline...

Have you checked them ?

Cheers,
Rob
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

I've 1 switch only. How to check?

According to my connection state and switch port led, can you give any analysis?
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi,

I'm surprised that you only have one switch...

Anyway, telnet to the switch and type

switchShow

That should show you the status of the ports.
The output of

cfgShow

might also be useful to see...


Cheers,

Rob

P.S. I've assumed that you're using Brocade based switches...
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

M sorry I joined late , sorry if i ask dumb question.
what is the state of cluster is it OK.
Other members are running.
If yes I would suggest to take errlog files of this member and see if disks have gone bad.
Change the disk or/else restore from backup.
BR,
Kapil
I am in this small bowl, I wane see the real world......
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

cluster can not boot.

There's no backup yet.
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

I think there is still hope for your cluster.
Do NOT reinstall yet.
And send output of:
>>> wwidmgr -show reach
to see can alpha server reach a boot VDISK.
Do not forget that MA8000 storage is active/passive so active path could change.
Please send output of:
>>> show boot*
In vino veritas, in VMS cluster
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi Paul,
How many nodes your cluster has.Canyou try to boot your cluster with other boot disk.
If you are not sure about the configuration you can try to boot from random disks.
Because only this disk seems to be currupt i think u can boot from other disks.

Or you may as well start one meber from meber boot disk and then can diagnose further from errlog files.

BR,
Kapil
I am in this small bowl, I wane see the real world......
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi Paul,

Looking again at the output from your HSG80s, the disks are online to the "other controller", and not all of the HSG80 ports are up...

You could try forcing the ownership of the LUNs to "this controller", by doing a "restart other".

That may well get things into life...

Hope this helps,

Regards,

Rob
Frequent Advisor

Re: trucluster member booting failure--failed to send read to dga

>>> wwidmgr -show reach

can reach to my dev. The status is listed in the command of quickset udid described in my last reply. Maybe its format is not clear. But I note all the devices are described connected.

Actually, when do >>>show dev, I can get all the devices set by wwidmgr quickset as the following,

dga74.1001.0.10.0
dga75.1001.0.10.0
dga70.1001.0.10.0

i do not have the record but remember show boot* get normal setup,

bootdef_dev dga74.1001.0.10.0
boot_osflag a

I state again neither cluster memeber can boot, it seems all the HSG disks fail or can not access.
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

Hi,

If there are multiple paths to a disk, then you should be specifying all of them in the bootdef_dev variable...

So something like:

set bootdef_dev dga74.1001.0.10.0 dga74.1002.0.10.0 dga74.1003.0.10.0 dga74.1004.0.10.0

I'm fairly convinced that the paths you're trying to boot from are inactive...

Cheers,

Rob
Honored Contributor

Re: trucluster member booting failure--failed to send read to dga

I completely agree with rob when he says
>>set bootdef_dev dga74.1001.0.10.0 dga74.1002.0.10.0 dga74.1003.0.10.0 dga74.1004.0.10.0

It should look like this.Please open a hardware case to check your hardware.
There must be some problem with hardware.
Because I have never seen Tru Cluster in such bad shape , there are several ways to escape from Tru cluster problems.

May be hardware case may help.
Because what I feel is it is not feasible/solution to just reinstall it.
Kapil
I am in this small bowl, I wane see the real world......
//Add this to "OnDomLoad" event