Re: I/O error while reading the VGDA

julio quadros · ‎12-21-2002

Hello

We have a cluster of 2x L1000 Machines with HP-UX B.11.00 and MC service guard. The machines have the same configuration, which is:
vg00 --> 2x Internal disks (c1t2d0 e c2t2d0)
other vg's on the FC60 external array

Today we had the following problem:

1. while doing a bdf, the session hung and that kept happening with all the sessions. We stopped the cluster (cmhaltcl) on MACH1 and tried to shut it down. The process hung at the "Unmounting filesystems" stage. Solution was to power it down.

2. When we tried to boot it up, it gave the "System alert: 12 - Software failure" and after acknowledging it gave a couple times "alert: 3" and it was there.

3. Then we booted with the "hpux -lm" command.

4. When trying to activate vg00 (vgchange -a y vg00), we got "I/O error while reading the VGDA".

5. we checked the disks with diskinfo and they were ok, dd worked as well.

6. After some research, we issued the command "vgcfgrestore -n /dev/vg00 /dev/rdsk/c1t2d0" but while activating still got the same error as on 4.

7. We did the same command now for disk c2t2d0.

8. Now the activation of vg00 worked.

9. We rebooted the machine and everything was fine.

Now, the questions are:

1) Why did this happen ?
2) Is there a way to find out what happened ?

Thanks for your support

Julio Quadros

Eugeny Brychkov · ‎12-21-2002

Julio,
looks like disk structures got corrupt. As dd works and you received I/O error reading VGDA I think underlying structure (BDRA) was pointing to invalid/inexistent disk location (that's why I/O error). Check:
- /var/adm/crash for crashdumps;
- /var/adm/tombstones for processor logs;
- expert tool in STM to see if any of these 2 disks have defects logged in growing defect list.
Did you try to boot '-lm' from both disks? I mean can we understand if there was info corrupt on only one disk or on both?
As soon as you wrote 'vgcfgrestore -n /dev/vg00 /dev/rdsk/c1t2d0' so I guess this disk had corrupt LVM structures. I would check its health with diagnose/verify in STM and paid attention to its defect growing table. In addition as soon as these 2 boot disks are on the different controllers then issues can be with controller c1t2d0 connected to.
If you have contract/warranty active I would suggest to call HP
Eugeny

Julio Quadros_1 · ‎12-22-2002

Dear Eugeny, thanks for your reply and my apologies for my ignorance.
1) What is STM ? How do I check the disks for errors ?

2) At PDC or ISL level, is there a way to see if disks are working properly ? Something like diskinfo ?

I will follow your recommendations and post here the results.

Thanks

JQ

Eugeny Brychkov · ‎12-22-2002

STM is Support Tools Manager available in hpux
cstm - command line stm
mstm - menued
xstm - graphical
If you have 700/96 terminal use mstm. Select desired disk, go to Tools and then try 'Information', 'Verify' and 'Diagnose'. Try expert tools to see disk defect table. If they are not available then you need temporary password - call HP.
At PDC level go to service menu and run 'pim' command to see if there's a valid timestamp in its output. Memory stats you can see in information menu.
In ISL there're ODE (offline diagnostics environment) but if only you have installed them. There's an expert tool to diag disks, but better use STM in hpux.
As I already mentioned check files in /var/adm/crash and /var/adm/tombstones (recent file should be ts99).
These all you can do by yourself.
Eugeny

julio quadros · ‎12-22-2002

Hi Evgueny

HELP HELP HELP

It happened again today and once again i had to vgcfgrestore c2t2d0.
Following your advice, i ran xstm and this is what i got:

1)No errors on both disks
2)"Diagnose" and "Expert tool" are not available
3) /var/adm/crash has no crash dumps
4) My machine has no /var/adm/tombstones. In /etc/rc.config.d I have the file pdcinfo with PDCINFO=1 and PDCINFO_OPTS=
5) Information on both controllers do not show any errors. If there is an error on controller, will they show up here ? It shows for both controllers:
Device status:
Bit 9-10:DEVSEL timing 01 - medium

I am really worried because this is a critical cluster and I don??t know what the root cause is.

Thanks for your support

JQ

avsrini · ‎12-22-2002

Hi Julio,
Is these disks c1t2d0 and c2t2d0 are mirrored?.
This seems that your c1t2d0 disk
is going to crash. So its time
to take a make_recovery tape and
check the possibilities of changing the disk.
As your stm gives no error on disks, may be LVM info on your disk may be getting corrupted.

Also check the SCSI termination on the bus.

Is there any errors logged on syslog file, like disk access error on c1t2d0 like that?.

Srini

Be on top.

Eugeny Brychkov · ‎12-22-2002

Srini is right. Check /var/adm/syslog/syslog. log and OLDsyslog.log to see if there are any disk events logged (PV failed, SCSI resets/aborts etc).
I think you should call HP. They will diagnose system and develop action plan
Eugeny

T. M. Louah · ‎12-27-2002

To add to above, `bdf` hangs you can do :
# tail -f /var/adm/syslog/syslog.log
& check what kind of events being logged at that point in time. U can always go back in time in syslog & OLDsyslog for any indications:
# grep -Ei "scsi|power|lbolt" syslog.log
should return any scsi resets or power fail msg. Watch for msg such as: .. dev_t 0x1F012000
the above HEX code means disk c1t2d0 has a problem, not necessarly a hardware one. Power fail msg can be corrected by increasing the IO time-out (seen by pvdisplay /dev/dsk/c#t#d#). Usually set to default, but can be modified by pvchange -t 90 disk for example. More info are found in pvchange man pages.
Generally, any HW failures can be predicted if you have a high nbr of IO Errors about a specific device. With diagnostics installed, run cstm ---> ru --> logtool option --> rs (to run summary). This will show you the nbr of IO errors per device.
To reset the log you can run SL (for switch log) at LogUility prompt.

Cheers,
T?

Little learning is dangerous!

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: I/O error while reading the VGDA

I/O error while reading the VGDA