1753805 Members
7921 Online
108805 Solutions
New Discussion юеВ

Re: Disk error ?

 
kenny chia
Regular Advisor

Disk error ?

I found this in the syslog file
Feb 26 20:48:17 qc1 vmunix: LVM: VG 1 : PV 0 (device 0x1f040000) is POWERFAILED
Feb 26 20:48:17 qc1 vmunix: LVM: Recovered Path (device 0x1f040000) to PV 0 in VG 1.
Feb 26 20:48:17 qc1 EMS [1852]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/storage/ev ents/disk_arrays/High_Availability/8_12.8.0.255.0.1" (Threshold: >= " 3") Execute the following command to obtain ev
ent details: /opt/resmon/bin/resdata -R 121372690 -r /storage/events/disk_arrays/High_Availability/8_12.8.0.255.0.1 -n 121
372678 -a
Feb 26 20:49:59 qc1 vmunix: LVM: Restored PV 0 to VG 1.

I have done a pvdisplay but I found no stale PEs. What could be wrong?

 

 

P.S. This thread has been moved from Disk to HP-UX > sysadmin. -HP Forum Moderator

All Your Bases Are Belong To Us!
4 REPLIES 4
Eugeny Brychkov
Honored Contributor

Re: Disk error ?

Disk is c4t0d0. Do:
- 'dd if=/dev/rdsk/c4t0d0 of=/dev/null bs=4096k' to check if all disk's extents are available (readable). If you'll get 'I/O error' then backup data and replace disk;
- this disk has (relatively) low priority. If disks with 1-7 SCSI Ids will be heavily loaded then requests to this disk may timeout. Solution: 'pvchange -t 180'.
If you have disk array at this address, then please describe what's this and its configuration
Eugeny
kenny chia
Regular Advisor

Re: Disk error ?

I have done a
pvdisplay -v /dev/dsk/c4t0d0
There are no stale PEs. the device is a Nike30 with 7x18GB disks. Configured as RAID5 with one hot spare.
All Your Bases Are Belong To Us!
Eugeny Brychkov
Honored Contributor

Re: Disk error ?

Ok Kenny,
it's a different story (FC attached disk array, I missed it in my last reply)!
What this message you receive means? It means that LVM, logical volume manager, using path to this FC device, detected timeout. It says "I think that as soon as device do not respond this device power failed" and keeps trying (if there's alternate path available then it will switch to that path). As soon as it detects device on the path came back it switches back.
Looking to output you provided I see that there're problems with LUN 0 (c4t0d0) and LUN 1 (8/12.8.0.255.0.1). I suspect these are different ones because LUN 0 should have .0 at then end but LUN 1 should have .1 at the end.
Time elapsed till link came back is 1.5 minutes.
Generally there're many reasons:
hardware -
- bad FC cable, check cables. If you receive these errors only for one FC30 controller, try replacing this cable, if for both and you use hub - replace FC cable from server's HBA to hub;
- bad hub/FC HBA. Try changing them;
software -
- make sure you have latest FC driver installed http://www.software.hp.com/cgi-bin/swdepot_parser.cgi/cgi/displayProducts.pl?group_type=category&group_name=DRIVERS ;
- make sure your system is patched well - Diags, GR and HWE recent patch budles and from CDs of one release
In addition you can log into FC30 serial port (do it for both controllers!) and go through unsolicited logs and presentation utility to understand how FC30 feels
Eugeny
Michael Steele_2
Honored Contributor

Re: Disk error ?

Very reliable method for detecting HW errors using LOGTOOL utility is found in STM. LOGTOOL lists out by HW address so you'll need to cross reference by ioscan. LOGTOOL lists out accumulated errors and very high numbers, say over 200, indicates a HW problem with that particular device. To invoke and read LOGTOOL follow these steps:

A.) STM > TOOLS > UTILITY > RUN > LOGTOOL > FILE > VIEW > RAW SUMMARY

B.) NOTE First and Last Date which appears at the top of the report, especially the Last Date. If you're collecting errors then the Last Date will be very, very recent.

C.) Read down the report by HW address and note the integer number in parenthesis. You're looking for numbers in the hundreds so you can ignore anything less.

For example, First and Last Date indicate a 4 hour time period and HW address 8_12.8.0.255.0.1 has (435). This is a very high number for a 4 hour time period and an HW call should be started.

LOGTOOL is the most reliable method for detecting HW errors in HP-UX.
Support Fatherhood - Stop Family Law