Operating System - Tru64 Unix
1753565 Members
5716 Online
108796 Solutions
New Discussion юеВ

Emulex HBAs - Tru64UNIX v5.1B-4 - Errors

 
A.W.R
Frequent Advisor

Emulex HBAs - Tru64UNIX v5.1B-4 - Errors

Hi,

I am getting a lot of Tru64UNIX v5.1B-4 errors from my KGPSA FC HBAs.

31-Aug-2009 10:36:36 [700] EMX fiber channel adaptor (KGPSA) event

31-Aug-2009 10:36:36 [700] EMX fiber channel adaptor (KGPSA) event

31-Aug-2009 10:36:36 [700] EMX fiber channel adaptor (KGPSA) event

EMX[0]: H/W Error detected - adapter failed to complete io 0xfffffc20fbd88208 (1251707796:297031 vs 1251296703:412209 = 411092884ms) ccb 0xfffffc20fbd883f8 0/56

EMX[0]: H/W Error detected - reset scheduled for failed HBA.

EMX[0]: H/W Error detected - adapter failed to complete io 0xfffffc10fbd7f708 (1251707796:297031 vs 1251296703:412209 = 411092884ms)

EMX[0]: H/W Error detected - reset scheduled for failed HBA.

EMX[0]: H/W Error detected - adapter failed to complete io 0xfffffc20fbd74208 (1251707796:297031 vs 1251296703:412209 = 411092884ms)
ccb 0xfffffc20fbd743f8 0/56

This is a dual port HBA, and I get the messages first on EMX[0] and then EMX [1]. The HBA has not failed. From the system if I do a (# hwmgr -view devices) I can see the FC drives.

If anyon can suggest what might cause this or where I can find further information it would be appreciated.

Thanks
Andrew
2 REPLIES 2
DCBrown
Frequent Advisor

Re: Emulex HBAs - Tru64UNIX v5.1B-4 - Errors

"The HBA has not failed."

Well, maybe it has.

The specific logic was recently added to the driver in an attempt to detect a common internal hba failure that caused hung systems. It was even a specific systemic failure on one model of the FCA adapters at a specific firmware rev. There's an advisory out on that somewhere. But that's an older issue (~2+ years?).

At each heartbeat the driver looks at the top of each time ordered hash list, if the same io is seen too many times it checks to see if the alloted timeout period has expired. All io with a less than 256 second timeout is timed within the hba itself. If the hba's internal timeout fails as it sometimes does on flaky h/w, the system doesn't get the io back... ever... and this can precip a hang. The "error" you see displayed is the driver complaining that the io timeout within the hardware failed. A soft/firmware reset of the adapter is performed to try and recover rather than just sitting there waiting for a hang. BTW, it is possible to turn off the reset portion of the logic since that is controlled via a different, existing, config in which case the software keeps complaining about the stuck hardware, keeps scheduling the reset but one never happens.

The various sysconfigs should be in the release notes. You can turn them off if you want, and if the h/w really is broken then the system will eventually hang due to io that never completes.

The reset logic forces all io to be retried, so it "heals" the issue but you'd see messages about the hardware being reset which isn't shown.

The one hole in the logic is if you glitch the system clock, then you can make it appear as if the io has been outstanding longer than it really has and cause the logic to falsely trigger... but you need to bump the clock alot.
Vladimir Fabecic
Honored Contributor

Re: Emulex HBAs - Tru64UNIX v5.1B-4 - Errors

I also think that FC HBA card may be damaged.
Did you check logs on FC switch?
In vino veritas, in VMS cluster