Re: P410 lockups on 256 SMART calls when firmware >= 5.70

Jack-S · ‎02-08-2015

Can anyone confirm some behavior we're seeing with the HP Smart Array P410 controller? Our disk monitoring system uses the SMART attributes provided by disks to provide us with early notification of predicted disk failures. However, we've found that if "smartctl --all" is called more than 256 times, the controller locks up until the server is rebooted.

Some testing has narrowed it to either of the SMART READ ATTRIBUTE THRESHOLDS or SMART STATUS CHECK calls causing these lockups. Testing was done on a variety of ProLiant DL180 G6 servers, with firmware versions ranging from 3.66 to 6.60. We have found that when the firmware version is less than 5.70 (5.14 being common for us), the lockups do not occur. Firmwares 5.70, 6.00-2, 6.40, 6.48, and 6.60 have all been confirmed to lock up after 256 calls.

This is what appears in dmesg when a lockup occurs:

[ 699.483852] hpsa 0000:06:00.0: Abort request on C0:B0:T0:L1
[ 723.171333] hpsa 0000:06:00.0: Controller lockup detected: 0x0015002d

We have crafted a one line command to reproduce these lockups. In this example, smartctl -H is called against a single disk in rapid succession. However, these lockups also occur when smartctl is run against multiple disks over a span of days to weeks. That is to say, it does not appear to be disk-specific or time-dependant; the lockup will simply occur once the 256 counter is reached.

i=1; while true; do echo "SMARTCTL ID $i"; smartctl -r ioctl,3 -d sat+cciss,0 -H /dev/sda; let i=$i+1; done

Please let me know if any other information is needed to help diagnose these lockups.

P.S. This thread has been moved from Disk Array to ProLiant Servers (ML,DL,SL). - Hp Forum Moderator

Johan Guldmyr · ‎02-09-2015

Hi,

I tried to run this on a DL 360 G7 with a P410i and a P411, both with 6.00-2. /dev/sda is behind first controller.

To get the smartctl to work I had to remove the sat+ part of the -d parameter.

Also made it log each run to syslog, bit easier to follow then :)

Unfortunately it didn't crash after 256 runs.

i=1; while true; do logger "SMARTCTL ID $i"; smartctl -r ioctl,3 -d cciss,0 -H /dev/sda; let i=$i+1; done

On another DL360 G7 we had problems with resets every now and then as well.

Never figured out what caused it but haven't seen it since replacing the controller + cache.

Which version of the hpsa module are you using? There is one in the kernel and one from hp:

http://downloads.linux.hp.com/SDR/repo/spp/RHEL/6/x86_64/current/ for example has kmod-hpsa-3.4.6-171 for RHEL6.

Jack-S · ‎02-11-2015

Hi Johan,

First, thanks for your help in trying to solve this!

Since we aren't running a RedHat based distro we typically use the kernel's hpsa driver. However, since this server isn't in production I tried installing CentOS 6.6 so I could test HP's official driver. Using kmod-hpsa-3.4.6-171.rhel6u6.x86_64.rpm I was still able to trigger a controller lockup, but this time it showed more information:

hpsa 0000:06:00.0: resetting device 0:0:0:0
hpsa 0000:06:00.0: Controller lockup detected: 0x0015002d
hpsa 0000:06:00.0: CDB 2a000914181800000800000000000000 : hardware error
hpsa 0000:06:00.0: CDB 2a007413ae9000002800000000000000 : hardware error
hpsa 0000:06:00.0: CDB 2a000014884800000800000000000000 : hardware error
hpsa 0000:06:00.0: cp ffff880037990300 had hardware error
hpsa 0000:06:00.0: resetting device failed.
sd 0:0:0:0: Device offlined - not ready after error recovery

[additional messages from ext4/sd; see http://pastebin.com/TjuaeCnk for full log]

What is the best way to proceed from here? Is this truely a hardware error even though we've only been able to reproduce it on firmware versions >= 5.70?

Johan Guldmyr · ‎02-12-2015

Hi,

OK all I know is that the errors look pretty much identical to what I had on one server.

At some point in January this was happening almost every day.

After the replacement (now a month, we haven't seen this behavior anymore) no errors seen.

With earlier firmwares the problem (this was before replacing controller) we had was that the server rebooted after the "resetting device", but with this newer firmware the controller was no longer visible from the OS and had to be manually rebooted. I guess some error handling has been changed/improved.

Maybe our errors aren't exactly the same and what you have is an actual firmware problem.

Jan 7 02:13:24 host kernel: hpsa 0000:06:00.0: resetting device 1:0:0:0
Jan 7 02:13:45 host kernel: hpsa 0000:06:00.0: Controller lockup detected: 0x00130024
Jan 7 02:13:45 host kernel: hpsa 0000:06:00.0: CDB 2a0000002bb000000800000000000000 : hardware error
Jan 7 02:13:45 host kernel: hpsa 0000:06:00.0: CDB 8a000000000475000838000000200000 : hardware error
[snip]
Jan 7 02:13:45 host kernel: hpsa 0000:06:00.0: cp ffff880037c39f00 had hardware error
Jan 7 02:13:45 host kernel: hpsa 0000:06:00.0: resetting device failed.
Jan 7 02:13:45 host kernel: sd 1:0:0:5: Device offlined - not ready after error recovery

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: P410 lockups on 256 SMART calls when firmware >= 5.70

P410 lockups on 256 SMART calls when firmware >= 5.70