HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

StephenWagner7 · ‎06-16-2014

Purchased a brand new HP MSA 2040 Dual Controller around 30 days ago, configured in minutes, worked great. This morning at 4:40AM it completely disappeared.

My apologize for the long winded post.

Info:

-Dual Controller

-24 X 900GB Dual Port Enterprise Drives

-Used with VMWare

-2 ESXi hosts, each with 2 DAC connections (each host has 1 connection to controller a, 1 connection to controller b)

-Configured as per HP's best practice for MSA 2040 with vSphere document

-Round Robin enabled, multiple links active (each host has 1 connection to each controller)

-2 vDisks, cont a owns disks 1-12. cont b owns disks 13-24

-Different subnets used on each link

-4:40AM errors and warnings in event log start to occur (discovered this later viewing back)

-6:00AM I wake up, notice storage is down, servers are OK, storage is down

-Check physical SAN, fans are 100%, amber health led is illuminated - degraded health

-Look at controllers, NICs are down on Cont A, NICs are up on Cont B, Amber health LED flashing on both controllers

-Go to log in to web interface, page loads, but states it's unavailable, cannot log in

-SSH in, logged in, every command gets a reply of:

Error: The MC is not ready. Wait a few seconds then retry the request. (2014-06-16 06:32:00)

After trying everything, couldn't do anything. Unplugged power cables, waited a couple seconds, and plugged back in.

-Unit came up, amber LED illuminated. Within 15 minutes this cleared by itself (after viewing logs, it recovered and wrote back the flash, I think)

I extracted these logs from the "save log" feature. They are very similiar to the actual logs presented inside of the management interface. I'll attach a screenshot of the actual logs in the web interface.

B440       2014-06-16 04:38:03  194   INFORMATIONAL  Auto-write-through trigger event: partner processor is not up.
B441       2014-06-16 04:38:03  71    INFORMATIONAL  Failover started. (failed or shutdown controller: A)
B442       2014-06-16 04:38:05  19    INFORMATIONAL  A rescan-bus operation was done. (number of disks that were found: 24, number of enclosures that were found: 1) (rescan reason: initiated by internal logic, rescan reason code: 2)
B443       2014-06-16 04:38:05  77    INFORMATIONAL  Write-back cache was initialized for controller A. Write-back data was found.
B444       2014-06-16 04:40:07  107   ERROR          Critical Error: OSMEnterDebugger  p1: 0x03259E6, p2: 0x0325E43, p3: 0x03268AD, p4: 0x0326DCB   CThr:   IcMsgMon, DbgRegNum=255
B445       2014-06-16 06:37:19  56    INFORMATIONAL  Storage Controller booted up (cold boot - power up). SC firmware version: GLS105R04-01
B446       2014-06-16 06:37:52  84    WARNING        Killed partner controller. (reason: Non volatile device flush or restore failure)
B447       2014-06-16 06:37:52  204   INFORMATIONAL  The system has come up normally and the NV device is in a normal expected state. (p1: 0x0, p2: 0x2F, p3: 0x0, p4: 0x0)
B448       2014-06-16 06:38:23  204   INFORMATIONAL  The system has come up normally and the NV device is in a normal expected state. (p1: 0x0, p2: 0x30, p3: 0x0, p4: 0x0)
B449       2014-06-16 06:38:27  211   INFORMATIONAL  The SAS topology changed (components were added or removed). (Channel: 0, number of elements: 95, expanders: 1, native levels: 1, partner levels: 0, device PHYs: 25)
B450       2014-06-16 06:38:33  112   INFORMATIONAL  Host link down. (port: 1)
B451       2014-06-16 06:38:34  112   INFORMATIONAL  Host link down. (port: 2)
B452       2014-06-16 06:38:34  112   INFORMATIONAL  Host link down. (port: 3)
B453       2014-06-16 06:38:34  112   INFORMATIONAL  Host link down. (port: 4)
B454       2014-06-16 06:38:34  310   INFORMATIONAL  Discovery and initialization of enclosure data was completed following a rescan.
B455       2014-06-16 06:38:35  188   INFORMATIONAL  Write-back cache was disabled.
B456       2014-06-16 06:38:35  190   INFORMATIONAL  Auto-write-through trigger event: supercapacitor charging.
B457       2014-06-16 06:38:35  310   INFORMATIONAL  Discovery and initialization of enclosure data was completed following a rescan.
B458       2014-06-16 06:38:35  77    INFORMATIONAL  Write-back cache was initialized for controller B. Write-back data was found.
B459       2014-06-16 06:38:41  77    INFORMATIONAL  Write-back cache was initialized for controller A. Write-back data was found.
B460       2014-06-16 06:38:50  19    INFORMATIONAL  A rescan-bus operation was done. (number of disks that were found: 24, number of enclosures that were found: 1) (rescan reason: initiated by internal logic, rescan reason code: 27)
B461       2014-06-16 06:39:01  202   INFORMATIONAL  Auto-write-through: Write-back cache was reenabled.
B462       2014-06-16 06:39:01  191   INFORMATIONAL  Auto-write-through trigger event: supercapacitor good.
B463       2014-06-16 06:39:49  81    INFORMATIONAL  Kill was released (that is, the partner controller was allowed to boot up), automatic.
B464       2014-06-16 06:39:54  211   INFORMATIONAL  The SAS topology changed (components were added or removed). (Channel: 1, number of elements: 91, expanders: 1, native levels: 0, partner levels: 1, device PHYs: 25)
B465       2014-06-16 06:39:54  310   INFORMATIONAL  Discovery and initialization of enclosure data was completed following a rescan.
B466       2014-06-16 06:39:56  19    INFORMATIONAL  A rescan-bus operation was done. (number of disks that were found: 24, number of enclosures that were found: 1) (rescan reason: initiated by internal logic, rescan reason code: 27)
B467       2014-06-16 06:40:26  363   INFORMATIONAL  Firmware versions match those in the firmware bundle. (controller: B)
B468       2014-06-16 06:40:26  181   INFORMATIONAL  Management Controller configuration parameters were set.
B469       2014-06-16 06:40:26  181   INFORMATIONAL  Management Controller configuration parameters were set.
B470       2014-06-16 06:40:26  141   INFORMATIONAL  The Management Controller IP address changed. (new IP address: IP: 10.127.32.16/255.255.255.0/10.127.32.5)
B471       2014-06-16 06:40:26  139   INFORMATIONAL  The Management Controller booted up. MC firmware version: GLM105R009-01 (baselevel: L100)
B472       2014-06-16 06:40:28  111   INFORMATIONAL  Host link up. (port: 3, speed: 10 Gbps)
B473       2014-06-16 06:40:31  111   INFORMATIONAL  Host link up. (port: 4, speed: 10 Gbps)
B474       2014-06-16 06:40:32  112   WARNING        Host link down. (port: 3)
B475       2014-06-16 06:40:33  111   INFORMATIONAL  Host link up. (port: 3, speed: 10 Gbps)
B476       2014-06-16 06:40:35  112   WARNING        Host link down. (port: 4)
B477       2014-06-16 06:40:36  111   INFORMATIONAL  Host link up. (port: 4, speed: 10 Gbps)
B478       2014-06-16 06:41:26  310   INFORMATIONAL  Discovery and initialization of enclosure data was completed following a rescan.
B479       2014-06-16 06:41:31  195   INFORMATIONAL  Auto-write-through trigger event: partner processor is up.
B480       2014-06-16 06:41:31  73    INFORMATIONAL  Heartbeat was detected from the partner controller. This indicates that the partner controller is operational.
B481       2014-06-16 06:41:31  72    INFORMATIONAL  Recovery was initiated for controller A.
B482       2014-06-16 06:41:49  19    INFORMATIONAL  A rescan-bus operation was done. (number of disks that were found: 24, number of enclosures that were found: 1) (rescan reason: initiated by internal logic, rescan reason code: 27)
B483       2014-06-16 06:41:53  19    INFORMATIONAL  A rescan-bus operation was done. (number of disks that were found: 24, number of enclosures that were found: 1) (rescan reason: initiated by internal logic, rescan reason code: 6)

Inside of the kernel controller (B) log, I noticed a bunch of these:

Jun 16 05:29:17 (none) user.warn kernel: MCMC: error status (0xdc) - Memory write failed. (Inter MC link message(0x17))

Viewing further, it appears that the NETDEVWATCHDOG mentioned errors on this interface, I don't know if this is a LAN type interface for communication between both controllers.

Jun 16 04:37:55 (none) user.warn kernel: MCMC: error status (0xda) - Unexpected failure. (Inter MC link message(0x17))
Jun 16 04:37:56 (none) user.warn kernel: ------------[ cut here ]------------
Jun 16 04:37:56 (none) user.warn kernel: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x15c/0x24c()
Jun 16 04:37:56 (none) user.info kernel: NETDEV WATCHDOG: mcmc (): transmit queue 0 timed out
Jun 16 04:37:56 (none) user.warn kernel: Modules linked in: mcmclink g_serial ocores_udc mcfulink mooseproc mcscbridge msgdrv
Jun 16 04:37:56 (none) user.warn kernel: [<c0014428>] (unwind_backtrace+0x0/0xec) from [<c02a6c50>] (dump_stack+0x20/0x24)
Jun 16 04:37:56 (none) user.warn kernel: [<c02a6c50>] (dump_stack+0x20/0x24) from [<c001bf60>] (warn_slowpath_common+0x5c/0x74)
Jun 16 04:37:56 (none) user.warn kernel: [<c001bf60>] (warn_slowpath_common+0x5c/0x74) from [<c001c034>] (warn_slowpath_fmt+0x40/0x48)
Jun 16 04:37:56 (none) user.warn kernel: [<c001c034>] (warn_slowpath_fmt+0x40/0x48) from [<c022a1e4>] (dev_watchdog+0x15c/0x24c)
Jun 16 04:37:56 (none) user.warn kernel: [<c022a1e4>] (dev_watchdog+0x15c/0x24c) from [<c00281b8>] (run_timer_softirq+0x1d0/0x2dc)
Jun 16 04:37:56 (none) user.warn kernel: [<c00281b8>] (run_timer_softirq+0x1d0/0x2dc) from [<c0021cfc>] (__do_softirq+0xd8/0x1c0)
Jun 16 04:37:56 (none) user.warn kernel: [<c0021cfc>] (__do_softirq+0xd8/0x1c0) from [<c002219c>] (irq_exit+0x50/0x5c)
Jun 16 04:37:56 (none) user.warn kernel: [<c002219c>] (irq_exit+0x50/0x5c) from [<c000f750>] (handle_IRQ+0x84/0xa4)
Jun 16 04:37:56 (none) user.warn kernel: [<c000f750>] (handle_IRQ+0x84/0xa4) from [<c00086b8>] (asm_do_IRQ+0x18/0x1c)
Jun 16 04:37:56 (none) user.warn kernel: [<c00086b8>] (asm_do_IRQ+0x18/0x1c) from [<c000e394>] (__irq_svc+0x34/0x80)
Jun 16 04:37:56 (none) user.warn kernel: Exception

Looks like a kernel panic.

Controller A was also filled with:

Jun 16 05:38:09 (none) user.warn kernel: MCMC: error status (0xdc) - Memory write failed. (Inter MC link message(0x17))
Jun 16 05:38:14 (none) user.warn kernel: MCMC: error status (0xdc) - Memory write failed. (Inter MC link message(0x17))
Jun 16 05:38:15 (none) user.warn kernel: MCMC: error status (0xdc) - Memory write failed. (Inter MC link message(0x17))

After 15 minutes or so after plugging it back in, the AMBER health LED disappeared, and the system was back up online and everything was good.

I called HP to see if they had input. They mentioned they don't know what caused this, however there was notes about the:

Critical Error: OSMEnterDebugger p1: 0x03259E6, p2: 0x0325E43, p3: 0x03268AD, p4: 0x0326DCB CThr: IcMsgMon, DbgRegNum=255

being a firmware related issue. They mentioned there are internal notes on other SANs (P2000, 2012i), but none for the MSA 2040.

They offered to replace controller A, however this unit is BRAND new, so I don't want to replace a controller with a "repaired" controller. They also mentioned the compact flash may not be working due to the error logged in the logs.

However, I think the compact flash error was spawned because I plugged it in right away after unplugging it, and it was probably still using the supercapacitor to write the flash to the compact flash (since later on it was able to succesfully write it back to disk, I think).

Anyone have any input? I was told normally they could make an internal note for investigation on the:

Critical Error: OSMEnterDebugger p1: 0x03259E6, p2: 0x0325E43, p3: 0x03268AD, p4: 0x0326DCB CThr: IcMsgMon, DbgRegNum=255

error, however a new firmware was released 4 days ago, so they can't since I'm running the 2nd newest firmware. Keep in mind the latest firmware only resolves two issues that are completely unrelated.

Any help would be appreciated. Been running back online for 6+ hours no and no issues. I'm just shocked both controllers wigged out and there was no fallback.

I'm thinking it's firmware related, but not sure.

StephenWagner7 · ‎06-16-2014

Here's a screenshot of the logs. Please find attached.

Sorry, I couldn't copy and paste from the web interface for some reason.

StephenWagner7 · ‎06-17-2014

Diving more in to detail in to the controller kernel logs, I'm betting this entry set it off:

Jun 16 04:38:08 (none) user.warn kernel: MCMC: error status (0xda) - Unexpected failure. (Inter MC link message(0x17))

The kernel logs on Controller A are flooding with these when the unit went offline. I'm assuming this is the network communication used in the backplane for the controller to talk to eachother.

Kernel log on controller B, reflected (tons of these):

Jun 16 04:37:58 (none) user.warn kernel: MCMC: error status (0xda) - Unexpected failure. (Inter MC link message(0x17))

Jun 16 04:39:57 (none) user.warn kernel: MCMC: error status (0xdc) - Memory write failed. (Inter MC link message(0x17))

And it all started with:

Jun 16 04:37:55 (none) user.warn kernel: MCMC: error status (0xda) - Unexpected failure. (Inter MC link message(0x17))
Jun 16 04:37:56 (none) user.warn kernel: ------------[ cut here ]------------
Jun 16 04:37:56 (none) user.warn kernel: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x15c/0x24c()
Jun 16 04:37:56 (none) user.info kernel: NETDEV WATCHDOG: mcmc (): transmit queue 0 timed out
Jun 16 04:37:56 (none) user.warn kernel: Modules linked in: mcmclink g_serial ocores_udc mcfulink mooseproc mcscbridge msgdrv
Jun 16 04:37:56 (none) user.warn kernel: [<c0014428>] (unwind_backtrace+0x0/0xec) from [<c02a6c50>] (dump_stack+0x20/0x24)
Jun 16 04:37:56 (none) user.warn kernel: [<c02a6c50>] (dump_stack+0x20/0x24) from [<c001bf60>] (warn_slowpath_common+0x5c/0x74)
Jun 16 04:37:56 (none) user.warn kernel: [<c001bf60>] (warn_slowpath_common+0x5c/0x74) from [<c001c034>] (warn_slowpath_fmt+0x40/0x48)
Jun 16 04:37:56 (none) user.warn kernel: [<c001c034>] (warn_slowpath_fmt+0x40/0x48) from [<c022a1e4>] (dev_watchdog+0x15c/0x24c)
Jun 16 04:37:56 (none) user.warn kernel: [<c022a1e4>] (dev_watchdog+0x15c/0x24c) from [<c00281b8>] (run_timer_softirq+0x1d0/0x2dc)
Jun 16 04:37:56 (none) user.warn kernel: [<c00281b8>] (run_timer_softirq+0x1d0/0x2dc) from [<c0021cfc>] (__do_softirq+0xd8/0x1c0)
Jun 16 04:37:56 (none) user.warn kernel: [<c0021cfc>] (__do_softirq+0xd8/0x1c0) from [<c002219c>] (irq_exit+0x50/0x5c)
Jun 16 04:37:56 (none) user.warn kernel: [<c002219c>] (irq_exit+0x50/0x5c) from [<c000f750>] (handle_IRQ+0x84/0xa4)
Jun 16 04:37:56 (none) user.warn kernel: [<c000f750>] (handle_IRQ+0x84/0xa4) from [<c00086b8>] (asm_do_IRQ+0x18/0x1c)
Jun 16 04:37:56 (none) user.warn kernel: [<c00086b8>] (asm_do_IRQ+0x18/0x1c) from [<c000e394>] (__irq_svc+0x34/0x80)
Jun 16 04:37:56 (none) user.warn kernel: Exception stack(0xc03e7f50 to 0xc03e7f98)
Jun 16 04:37:56 (none) user.warn kernel: 7f40: 00000000 0005317f 0005217f 60000013
Jun 16 04:37:56 (none) user.warn kernel: 7f60: c03e86c8 00000000 c07400e0 c03eb1fc 49804000 41069265 49bd4dc4 c03e7fa4
Jun 16 04:37:56 (none) user.warn kernel: 7f80: 600000d3 c03e7f98 c000f8ec c000f8f8 60000013 ffffffff
Jun 16 04:37:56 (none) user.warn kernel: [<c000e394>] (__irq_svc+0x34/0x80) from [<c000f8f8>] (default_idle+0x3c/0x40)
Jun 16 04:37:56 (none) user.warn kernel: [<c000f8f8>] (default_idle+0x3c/0x40) from [<c000fad0>] (cpu_idle+0x60/0xb4)
Jun 16 04:37:56 (none) user.warn kernel: [<c000fad0>] (cpu_idle+0x60/0xb4) from [<c02a42c4>] (rest_init+0x68/0x80)
Jun 16 04:37:56 (none) user.warn kernel: [<c02a42c4>] (rest_init+0x68/0x80) from [<c03be7a0>] (start_kernel+0x2a8/0x300)
Jun 16 04:37:56 (none) user.warn kernel: ---[ end trace ec0622d06186d082 ]---

Doug Skranak · ‎07-24-2014

hi Stephen,

Like you I have a brand new MSA (controller firmware revision: GLS105R04-01) and I was able to crash the controller. I'm in process of opening a ticket with HP concerning this. I'm curious to know what you find out, and I'll post back to your thread when I find out more with my ticket. :)

I crashed Controller A as well, curiously, but it was also the controller that I was doing a benchmark run on.

I'm going to look through your posts a little more carefully a bit later and through my logs and followup.

Thanks,

Doug

StephenWagner7 · ‎07-24-2014

Hmm,

Shortly after this I updated to the latest firmware, and I haven't encountered the issue again. Keep in mind I haven't done any benchmarks on the unit since. It's been fairly stable, and I occasionally restart the individual controllers.

HP kept in touch regarding this issue with my case, however I didn't want to replace any of the hardware since my unit is brand new. Eventually they closed the case since this didn't occur again.

I'm betting money this is firmware related, hopefully if enough people complain, they will investigate the issue and come out with a fix.

hpnoobie · ‎10-17-2014

For anyone else searching this post, here is some more useful information:

We've also experienced the same issue. MSA2040 using iSCSI directly attached to c7000 virtual connects. Under heavy network traffic to / from MSA2040 it simply disappears from vmware and each host can't see it.

We updated MSA2040 to latest firmware, added drivers to ESXi for the network adapters in each blade, and still no dice. From reading around it could be a firmward issue on the network adapter on each blade, or issue with ESXi 5.5. Will post back here once we have the solution. Some other posts about similar issues with the P2000 show that it might be a ESXi 5.5 issue and you (we) might have to roll it back to 5.1 for compatibility. Bummer.

StephenWagner7 · ‎10-17-2014

Just for an update in my situation, shortly after I updated to the latest firmware on the MSA 2040, also since then there's been a few updates to ESXi which I've done.

Since then, this issue has not occured again for me. If anything happens on my side, I'll post back.

aniebuhr · ‎10-31-2016

Hello, anybody solve this problem?

I have this errors and the controller does not boot:

WARNING Killed partner controller. (reason: Non volatile device flush or restore failure)

StephenWagner7 · ‎10-31-2016

Hello AnieBuhr,

Could you clarify if you have the problem mentioned in the initial thread? Or is it a new different problem?

If it is the same problem that was mentioned in the initial thread, please make sure you have your firmware up to date on the MSA2040 SAN (as the firmware update resolved my issues).

If your issue is different than the problem mentioned in the initial thread, please contact HPe support as they will be able to provide support and help diagnose your issue if your MSA2040 is under warranty or covered by an HPe Care Pack..

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger

Re: HP MSA 2040 - Dual Controller iSCSI - Disappeared. Critical Error: OSMEnterDebugger