Re: What happens during disk replacement?

skendric · ‎01-18-2012

We have a T800 running v2.3.1, loaded with ~500 1-2TB SATA drives. Four clients (a clustered NetApp v3140, a Solaris x86 box, and a Win2008 box).

Does anyone else experience hiccups when a tech replaces a disk?

We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]

Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.

A few tidbits from syslog (I'm leaving most lines out of course):

[I think this is where the tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}

[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal

[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql-box.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0

[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.

[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled

[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.

[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.

[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled

==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?

--sk

Stuart Kendrick
FHCRC

Sheldon Smith · ‎01-19-2012

The removal/replacement should be transparent to all hosts. What does HP Support say?

Note: While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company

Torsten. · ‎01-19-2012

Let's assume you have a failed disk inside a magazine with other 3 working disks. Now during replacement, the good disks can be suspended or all data moved away. In both cases it is safe to pull out all 4 disks once the array is ready for that.

Both actions are not "visible" from outside.

You should discuss this with HP support.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.
__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: What happens during disk replacement?

What happens during disk replacement?

Re: What happens during disk replacement?

Re: What happens during disk replacement?