3PAR StoreServ Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

What happens during disk replacement?

skendric
Occasional Visitor

What happens during disk replacement?

We have a T800 running v2.3.1, loaded with ~500 1-2TB SATA drives. Four clients (a clustered NetApp v3140, a Solaris x86 box, and a Win2008 box).

Does anyone else experience hiccups when a tech replaces a disk?

We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]

Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.

A few tidbits from syslog (I'm leaving most lines out of course):

[I think this is where the tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}

[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal

[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql-box.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0

[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.


[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled

[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.

[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.


[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled


==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?

--sk

Stuart Kendrick
FHCRC

2 REPLIES
Sheldon Smith
Honored Contributor

Re: What happens during disk replacement?

The removal/replacement should be transparent to all hosts. What does HP Support say?


Note: While I work for Hewlett Packard Enterprise, all of my comments (whether noted or not), are my own and are not any official representation of the company.
----------
If my post was useful, click on my KUDOS! thumb below!
Torsten.
Acclaimed Contributor

Re: What happens during disk replacement?

Let's assume you have a failed disk inside a magazine with other 3 working disks. Now during replacement, the good disks can be suspended or all data moved away. In both cases it is safe to pull out all 4 disks once the array is ready for that.

Both actions are not "visible" from outside.


You should discuss this with HP support.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!