- Community Home
- >
- Storage
- >
- Midrange and Enterprise Storage
- >
- HPE 3PAR StoreServ Storage
- >
- Re: What happens during disk replacement?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-18-2012 09:20 AM
тАО01-18-2012 09:20 AM
What happens during disk replacement?
We have a T800 running v2.3.1, loaded with ~500 1-2TB SATA drives. Four clients (a clustered NetApp v3140, a Solaris x86 box, and a Win2008 box).
Does anyone else experience hiccups when a tech replaces a disk?
We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]
Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.
A few tidbits from syslog (I'm leaving most lines out of course):
[I think this is where the tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}
[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal
[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql-box.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0
[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.
[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.
[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.
[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled
==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?
--sk
Stuart Kendrick
FHCRC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-19-2012 06:08 AM
тАО01-19-2012 06:08 AM
Re: What happens during disk replacement?
The removal/replacement should be transparent to all hosts. What does HP Support say?
Note: While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-19-2012 06:21 AM
тАО01-19-2012 06:21 AM
Re: What happens during disk replacement?
Both actions are not "visible" from outside.
You should discuss this with HP support.
Hope this helps!
Regards
Torsten.
__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.
__________________________________________________
No support by private messages. Please ask the forum!
If you feel this was helpful please click the KUDOS! thumb below!