MSA Storage

MSA 2040 SAN- Completely Lost Connection

 
Clipnews_ITD
New Member

MSA 2040 SAN- Completely Lost Connection

Dear HPE Community, 

 yesterday 29th of April we had a SAN dissapearing.  At first it seemed that controller B lost connection but as i can understand neither controller A was Active  as we completely lost data contact/connection .

I'm providing log parts of that time in order to see, please., maye something we don't. 

From Logs Folder A controller:

INF) 2020-04-29 09:25:12.421 (29) Sync: {A} Received Partner State: ActiveActive/Synchronized
INF) 2020-04-29 09:25:19.182 (27) MEM: h/9289728 u/8505136, f/784592, e/2396160, d/8, t/15992, td/0
ERR) 2020-04-29 09:25:39.014 (10) LocalPipe: SC is down
INF) 2020-04-29 09:25:39.016 (10) SC state: Me Down/NotAvailable Other None/Unknown 2c0 2c0 61
INF) 2020-04-29 09:25:39.017 (10) bool services::capi::ControllerContext::setControllerStates(services::capi::ControllerContext::OperationalState, services::capi::ControllerContext::AvailabilityState, bool)[492] Me SC State changed: OperationalState now [Down]
INF) 2020-04-29 09:25:39.018 (10) data state transition: [Stale] to [ScDead]
INF) 2020-04-29 09:25:39.046 (25) Sync: {A} sending object to partner: runinfo
WAR) 2020-04-29 09:25:39.051 (10) Sync: {A} sc state changed, running op state machine
INF) 2020-04-29 09:25:39.053 (10) Sync: {A} Me Operational State: from [ActiveActive] to [Inactive]
INF) 2020-04-29 09:25:39.055 (9) EventNotifier: {A} Me Operational State: from [Active] to [Inactive]
INF) 2020-04-29 09:25:39.064 (10) SCHED {} Scheduler opStateMachine Sync: Inactive/ActiveActive
INF) 2020-04-29 09:25:39.066 (10) SCHED {} Operational State: from [ActiveActive] to [Inactive]
INF) 2020-04-29 09:25:39.067 (10) SCHED {} NOT RUNNING
INF) 2020-04-29 09:25:39.126 (11) HEALTH: add unhealthy component controller/controller_a
INF) 2020-04-29 09:25:39.136 (11) iomHealth: unexpected controllerStatus=5215=ControllerStatusUndefined
ERR) 2020-04-29 09:25:39.203 (11) LocalPipe: SC is down
WAR) 2020-04-29 09:25:39.204 (11) Down mode: not issuing CAPI_COMMAND_U_GET_CONTROLLER_DATA
ERR) 2020-04-29 09:25:39.206 (11) CapiRequestGetController failed. [3047:CAPI_ERROR_CONTROLLER_SHUTDOWN]
INF) 2020-04-29 09:25:39.554 (12) CHC: SC state: from [RUNNING] to [LOST_COMM]
WAR) 2020-04-29 09:25:39.555 (12) virtual int services::capi::CapiHighCommand::poll()[1453] GOCAPI_UPDATE_ALL run, result: 10089
ERR) 2020-04-29 09:25:39.556 (12) Failed Error Handling Update all
WAR) 2020-04-29 09:25:39.564 (12) void services::capi::CapiHighCommand::run()[459] Poll failed
ERR) 2020-04-29 09:25:39.566 (12) LocalPipe: SC is down
WAR) 2020-04-29 09:25:39.567 (12) Down mode: not issuing CAPI_COMMAND_U_GET_CONFIG_SEQ_NUMBER
INF) 2020-04-29 09:25:39.568 (12) CHC: SC state: waiting for SC to start talking again ...
ERR) 2020-04-29 09:25:44.016 (11) LocalPipe: SC is down
WAR) 2020-04-29 09:25:44.017 (11) Down mode: not issuing CAPI_COMMAND_U_GET_CONTROLLER_DATA
ERR) 2020-04-29 09:25:44.018 (11) CapiRequestGetController failed. [3047:CAPI_ERROR_CONTROLLER_SHUTDOWN]
ERR) 2020-04-29 09:25:47.015 (11) virtual services::capi::Pipe::ResponseCode services::capi::LocalPipe::Process(services::capi::CapiRequest&)[309] [0] packet send error
ERR) 2020-04-29 09:25:49.024 (10) LocalPipe: SC is down
ERR) 2020-04-29 09:25:49.027 (11) LocalPipe: SC is down
WAR) 2020-04-29 09:25:49.028 (11) Down mode: not issuing CAPI_COMMAND_U_GET_CONTROLLER_DATA
ERR) 2020-04-29 09:25:49.029 (11) CapiRequestGetController failed. [3047:CAPI_ERROR_CONTROLLER_SHUTDOWN]
ERR) 2020-04-29 09:25:49.575 (12) LocalPipe: SC is down

 

Also from Controller A:

appsv : Up ( 2084), suspend: false, last transition: 2589777, code 18896 kb data: 15992 kb shared 1752 kb
MEM: F/231300 kb (59%), C/77128 kb (20%), B/16236 kb (4%), A/36308 kb (9%)
INF) 2020-04-29 09:23:01.125 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:25:01.125 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:25:47.558 (8) MC Inner Path is now DOWN vvvvv
INF) 2020-04-29 09:26:39.028 (2) Total in: /cfg/log, 3944174
INF) 2020-04-29 09:26:39.050 (2) FileLogger::cleanupLogs(): Check Limit: total: 3944174, limit: 4000000, path: /cfg/log/appsv_000163_000099.log, target: /mem/log/appsv_000163_000099.log.gz
INF) 2020-04-29 09:26:39.051 (2) FileLogger::cleanupLogs(): done
INF) 2020-04-29 09:27:01.125 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:27:56.155 (7)
appsv : Up ( 2084), suspend: false, last transition: 2590077, code 18904 kb data: 16112 kb shared 1752 kb
MEM: F/231176 kb (59%), C/77152 kb (20%), B/16236 kb (4%), A/36420 kb (9%)
INF) 2020-04-29 09:29:01.126 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:31:01.126 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:31:39.029 (2) Total in: /cfg/log, 3974916
INF) 2020-04-29 09:31:39.031 (2) FileLogger::cleanupLogs(): Check Limit: total: 3974916, limit: 4000000, path: /cfg/log/appsv_000163_000099.log, target: /mem/log/appsv_000163_000099.log.gz
INF) 2020-04-29 09:31:39.032 (2) FileLogger::cleanupLogs(): done
INF) 2020-04-29 09:32:56.161 (7)
appsv : Up ( 2084), suspend: false, last transition: 2590377, code 18904 kb data: 16152 kb shared 1756 kb
MEM: F/231176 kb (59%), C/77176 kb (20%), B/16236 kb (4%), A/36420 kb (9%)
INF) 2020-04-29 09:33:01.132 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:35:01.195 (7) MEM: h/610304 u/470728, f/139576, e/0, d/0, t/2516, td/0
INF) 2020-04-29 09:36:39.019 (2) Total in: /cfg/log, 4008500
INF) 2020-04-29 09:36:39.020 (2) FileLogger::cleanupLogs(): Check Limit: total: 4008500, limit: 4000000, path: /cfg/log/appsv_000163_000099.log, target: /mem/log/appsv_000163_000099.log.gz
INF) 2020-04-29 09:36:39.022 (2) migrateFile log file source /cfg/log/appsv_000163_000099.log, dest /mem/log/appsv_000163_000099.log.gz

 

---------------------------------------------------------------------------------------------------------------------------------

 

 

From Logs Folder B controller:

INF) 2020-04-29 09:25:32.617 (71) Sync: {B} Received Partner State: ActiveActive/Synchronized
INF) 2020-04-29 09:25:38.656 (71) RunInfo: update MCA info
INF) 2020-04-29 09:25:38.725 (71) HEALTH: add unhealthy component controller/controller_a
INF) 2020-04-29 09:25:38.767 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:25:38.768 (71) Sync: {B} Partner Operational State: from [ActiveActive] to [Inactive]
INF) 2020-04-29 09:25:38.771 (71) SCHED {} Scheduler opStateMachine Sync: ActiveActive/Inactive
INF) 2020-04-29 09:25:38.772 (71) SCHED {} Operational State: from [ActiveActive] to [Inactive]
INF) 2020-04-29 09:25:38.773 (71) SCHED {} NOT RUNNING
INF) 2020-04-29 09:25:40.829 (26)
httpd-progress: Down ( 0), suspend: true , last transition: 43948139, code 0 kb data: 0 kb shared 0 kb
logc : Up ( 2062), suspend: false, last transition: 43948268, code 12244 kb data: 2672 kb shared 4 kb
pfud : Up ( 1881), suspend: false, last transition: 43948268, code 476 kb data: 568 kb shared 0 kb
pure-ftpd : Up ( 2190), suspend: false, last transition: 43948268, code 428 kb data: 112 kb shared 0 kb
rpcd : Up ( 1871), suspend: false, last transition: 43948268, code 428 kb data: 76 kb shared 0 kb
sfcbd : Up ( 0), suspend: false, last transition: 43948239, code 32824 kb data: 17580 kb shared 176 kb
( 2323), code: 1144 kb data: 444 kb shared: 0 kb
( 2328), code: 728 kb data: 204 kb shared: 0 kb
( 2330), code: 1472 kb data: 532 kb shared: 0 kb
( 2331), code: 1084 kb data: 844 kb shared: 0 kb
( 2333), code: 1016 kb data: 412 kb shared: 0 kb
( 2334), code: 1492 kb data: 10304 kb shared: 0 kb
( 2339), code: 1076 kb data: 480 kb shared: 0 kb
( 2345), code: 22752 kb data: 3440 kb shared: 176 kb
( 2428), code: 1024 kb data: 456 kb shared: 0 kb
( 2440), code: 1036 kb data: 464 kb shared: 0 kb
slpd : Up ( 2249), suspend: false, last transition: 43948259, code 640 kb data: 156 kb shared 0 kb
snmpd : Up ( 2204), suspend: false, last transition: 43948259, code 13536 kb data: 5196 kb shared 4 kb
sshd : Up ( 2384), suspend: false, last transition: 43948229, code 860 kb data: 304 kb shared 0 kb
telnetd : Up ( 2214), suspend: false, last transition: 43948259, code 728 kb data: 152 kb shared 0 kb
web : Up ( 2173), suspend: false, last transition: 43948268, code 13416 kb data: 7544 kb shared 504 kb
MEM: F/190916 kb (49%), C/80500 kb (20%), B/16572 kb (4%), A/52832 kb (13%)
ERR) 2020-04-29 09:25:50.355 (26) MEM: h/9134080 u/8346088, f/787992, e/2396160, d/360, t/17992, td/8
INF) 2020-04-29 09:26:18.617 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:26:58.617 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:27:38.619 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:27:50.355 (26) MEM: h/9134080 u/8346088, f/787992, e/2396160, d/0, t/17992, td/0
INF) 2020-04-29 09:28:18.618 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:28:58.618 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:29:38.619 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:29:50.355 (26) MEM: h/9134080 u/8346104, f/787976, e/2396160, d/16, t/17992, td/0
INF) 2020-04-29 09:30:18.619 (71) Sync: {B} Received Partner State: Inactive/Synchronized
INF) 2020-04-29 09:30:50.826 (26)

 

Also from the Main Log File:

04/29 09:25:46.000000 -HB HeartBeat: OldState = 2, NewState = 0
04/29 09:25:46.000042 HB Comm Lost reason: 3.
04/29 09:25:46.000084 Killing other (reason=3)
04/29 09:25:46.000124 FOM: Other State = DOWN
04/29 09:25:46.000164 FCR: Kill for reason (3) with data (0)
04/29 09:25:46.000283 EVENT W KILLED OTHER No Heartbeat
04/29 09:25:46.000325 Killing Other
04/29 09:25:46.000363 Stop sending HB
04/29 09:25:46.000437 Asserting Kill: 04/29 09:25:46
04/29 09:25:46.004502 FXC: ScheduleReset: 2, 43948373
04/29 09:25:46.004541 FOM: SetOtherReady old=1 new=0
04/29 09:25:46.004579 HAL: Changing Mirroring Broadcast State to : F
04/29 09:25:46.004618 HAL: Changing MetaData Broadcast State to : F
04/29 09:25:46.004657 Disable MetaData broadcast on engine 1
04/29 09:25:46.004697 HAL: Changing RDE Broadcast State for engine 1 to : F
04/29 09:25:46.004736 Disable broadcast RDE broadcast on engine 1
04/29 09:25:46.004781 ECHO: 1 0 5 0Read/Verify from HAL Meta disabled
04/29 09:25:46.004860 ECHO: E1 0 E5 0 (ECHO OFF)
04/29 09:25:46.004904 RDE 1 00 0200000000 0004320000 Mem Buff 0 20 ffffff0000 0 0 0
04/29 09:25:46.005056 Disable broadcast for 'A' and 'B' buses
04/29 09:25:46.005098 Read/Verify from HAL_EnableBroadcast
04/29 09:25:46.005139 ECHO: E1 0 E5 0 (ECHO OFF)
04/29 09:25:46.005182 RDE 1 00 0200000000 0004320000 Mem Buff 0 20 ffffff0000 0 0 0
04/29 09:25:46.005411 EVENT AWT PARTNER DOWN
04/29 09:25:46.005451 SP_PTCC: p0 partnerActive=0 localActive=0
04/29 09:25:46.005490 SP_PTCC: p1 partnerActive=0 localActive=0
04/29 09:25:46.005529 Scheduling Failover failover:1.
04/29 09:25:46.006033 AA Start task:failover.
04/29 09:25:46.006072 AA FAILOVER::Start 1
04/29 09:25:46.006111 NOCP::UnQuiesce userId 0 userMask 0h
04/29 09:25:46.006156 FDE_ADMIN-1949: UpdateFdeState(), start
04/29 09:25:46.006248 FOFB Start 43948371
04/29 09:25:46.053978 FDE_ADMIN-2110: UpdateArrayState(), start
04/29 09:25:46.054081 Partner UID present - set failover on boot false
04/29 09:25:46.054121 Failover - own foset B, not doing anything
04/29 09:25:46.054159 Failover - foset A
04/29 09:25:46.054281 EVENT FAILOVER Initiated Failover Set A
04/29 09:25:46.054329 AA Failover Phase 0 subphase 00
04/29 09:25:46.054406 AA Failover Phase 0 subphase 01
04/29 09:25:46.054476 AA Failover Phase 0 subphase 02
04/29 09:25:46.054546 AA Failover Phase 0 subphase 03
04/29 09:25:46.054616 AA Failover Phase 0 subphase 04
04/29 09:25:46.054686 AA Failover Phase 1 subphase 00
04/29 09:25:46.054752 FOSET A disk, Failover Phase1, Real Failover
04/29 09:25:46.054795 AA Failover Phase 1 subphase 01
04/29 09:25:46.054865 AA Failover Phase 1 subphase 02
04/29 09:25:46.054934 AA Failover Phase 1 subphase 03
04/29 09:25:46.055000 DMSSNAP::FailoverP1 SetA hostID=ffh real=Real Failover
04/29 09:25:46.055043 AA Failover Phase 1 subphase 04
04/29 09:25:46.055109 FOSET A idset, Failover Phase1, hostId=0xff, Real Failover
04/29 09:25:46.055152 AA Failover Phase 2 subphase 00
04/29 09:25:46.055217 FOSET A disk, Failover Phase2, Real Failover
04/29 09:25:46.055262 NVD storeNextUIDInCache() updateDone: 0 at 2
04/29 09:25:46.055304 DFE::FOP2 - become CURRENT owner of
04/29 09:25:46.055343 DS 534925cb
04/29 09:25:46.055385 DS_ForceOwnership, DS 534925cb to 00c0ff1b70030000, gen 25
04/29 09:25:46.055456 FM: Starting Failover Monitor
04/29 09:25:46.275010 DS_DiskSetUpdate - 13 drives updated, 0 1 2 3 4 5 6 7 8 9 10 22 23
04/29 09:25:46.275061 AA Msg CancelSendMessages.
04/29 09:25:46.275106 Abort Msg RCB 709c6381 FCB 70a63424 coalesced 0
04/29 09:25:46.275146 Not Canceling This Msg = 709c6381 it is on the IcUDMABusyQueue
04/29 09:25:46.275187 Abort Msg RCB 709c6ac8 FCB 70a62a84 coalesced 0
04/29 09:25:46.275229 MsgIob failed type=1 msgType=29 seq=24921 error=3
04/29 09:25:46.275272 CACHE: Mirror Msg Err = 3 CIO(53069227)-LBA = 127db6f8 v=197 len=b SGL LBA=127db6f8 len=b
04/29 09:25:46.275314 MsgIob failed type=1 msgType=29 seq=24923 error=3
04/29 09:25:46.275356 CACHE: Mirror Msg Err = 3 CIO(52fe4877)-LBA = 127db703 v=197 len=2 SGL LBA=127db703 len=2
04/29 09:25:46.275398 MsgIob failed type=1 msgType=29 seq=24925 error=3

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As far as i can see and if im not wrong, no one controller could support and thats why the sever connections completely lost. 

IN this situtation after Power Retstarting MSA2040 and arround 5 minutes later i managed to connect at controller A (web interface) and restart B . 

I'm happy to Provide anything maybee someone need if you would like to help in order to avoid similar kind of outage in future and knowledge of course for the community!

 

Furthermore i would be more that happy if you could provide me, maybbe a link that describes how to proper read the store.log file , or, if you could guide me  to specific files in log structure of that folder.

Thank you very much in advance  !

Dimitris.

2 REPLIES 2

Re: MSA 2040 SAN- Completely Lost Connection

Couple of questions,

1> Did you tried to ping both Controller management IPs at the time of the issue you faced?

2> Are you sure that both controller physically was down means no LED at all on Controllers?

3> Do your Volumes presented with all host ports of both controllers?

4> Did you verified that host were up that time and not rebooted?

You can also check if multi-path properly configured at host end and if Round-Robin policy set or not.

Also, if you wish to check other components on your array you can perform an MSA Health Check here:  

www.hpe.com/storage/MSAHealthCheck

Download your MSA Log File from your MSA array
Upload the MSA Log File into the MSA Health Check website
Review Results by clicking through the tabs and saving the PDF report
Links to array, enclosure, and drive firmware will be provided

WIthout entire log check it will be difficult to tell anything.

 

Hope this helps!
Regards
Subhajit

I am an HPE employee

If you feel this was helpful please click the KUDOS! thumb below!

**********************************************************************


I work for HPE
Accept or Kudo
Shawn_K
HPE Pro

Re: MSA 2040 SAN- Completely Lost Connection

Hello Dimitris,

From the log information you have provided you had two issues.

Controller A went down due to the SC suffering a crash. There are no crash logs provided so I cannot determine why the SC crashed.

ERR) 2020-04-29 09:25:39.014 (10) LocalPipe: SC is down

INF) 2020-04-29 09:25:39.126 (11) HEALTH: add unhealthy component controller/controller_a
INF) 2020-04-29 09:25:39.136 (11) iomHealth: unexpected controllerStatus=5215=ControllerStatusUndefined
ERR) 2020-04-29 09:25:39.203 (11) LocalPipe: SC is down

Controller B lost the heartbeat communication with Controller A (but Controller A was up at the time) and due to the lost PCIE link Controller B went down to allow Controlelr A to remain active. Unfortunately, Controller A was going down for an SC crash.

Couple of things you can do.

Check and make sure you have no unwritten cache on both controllers. The cache may have already flushed but while Controller B was going down there was some cache sitting on that controller. 

There have been many fixes in newer firmware - some relating to SC stability and PCIE link improvement. I suggest you upgrade your system to the latest firmware versions. Also, confirm your the HBAs on your servers are running the latest firmware and drivers.

As Subhajit mentioned - check the other components on your array by performing an MSA Health Check here:  

www.hpe.com/storage/MSAHealthCheck

Download your MSA Log File from your MSA array
Upload the MSA Log File into the MSA Health Check website
Review Results by clicking through the tabs and saving the PDF report
Links to array, enclosure, and drive firmware will be provided

Cheers,
Shawn

I work for Hewlett Packard Enterprise. The comments in this post are my own and do not represent an official reply from HPE. No warranty or guarantees of any kind are expressed in my reply.


I work for HPE

Accept or Kudo