StoreVirtual Storage
1748183 Members
3700 Online
108759 Solutions
New Discussion

Failover manager degrades twice a day.

 
PaulId
Occasional Contributor

Failover manager degrades twice a day.

Hi

 

Started recieving emails from our CMC twice a day.

 

E00050100 EID_FOM_SERVER_STATUS_G_DOWN

E00050101 EID_FOM_SERVER_STATUS_G_DEGRADED

E00050100 EID_FOM_SERVER_STATUS_G_DOWN

E00050102 EID_FOM_SERVER_STATUS_G_UP

 

The VM logs show no downtime, and no networking issues.

 

We recieve a number of Windows logs, MPIO and Iscsi.

 

HP P4000 DSM for MPIO failed to return a Path to \Device\MPIODisk3.

 

The 2 x P4500 G2 carry on working.

 

Here are some logs...

 

Feb  5 22:01:02 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:vote seqno=30370, latency=59.939

Feb  5 22:01:02 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=volume_quota_hard_increase2(CL01-CSV01(110), hard=2852536, ltime=93), reply=ok, message='ok'

Feb  5 22:01:02 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=volume_quota_hard_increase2(CL01-CSV01(110), hard=2852552, ltime=93), reply=ok, message='ok'

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:SUSPENDING TPC:IO degraded

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:SUSPENDED TPC

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=T

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=69 nserver=1

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_down;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2']

Feb  5 22:01:44 FOM1 dbd_manager[1749]: DBD_MANAGER_STATS:cluster:SAS-CL01             iscsi:active=000 rate=20516511266163/s lat=000ms len=000k bw=20035455224k 00%w nstore=2

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER_STATS:cluster:SAS-CL01              phys:active=000 rate=0092/s lat=000ms recent_maxlat=001ms

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:vote seqno=30372, latency=50.964

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:vote dropped(seqno=30372;ltime=68;id=0) op->ltime_recv=68 blocked_ltime=67 tpc=f suspending=T

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:state seqno=30371, latency=16.167

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:resume TPC in 0.000 secs

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:RESUMING TPC

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:system_name='MG01' system_id=16cd2cc7db2310b25b3429d22feeade3 seqno=30371 quorum=0

Feb  5 22:01:53 FOM1 dbd_manager[1749]: TPC:creating:proto=:BOTTOM:MNAK:PT2PT:FRAG:LOCAL:TOP_APPL:STABLE:VSYNC:SYNC:ELECT:INTRA:INTER:LEAVE:SUSPECT:PRESENT:HEAL:TOP:

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=f

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=69 nserver=0

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:tpc blocked, blocked_ltime=69

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:state transfer completed seqno=30377

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234,server=T

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[1]={-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234,server=T

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[2]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=f

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=T coordinator=f ltime=70 nserver=2

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_up;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2, IPADDRESS1']

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up_and_coordinating;name='SAN1(1)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up;name='SAN2(92)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=fom_global_status_degraded;name='IPADDRESS1(153)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:state_set_done

Feb  5 22:01:54 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=nop, reply=ok, message='ok'

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:SUSPENDING TPC:not degraded

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:SUSPENDED TPC

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=f

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=71 nserver=0

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_down;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2']

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:resume TPC in 0.000 secs

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:RESUMING TPC

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:system_name='MG01' system_id=16cd2cc7db2310b25b3429d22feeade3 seqno=30378 quorum=0

Feb  5 22:02:01 FOM1 dbd_manager[1749]: TPC:creating:proto=:BOTTOM:MNAK:PT2PT:FRAG:LOCAL:TOP_APPL:STABLE:VSYNC:SYNC:ELECT:INTRA:INTER:LEAVE:SUSPECT:PRESENT:HEAL:TOP:

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=T

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=71 nserver=1

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:tpc blocked, blocked_ltime=71

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234,server=T

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[1]={-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234,server=T

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[2]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=T

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=T coordinator=f ltime=72 nserver=3

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_up;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2, IPADDRESS1']

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up_and_coordinating;name='SAN1(1)';user='';attr=[]

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up;name='SAN2(92)';user='';attr=[]

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=fom_global_status_up;name='IPADDRESS1(153)';user='';attr=[]

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=nop, reply=ok, message='ok'

Feb  5 22:02:05 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=initiator_session_set4(volume=CL01-CSV01(110), auth=HOST2(40), gateway=IPADDRESS(166), target ip={inet=IPADDRESS;port=13849} initiator iqn=iqn.1991-05.com.microsoft:, initiator ip={inet=IPADDRESS12;port=57440}, status=connected, isid=70368764559362 tpgt=1 last_modified=9), reply=ok, message='ok'

 

 

One difference that I have seen is that the FOM has a different version to the SAN's. 9.5.00.1234 and 9.5.00.1215

 

Could this be an issue?

 

Thanks

 

Paul

 

7 REPLIES 7
oikjn
Honored Contributor

Re: Failover manager degrades twice a day.

what MPIO errors do you get on your windows servers?  The FOM doesn't have any connections to the windows machines so that should guide you towards where your problem may be.  If the problem is happening at the same time every day or on a pattern, it sounds like there is something external going on. 

 

All that said, I would definitely update the FOM to the current version or just remove/delete it and recreate.  If something got corrupted in the FOM it just isn't worth the time to figure out when you can recreate an FOM in under 30 minutes.

PaulId
Occasional Contributor

Re: Failover manager degrades twice a day.

Hi my MPIO errors are

 

HP P4000 DSM for MPIO failed to return a Path to \Device\MPIODisk3.

 

and my iSCSI errors are..

 

Target did not respond in time for a SCSI request.

 

We are planning on upgrading the SAN to the latest version in the coming weeks.

 

I was mainly wondering if it was a version mismatch.

 

My boss tried to upgrade the FOM but it failed due to having a gateway address which I believe  is no longer an issue with the latest version.

 

Hopefully the upgrade will solve a few issues.

 

Thanks

 

Paul

 

 

oikjn
Honored Contributor

Re: Failover manager degrades twice a day.

hmm.. that sounds like an MPIO system fault and not just the loss of a single path on MPIO which is an odd issue that might be due to storage system latency.   Are you noticing anything funky on the nodes when these errors occur?  Does it happen when you initiate backups or something like that?

 

 

More I think about the FOM, its strange and I'm not sure how you managed to get the FOM off version with the management group unless you upgraded the management group and then installed the FOM after and never updated that.   Usually you cannot upgrade parts of a managment group selectively.

 

You might want to just select the "stay on current version" option in the CMC upgrade and then run the update again to get the FOM at the same level as the rest of the group....  I would also just remove that FOM and install a new one becasue it takes so little time and just eliminates the idea of anything corrupt on that FOM causing an issue.

RossK
Occasional Visitor

Re: Failover manager degrades twice a day.

Hi Paul,

 

As much as I hate to post a 'me too', I'm seeing something similar on a Software VSA under VMWare.  We're running latest versions but one (our environment is sufficiently sandboxed not to have to worry about the security update at this stage).

 

On occasion, I've been seeing this error also associated with some kind of underlying storage subsystem failure ("Disk off or removed).  Meanwhile, the physical disk subsystem is an HP SmartArray P420i with 1GB of flash cache and RAID6 sets under it...

 

I've been down the route of modifying Queue Depth to 4 and MaxIO to 1024 for the physical controller under VMWare and setting QueueDepth to 8 for the iSCSI initiator.

 

This appears to be something to do with the physical disk and / or controller, at least some of the time...

 

Have you arrived at a solution for your deployment?

 

 

Ross

Emilo
Trusted Contributor

Re: Failover manager degrades twice a day.

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member[0]={-5905646969214021532_FOM1_00:15:5D:BF:FF:00,a4eddc25dd53c2735835680842a2b1f9},version=9.5.00.1215,server=T

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-

 

OK the above line shows that the FOM is losing connectivity with the the other nodes here the FOM thinks that SAN 1 has gone offline

 

 

8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=69 nserver=1

Feb  5 22:01:37 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_down;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2']

 

OK the above line shows that the FOM is losing connectivity with the the other nodes here the FOM thinks that SAN 2 has gone offline

 

IF the the other 2 nodes are staying up then the FOM is having network issues. It would be interesting to look at the swtich counters I would be you are seeing lots of dropped packets.

 

Roughly 13 seconds latter service is restored as it now sees the other 2 nodes 'joining' the management group.

 

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=T coordinator=f ltime=70 nserver=2

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_up;name='MG01(-MG01)';user='';attr=[mgrName='IPADDRESS1',mgrList='SAN1, SAN2, IPADDRESS1']

 

Here it is restored you can see the FOM now thinks the 2 nodes are rejoining the network

 

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up_and_coordinating;name='SAN1(1)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=m_global_status_up;name='SAN2(92)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=fom_global_status_degraded;name='IPADDRESS1(153)';user='';attr=[]

Feb  5 22:01:53 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:state_set_done

Feb  5 22:01:54 FOM1 dbd_manager[1749]: DBD_MANAGER_GLOBAL:op=nop, reply=ok, message='ok'

 

Here it is 'flapping again'  7 mintues latter first member left I am going to guess that the FOM is not the only one online?

 

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member left {-8858197615815142666_SAN2_78:E3:B5:07:FA:94,78fd907398a119d5a3e8e54811f63561},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:status enabled=f coordinator=f ltime=71 nserver=0

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_EVENT:POST:type=mg_connection_status_down;name='MG01

 

It didnt stay down very long this round as we can see it restoring here

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-8635866999098246674_SAN1_78:E3:B5:08:41:C0,c8df407b51465f00aa7f9e30a65bbe16},version=9.5.00.1234

Feb  5 22:02:01 FOM1 dbd_manager[1749]: DBD_MANAGER:TPC:member joined {-

 

I would look at the switch counters where the FOM is connected the FOM is not crashing he is just loosing network connectivity to the other nodes, it is able to record and report this information to the logs so its staying up.

 

Get the ifconfig logs from the FOM for the same time period see if is shows drops

The MPIO should not even be configured on the FOM so that MPIO issue you are seeing could be part of a networking issue but the FOM does not participate in that at all.

 

Your statment is pointing to a network issue of some type with the FOM as the other nodes stay up. Ill bet if we looked at the logs for those 2 nodes we would see the FOM leaving but they are still communicating (not much of reach)

 

You can change out the cable or move it to another port.

Before you move it you might want to look at the swtich counters and reset them.

 

The version of the saniq is should not be the issue.

 

Good luck

 

Dennis Handly
Acclaimed Contributor

Re: Failover manager degrades twice a day.

>As much as I hate to post a 'me too',

 

Have you clicked on the "Me too" button?  :-)

sergitin
New Member

Re: Failover manager degrades twice a day.

hi,

I had the same message with code E00050101 and another with code E00000300, both at the same time,

I just check the time sync between the SAN nodes, they were out to update

just that,

txs