HPE SimpliVity

OVC state "faulty" after power outage

 
adidasnmotion
Advisor

OVC state "faulty" after power outage

We had a power outage recently where I wasn't able to shutdown the Simplivity cluster before the UPS ran out of power.  When we eventually got power back and powered everything up, I ran the svt-federation-show command and one of the OVC's showed that the state was "faulty".  If I use the dsv-balance-show command I only see 2 out of the 3 hosts, the one thats missing is the one that showed it was "faulty".  If I use the dsv-balance-manual -q command and choose the datacenter in question I get the message:

node 192.168.40.41, state 3, is invalid

Fatal condition [code: 1], see /var/log/dsv-balance-manual.log for details

In that log it only shows the error mentioned above: "node 192.168.40.41, state 3, is invalid"

I powered down that OVC and ESXi host and powered it back on after 10 minutes but that did not help resolve the issue.  In vCenter I'm not seeing any strange events for that host and the status for all the hardware on the host shows a status of "normal".

Are there any diagnostic steps I could take to try and troubleshoot this further?  Is there some other log files I could look at?  Are there any repair commands for the OVC?

Our support contract has expired so thats why I'm hoping for some DIY help options here. 

We're running Simplivity 3.7.9 on VMWare VCSA 6.5

Thank you

 

5 REPLIES 5
adidasnmotion
Advisor

Re: OVC state "faulty" after power outage

New information that hopefully gets me closer to a solution.  I noticed that the svt-show-federation command would throw the following error:

Error: Thrift::SSLSocket: Could not connect to 192.168.40.41:9190 (Connection refused)

I then found this HPE support article:

https://support.hpe.com/hpesc/public/docDisplay?docId=sf000063383en_us&docLocale=en_US

Using the troubleshooting steps from there I opened a shell on our vcenter server and enter this command to the OVC in question:

nc -zw3 192.168.40.15 9190 && echo "opened" || echo "closed"

I get a response back of "closed".  If I run that command to any other OVC I get a response back of "opened".

If I use the command netstat -tpan on the faulty OVC I get the following results:

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:9160 0.0.0.0:* LISTEN 5323/java
tcp 0 0 0.0.0.0:9097 0.0.0.0:* LISTEN 2505/java
tcp 0 0 0.0.0.0:9130 0.0.0.0:* LISTEN 4648/java
tcp 0 0 0.0.0.0:8010 0.0.0.0:* LISTEN 2632/python
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 2587/java
tcp 0 0 127.0.0.1:9101 0.0.0.0:* LISTEN 2454/remotepsapp
tcp 0 0 127.0.0.1:9135 0.0.0.0:* LISTEN 2505/java
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 893/rpcbind
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 5024/java
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 4979/nginx
tcp 0 0 127.0.0.1:9170 0.0.0.0:* LISTEN 5323/java
tcp 0 0 0.0.0.0:9140 0.0.0.0:* LISTEN 4648/java
tcp 0 0 0.0.0.0:9397 0.0.0.0:* LISTEN 4648/java
tcp 0 0 0.0.0.0:9110 0.0.0.0:* LISTEN 4648/java
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1676/sshd
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 3163/postgres
tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN 1887/master
tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 4979/nginx
tcp 0 0 0.0.0.0:41245 0.0.0.0:* LISTEN 4648/java
tcp 0 0 127.0.0.1:9150 0.0.0.0:* LISTEN 5323/java
tcp 0 0 127.0.0.1:47920 127.0.0.1:5432 ESTABLISHED 4648/java
tcp 0 188 192.168.40.41:22 192.168.3.91:57994 ESTABLISHED 13084/sshd: adminis
tcp 0 0 127.0.0.1:5432 127.0.0.1:48076 ESTABLISHED 5394/postgres: mgmt
tcp 0 0 127.0.0.1:35686 127.0.0.1:5432 ESTABLISHED 5024/java
tcp 0 0 192.168.40.41:60800 192.168.40.31:22 ESTABLISHED 2587/java
tcp 0 0 127.0.0.1:34938 127.0.0.1:5432 ESTABLISHED 5024/java
tcp 0 0 127.0.0.1:5432 127.0.0.1:43012 ESTABLISHED 15142/postgres: mgm
tcp 0 0 127.0.0.1:48076 127.0.0.1:5432 ESTABLISHED 5323/java
tcp 0 0 127.0.0.1:35832 127.0.0.1:5432 ESTABLISHED 5024/java
tcp 0 0 127.0.0.1:5432 127.0.0.1:36280 ESTABLISHED 13904/postgres: mgm
tcp 0 0 127.0.0.1:5432 127.0.0.1:47920 ESTABLISHED 5223/postgres: mgmt
tcp 0 0 127.0.0.1:43012 127.0.0.1:5432 ESTABLISHED 5024/java
tcp 0 0 127.0.0.1:5432 127.0.0.1:35832 ESTABLISHED 13883/postgres: mgm
tcp 0 0 127.0.0.1:5432 127.0.0.1:34938 ESTABLISHED 13847/postgres: mgm
tcp 0 0 127.0.0.1:5432 127.0.0.1:35686 ESTABLISHED 13879/postgres: mgm
tcp 0 0 127.0.0.1:36280 127.0.0.1:5432 ESTABLISHED 5024/java
tcp 0 0 127.0.0.1:34172 127.0.0.1:5432 TIME_WAIT -
tcp6 0 0 :::111 :::* LISTEN 893/rpcbind
tcp6 0 0 :::22 :::* LISTEN 1676/sshd
tcp6 0 0 :::25 :::* LISTEN 1887/master

This same command on a different ovc that is not having issues gives me a much larger list with connections to other servers and many more ports listed:

tcp 0 0 127.0.0.1:7010 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 192.168.40.40:9190 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:9095 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:9160 0.0.0.0:* LISTEN 6283/java
tcp 0 0 127.0.0.1:7080 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:9097 0.0.0.0:* LISTEN 2513/java
tcp 0 0 0.0.0.0:9130 0.0.0.0:* LISTEN 14181/java
tcp 0 0 127.0.0.1:7210 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:7050 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 192.168.40.40:22122 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 192.168.42.40:22122 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:8010 0.0.0.0:* LISTEN 2640/python
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 2595/java
tcp 0 0 127.0.0.1:7020 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:9101 0.0.0.0:* LISTEN 2462/remotepsapp
tcp 0 0 0.0.0.0:9390 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:38767 0.0.0.0:* LISTEN 14181/java
tcp 0 0 127.0.0.1:9135 0.0.0.0:* LISTEN 2513/java
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 905/rpcbind
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 5779/java
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 5734/nginx
tcp 0 0 127.0.0.1:9170 0.0.0.0:* LISTEN 6283/java
tcp 0 0 0.0.0.0:9140 0.0.0.0:* LISTEN 14181/java
tcp 0 0 127.0.0.1:7060 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:7220 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:9397 0.0.0.0:* LISTEN 14181/java
tcp 0 0 0.0.0.0:9110 0.0.0.0:* LISTEN 14181/java
tcp 0 0 127.0.0.1:7030 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1680/sshd
tcp 0 0 127.0.0.1:7000 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 3166/postgres
tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN 1899/master
tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 5734/nginx
tcp 0 0 127.0.0.1:9150 0.0.0.0:* LISTEN 6283/java
tcp 0 0 127.0.0.1:7230 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:7070 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:7040 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 127.0.0.1:7200 0.0.0.0:* LISTEN 3722/svtfs
tcp 0 0 192.168.40.40:9390 192.168.40.40:51164 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:5432 127.0.0.1:39724 ESTABLISHED 14548/postgres: mgm
tcp 0 0 192.168.40.40:49115 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:60370 127.0.0.1:9099 ESTABLISHED 14181/java
tcp 0 0 127.0.0.1:57036 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 127.0.0.1:5432 127.0.0.1:52812 ESTABLISHED 6797/postgres: mgmt
tcp 0 0 127.0.0.1:43728 127.0.0.1:9097 TIME_WAIT -
tcp 0 0 127.0.0.1:57000 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 192.168.40.40:44991 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:5432 127.0.0.1:52766 ESTABLISHED 6782/postgres: mgmt
tcp 0 0 127.0.0.1:43746 127.0.0.1:9097 TIME_WAIT -
tcp 0 0 127.0.0.1:57040 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 127.0.0.1:34996 127.0.0.1:9099 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:44097 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:41339 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.42.40:34319 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:43659 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:39724 127.0.0.1:5432 ESTABLISHED 14181/java
tcp 0 0 127.0.0.1:52822 127.0.0.1:5432 ESTABLISHED 5779/java
tcp 0 0 192.168.40.40:34479 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:57034 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 127.0.0.1:9099 127.0.0.1:60370 ESTABLISHED 2595/java
tcp 0 0 127.0.0.1:52766 127.0.0.1:5432 ESTABLISHED 5779/java
tcp 0 0 127.0.0.1:9099 127.0.0.1:34996 ESTABLISHED 2595/java
tcp 0 0 127.0.0.1:5432 127.0.0.1:56994 ESTABLISHED 6356/postgres: mgmt
tcp 0 0 192.168.40.40:42139 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:53895 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:36257 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:35589 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:52966 127.0.0.1:5432 ESTABLISHED 5779/java
tcp 0 0 192.168.42.40:40187 192.168.42.41:22122 CLOSE_WAIT 3722/svtfs
tcp 0 0 192.168.40.40:48208 192.168.40.15:443 TIME_WAIT -
tcp 0 0 127.0.0.1:43734 127.0.0.1:9097 TIME_WAIT -
tcp 0 0 192.168.42.40:42499 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:51164 192.168.40.40:9390 ESTABLISHED 14181/java
tcp 0 0 192.168.40.40:43627 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:39971 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.42.40:34605 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:38725 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:5432 127.0.0.1:52966 ESTABLISHED 6840/postgres: mgmt
tcp 0 0 192.168.40.40:48202 192.168.40.15:443 TIME_WAIT -
tcp 0 0 192.168.40.40:48196 192.168.40.15:443 TIME_WAIT -
tcp 0 0 192.168.40.40:48214 192.168.40.15:443 TIME_WAIT -
tcp 0 0 192.168.42.40:36687 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:48220 192.168.40.15:443 TIME_WAIT -
tcp 0 0 127.0.0.1:43752 127.0.0.1:9097 TIME_WAIT -
tcp 0 0 127.0.0.1:43740 127.0.0.1:9097 TIME_WAIT -
tcp 0 1688 192.168.40.40:22 192.168.3.91:57757 ESTABLISHED 2583/sshd: administ
tcp 0 0 192.168.40.40:58817 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.42.40:59065 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:59819 192.168.50.40:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:60181 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.42.40:54895 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:42259 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:56994 127.0.0.1:5432 ESTABLISHED 6283/java
tcp 0 0 127.0.0.1:52768 127.0.0.1:5432 ESTABLISHED 5779/java
tcp 0 0 192.168.40.40:53279 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:45667 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:56998 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 192.168.40.40:39945 192.168.50.41:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:52812 127.0.0.1:5432 ESTABLISHED 5779/java
tcp 0 0 192.168.42.40:48865 192.168.42.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 192.168.40.40:60609 192.168.50.42:22122 ESTABLISHED 3722/svtfs
tcp 0 0 127.0.0.1:5432 127.0.0.1:52822 ESTABLISHED 6803/postgres: mgmt
tcp 0 0 127.0.0.1:57038 127.0.0.1:7210 TIME_WAIT -
tcp 0 0 127.0.0.1:5432 127.0.0.1:52768 ESTABLISHED 6783/postgres: mgmt
tcp6 0 0 :::2049 :::* LISTEN 3722/svtfs
tcp6 0 0 :::111 :::* LISTEN 905/rpcbind
tcp6 0 0 :::22 :::* LISTEN 1680/sshd
tcp6 0 0 :::25 :::* LISTEN 1899/master
tcp6 0 0 :::32767 :::* LISTEN 3722/svtfs
tcp6 0 0 192.168.41.40:2049 192.168.41.50:955 ESTABLISHED 3722/svtfs
tcp6 0 0 192.168.41.40:2049 192.168.41.50:954 ESTABLISHED 3722/svtfs

Unfortunately that article doesn't really go into how to resolve this type of issue.  Any help would be greatly appreciated.

Rajini_Saini
HPE Pro

Re: OVC state "faulty" after power outage

Hi @adidasnmotion,

Thank you for choosing HPE.
This issue can occur for a number of reasons, but to start with, On the Faulty OVC, please take a putty session and check with the below commands.

cd /var/svtfs/svt-hal/0
ls
Please check if there is any file with name nostart, if yes, we have to delete the nostart file.

sudo su
source /var/tmp/build/bin/appsetup
rm -rf nostart

regards,
Rajini Saini


I work for HPE

Accept or Kudo

adidasnmotion
Advisor

Re: OVC state "faulty" after power outage

Thank you for your response.  Unfortunately there is no nostart file at that location.

db13
HPE Pro

Re: OVC state "faulty" after power outage

@adidasnmotion, if there is no nostart file in /var/svtfs/0 or /var/svtfs/svt-hal/0, then you would need to call support to determine the issue further. Possible the power outage caused data to become lost or corrupt on the node, services are failed, logical drives are not accessible, etc...

I am an HPE Employee
A quick resolution to technical issues for your HP Enterprise products is just a click away HPE Support Center Knowledge-base

Accept or Kudo

steez
Frequent Advisor

Re: OVC state "faulty" after power outage

Had the same issue also due to power outtage.

As the host was faulty and not participating in the cluster I rebooted the OVC. After that status changed to Healthy. Doesn't sound like a smart solution but might give a try, as it is not really affecting the cluster.