HPE Ezmeral Software platform
1819523 Members
3032 Online
109603 Solutions
New Discussion юеВ

Hpe Ezmeral Data Fabric Yarn service down

 
mglee0812
Occasional Advisor

Hpe Ezmeral Data Fabric Yarn service down

We are using Data Fabric 6.1 version with 5 nodes. Zookeeper is operating normally on three nodes. However, Yarn is not running properly after restarting the service. There seems to be no problem with Zookeeper, but since Yarn is not working, both Application Manager and Nodemanager are also not working. I tried restarting the service through Warden, but that didn't fix the error.

A error log is as follows:

>> org.apache.hadoop.metric2.MetricException: Metrics source ClusterMetrics already exists

I terminated the service using warden, and confirmed that it terminated normally using the ps -ef command. The node's disk and license are also fine.

Please help me solve the problem.

7 REPLIES 7
support_s
System Recommended

Query: Hpe Ezmeral Data Fabric Yarn service down

System recommended content:

1. HPE Ezmeral Data Fabric тАУ Customer-Managed 7.6.1 Documentation | Administering Services

2. HPE Ezmeral Data Fabric тАУ Customer-Managed 7.7.0 Documentation | Administering Services

 

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

 

Thank you for being a HPE valuable community member.


Accept or Kudo

mglee0812
Occasional Advisor

Re: Query: Hpe Ezmeral Data Fabric Yarn service down

I have been using the service reliably for two years. But this situation happened suddenly. My yarn-site.xml file doesn't seem to have any problems at all.

<configuration>
<!-- Resource Manager MapR HA Configs -->
<property>
<name>yarn.resourcemanager.ha.custom-ha-enabled</name>
<value>true</value>
<description>MapR Zookeeper based RM Reconnect Enabled. If this is true, set the failover proxy to be the class MapRZKBasedRMFailoverProxyProvider</description>
</property>

<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.MapRZKBasedRMFailoverProxyProvider</value>
<description>Zookeeper based reconnect proxy provider. Should be set if and only if mapr-ha-enabled property is true.</description>
</property>

<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
<description>RM Recovery Enabled</description>
</property>

<property>
<name>yarn.resourcemanager.ha.custom-ha-rmaddressfinder</name>
<value>org.apache.hadoop.yarn.client.MapRZKBasedRMAddressFinder</value>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>20480</value>
<source>yarn-default.xml</source>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
<source>yarn-default.xml</source>
</property>

Dave Olker
Neighborhood Moderator

Re: Query: Hpe Ezmeral Data Fabric Yarn service down

One thing you can try is moving the contents of FSRMStateRoot to a backup directory and deleting it:

# hadoop fs -mv /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/* /FSRMStateRoot_backup/
# hadoop fs -rmr /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/

After making the above changes, restart the ResourceManager service.



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
mglee0812
Occasional Advisor

Re: Query: Hpe Ezmeral Data Fabric Yarn service down

When I checked the logs, would it be effective to restart the resource manager and node manager after the actions you provided?

INFO org.apache-hadoop. yarn.server.api.ConfigurableAuxServices: Adding auxiliary service RMVolumeManager
INFO com.mapr.hadoop. yarn. resourcemanager .RMVolumeManager: Checking for ResourceManager volume. If volume not present command will create and mount it. Command invoked as: /opt/mapr/server/createdTVolume.sh abigufa2 /var/mapr/cluster/yarn/rm /var/mapr/cluster/yarn/rm/system with permission: rwx---
com.mapr.hadoop. yarn. resourcemanager RMVolumeManager: Successfully created ResourceManager volume and mounted at /var/mapr/cluster/yarn/rm.
org.apache.hadoop. yarn. server. resourcemanager .ResourceManager: Transitioning to active state.
org.apache.hadoop. yarn. server. resourcemanager .ResourceManager: Recovery started.
org.apache.hadoop. yarn. server. resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.2.
org.apache.hadoop. yarn. server. resourcemanager.recovery.FileSystemRMStateStore: Done loading applications from FS state store.
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state.
com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$AMRMTokenSecretManagerStateProto.<init>(YarnServerResourceManagerRecoveryProtos.java:3938)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$AMRMTokenSecretManagerStateProto.<init>(YarnServerResourceManagerRecoveryProtos.java:3902)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$AMRMTokenSecretManagerStateProto$1.parsePartialFrom(YarnServerResourceManagerRecoveryProtos.java:4006)
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.loadState(FileSystemRMStateStore.java:206)
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:1032)
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

mglee0812
Occasional Advisor

Re: Query: Hpe Ezmeral Data Fabric Yarn service down

When I checked the logs, would it be effective to restart the resource manager and node manager after the actions you provided?

INFO org.apache-hadoop. yarn.server.api.ConfigurableAuxServices: Adding auxiliary service RMVolumeManager
INFO com.mapr.hadoop. yarn. resourcemanager .RMVolumeManager: Checking for ResourceManager volume. If volume not present command will create and mount it. Command invoked as: /opt/mapr/server/createdTVolume.sh abigufa2 /var/mapr/cluster/yarn/rm /var/mapr/cluster/yarn/rm/system with permission: rwx---
com.mapr.hadoop. yarn. resourcemanager RMVolumeManager: Successfully created ResourceManager volume and mounted at /var/mapr/cluster/yarn/rm.
org.apache.hadoop. yarn. server. resourcemanager .ResourceManager: Transitioning to active state.
org.apache.hadoop. yarn. server. resourcemanager .ResourceManager: Recovery started.
org.apache.hadoop. yarn. server. resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.2.
org.apache.hadoop. yarn. server. resourcemanager.recovery.FileSystemRMStateStore: Done loading applications from FS state store.
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state.
...
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

Dave Olker
Neighborhood Moderator

Re: Query: Hpe Ezmeral Data Fabric Yarn service down

You could do a Warden restart if you prefer and force all services to restart.  



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
support_s
System Recommended

Query: Hpe Ezmeral Data Fabric Yarn service down

Hello,

 

Let us know if you were able to resolve the issue.

 

If you have no further query, and you are satisfied with the answer then kindly mark the topic as Solved so that it is helpful for all community members.

 

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

 

Thank you for being a HPE valuable community member.


Accept or Kudo