HPE Ezmeral Software platform
1825051 Members
3480 Online
109678 Solutions
New Discussion

Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r ()

 
SOLVED
Go to solution
filip_novak
Occasional Contributor

Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r ()

Hi!

Recently our cluster (Ezmeral Data Fabric v7.3 Customer Managed) was down because of disk problem (fixed with fsck -r) -  And now we got a new problem: Zookeeper (1 of 3) won't start up with this error:

 

 

Failed to set cldb key file /opt/mapr/conf/cldb.key err com.mapr.security.MutableInt@70be0a2b
Cldb key can not be obtained: 2

 

 

Its seems to be a known bug (because since 7.0 version there is no cldb.key) and may be related to this discussion 

I have checked 

 

 

/opt/mapr/conf/maprhsm.conf
/opt/mapr/conf/tokens/

 

 

And noticed, that on problem node:

-rw------- 1 mapr mapr 8 Jan 10 16:49 generation
-rw------- 1 mapr mapr 0 Nov 29 19:12 token.lock
-rw------- 1 mapr mapr 0 Nov 29 19:12 token.object

and on master:

-rw------- 1 mapr mapr 8 Jan 10 17:12 generation
-rw------- 1 mapr mapr 0 Jan 10 17:12 token.lock
-rw------- 1 mapr mapr 320 Jan 10 17:12 token.object

so I did:

cd /opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/
mkdir ~/zookeeper_backup/
mv generation token* ~/zookeeper_backup
scp nl-hadoop-master:/opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/generation .
scp nl-hadoop-master:/opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/token* .
sudo systemctl restart mapr-zookeeper

Then I saw error in the zookeeper logs:

 

 

2025-01-10 17:24:03,664 [myid:2] - INFO  [main:FileSnap@83] - Reading snapshot /opt/mapr/zkdata/version-2/snapshot.5005d10a3
2025-01-10 17:24:03,833 [myid:2] - ERROR [main:Util@211] - Last transaction was partial.
2025-01-10 17:24:03,836 [myid:2] - ERROR [main:QuorumPeer@955] - Unable to load database on disk
java.io.IOException: The accepted epoch, 5 is less than the current epoch, 6
	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:952)
	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:905)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
2025-01-10 17:24:03,838 [myid:2] - ERROR [main:QuorumPeerMain@101] - Unexpected exception, exiting abnormally

 

 

so I moved zk snapshot to the backup folder (suggested here)

 

 

mv /opt/mapr/zkdata/version-2 ~/zookeeper_backup/
sudo systemctl restart mapr-zookeeper

 

 

Hadoop fs commands are working, but cldb.log suggests that some problems are present

 

 

2025-01-10 18:02:55,243 INFO  ZooKeeper [main]: Client environment:java.library.path=/opt/mapr/lib
2025-01-10 18:02:55,243 INFO  ZooKeeper [main]: Client environment:java.io.tmpdir=/tmp
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:java.compiler=<NA>
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:os.name=Linux
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:os.arch=amd64
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:os.version=5.4.0-204-generic
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:user.name=mapr
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:user.home=/home/mapr
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:user.dir=/opt/mapr/initscripts
2025-01-10 18:02:55,244 INFO  ZooKeeper [main]: Client environment:os.memory.free=2302MB
2025-01-10 18:02:55,245 INFO  ZooKeeper [main]: Client environment:os.memory.max=4000MB
2025-01-10 18:02:55,245 INFO  ZooKeeper [main]: Client environment:os.memory.total=2402MB
2025-01-10 18:02:55,255 INFO  ZooKeeper [main]: Initiating client connection, connectString=nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181 sessionTimeout=30000 watcher=com.mapr.fs.cldb.CLDBServer@7e809b79
2025-01-10 18:02:55,260 INFO  X509Util [main]: Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2025-01-10 18:02:55,265 INFO  ClientCnxnSocket [main]: jute.maxbuffer value is 4194304 Bytes
2025-01-10 18:02:55,271 INFO  ClientCnxn [main]: zookeeper.request.timeout value is 0. feature enabled=
2025-01-10 18:02:55,272 INFO  CLDBServer [main]: CLDB configured with ZooKeeper ensemble with connection string nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181
2025-01-10 18:02:55,320 INFO  S3ServerHandler [main]: Creating new instance of S3ServerHandler
2025-01-10 18:02:55,330 INFO  ActiveContainersMap [main]: Caching a max of 3177503 containers in cache
2025-01-10 18:02:55,357 INFO  Login [main-SendThread(nl-hadoop-worker2.local:5181)]: Client successfully logged in.
2025-01-10 18:02:55,357 ERROR UsageEmailManager [UsageEmailManager]: Failed to send email
2025-01-10 18:02:55,359 INFO  ZooKeeperSaslClient [main-SendThread(nl-hadoop-worker2.local:5181)]: Client will use MAPR-SECURITY as SASL mechanism.
2025-01-10 18:02:55,363 INFO  ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Opening socket connection to server nl-hadoop-worker2.local/10.0.2.42:5181. Will attempt to SASL-authenticate using Login Context section 'Client', mechanism MAPR-SECURITY, principal zookeeper/nl-hadoop-worker2.local
2025-01-10 18:02:55,366 INFO  ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Socket connection established, initiating session, client: /10.0.2.40:12864, server: nl-hadoop-worker2.local/10.0.2.42:5181
2025-01-10 18:02:55,379 INFO  ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Session establishment complete on server nl-hadoop-worker2.local/10.0.2.42:5181, sessionid = 0x20000819cc700a7, negotiated timeout = 30000
2025-01-10 18:02:55,381 INFO  CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
2025-01-10 18:02:55,395 INFO  CLDBServer [main-EventThread]: onZKConnect: The CLDB has successfully connected to the ZooKeeper server State:CONNECTED Timeout:30000 sessionid:0x20000819cc700a7 local:/10.0.2.40:12864 remoteserver:nl-hadoop-worker2.local/10.0.2.42:5181 lastZxid:0 xid:3 sent:1 recv:3 queuedpkts:0 pendingresp:0 queuedevents:1 in the ZooKeeper ensemble with connection string nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181
2025-01-10 18:02:55,407 INFO  TierGatewayHandler [main]: Init TierGatewayHandler
2025-01-10 18:02:55,447 INFO  ZooKeeperClient [main-EventThread]: Setting Cldb Info in ZooKeeper, external Port:7222
2025-01-10 18:02:55,457 INFO  ECTierManager [main]: Subscribed for EC gateway registration notifications. Current gateways...
2025-01-10 18:02:55,467 INFO  CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
2025-01-10 18:02:55,470 INFO  CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2025-01-10 18:02:56,858 INFO  CLDB [main]: CLDBState: CLDB State change : WAIT_FOR_FILESERVERS
2025-01-10 18:02:56,861 INFO  CLDBWatchdog [main]: CLDB memory threshold(heap + non heap) is set to : 8096 MB. Xmx: 4000, Configured Non-Heap: 4096
2025-01-10 18:02:56,862 INFO  CLDB [main]: [Starting RPCServer] port: 7222 num threads: 10 heap size: 4000MB IPGutsShm 32768 startup options: -Xms2400m -Xmx4000m -XX:ErrorFile=/opt/cores/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/cores -XX:ThreadStackSize=1024
2025-01-10 18:02:56,862 INFO  CLDB [main]: Starting 2 RPC Instances for CLDB
2025-01-10 18:02:56,863 ERROR CLDB [main]: Exception in RPC init
2025-01-10 18:02:56,863 ERROR CLDB [main]: Could not initialize RPC.. aborting
2025-01-10 18:03:08,889 INFO  ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:03:09,454 INFO  ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDeleted
2025-01-10 18:03:09,455 INFO  ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb. Event state: SyncConnected. Event type: NodeChildrenChanged
2025-01-10 18:03:09,561 INFO  ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services_config/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDataChanged
2025-01-10 18:03:09,706 INFO  ClusterGroup [Thread-7]: periodicPull: Found change in Ips for cluster hadoop.cluster.local. Old APiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local newApiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local Old CLDBs: nl-hadoop-master.local:7222 New CLDBs:
2025-01-10 18:03:12,212 INFO  Alarms [RPC-4]: Adding NODE_ALARM_SERVICE_CLDB_DOWN into AlarmNameToHashMap
2025-01-10 18:03:12,212 WARN  Alarms [RPC-4]: composeEmailMessage: Alarm raised: NODE_ALARM_SERVICE_CLDB_DOWN:nl-hadoop-master.local:NODE_ALARM; Cluster: hadoop.cluster.local; Can not determine if service: cldb is running. Check logs at: /opt/mapr/logs/cldb.log
2025-01-10 18:03:13,203 ERROR EmailManager [EmailManager]: EmailManager: Failed to send email for alarm: NODE_ALARM_SERVICE_CLDB_DOWN:nl-hadoop-master.local:NODE_ALARM, Error:  Sending the email to the following server failed : smtp.gmail.com:587
2025-01-10 18:04:05,990 INFO  LabelBasedAllocator [RPC-2]: ContainerAssign ignore Writer 10.0.2.40:0 for volume, mapr.resourcemanager.volume with max size, 32768 and with 85 containers
2025-01-10 18:04:11,312 ERROR S3VolumeCache [ACR-6]: updateBucketCount Account: 0 doesn't belong to S3 Volume cache
2025-01-10 18:05:08,894 INFO  ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:05:23,448 ERROR S3VolumeCache [ACR-4]: updateBucketCount Account: 0 doesn't belong to S3 Volume cache
2025-01-10 18:07:08,900 INFO  ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:08:08,853 INFO  DiskBalancer [DBal]: Balancing SPs in the average and below-average bins
2025-01-10 18:08:08,856 INFO  MutableContainerInfo [RBal]:  Container ID:2237 vol:188767170 Servers:  10.0.2.45 10.0.2.40 10.0.2.43-R Epoch:304 Ctx [Switch Roles] (I<->T) Cid(D): 2237 Src SP: 89e15385f480bf800065aa4a2d0305de Dst SP: 75aa600482d668cb0065685e910523fe
2025-01-10 18:08:08,857 INFO  RoleBalancer [RBal]: [StoragePool: 75aa600482d668cb0065685e910523fe, IP: 10.0.2.43-] Increased Tails by 1246 MB
2025-01-10 18:08:08,871 INFO  DiskBalancer [DBal]:  Container ID:2648 vol:158847011 Servers:  10.0.2.41 10.0.2.45 10.0.2.44 10.0.2.42-2-R Epoch:38 Ctx ClusterAvg 62 moving container of size 29942 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 0]  to sp 5ae035730a77f9c6006565d53106d277 on fs 10.0.2.42 binIndex: 0 [ % 18 u 602336 c 3349964 in 29942 out 0]
2025-01-10 18:08:08,889 INFO  DiskBalancer [DBal]:  Container ID:2654 vol:158847011 Servers:  10.0.2.45 10.0.2.43 10.0.2.41 10.0.2.42-2-R Epoch:33 Ctx ClusterAvg 62 moving container of size 31220 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 29942]  to sp e857a79128c235da0067515e8900f259 on fs 10.0.2.42 binIndex: 0 [ % 17 u 584929 c 3450316 in 31220 out 0]
2025-01-10 18:08:08,894 INFO  DiskBalancer [DBal]:  Container ID:2603 vol:158847011 Servers:  10.0.2.40 10.0.2.41 10.0.2.47 10.0.2.42-2-R Epoch:26 Ctx ClusterAvg 62 moving container of size 27344 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 61162]  to sp 5ae035730a77f9c6006565d53106d277 on fs 10.0.2.42 binIndex: 0 [ % 19 u 602336 c 3349964 in 57286 out 0]
2025-01-10 18:08:08,894 INFO  DiskBalancer [DBal]: Moved 3 containers from sp d19d39b73aa48290006564c24e08b78e
2025-01-10 18:08:10,872 INFO  MutableContainerInfo [ACR-6]:  Container ID:2237 vol:188767170 Servers:  10.0.2.45 10.0.2.40 10.0.2.43 Epoch:304 Ctx 10.0.2.43 resync response
2025-01-10 18:08:23,861 INFO  MutableContainerInfo [RBal]:  Container ID:2659 vol:188767170 Servers:  10.0.2.47 10.0.2.44 10.0.2.45-R Epoch:176 Ctx [Switch Roles] (I<->T) Cid(D): 2659 Src

 

Also cores present on worker2:

ls -la /opt/cores
total 8
drwxrwxrwt 2 root root 4096 Jan 10 15:07 .
drwxr-xr-x 8 root root 4096 Nov 28  2023 ..
-rw-r--r-- 1 root root    0 Jan 10 15:07 loopbacknfs.core.1382.nl-hadoop-worker2.local

MASTER_MISSING and 

ClusterGroup [Thread-7]: periodicPull: Found change in Ips for cluster hadoop.cluster.local. Old APiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local newApiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local Old CLDBs: nl-hadoop-master.local:7222 New CLDBs:

 

And this doesn't look good

2025-01-10 18:03:09,454 INFO  ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDeleted
8 REPLIES 8
syeddula
HPE Pro

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

Possible Causes:

  • Data Corruption:  running fsck -r on ZooKeeper or CLDB data directories might have caused data corruption, leading to the initialization failure.
  • Other Issues: There could be other underlying problems with the CLDB service itself that are unrelated to fsck -r.


Can you please share us below information 
Can you check any disk related issues ?
 
systemctl status mapr-zookeeper.service
maprcli dump cldbstate
/opt/mapr/initscripts/zookeeper qstatus

 



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
ParvYadav
HPE Pro

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

Can you share the below output:

1. maprcli dump containerinfo -ids 1 -json

2. /opt/mapr/server/mrconfig sp list -v (from the problematic node)

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
filip_novak
Occasional Contributor

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

@syeddula 

systemctl status mapr-zookeeper.service (nl-hadoop-worker2)

systemctl status mapr-zookeeper.service
● mapr-zookeeper.service - MapR Technologies, Inc. zookeeper service
     Loaded: loaded (/etc/systemd/system/mapr-zookeeper.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2025-01

 cldb state (nl-hadoop-worker2)

maprcli dump cldbstate
mode               s3Info  ip         state                      stateDuration  desc
MASTER_READ_WRITE  ...     10.0.2.40  CLDB_IS_MASTER_READ_WRITE  65:27:10       kvstore tables loading complete, cldb running as master

qstatus (nl-hadoop-worker2)

/opt/mapr/initscripts/zookeeper qstatus
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Mode: follower

 qstatus (nl-hadoop-worker1)

/opt/mapr/initscripts/zookeeper qstatus
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Mode: follower

 

@ParvYadav 

maprcli dump containerinfo -ids 1 -json
{
	"timestamp":1736759573667,
	"timeofday":"2025-01-13 11:12:53.667 GMT+0200 AM",
	"status":"OK",
	"total":1,
	"data":[
		{
			"ContainerId":1,
			"Epoch":29,
			"mirrorCid":0,
			"Primary":"10.0.2.40:5660--29-VALID",
			"ActiveServers":{
				"IP":[
					"10.0.2.40:5660--29-VALID",
					"10.0.2.46:5660--29-VALID",
					"10.0.2.44:5660--29-VALID",
					"10.0.2.43:5660--29-VALID",
					"10.0.2.41:5660--29-VALID",
					"10.0.2.47:5660--29-VALID"
				]
			},
			"InactiveServers":{

			},
			"UnusedServers":{

			},
			"OwnedSizeMB":"0 MB",
			"SharedSizeMB":"0 MB",
			"LogicalSizeMB":"0 MB",
			"TotalSizeMB":"0 MB",
			"NameContainer":"true",
			"ContainerType":"NameSpaceContainer",
			"ReplicationType":"STAR",
			"CreatorContainerId":0,
			"CreatorVolumeUuid":"",
			"UseActualCreatorId":false,
			"VolumeName":"mapr.cldb.internal",
			"VolumeId":1,
			"VolumeReplication":6,
			"NameSpaceReplication":6,
			"VolumeMounted":false,
			"AccessTime":"September 26, 2023",
			"TopologyViolated":"false",
			"LabelViolated":"false"
		}
	]
}
/opt/mapr/server/mrconfig sp list -v
ListSPs resp: status 0:3
No. of SPs (3), totalsize 7522381 MiB, totalfree 5202621 MiB

SP 0: name SP3, Online, size 3349964 MiB, free 2344583 MiB, path /dev/sdb, log 200 MiB, port 5660, guid 5ae035730a77f9c6006565d53106d277, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sdb /dev/sdc, dare 0, label default:0
SP 1: name SP4, Online, size 722099 MiB, free 404698 MiB, path /dev/sdd, log 200 MiB, port 5660, guid 1fa0c87596603b29006568294006f1a4, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sdd, dare 0, label default:0
SP 2: name SP6, Online, size 3450316 MiB, free 2453339 MiB, path /dev/sde, log 200 MiB, port 5660, guid e857a79128c235da0067515e8900f259, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sde, dare 0, label default:0

 

ParvYadav
HPE Pro

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

@filip_novak Everything looks good from the outputs that you have shared.  You have not shared the output from the 3rd zookeeper node. So I am assuming that is the leader. 

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
filip_novak
Occasional Contributor
Solution

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

@ParvYadav Thanks!

Yes, the 3rd zookeeper is a Leader.

I checked /opt/mapr/pid and noticed that cldb.pid was owned by root, and the process listed in cldb.pid wasn’t running. So, I stopped the warden and zookeeper, deleted all PID files, and restarted the master node. Now, the CLDB has started and appears to be working.

How can I verify that the cluster is in a healthy state?

Is this a normal logs?

 

2025-01-13 14:38:44,590 INFO  ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-13 14:40:44,597 INFO  ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-13 14:42:42,816 INFO  ErrorNotificationHandler [RPC-3]: [container failure report] cid: 3024 reporting fs: 10.0.2.41 sp: 9bbd5c33ac74cc84006568293909f4f0 failing server: 10.0.2.47
2025-01-13 14:42:42,819 INFO  MutableContainerInfo [RPC-3]:  Container ID:3024 vol:158847011 Servers:  10.0.2.40 10.0.2.41 10.0.2.42-C Epoch:26

 

 

ParvYadav
HPE Pro

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

@filip_novak You can check the cluster health from MCS. Also you can run below to check it through CLI:

 

maprcli node list -columns svc,csvc,healthDesc,id

 

Reg the log lines that you have shared:

It is reporting issue for container 3024, this has nothing to do with the overall cluster health.. To deal with this issue please check:

 

maprcli dump containerinfo -ids 3024 -json

 

and check if the container have a valid master copy.. If the master is missing we need to check further why it got missed and try to bring it back if possible. In case we are not able to retrieve it we can force master the other relica with higest epoch.

NOTE: If other replica copy have less epoch than the previous master copy, we can expect some data loss for sure.

 

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
filip_novak
Occasional Contributor

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

The s3 service is down on nl-hadoop-worker2, but we are not using s3, so it ok 

hostname                      service                                                                                                                           ip         configuredservice                                                                                                                             healthDesc                    id
nl-hadoop-master.local   hs2,cldb,mastgateway,hcat,hbasethrift,spark-thriftserver,hoststats,collectd,fluentd,hivemeta,resourcemanager,fileserver           10.0.2.40  hs2,cldb,mastgateway,hcat,hbasethrift,s3server,spark-thriftserver,hoststats,collectd,fluentd,hivemeta,resourcemanager,fileserver,loopbacknfs  One or more alarms raised     4999696247408828448
nl-hadoop-worker1.local  httpfs,fileserver,mastgateway,nodemanager,spark-historyserver,hoststats,collectd,fluentd,grafana,historyserver,gateway,apiserver  10.0.2.41  httpfs,fileserver,mastgateway,nodemanager,s3server,spark-historyserver,hoststats,collectd,fluentd,grafana,historyserver,gateway,apiserver     Healthy                       4950730202321169056
nl-hadoop-worker2.local  fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,grafana,gateway,apiserver                                           10.0.2.42  httpfs,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,grafana,gateway,apiserver                                       One or more services is down  5027436551618226432
nl-hadoop-worker3.local  fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,livy,hbaserest,grafana,gateway,apiserver                            10.0.2.43  httpfs,fileserver,mastgateway,nodemanager,s3server,zeppelin,hoststats,collectd,fluentd,livy,hbaserest,grafana,gateway,apiserver               Healthy                       2470949201523985696
nl-hadoop-worker4.local  elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb                                              10.0.2.44  elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb                                                 Healthy                       139650559888138400
nl-hadoop-worker5.local  elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb                                              10.0.2.45  elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb                                                 Healthy                       8245262840894133376
nl-hadoop-worker6.local  elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb                                              10.0.2.46  elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb                                                 Healthy                       600409955788436096
nl-hadoop-worker7.local  fileserver,mastgateway,nodemanager,kibana,hoststats,collectd,fluentd,hue                                                          10.0.2.47  fileserver,mastgateway,nodemanager,kibana,drill-bits,s3server,hoststats,collectd,fluentd,hue                                                  Healthy                       4582742894302409344

 

maprcli dump containerinfo -ids 3024 -json
{
	"timestamp":1736847749615,
	"timeofday":"2025-01-14 11:42:29.615 GMT+0200 AM",
	"status":"OK",
	"total":1,
	"data":[
		{
			"ContainerId":3024,
			"Epoch":26,
			"mirrorCid":0,
			"Primary":"10.0.2.40:5660--26-VALID",
			"ActiveServers":{
				"IP":[
					"10.0.2.40:5660--26-VALID",
					"10.0.2.41:5660--26-VALID",
					"10.0.2.42:5660--26-VALID"
				]
			},
			"InactiveServers":{

			},
			"UnusedServers":{

			},
			"OwnedSizeMB":"30.16 GB",
			"SharedSizeMB":"0 MB",
			"LogicalSizeMB":"30.26 GB",
			"TotalSizeMB":"30.16 GB",
			"NumInodesInUse":9012,
			"Mtime":"January 14, 2025",
			"NameContainer":"false",
			"ContainerType":"DataContainer",
			"ReplicationType":"CASCADE",
			"CreatorContainerId":0,
			"CreatorVolumeUuid":"",
			"UseActualCreatorId":true,
			"VolumeName":"users",
			"VolumeId":158847011,
			"VolumeReplication":3,
			"NameSpaceReplication":3,
			"VolumeMounted":true,
			"AccessTime":"January 14, 2025",
			"TopologyViolated":"false",
			"LabelViolated":"false"
		}
	]
}

It seems to be ok. I appreciate your help!

Sunitha_Mod
Honored Contributor

Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (

Hello @filip_novak,

That's Awesome! 

We are delighted to hear the issue has been resolved and we appreciate you for keeping us updated.