- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- Ezmeral Data Fabric v7.3 Customer Managed: Zookeep...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2025 08:16 AM - last edited on 01-12-2025 11:30 PM by support_s
01-10-2025 08:16 AM - last edited on 01-12-2025 11:30 PM by support_s
Hi!
Recently our cluster (Ezmeral Data Fabric v7.3 Customer Managed) was down because of disk problem (fixed with fsck -r) - And now we got a new problem: Zookeeper (1 of 3) won't start up with this error:
Failed to set cldb key file /opt/mapr/conf/cldb.key err com.mapr.security.MutableInt@70be0a2b
Cldb key can not be obtained: 2
Its seems to be a known bug (because since 7.0 version there is no cldb.key) and may be related to this discussion
I have checked
/opt/mapr/conf/maprhsm.conf
/opt/mapr/conf/tokens/
And noticed, that on problem node:
-rw------- 1 mapr mapr 8 Jan 10 16:49 generation
-rw------- 1 mapr mapr 0 Nov 29 19:12 token.lock
-rw------- 1 mapr mapr 0 Nov 29 19:12 token.object
and on master:
-rw------- 1 mapr mapr 8 Jan 10 17:12 generation
-rw------- 1 mapr mapr 0 Jan 10 17:12 token.lock
-rw------- 1 mapr mapr 320 Jan 10 17:12 token.object
so I did:
cd /opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/
mkdir ~/zookeeper_backup/
mv generation token* ~/zookeeper_backup
scp nl-hadoop-master:/opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/generation .
scp nl-hadoop-master:/opt/mapr/conf/tokens/ce0feb26-f264-d800-4031-547ee9363e50/token* .
sudo systemctl restart mapr-zookeeper
Then I saw error in the zookeeper logs:
2025-01-10 17:24:03,664 [myid:2] - INFO [main:FileSnap@83] - Reading snapshot /opt/mapr/zkdata/version-2/snapshot.5005d10a3
2025-01-10 17:24:03,833 [myid:2] - ERROR [main:Util@211] - Last transaction was partial.
2025-01-10 17:24:03,836 [myid:2] - ERROR [main:QuorumPeer@955] - Unable to load database on disk
java.io.IOException: The accepted epoch, 5 is less than the current epoch, 6
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:952)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:905)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
2025-01-10 17:24:03,838 [myid:2] - ERROR [main:QuorumPeerMain@101] - Unexpected exception, exiting abnormally
so I moved zk snapshot to the backup folder (suggested here)
mv /opt/mapr/zkdata/version-2 ~/zookeeper_backup/
sudo systemctl restart mapr-zookeeper
Hadoop fs commands are working, but cldb.log suggests that some problems are present
2025-01-10 18:02:55,243 INFO ZooKeeper [main]: Client environment:java.library.path=/opt/mapr/lib
2025-01-10 18:02:55,243 INFO ZooKeeper [main]: Client environment:java.io.tmpdir=/tmp
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:java.compiler=<NA>
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:os.name=Linux
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:os.arch=amd64
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:os.version=5.4.0-204-generic
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:user.name=mapr
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:user.home=/home/mapr
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:user.dir=/opt/mapr/initscripts
2025-01-10 18:02:55,244 INFO ZooKeeper [main]: Client environment:os.memory.free=2302MB
2025-01-10 18:02:55,245 INFO ZooKeeper [main]: Client environment:os.memory.max=4000MB
2025-01-10 18:02:55,245 INFO ZooKeeper [main]: Client environment:os.memory.total=2402MB
2025-01-10 18:02:55,255 INFO ZooKeeper [main]: Initiating client connection, connectString=nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181 sessionTimeout=30000 watcher=com.mapr.fs.cldb.CLDBServer@7e809b79
2025-01-10 18:02:55,260 INFO X509Util [main]: Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2025-01-10 18:02:55,265 INFO ClientCnxnSocket [main]: jute.maxbuffer value is 4194304 Bytes
2025-01-10 18:02:55,271 INFO ClientCnxn [main]: zookeeper.request.timeout value is 0. feature enabled=
2025-01-10 18:02:55,272 INFO CLDBServer [main]: CLDB configured with ZooKeeper ensemble with connection string nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181
2025-01-10 18:02:55,320 INFO S3ServerHandler [main]: Creating new instance of S3ServerHandler
2025-01-10 18:02:55,330 INFO ActiveContainersMap [main]: Caching a max of 3177503 containers in cache
2025-01-10 18:02:55,357 INFO Login [main-SendThread(nl-hadoop-worker2.local:5181)]: Client successfully logged in.
2025-01-10 18:02:55,357 ERROR UsageEmailManager [UsageEmailManager]: Failed to send email
2025-01-10 18:02:55,359 INFO ZooKeeperSaslClient [main-SendThread(nl-hadoop-worker2.local:5181)]: Client will use MAPR-SECURITY as SASL mechanism.
2025-01-10 18:02:55,363 INFO ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Opening socket connection to server nl-hadoop-worker2.local/10.0.2.42:5181. Will attempt to SASL-authenticate using Login Context section 'Client', mechanism MAPR-SECURITY, principal zookeeper/nl-hadoop-worker2.local
2025-01-10 18:02:55,366 INFO ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Socket connection established, initiating session, client: /10.0.2.40:12864, server: nl-hadoop-worker2.local/10.0.2.42:5181
2025-01-10 18:02:55,379 INFO ClientCnxn [main-SendThread(nl-hadoop-worker2.local:5181)]: Session establishment complete on server nl-hadoop-worker2.local/10.0.2.42:5181, sessionid = 0x20000819cc700a7, negotiated timeout = 30000
2025-01-10 18:02:55,381 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
2025-01-10 18:02:55,395 INFO CLDBServer [main-EventThread]: onZKConnect: The CLDB has successfully connected to the ZooKeeper server State:CONNECTED Timeout:30000 sessionid:0x20000819cc700a7 local:/10.0.2.40:12864 remoteserver:nl-hadoop-worker2.local/10.0.2.42:5181 lastZxid:0 xid:3 sent:1 recv:3 queuedpkts:0 pendingresp:0 queuedevents:1 in the ZooKeeper ensemble with connection string nl-hadoop-master.local:5181,nl-hadoop-worker1.local:5181,nl-hadoop-worker2.local:5181
2025-01-10 18:02:55,407 INFO TierGatewayHandler [main]: Init TierGatewayHandler
2025-01-10 18:02:55,447 INFO ZooKeeperClient [main-EventThread]: Setting Cldb Info in ZooKeeper, external Port:7222
2025-01-10 18:02:55,457 INFO ECTierManager [main]: Subscribed for EC gateway registration notifications. Current gateways...
2025-01-10 18:02:55,467 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
2025-01-10 18:02:55,470 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2025-01-10 18:02:56,858 INFO CLDB [main]: CLDBState: CLDB State change : WAIT_FOR_FILESERVERS
2025-01-10 18:02:56,861 INFO CLDBWatchdog [main]: CLDB memory threshold(heap + non heap) is set to : 8096 MB. Xmx: 4000, Configured Non-Heap: 4096
2025-01-10 18:02:56,862 INFO CLDB [main]: [Starting RPCServer] port: 7222 num threads: 10 heap size: 4000MB IPGutsShm 32768 startup options: -Xms2400m -Xmx4000m -XX:ErrorFile=/opt/cores/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/cores -XX:ThreadStackSize=1024
2025-01-10 18:02:56,862 INFO CLDB [main]: Starting 2 RPC Instances for CLDB
2025-01-10 18:02:56,863 ERROR CLDB [main]: Exception in RPC init
2025-01-10 18:02:56,863 ERROR CLDB [main]: Could not initialize RPC.. aborting
2025-01-10 18:03:08,889 INFO ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:03:09,454 INFO ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDeleted
2025-01-10 18:03:09,455 INFO ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb. Event state: SyncConnected. Event type: NodeChildrenChanged
2025-01-10 18:03:09,561 INFO ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services_config/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDataChanged
2025-01-10 18:03:09,706 INFO ClusterGroup [Thread-7]: periodicPull: Found change in Ips for cluster hadoop.cluster.local. Old APiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local newApiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local Old CLDBs: nl-hadoop-master.local:7222 New CLDBs:
2025-01-10 18:03:12,212 INFO Alarms [RPC-4]: Adding NODE_ALARM_SERVICE_CLDB_DOWN into AlarmNameToHashMap
2025-01-10 18:03:12,212 WARN Alarms [RPC-4]: composeEmailMessage: Alarm raised: NODE_ALARM_SERVICE_CLDB_DOWN:nl-hadoop-master.local:NODE_ALARM; Cluster: hadoop.cluster.local; Can not determine if service: cldb is running. Check logs at: /opt/mapr/logs/cldb.log
2025-01-10 18:03:13,203 ERROR EmailManager [EmailManager]: EmailManager: Failed to send email for alarm: NODE_ALARM_SERVICE_CLDB_DOWN:nl-hadoop-master.local:NODE_ALARM, Error: Sending the email to the following server failed : smtp.gmail.com:587
2025-01-10 18:04:05,990 INFO LabelBasedAllocator [RPC-2]: ContainerAssign ignore Writer 10.0.2.40:0 for volume, mapr.resourcemanager.volume with max size, 32768 and with 85 containers
2025-01-10 18:04:11,312 ERROR S3VolumeCache [ACR-6]: updateBucketCount Account: 0 doesn't belong to S3 Volume cache
2025-01-10 18:05:08,894 INFO ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:05:23,448 ERROR S3VolumeCache [ACR-4]: updateBucketCount Account: 0 doesn't belong to S3 Volume cache
2025-01-10 18:07:08,900 INFO ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-10 18:08:08,853 INFO DiskBalancer [DBal]: Balancing SPs in the average and below-average bins
2025-01-10 18:08:08,856 INFO MutableContainerInfo [RBal]: Container ID:2237 vol:188767170 Servers: 10.0.2.45 10.0.2.40 10.0.2.43-R Epoch:304 Ctx [Switch Roles] (I<->T) Cid(D): 2237 Src SP: 89e15385f480bf800065aa4a2d0305de Dst SP: 75aa600482d668cb0065685e910523fe
2025-01-10 18:08:08,857 INFO RoleBalancer [RBal]: [StoragePool: 75aa600482d668cb0065685e910523fe, IP: 10.0.2.43-] Increased Tails by 1246 MB
2025-01-10 18:08:08,871 INFO DiskBalancer [DBal]: Container ID:2648 vol:158847011 Servers: 10.0.2.41 10.0.2.45 10.0.2.44 10.0.2.42-2-R Epoch:38 Ctx ClusterAvg 62 moving container of size 29942 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 0] to sp 5ae035730a77f9c6006565d53106d277 on fs 10.0.2.42 binIndex: 0 [ % 18 u 602336 c 3349964 in 29942 out 0]
2025-01-10 18:08:08,889 INFO DiskBalancer [DBal]: Container ID:2654 vol:158847011 Servers: 10.0.2.45 10.0.2.43 10.0.2.41 10.0.2.42-2-R Epoch:33 Ctx ClusterAvg 62 moving container of size 31220 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 29942] to sp e857a79128c235da0067515e8900f259 on fs 10.0.2.42 binIndex: 0 [ % 17 u 584929 c 3450316 in 31220 out 0]
2025-01-10 18:08:08,894 INFO DiskBalancer [DBal]: Container ID:2603 vol:158847011 Servers: 10.0.2.40 10.0.2.41 10.0.2.47 10.0.2.42-2-R Epoch:26 Ctx ClusterAvg 62 moving container of size 27344 MB from sp d19d39b73aa48290006564c24e08b78e on fs 10.0.2.41 binIndex: 3 [ % 71 u 3628981 c 5051852 in 0 out 61162] to sp 5ae035730a77f9c6006565d53106d277 on fs 10.0.2.42 binIndex: 0 [ % 19 u 602336 c 3349964 in 57286 out 0]
2025-01-10 18:08:08,894 INFO DiskBalancer [DBal]: Moved 3 containers from sp d19d39b73aa48290006564c24e08b78e
2025-01-10 18:08:10,872 INFO MutableContainerInfo [ACR-6]: Container ID:2237 vol:188767170 Servers: 10.0.2.45 10.0.2.40 10.0.2.43 Epoch:304 Ctx 10.0.2.43 resync response
2025-01-10 18:08:23,861 INFO MutableContainerInfo [RBal]: Container ID:2659 vol:188767170 Servers: 10.0.2.47 10.0.2.44 10.0.2.45-R Epoch:176 Ctx [Switch Roles] (I<->T) Cid(D): 2659 Src
Also cores present on worker2:
ls -la /opt/cores
total 8
drwxrwxrwt 2 root root 4096 Jan 10 15:07 .
drwxr-xr-x 8 root root 4096 Nov 28 2023 ..
-rw-r--r-- 1 root root 0 Jan 10 15:07 loopbacknfs.core.1382.nl-hadoop-worker2.local
MASTER_MISSING and
ClusterGroup [Thread-7]: periodicPull: Found change in Ips for cluster hadoop.cluster.local. Old APiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local newApiips: nl-hadoop-worker1.local, nl-hadoop-worker3.local, nl-hadoop-worker2.local Old CLDBs: nl-hadoop-master.local:7222 New CLDBs:
And this doesn't look good
2025-01-10 18:03:09,454 INFO ZKDataRetrieval [cAdmin-1-EventThread]: Process path: /services/cldb/nl-hadoop-master.local. Event state: SyncConnected. Event type: NodeDeleted
Solved! Go to Solution.
- Tags:
- drive
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2025 09:38 PM - edited 01-12-2025 09:40 PM
01-12-2025 09:38 PM - edited 01-12-2025 09:40 PM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
Possible Causes:
- Data Corruption: running fsck -r on ZooKeeper or CLDB data directories might have caused data corruption, leading to the initialization failure.
- Other Issues: There could be other underlying problems with the CLDB service itself that are unrelated to fsck -r.
Can you please share us below information
Can you check any disk related issues ?
systemctl status mapr-zookeeper.service
maprcli dump cldbstate
/opt/mapr/initscripts/zookeeper qstatus
I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 12:49 AM
01-13-2025 12:49 AM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
Can you share the below output:
1. maprcli dump containerinfo -ids 1 -json
2. /opt/mapr/server/mrconfig sp list -v (from the problematic node)
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 01:31 AM
01-13-2025 01:31 AM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
systemctl status mapr-zookeeper.service (nl-hadoop-worker2)
systemctl status mapr-zookeeper.service
● mapr-zookeeper.service - MapR Technologies, Inc. zookeeper service
Loaded: loaded (/etc/systemd/system/mapr-zookeeper.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2025-01
cldb state (nl-hadoop-worker2)
maprcli dump cldbstate
mode s3Info ip state stateDuration desc
MASTER_READ_WRITE ... 10.0.2.40 CLDB_IS_MASTER_READ_WRITE 65:27:10 kvstore tables loading complete, cldb running as master
qstatus (nl-hadoop-worker2)
/opt/mapr/initscripts/zookeeper qstatus
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Mode: follower
qstatus (nl-hadoop-worker1)
/opt/mapr/initscripts/zookeeper qstatus
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Mode: follower
maprcli dump containerinfo -ids 1 -json
{
"timestamp":1736759573667,
"timeofday":"2025-01-13 11:12:53.667 GMT+0200 AM",
"status":"OK",
"total":1,
"data":[
{
"ContainerId":1,
"Epoch":29,
"mirrorCid":0,
"Primary":"10.0.2.40:5660--29-VALID",
"ActiveServers":{
"IP":[
"10.0.2.40:5660--29-VALID",
"10.0.2.46:5660--29-VALID",
"10.0.2.44:5660--29-VALID",
"10.0.2.43:5660--29-VALID",
"10.0.2.41:5660--29-VALID",
"10.0.2.47:5660--29-VALID"
]
},
"InactiveServers":{
},
"UnusedServers":{
},
"OwnedSizeMB":"0 MB",
"SharedSizeMB":"0 MB",
"LogicalSizeMB":"0 MB",
"TotalSizeMB":"0 MB",
"NameContainer":"true",
"ContainerType":"NameSpaceContainer",
"ReplicationType":"STAR",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":false,
"VolumeName":"mapr.cldb.internal",
"VolumeId":1,
"VolumeReplication":6,
"NameSpaceReplication":6,
"VolumeMounted":false,
"AccessTime":"September 26, 2023",
"TopologyViolated":"false",
"LabelViolated":"false"
}
]
}
/opt/mapr/server/mrconfig sp list -v
ListSPs resp: status 0:3
No. of SPs (3), totalsize 7522381 MiB, totalfree 5202621 MiB
SP 0: name SP3, Online, size 3349964 MiB, free 2344583 MiB, path /dev/sdb, log 200 MiB, port 5660, guid 5ae035730a77f9c6006565d53106d277, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sdb /dev/sdc, dare 0, label default:0
SP 1: name SP4, Online, size 722099 MiB, free 404698 MiB, path /dev/sdd, log 200 MiB, port 5660, guid 1fa0c87596603b29006568294006f1a4, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sdd, dare 0, label default:0
SP 2: name SP6, Online, size 3450316 MiB, free 2453339 MiB, path /dev/sde, log 200 MiB, port 5660, guid e857a79128c235da0067515e8900f259, clusterUuid -5972596296548369882-9071595250348215226, disks /dev/sde, dare 0, label default:0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 03:27 AM
01-13-2025 03:27 AM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
@filip_novak Everything looks good from the outputs that you have shared. You have not shared the output from the 3rd zookeeper node. So I am assuming that is the leader.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 04:53 AM
01-13-2025 04:53 AM
Solution@ParvYadav Thanks!
Yes, the 3rd zookeeper is a Leader.
I checked /opt/mapr/pid and noticed that cldb.pid was owned by root, and the process listed in cldb.pid wasn’t running. So, I stopped the warden and zookeeper, deleted all PID files, and restarted the master node. Now, the CLDB has started and appears to be working.
How can I verify that the cluster is in a healthy state?
Is this a normal logs?
2025-01-13 14:38:44,590 INFO ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-13 14:40:44,597 INFO ReplicationHandlerThread [Repl]: <MASTER_MISSING> P=576; F=576; QS=72;
2025-01-13 14:42:42,816 INFO ErrorNotificationHandler [RPC-3]: [container failure report] cid: 3024 reporting fs: 10.0.2.41 sp: 9bbd5c33ac74cc84006568293909f4f0 failing server: 10.0.2.47
2025-01-13 14:42:42,819 INFO MutableContainerInfo [RPC-3]: Container ID:3024 vol:158847011 Servers: 10.0.2.40 10.0.2.41 10.0.2.42-C Epoch:26
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 09:22 AM
01-13-2025 09:22 AM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
@filip_novak You can check the cluster health from MCS. Also you can run below to check it through CLI:
maprcli node list -columns svc,csvc,healthDesc,id
Reg the log lines that you have shared:
It is reporting issue for container 3024, this has nothing to do with the overall cluster health.. To deal with this issue please check:
maprcli dump containerinfo -ids 3024 -json
and check if the container have a valid master copy.. If the master is missing we need to check further why it got missed and try to bring it back if possible. In case we are not able to retrieve it we can force master the other relica with higest epoch.
NOTE: If other replica copy have less epoch than the previous master copy, we can expect some data loss for sure.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-14-2025 01:47 AM
01-14-2025 01:47 AM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
The s3 service is down on nl-hadoop-worker2, but we are not using s3, so it ok
hostname service ip configuredservice healthDesc id
nl-hadoop-master.local hs2,cldb,mastgateway,hcat,hbasethrift,spark-thriftserver,hoststats,collectd,fluentd,hivemeta,resourcemanager,fileserver 10.0.2.40 hs2,cldb,mastgateway,hcat,hbasethrift,s3server,spark-thriftserver,hoststats,collectd,fluentd,hivemeta,resourcemanager,fileserver,loopbacknfs One or more alarms raised 4999696247408828448
nl-hadoop-worker1.local httpfs,fileserver,mastgateway,nodemanager,spark-historyserver,hoststats,collectd,fluentd,grafana,historyserver,gateway,apiserver 10.0.2.41 httpfs,fileserver,mastgateway,nodemanager,s3server,spark-historyserver,hoststats,collectd,fluentd,grafana,historyserver,gateway,apiserver Healthy 4950730202321169056
nl-hadoop-worker2.local fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,grafana,gateway,apiserver 10.0.2.42 httpfs,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,grafana,gateway,apiserver One or more services is down 5027436551618226432
nl-hadoop-worker3.local fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,livy,hbaserest,grafana,gateway,apiserver 10.0.2.43 httpfs,fileserver,mastgateway,nodemanager,s3server,zeppelin,hoststats,collectd,fluentd,livy,hbaserest,grafana,gateway,apiserver Healthy 2470949201523985696
nl-hadoop-worker4.local elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb 10.0.2.44 elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb Healthy 139650559888138400
nl-hadoop-worker5.local elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb 10.0.2.45 elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb Healthy 8245262840894133376
nl-hadoop-worker6.local elasticsearch,fileserver,mastgateway,nodemanager,hoststats,collectd,fluentd,opentsdb 10.0.2.46 elasticsearch,fileserver,mastgateway,nodemanager,s3server,hoststats,collectd,fluentd,opentsdb Healthy 600409955788436096
nl-hadoop-worker7.local fileserver,mastgateway,nodemanager,kibana,hoststats,collectd,fluentd,hue 10.0.2.47 fileserver,mastgateway,nodemanager,kibana,drill-bits,s3server,hoststats,collectd,fluentd,hue Healthy 4582742894302409344
maprcli dump containerinfo -ids 3024 -json
{
"timestamp":1736847749615,
"timeofday":"2025-01-14 11:42:29.615 GMT+0200 AM",
"status":"OK",
"total":1,
"data":[
{
"ContainerId":3024,
"Epoch":26,
"mirrorCid":0,
"Primary":"10.0.2.40:5660--26-VALID",
"ActiveServers":{
"IP":[
"10.0.2.40:5660--26-VALID",
"10.0.2.41:5660--26-VALID",
"10.0.2.42:5660--26-VALID"
]
},
"InactiveServers":{
},
"UnusedServers":{
},
"OwnedSizeMB":"30.16 GB",
"SharedSizeMB":"0 MB",
"LogicalSizeMB":"30.26 GB",
"TotalSizeMB":"30.16 GB",
"NumInodesInUse":9012,
"Mtime":"January 14, 2025",
"NameContainer":"false",
"ContainerType":"DataContainer",
"ReplicationType":"CASCADE",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":true,
"VolumeName":"users",
"VolumeId":158847011,
"VolumeReplication":3,
"NameSpaceReplication":3,
"VolumeMounted":true,
"AccessTime":"January 14, 2025",
"TopologyViolated":"false",
"LabelViolated":"false"
}
]
}
It seems to be ok. I appreciate your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 11:14 PM
01-15-2025 11:14 PM
Re: Ezmeral Data Fabric v7.3 Customer Managed: Zookeeper and CLDB problems after fsck -r (
Hello @filip_novak,
That's Awesome!
We are delighted to hear the issue has been resolved and we appreciate you for keeping us updated.