Unable to connect to any of the cluster's CLDBs

HariprasathAS · ‎06-05-2024

Hi,

I have a EDF 7.3 cluster with 8 nodes. CLDB service was running in first 3 nodes. Now, I am getting below error when i am trying to intialize maprticket and when accessing HDFS.

[mapr@node ~]$ maprlogin password
[Password for user 'mapr' at cluster 'cluster.ezm.tst': ]
Unable to connect to any of the cluster's CLDBs. CLDBs tried: node2.ezm.tst:7443, node3.ezm.tst:7443, node1.ezm.tst:7443. Please check your cluster configuration.
[mapr@node ~]$ hdfs dfs -ls
ls: Could not create FileClient err: 104

I have checked the cldb.log and cldb.out files and below are those respectively

2024-06-05 12:08:14,697 INFO ClientCnxn [main-SendThread(node2.ezm.tst:5181)]: Opening socket connection to server node2.ezm.tst/10.x.x.x:5181. Will attempt to SASL-authenticate using Login Context section 'Client', mechanism MAPR-SECURITY, principal zookeeper/node2.ezm.tst

2024-06-05 12:08:14,698 INFO ClientCnxn [main-SendThread(node2.ezm.tst:5181)]: Socket error occurred: node2.ezm.tst/10.x.x.x:5181: Connection refused

2024-06-05 12:08:15,994 INFO Login [main-SendThread(node3.ezm.tst:5181)]: Client successfully logged in.

2024-06-05 12:08:15,994 INFO ZooKeeperSaslClient [main-SendThread(node3.ezm.tst:5181)]: Client will use MAPR-SECURITY as SASL mechanism.

2024-06-05 12:08:15,995 INFO ClientCnxn [main-SendThread(node3.ezm.tst:5181)]: Opening socket connection to server node3.ezm.tst/10.x.x.x:5181. Will attempt to SASL-authenticate using Login Context section 'Client', mechanism MAPR-SECURITY, principal zookeeper/node3.ezm.tst

2024-06-05 12:08:15,995 INFO ClientCnxn [main-SendThread(node3.ezm.tst:5181)]: Socket connection established, initiating session, client: /10.x.x.x:58294, server: node3.ezm.tst/10.x.x.x:5181

2024-06-05 12:08:15,996 INFO ClientCnxn [main-SendThread(node3.ezm.tst:5181)]: Session establishment complete on server node3.ezm.tst/10.x.x.x::5181, sessionid = 0x2000000abb20047, negotiated timeout = 30000

2024-06-05 12:09:05,301 INFO CLDBServer [Lookup-2]: Rejecting RPC 2345.5 from 10.x.x.x:38499, sequence_num: 2, rpcServerIdx: 0 with status 30 as CLDB is not yet initialized, current mode INITIALIZE

2024-05-29 13:36:56,2468 :1596 Obtained CLDB key from PKCS#11 file store
java.lang.Exception: Username in ticket file doesn't match with cluster owner
at com.mapr.fs.cldb.CLDBServer.initSecurity(CLDBServer.java:1605)
at com.mapr.fs.cldb.CLDBServer.<init>(CLDBServer.java:523)
at com.mapr.fs.cldb.CLDBServerHolder.getInstance(CLDBServerHolder.java:24)
at com.mapr.fs.cldb.CLDB.<init>(CLDB.java:76)
at com.mapr.fs.cldb.CLDB.main(CLDB.java:411)
2024-06-05 12:05:37,5476 :1596 Obtained CLDB key from PKCS#11 file store
CLDBJNI: Initializing cldb jni with memory 838860800 estContainerSize:144 maxContainersInCache:5825422 mapr-version: $Id: mapr-version: 7.3.0.0.20230425002320.GA 35c1bacac83b999156e2572f2619da84fe2e225e $
fs/common/daremgr.cc:189: HSM enabled, but DARE key not found on HSM. Check log for details

I am using the same mapr user for intailizing the ticket which I used while creating setup. Can anyone please help to bring up the CLDB server?

AwezS · ‎06-05-2024

Hello,

Let's check cluster health.

1. Are SP on all the nodes Online ?

/opt/mapr/server/mrconfig sp list -v

2. Is cluster user present in all the nodes?

grep CMD /opt/mapr/logs/configure.log

3. Number of cluster nodes?

4. Any configuration change make recently ?

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

HariprasathAS · ‎06-05-2024

1. In only one node, it is showing below message

[mapr@node1 ~]$ /opt/mapr/server/mrconfig sp list -v
ListSPs resp: status 0:1
No. of SPs (1), totalsize 290266 MiB, totalfree 288052 MiB

SP 0: name SP1, Online, size 290266 MiB, free 288052 MiB, path /dev/sdc, log 200 MiB, port 5660, guid c015e6bc572b86630065800c66090a38, clusterUuid -5881150811843788447--6420490077672223050, disks /dev/sdc /dev/sdb /dev/sdd, dare 0, label default:0

And in all other nodes, it is showing as
[mapr@node8 ~]$ /opt/mapr/server/mrconfig sp list -v
2024-06-06 10:00:10,1077 ERROR Global mrconfig.cc:782 ListSPs rpc failed Connection reset by peer.(104).
2024-06-06 10:00:10,1078 ERROR Global mrconfig.cc:10539 ProcessSPList failed Connection reset by peer.(104).

2. Yes cluster user(mapr) is available in all nodes.

[mapr@node8 ~]$ grep CMD /opt/mapr/logs/configure.log
2023-12-18 15:20:31.46 node8.ezm.tst configure.sh(14967) Install main:4180 CMD: /opt/mapr/server/configure.sh -N cluster.ezm.tst -u mapr -g mapr -f -no-autostart -on-prompt-cont y -secure -v -no-autostart -HS node4.ezm.tst -OT node5.ezm.tst,node6.ezm.tst,node7.ezm.tst -C node1.ezm.tst,node2.ezm.tst,node3.ezm.tst -Z node1.ezm.tst,node2.ezm.tst,node3.ezm.tst -EC -hiveMetastoreHost node4.ezm.tst
2023-12-18 15:34:06.796 node8.ezm.tst configure.sh(31059) Install main:4180 CMD: /opt/mapr/server/configure.sh -R -v -no-autostart -HS node4.ezm.tst -OT node5.ezm.tst,node6.ezm.tst,node7.ezm.tst -EPcollectd -all -EC -hiveMetastoreHost node4.ezm.tst
2023-12-27 09:34:39.842 node8.ezm.tst configure.sh(27919) Install main:4180 CMD: /opt/mapr/server/configure.sh --noRecalcMem -R
2023-12-27 09:45:55.252 node8.ezm.tst configure.sh(8256) Install main:4180 CMD: /opt/mapr/server/configure.sh -R -v -no-autostart -HS node4.ezm.tst -OT node5.ezm.tst,node6.ezm.tst,node7.ezm.tst -EPcollectd -all -EC -hiveMetastoreHost node4.ezm.tst

3. Totally cluster contains 8 nodes.

4. No configurations made recently.

Mahesh_S · ‎06-05-2024

Hi

Can you share the below logs

grep "ZK-Connect"  /opt/mapr/logs/cldb.log

grep "FATAL"  /opt/mapr/logs/cldb.log

Also please confirm the mapr user is present on all the node and is there any UID change for the mapr user.

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Skc_Grd · ‎06-05-2024

Hello,

If this is the output for "/opt/mapr/server/mrconfig sp list -v" from all the 7 nodes. Then there might be high chance of CID:1(Primary) would be down, due to SP's are down on all 7 nodes.

And in all other nodes, it is showing as
[mapr@node8 ~]$ /opt/mapr/server/mrconfig sp list -v
2024-06-06 10:00:10,1077 ERROR Global mrconfig.cc:782 ListSPs rpc failed Connection reset by peer.(104).
2024-06-06 10:00:10,1078 ERROR Global mrconfig.cc:10539 ProcessSPList failed Connection reset by peer.(104).

ASK: Could you please share below command output as well, along with the details asked by mapr experts to check this further.

Run below commands on all the cldb nodes.

#maprcli dump cldbstate

#/opt/mapr/server/mrconfig info dumpcontainers | grep "cid:1"

Below from all the zookeeper nodes.

#/opt/mapr/initscripts/zookeeper qstatus

#/opt/mapr/initscripts/zookeeper status

note: What is the status of warden and zookeeper on all 7 nodes?

#systemctl status mapr-zookeeper
#systemctl status mapr-warden

Thanks,

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

HariprasathAS · ‎06-06-2024

[mapr@node1 ~]$ grep "ZK-Connect" /opt/mapr/logs/cldb.log
2023-12-18 14:41:16,343 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-18 14:41:36,345 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient : No KvStore Epoch info found in ZooKeeper. New Installation, becoming Master
2023-12-18 14:41:36,345 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_MASTER_LOCK
2023-12-18 14:41:36,349 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: CLDB is current Master
2023-12-18 14:41:36,349 INFO ZooKeeperClient [ZK-Connect]: CLDB became master. Creating new KvStoreContainer with no fileservers for cid: 1
2023-12-18 14:41:36,352 INFO ZooKeeperClient [ZK-Connect]: Storing KvStoreContainerInfo to ZooKeeper Container ID:1 Servers: Inactive: Unused: Epoch:3 SizeMB:0 CType:NameSpaceContainer
2023-12-18 14:41:36,366 INFO ZooKeeperClient [ZK-Connect]: CLDB became master. Initializing KvStoreContainer for cid: 1
2023-12-18 14:41:36,369 INFO ZooKeeperClient [ZK-Connect]: becomeMasterForKvStoreContainer: CID 1 servers info Container ID:1 Servers: Inactive: Unused: Epoch:3 SizeMB:0 CType:NameSpaceContainer
2023-12-18 14:41:36,369 INFO ZooKeeperClient [ZK-Connect]: Storing KvStoreContainerInfo to ZooKeeper Container ID:1 Servers: Inactive: Unused: Epoch:3 SizeMB:0 CType:NameSpaceContainer
2023-12-18 14:41:36,371 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from INITIALIZE to MASTER_REGISTER_READY
2023-12-18 14:41:36,371 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to AWAITING_FS_REGISTER
2023-12-18 14:41:36,371 INFO CLDBServer [ZK-Connect]: Starting thread to monitor waiting for local kvstore to become master
2023-12-18 15:36:03,285 ERROR CLDB [main-EventThread]: Thread: ZK-Connect ID: 21
2023-12-18 15:37:45,957 INFO CLDBServer [ZK-Connect]: tryBecomeMaster: Waiting for cldb init to complete.
2023-12-18 15:37:48,961 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-18 15:37:48,962 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_MASTER_LOCK
2023-12-18 15:37:48,964 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-18 15:37:48,964 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-18 15:37:48,965 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-18 15:37:48,965 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from INITIALIZE to BECOMING_SLAVE
2023-12-18 15:37:48,965 INFO CLDBServer [ZK-Connect]: Starting thread to become slave CLDB
2023-12-18 15:38:15,252 ERROR CLDB [Becoming Slave Thread]: Thread: ZK-Connect ID: 21
2023-12-18 15:38:47,791 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-18 15:39:07,796 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-18 15:39:07,798 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_MASTER_LOCK
2023-12-18 15:39:07,801 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-18 15:39:07,802 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-18 15:39:07,802 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-18 15:39:07,802 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from INITIALIZE to BECOMING_SLAVE
2023-12-18 15:39:07,802 INFO CLDBServer [ZK-Connect]: Starting thread to become slave CLDB
2023-12-18 15:39:14,431 ERROR CLDB [Becoming Slave Thread]: Thread: ZK-Connect ID: 21
2023-12-18 15:39:46,963 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-18 15:40:06,968 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-18 15:40:06,968 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_MASTER_LOCK
2023-12-18 15:40:06,971 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-18 15:40:06,971 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-18 15:40:06,971 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-18 15:40:06,971 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from INITIALIZE to BECOMING_SLAVE
2023-12-18 15:40:06,972 INFO CLDBServer [ZK-Connect]: Starting thread to become slave CLDB
2023-12-18 15:40:13,539 ERROR CLDB [Becoming Slave Thread]: Thread: ZK-Connect ID: 21
2023-12-18 15:51:20,547 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-18 15:51:40,551 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-18 15:51:40,552 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_MASTER_LOCK
2023-12-18 15:51:40,557 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-18 15:51:40,557 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-18 15:51:40,557 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-18 15:51:40,558 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from INITIALIZE to BECOMING_SLAVE
2023-12-18 15:51:40,558 INFO CLDBServer [ZK-Connect]: Starting thread to become slave CLDB
2023-12-27 09:01:05,202 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-27 09:01:25,204 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-27 09:01:25,204 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from CLDB_IS_SLAVE_READ_ONLY to AWAITING_MASTER_LOCK
2023-12-27 09:01:25,208 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-27 09:01:25,208 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-27 09:01:25,208 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-27 09:01:25,208 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to CLDB_IS_SLAVE_READ_ONLY
2023-12-27 09:01:34,934 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-27 09:01:54,935 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-27 09:01:54,935 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from CLDB_IS_SLAVE_READ_ONLY to AWAITING_MASTER_LOCK
2023-12-27 09:01:54,938 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-27 09:01:54,938 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-27 09:01:54,939 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-27 09:01:54,939 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to CLDB_IS_SLAVE_READ_ONLY
2023-12-29 11:07:26,420 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-29 11:07:46,421 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-29 11:07:46,421 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from CLDB_IS_SLAVE_READ_ONLY to AWAITING_MASTER_LOCK
2023-12-29 11:07:46,422 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-29 11:07:46,423 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-29 11:07:46,423 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-29 11:07:46,423 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to CLDB_IS_SLAVE_READ_ONLY
2023-12-29 17:06:36,196 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-29 17:06:56,197 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-29 17:06:56,197 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from CLDB_IS_SLAVE_READ_ONLY to AWAITING_MASTER_LOCK
2023-12-29 17:06:56,202 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient createActiveEphemeralMasterZNode: /datacenter/controlnodes/cldb/active/CLDBMaster already exists
2023-12-29 17:06:56,202 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: Some other CLDB become master. Current CLDB is Slave
2023-12-29 17:06:56,203 INFO ZooKeeperClient [ZK-Connect]: CLDB got role of slave
2023-12-29 17:06:56,203 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to CLDB_IS_SLAVE_READ_ONLY
2023-12-29 17:07:06,937 INFO CLDBServer [ZK-Connect]: Previous CLDB was not a clean shutdown waiting for 20000ms before attempting to become master
2023-12-29 17:07:26,940 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
2023-12-29 17:07:26,941 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from CLDB_IS_SLAVE_READ_ONLY to AWAITING_MASTER_LOCK
2023-12-29 17:07:26,949 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: CLDB is current Master
2023-12-29 17:07:26,949 INFO ZooKeeperClient [ZK-Connect]: CLDB became master. Initializing KvStoreContainer for cid: 1
2023-12-29 17:07:26,952 INFO ZooKeeperClient [ZK-Connect]: becomeMasterForKvStoreContainer: CID 1 servers info Container ID:1 Master: 10.1.84.67-5(3162807577909630400) SPGUID:8db905e34208aef00065800c660aea86 Servers: 10.1.84.67-5(3162807577909630400) SPGUID:8db905e34208aef00065800c660aea86 10.1.84.66-5(1017878648718736928) SPGUID:c015e6bc572b86630065800c66090a38 10.1.84.68-5(4530750987536816096) SPGUID:2d9f890c872085ea0065800c66058d47 Inactive: Unused: Epoch:5 SizeMB:0 CType:NameSpaceContainer
2023-12-29 17:07:26,953 INFO ZooKeeperClient [ZK-Connect]: Storing KvStoreContainerInfo to ZooKeeper Container ID:1 Servers: 10.1.84.66-5(1017878648718736928) SPGUID:c015e6bc572b86630065800c66090a38 Inactive: 10.1.84.67-5(3162807577909630400) SPGUID:8db905e34208aef00065800c660aea86 10.1.84.68-5(4530750987536816096) SPGUID:2d9f890c872085ea0065800c66058d47 Unused: Epoch:5 SizeMB:0 CType:NameSpaceContainer
2023-12-29 17:07:26,955 INFO CLDBConfiguration [ZK-Connect]: cldb mode changed from SLAVE_READ_ONLY to MASTER_REGISTER_READY
2023-12-29 17:07:26,955 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_MASTER_LOCK to AWAITING_FS_REGISTER
2023-12-29 17:07:26,955 INFO CLDBServer [ZK-Connect]: Starting thread to monitor waiting for local kvstore to become master
2024-01-09 04:30:16,049 INFO CLDBServer [ZK-Connect]: This CLDB is not currently connected to ZooKeeper. It will try to reestablish a connection to the ZooKeeper ensemble for up to 10000 milliseconds before giving up and shutting down
2024-01-23 11:43:52,998 INFO CLDBServer [ZK-Connect]: This CLDB is not currently connected to ZooKeeper. It will try to reestablish a connection to the ZooKeeper ensemble for up to 10000 milliseconds before giving up and shutting down
2024-01-23 11:43:53,107 ERROR CLDB [main-EventThread]: Thread: ZK-Connect ID: 21
2024-06-05 12:05:39,681 INFO CLDBServer [ZK-Connect]: tryBecomeMaster: Waiting for cldb init to complete.
2024-06-05 12:05:42,685 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore does not have epoch entry CLDB trying to wait until it is Ready
2024-06-05 12:05:42,686 INFO CldbDiagnostics [ZK-Connect]: cldbState changed from AWAITING_ZK_CONNECT to AWAITING_CID1_EPOCH
2024-06-05 12:05:45,702 INFO ZooKeeperClient [ZK-Connect]: Waiting for local KvStoreContainer to become valid. KvStore ContainerInfo Container ID:1 Servers: 10.1.84.67-75508(3162807577909630400) SPGUID:8db905e34208aef00065800c660aea86 Inactive: 10.1.84.73-75508(5333387396529766432) SPGUID:aa1fb11bae23e61d006580162b0392d6 10.1.84.68-75508(4530750987536816096) SPGUID:2d9f890c872085ea0065800c66058d47 Unused: Epoch:75508 SizeMB:0 CType:NameSpaceContainer CLDB ServerID : 1017878648718736928
2024-06-05 12:07:58,229 WARN ZooKeeperClient [ZK-Connect]: ZooKeeperClient : KvStoreContainerInfo read received connection loss exception. Sleeping for 30 Number of retry left 1

[mapr@node1 ~]$ grep "FATAL" /opt/mapr/logs/cldb.log
2023-12-18 15:36:03,258 FATAL CLDB [main-EventThread]: CLDBShutdown: This CLDB will shutdown now because it was holding the master CLDB lock and received notification from the ZooKeeper ensemble that the lock was deleted
2023-12-18 15:38:15,224 FATAL BecomeSlaveThread [Becoming Slave Thread]: license not found for CLDB HA: shutting down
2023-12-18 15:38:15,224 FATAL CLDB [Becoming Slave Thread]: CLDBShutdown: license not found for CLDB HA: shutting down
2023-12-18 15:39:14,412 FATAL BecomeSlaveThread [Becoming Slave Thread]: license not found for CLDB HA: shutting down
2023-12-18 15:39:14,413 FATAL CLDB [Becoming Slave Thread]: CLDBShutdown: license not found for CLDB HA: shutting down
2023-12-18 15:40:13,533 FATAL BecomeSlaveThread [Becoming Slave Thread]: license not found for CLDB HA: shutting down
2023-12-18 15:40:13,534 FATAL CLDB [Becoming Slave Thread]: CLDBShutdown: license not found for CLDB HA: shutting down
2024-01-23 11:43:53,090 FATAL CLDB [main-EventThread]: CLDBShutdown: This CLDB will shutdown now because it was holding the master CLDB lock and received notification from the ZooKeeper ensemble that the lock was deleted

Yes, mapr user is available in all the nodes and there is no UID change.

Mahesh_S · ‎06-06-2024

Hi

Thanks for sharing the logs. From the rcent logs I could see the container one is not yet the valid copy and hence CLDB is waiting for CID 1 to become valid. You can check for the storage pool status on the node and /opt/mapr/logs/mfs.log-3 for any errors in loading the storage pools.

/opt/mapr/server/mrconfig sp list -v

Also please confirm the below command output as well.

maprcli dump cldbstate

And below command from all three CLDB node.

/opt/mapr/server/mrconfig info dumpcontainers | grep -w "cid:1"

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

HariprasathAS · ‎06-06-2024

CLDB is running in first three nodes(node1 to node3) in the cluster.

1. In node 1, i am getting below message,

[mapr@node1 ~]$ /opt/mapr/server/mrconfig sp list -v
ListSPs resp: status 0:1
No. of SPs (1), totalsize 290266 MiB, totalfree 288052 MiB

SP 0: name SP1, Online, size 290266 MiB, free 288052 MiB, path /dev/sdc, log 200 MiB, port 5660, guid c015e6bc572b86630065800c66090a38, clusterUuid -5881150811843788447--6420490077672223050, disks /dev/sdc /dev/sdb /dev/sdd, dare 0, label default:0

in all other nodes, i am getting below error,

[mapr@node2 ~]$ /opt/mapr/server/mrconfig sp list -v
2024-06-06 13:18:13,4211 ERROR Global mrconfig.cc:782 ListSPs rpc failed Connection reset by peer.(104).
2024-06-06 13:18:13,4211 ERROR Global mrconfig.cc:10539 ProcessSPList failed Connection reset by peer.(104).

2. I am getting the same message in all the nodes for cldbstate

[mapr@node2 ~]$ maprcli dump cldbstate
ip error
10.x.x.x Couldn't connect to the CLDB service
10.x.x.x Couldn't connect to the CLDB service
10.x.x.x Couldn't connect to the CLDB service

3. I am getting below message in node1.

[mapr@node1 ~]$ /opt/mapr/server/mrconfig info dumpcontainers | grep -w "cid:1"
cid:1 volid:1 sp:SP1:/dev/sdc spid:c015e6bc572b86630065800c66090a38 prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:1 querycldb:0 resyncinprog:0 shared:0 owned:0 logical:0 snapusage:0 snapusageupdated:0 ismirror:0 isrwmirrorcapable:0 role:-1 awaitingrole:0 totalInodes:0 freeInodes:0 dare:0 istiered:0 numtotalblocks:0 numpurgedblocks:0 numoffloadedblocks:0 isConStatsEnabled:0 mirrorId:0 rollforwardInProg:0 rollforwardpending:0 maxUniq:0 isResyncSnapshot:0 snapId:0 port:5660

I am getting below error in node 2 and node3.

[mapr@node3 ~]$ /opt/mapr/server/mrconfig info dumpcontainers | grep -w "cid:1"
2024-06-06 13:19:21,0401 ERROR Global mrconfig.cc:4069 RPC to dump containers failed Connection reset by peer.(104).

Mahesh_S · ‎06-06-2024

Hi

Thanks for sharing the details. storage pool on the first CLDb node seems to be file but the CID 1 is in stale state. And in other 2 nodes the storage pools are not yet loaded. For CLDB to start, CID 1 should have min valid replica. Please check if mapr-warden service is running on 2nd and 3rd cldb nodes. else please start the service.

systemctl status  mapr-warden
systemctl start mapr-warden

Once it started please check the storage status.

/opt/mapr/server/mrconfig sp list -v

If you are still facing error with above command, please check /opt/mapr/logs/mfs.log-3 and share if there is any error.

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

HariprasathAS · ‎06-06-2024

@Mahesh_S I started the mapr-warden and checked the storage status. It is showing below error.

2024-06-06 15:23:49,1548 ERROR Global mrconfig.cc:782 ListSPs rpc failed Connection reset by peer.(104).
2024-06-06 15:23:49,1548 ERROR Global mrconfig.cc:10539 ProcessSPList failed Connection reset by peer.(104).

Then i checked the /opt/mapr/logs/mfs.log-3 file and noticed that it is updated two days back and showing the below error

2024-06-04 16:14:00,2027 INFO CLDBHA cldbha.cc:1170 Above message hit 2 times in 1717497825236 ms
2024-06-04 16:14:02,1826 INFO CLDBHA cldbha.cc:1170 RegisterToCldbDone iid 0 regnErr 30 from CLDB 10.x.x.x:7222
2024-06-04 16:14:02,2112 INFO CLDBHA cldbha.cc:1170 RegisterToCldbDone iid 0 regnErr 3 from CLDB 10.x.x.x:7222
2024-06-04 16:14:03,2142 INFO CLDBHA cldbha.cc:1170 RegisterToCldbDone iid 0 regnErr 3 from CLDB 10.x.x.x:7222
2024-06-04 16:14:03,2142 INFO CLDBHA cldbha.cc:1170 Above message hit 1 times in 1717497824236 ms

Mahesh_S · ‎06-07-2024

Hi

Thanks for sharing the details. Seems like mfs is not started yet. You can check /opt/mapr/logs/warden.log and check if there is any error.
Also please check if there is any pid file exist in /opt/mapr/pid. if Pid files exists, stop warden serivce and the clear all the pid files. check if there is any mapr process are still running (ps -ef | grep mapr), if yes stop them and restart mapr-warden service. Once started please monitor warden.log and mgs.log-3.

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Unable to connect to any of the cluster's CLDBs

Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs

Re: Unable to connect to any of the cluster's CLDBs