Re: Incremental Installation of Zookeeper at Data Fabric

msaidbilgehan · ‎08-07-2023

While Incremental Installation of the current cluster, we reproduce Zookeeper from one note to three nodes. Occurred below problem;

All logs about ZK at every node (1: Master, 2 and 3 slave): Logs

Shishir_Prakash · ‎08-07-2023

Make sure port 5181 , 2888, 3888 is opened between all three zookeeper nodes. Also share you zoo.cfg file from all three zookeeper nodes for review.

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

msaidbilgehan · ‎08-08-2023

zoo.conf files for all nodes: zoo_conf_drive

Checked all nodes for firewall status and open port-process status below;

node 1 (Master - ZK and CLDB):

> mapr@node-1:~$ sudo ufw status verbose
Status: inactive

> mapr@node-1:~$ sudo lsof -t -i:3888
1960

> mapr@node-1:~$ sudo lsof -t -i:2888

> mapr@node-1:~$ sudo lsof -t -i:5181
1960

node 2 and 3 have same output to same commands (Slaves - Incremental installation fails):

> sudo ufw status verbose
Status: inactive

> sudo lsof -t -i:5181

> sudo lsof -t -i:2888

> sudo lsof -t -i:3888

> sudo systemctl status mapr-zookeeper.service
● mapr-zookeeper.service - MapR Technologies, Inc. zookeeper service
Loaded: loaded (/etc/systemd/system/mapr-zookeeper.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2023-08-07 10:09:30 EDT; 16h ago
Process: 3205 ExecStart=/opt/mapr/initscripts/zookeeper start (code=exited, status=0/SUCCESS)
Main PID: 3378 (code=exited, status=1/FAILURE)

Aug 07 10:09:30 node-2.treo.com.tr systemd[1]: mapr-zookeeper.service: Scheduled restart job, restart counter is at 3.
Aug 07 10:09:30 node-2.treo.com.tr systemd[1]: Stopped MapR Technologies, Inc. zookeeper service.
Aug 07 10:09:30 node-2.treo.com.tr systemd[1]: mapr-zookeeper.service: Start request repeated too quickly.
Aug 07 10:09:30 node-2.treo.com.tr systemd[1]: mapr-zookeeper.service: Failed with result 'exit-code'.
Aug 07 10:09:30 node-2.treo.com.tr systemd[1]: Failed to start MapR Technologies, Inc. zookeeper service.

> sudo systemctl start mapr-zookeeper.service
Job for mapr-zookeeper.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status mapr-zookeeper.service" and "journalctl -xe" for details.

> sudo systemctl status mapr-zookeeper.service
● mapr-zookeeper.service - MapR Technologies, Inc. zookeeper service
Loaded: loaded (/etc/systemd/system/mapr-zookeeper.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: protocol) since Tue 2023-08-08 03:08:33 EDT; 1s ago
Process: 122255 ExecStart=/opt/mapr/initscripts/zookeeper start (code=exited, status=0/SUCCESS)

Incremental installation failed therefore node 2 and node 3 have no zookeeper running.

Shishir_Prakash · ‎08-08-2023

can you restart all mapr-zookeeper service on all three zookeeper node and then share below command's output from all three node.

/opt/mapr/initscripts/zookeeper qstatus

Also , please upload zookeeper logs from all thre nodes.

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

msaidbilgehan · ‎08-09-2023

After restart, zookeeper logs at every node: zk_after_restart_nodes

Here "/opt/mapr/initscripts/zookeeper qstatus" command output:

> ssh mapr@10.34.2.129 "/opt/mapr/initscripts/zookeeper qstatus"
(mapr@10.34.2.129) Password:
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Error contacting service. It is probably not running.

> ssh mapr@10.34.2.131 "/opt/mapr/initscripts/zookeeper qstatus"
(mapr@10.34.2.131) Password:
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Error contacting service. It is probably not running.

> ssh mapr@10.34.2.135 "/opt/mapr/initscripts/zookeeper qstatus"
(mapr@10.34.2.135) Password:
Using config: /opt/mapr/zookeeper/zookeeper-3.5.6/conf/zoo.cfg
Client port found: 5181. Client address: localhost.
Error contacting service. It is probably not running.

Awez · ‎08-09-2023

Let us know if you have gone through https://docs.ezmeral.hpe.com/datafabric-customer-managed/61/AdministratorGuide/AddingZKrole.html

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Shishir_Prakash · ‎08-09-2023

From the provided logs looks like /opt/mapr/conf/cldb.key is missing or having issue. Can you check if that file is present? Also, earlier you mentioned cluster is running with one zookeeper but from the command output seems like none of the zk nodes are running. Are there all three new nodes which you are trying to add?

2023-08-09 04:09:59,082 [myid:1] - ERROR [main:MaprSecurityLoginModule@71] - Failed to set cldb key file /opt/mapr/conf/cldb.key err com.mapr.security.MutableInt@d83da2e
2023-08-09 04:09:59,084 [myid:1] - ERROR [main:MaprSecurityLoginModule@79] - Cldb key can not be obtained: 2

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

ldarby · ‎08-09-2023

@msaidbilgehan I've reproduced this at last and found the problem, it's because the installer needs to copy the file /opt/mapr/conf/maprhsm.conf and the whole dir /opt/mapr/conf/tokens to the new Zookeeper nodes. I've raised a bug for the installer about this.

To copy them manually, do this:

scp /opt/mapr/conf/maprhsm.conf  mapr@<new zk node>:/opt/mapr/conf/maprhsm.conf
scp -r /opt/mapr/conf/tokens/*  mapr@<new zk node>:/opt/mapr/conf/tokens/

(needs to be as the mapr user, not root, so the mapr user can read them).

Unfortunately, these steps are missing from the docs page posted above as well, and there's a separate bug with the docs team about that.

Regards,
Laurence Darby

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

msaidbilgehan · ‎08-10-2023

@ldarby At which step I should run these copy commands? Should I do this After installation fail and copy files and retry again?

BTW I just did what you have told me, now seems like all ZKs running, service status output below;

Now as @Shishir_Prakash mentioned, cldb key is missing on node-1 which should have it. Also there is one more problem, Installer reset itself as explained in here (the ticket) . Need to solve these two more problems and than it should work well I guess.

msaidbilgehan · ‎08-10-2023

As you mention, the "cldb.key" file is missing. CLDB and ZK were on node-1. Other nodes (node-2 and node-3) have other services. At Incremental installation, I have select node-2 and node-3 in addition to node-1. After failure, this issue happened. Also, the installer reset itself as explained in this ticket.

I have checked all 3 nodes for "cldb.key" and not found

ldarby · ‎08-10-2023

Hi @msaidbilgehan,

Apologies for the confusion, cldb.key no longer exists, it was replaced by maprhsm.conf and the tokens dir in version 7.0, but the error message about cldb.key didn't get updated. I've again requested engineering to fix this error message. @Shishir_Prakash you may also want to push Engineering to fix this wrong error message.

Also, apoligies I'm not super familiar with the installer, I'm not sure if there is a way tell it that the problem has been manually resolved, possibly the only way is to re-run the incremental install, which should work now since you've copied the files. I'll check internally about this.

Regards,
Laurence Darby

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

msaidbilgehan · ‎08-10-2023

I see, all good then. I can not do Incremental installation because the installer at the installation page and I can not change it right now. I will continue with the installer issue ticket then. Thanks for the support.

msaidbilgehan · ‎08-10-2023

@ldarby @Shishir_Prakash

Before leave the ticket, just realized that qstatus command output shows some errors as below;

Here are the zk logs of every node: https://drive.google.com/drive/folders/1QavBt3D1L4wU3cLECsROr7awe7voA1xC?usp=drive_link

ldarby · ‎08-14-2023

Hi @msaidbilgehan,

Unfortunately these error logs with 'java.io.IOException: ZK down' are pretty hard to diagnose. One known cause of them is if the self-signed certificate is replaced with one signed by a CA, but the cert is missing the 'TLS Web Client Authentication' flag, this flag is needed by the Zookeepers for connecting to each other with SSL client certificates (mutual SSL auth). Have you done this? (I haven't seen you mention custom CAs so far, so I think not).

To debug this, on one of the ZK nodes, edit /opt/mapr/zookeeper/zookeeper-3.5.6/bin/zkServer.sh and change the line with ZK_SUPPORT_OPTS="-XX..." to be this:

ZK_SUPPORT_OPTS="-XX:ErrorFile=${ZOO_LOG_DIR}/hs_err_pid%p.log -Djavax.net.debug=ssl:trustmanager:verbose -Djavax.net.debug=ssl:handshake:verbose "

Then start a tcpdump:

tcpdump -i any -s 0 -n -w zk.pcap

Then restart that ZK:

systemctl restart mapr-zookeeper

Then the tcpdump should capture ZK starting up and giving the error message for the first time after startup, and the logs hopefully have more info.

(this is what I had to do earlier to discover the missing TLS Client Auth flag issue).

Regards,
Laurence Darby

I'm an HPE employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

msaidbilgehan · ‎08-17-2023

I start to install it from scratch so next time it happens, I will try your answer and create a new ticket with detailed logs.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Incremental Installation of Zookeeper at Data Fabric

Incremental Installation of Zookeeper at Data Fabric