Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
i m trying to implement Kafka with zookeeper in my network but i am facing a weird issue with zookeeper. i have looked around google and realized that many other users reported such issue but no one posted any proper solution for this .
My current setup has 3 different zookeeper nodes (32 GB ram dedicated boxes)
The issue is if i kill zookeeper leader, both the remaining follower nodes also goes down and they do not recover for at-least next 15-20 minutes.
All i am getting in the zookeeper logs is "notification timeouts" without any explanation
Here is my zookeeper config file
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=100
maxSessionTimeout=50000
dataDir=/var/lib/zookeeper
clientPort=2181
autopurge.snapRetainCount=100
autopurge.purgeInterval=1
preAllocSize=131072
snapCount=3000000
server.1=zo1:2888:3888
server.2=zo2:2888:3888
server.3=zo3:2888:3888
in my /etc/hosts file i have mapped zo1 , zo2 , zo3 to their ip address.
Note:i have also tested by setting current node ip to 0.0.0.0 it doesn't makes any difference.
just few minutes ago i tested it and again it failed to recover.
As i have three node cluster zo1 , zo2 and zo3 . zo3 was the leader and zo1 and zo2 were followers. after i killed zo3 node . it took approx 13 minutes to recover automatically . i got the following logs in zo1 and zo2 .
Log for zo1.
tail /var/lib/zookeeper/zookeeper.out -n 10000 | grep 'QuorumPeer'
2019-01-02 10:25:50,848 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2019-01-02 10:25:50,848 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete
2019-01-02 10:25:50,849 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
2019-01-02 10:25:50,849 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2019-01-02 10:25:50,850 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 1, proposed zxid=0x2d00035c8e
2019-01-02 10:25:51,057 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400
2019-01-02 10:25:51,458 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800
2019-01-02 10:25:52,259 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600
2019-01-02 10:25:53,859 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200
2019-01-02 10:25:57,060 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400
2019-01-02 10:26:03,461 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800
2019-01-02 10:26:16,262 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600
2019-01-02 10:26:41,862 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200
2019-01-02 10:27:33,063 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:28:33,065 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:29:33,066 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:30:33,066 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:31:33,067 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:32:33,068 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:33:33,069 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:34:33,069 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:35:33,070 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:36:33,071 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:37:33,071 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:38:33,072 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:39:33,073 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:40:33,074 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:41:33,075 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:42:33,076 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,076 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,082 [myid:1] - INFO [WorkerSender[myid=1]:QuorumPeer$QuorumServer@167] - Resolved hostname: zo3 to address: zo3/144.76.xxx.xxx
2019-01-02 10:43:33,091 [myid:1] - INFO [WorkerSender[myid=1]:QuorumPeer$QuorumServer@167] - Resolved hostname: zo3 to address: zo3/144.76.xxx.xxx
2019-01-02 10:43:33,290 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
2019-01-02 10:43:33,290 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 50000 datadir /var/lib/zookeeper/version-2 snapdir /var/lib/zookeeper/version-2
2019-01-02 10:43:33,291 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER ELECTION TOOK - 1062441
2019-01-02 10:43:33,291 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@167] - Resolved hostname: zo2 to address: zo2/88.198.35.34
2019-01-02 10:43:33,294 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@237] - Unexpected exception, tries=0, connecting to zo2/88.198.35.34:2888
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
2019-01-02 10:43:34,468 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@332] - Getting a diff from the leader 0x2d00035c8e
2019-01-02 10:43:35,120 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@687] - Established session 0x2680a49e3dc0013 with negotiated timeout 6000 for client /5.9.xxx.xxx:36664
2019-01-02 10:43:35,244 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@687] - Established session 0x1680a49b6b90011 with negotiated timeout 30000 for client /5.9.xxx.xxx:36668
2019-01-02 10:43:35,625 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@118] - Got zxid 0x2e00000001 expected 0x1
Logs from node zo2 , which became leader later
2019-01-02 10:25:50,852 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed socket connection for client /5.9.xxx.xxx:21218 which had sessionid 0x2680a49e3dc0012
2019-01-02 10:25:50,852 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down
2019-01-02 10:25:50,853 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting down
2019-01-02 10:25:50,853 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down
2019-01-02 10:25:50,854 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2019-01-02 10:25:50,854 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete
2019-01-02 10:25:50,856 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
2019-01-02 10:25:50,857 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2019-01-02 10:25:50,858 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 2, proposed zxid=0x2d00035c8e
2019-01-02 10:25:51,061 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400
2019-01-02 10:25:51,462 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800
2019-01-02 10:25:52,283 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600
2019-01-02 10:25:53,884 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200
2019-01-02 10:25:57,084 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400
2019-01-02 10:26:03,485 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800
2019-01-02 10:26:16,286 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600
2019-01-02 10:26:41,887 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200
2019-01-02 10:27:33,087 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:28:33,088 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:29:33,089 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:30:33,090 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:31:33,091 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:32:33,092 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:33:33,092 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:34:33,093 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:35:33,094 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:36:33,095 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:37:33,095 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:38:33,096 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:39:33,097 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:40:33,098 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:41:33,099 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:42:33,100 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000
2019-01-02 10:43:33,293 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@947] - LEADING
2019-01-02 10:43:33,299 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@62] - TCP NoDelay set to: true
2019-01-02 10:43:33,301 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 50000 datadir /var/lib/zookeeper/version-2 snapdir /var/lib/zookeeper/version-2
2019-01-02 10:43:33,301 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@371] - LEADING - LEADER ELECTION TOOK - 1062443
2019-01-02 10:43:34,307 [myid:2] - INFO [LearnerHandler-/144.76.120.143:64542:LearnerHandler@346] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@33d2c290
2019-01-02 10:43:34,509 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@961] - Have quorum of supporters, sids: [ 1,2 ]; starting up and setting last processed zxid: 0x2e00000000
As you can see all i am getting is continuous timeouts in the log without any explanation.
Been testing it since more then a week still cant find any solution for this.
I would be very grateful if somebody could point me in right direction.
Thanks
We faced the same issue and found that zookeeper leader election happened after 15 minutes.
we bypassed this 15 min idle time by setting tcpKeepAlive=true
in zookeeper properties.
No election can happen when the cluster size is 2. If you intend to run a HA Zookeeper cluster, increase your zookeeper count to 5. Also, Zookeeper doesn't need 32GB to run effectively.
Check out:
https://docs.confluent.io/current/zookeeper/deployment.html#multi-node-setup for cluster information
https://docs.confluent.io/current/zookeeper/deployment.html#jvm for JVM sizing.
–
TO ANYONE WHO FACE THE SAME ISSUE - the reason for this was zookeeper being idle. if you the zookeeper nodes becomes idle due to no activity (data transaction) between the nodes then internally the zookeeper nodes stops talking to each other . and if any zookeeper nodes goes down the other nodes will not know until they do their periodic ping after 10-15 minutes.
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.