Skip to content

Improve keepalive checks in cluster#1482

Merged
jesuslinares merged 12 commits into3.7from
dev-3.7-cluster-lastkeepalive-checker
Oct 3, 2018
Merged

Improve keepalive checks in cluster#1482
jesuslinares merged 12 commits into3.7from
dev-3.7-cluster-lastkeepalive-checker

Conversation

@mgmacias95
Copy link
Contributor

Hi @jesuslinares,

This PR improves keepalive based checks in the cluster, to disconnect a node if there's no internet connection.

The implemented solution is detailed in #1355.

Best regards,
Marta

@mgmacias95 mgmacias95 force-pushed the dev-3.7-cluster-lastkeepalive-checker branch from 4c72024 to 648b4ad Compare September 28, 2018 14:09
Marta Gómez Macías added 2 commits October 1, 2018 11:39
The time to disconnect a worker must be the same in both master and workers. This is why the attempts have been decreated to two. A master disconnects a worker after 120s.
Even though all worker threads were stopped, the asyncore loop was still running, blocking the worker node.
@mgmacias95
Copy link
Contributor Author

An update about this:
The worker threads were correctly stopped but the worker was blocked in the asyncore loop (this is why it took so long to reconnect):

asyncore.loop(timeout=1, use_poll=False, map=manager.handler.map, count=None)

To fix this, it is necessary to stop the asyncore loop before stopping the threads:
self.worker_handler.handle_close()

"string_transfer_send": 0.1
"string_transfer_send": 0.1,

"max_allowed_lastkeepalive": 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in "master".


logger.info("{0}: End. Sleeping: {1}s.".format(self.thread_tag, self.interval))
self.sleep(self.interval)
logger.info("{0}: Stopped.".format(self.thread_tag))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug

self.worker_handler.handle_close()
self.stopper.set()
else:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments.


for worker, worker_info in self.master.get_connected_workers().items():
if time.time() - worker_info['status']['last_keep_alive'] > get_cluster_items_communication_intervals()['max_allowed_lastkeepalive']:
logger.critical("[Master] [ClientStatus] Last keep alive from worker {} is higher than allowed maximum. Disconnecting.".format(worker))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkerChecks instead of ClientStatus

@jesuslinares jesuslinares merged commit 38de3fd into 3.7 Oct 3, 2018
@jesuslinares jesuslinares deleted the dev-3.7-cluster-lastkeepalive-checker branch October 3, 2018 15:27
@albertomn86 albertomn86 mentioned this pull request Nov 23, 2018
89 tasks
@chemamartinez chemamartinez mentioned this pull request Jan 9, 2019
89 tasks
@albertomn86 albertomn86 mentioned this pull request Jan 15, 2019
89 tasks
@albertomn86 albertomn86 mentioned this pull request Feb 18, 2019
38 tasks
@albertomn86 albertomn86 mentioned this pull request Apr 25, 2019
39 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants