Improve keepalive checks in cluster#1482
Merged
jesuslinares merged 12 commits into3.7from Oct 3, 2018
Merged
Conversation
4c72024 to
648b4ad
Compare
added 2 commits
October 1, 2018 11:39
The time to disconnect a worker must be the same in both master and workers. This is why the attempts have been decreated to two. A master disconnects a worker after 120s.
Even though all worker threads were stopped, the asyncore loop was still running, blocking the worker node.
Contributor
Author
|
An update about this: wazuh/framework/scripts/wazuh-clusterd.py Line 179 in 588888a To fix this, it is necessary to stop the asyncore loop before stopping the threads:wazuh/framework/wazuh/cluster/worker.py Line 615 in a8ffad1 |
jesuslinares
suggested changes
Oct 3, 2018
framework/wazuh/cluster/cluster.json
Outdated
| "string_transfer_send": 0.1 | ||
| "string_transfer_send": 0.1, | ||
|
|
||
| "max_allowed_lastkeepalive": 120 |
Contributor
There was a problem hiding this comment.
This should be in "master".
framework/wazuh/cluster/worker.py
Outdated
|
|
||
| logger.info("{0}: End. Sleeping: {1}s.".format(self.thread_tag, self.interval)) | ||
| self.sleep(self.interval) | ||
| logger.info("{0}: Stopped.".format(self.thread_tag)) |
framework/wazuh/cluster/worker.py
Outdated
| self.worker_handler.handle_close() | ||
| self.stopper.set() | ||
| else: | ||
| raise e |
framework/wazuh/cluster/master.py
Outdated
|
|
||
| for worker, worker_info in self.master.get_connected_workers().items(): | ||
| if time.time() - worker_info['status']['last_keep_alive'] > get_cluster_items_communication_intervals()['max_allowed_lastkeepalive']: | ||
| logger.critical("[Master] [ClientStatus] Last keep alive from worker {} is higher than allowed maximum. Disconnecting.".format(worker)) |
Contributor
There was a problem hiding this comment.
WorkerChecks instead of ClientStatus
added 6 commits
October 3, 2018 12:05
…guration in cluster.json
…ched If an exception is raised, a log message saying "Unknown error" is printed. I think it's better to return None and get a message saying "Result: Error".
38 tasks
39 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi @jesuslinares,
This PR improves keepalive based checks in the cluster, to disconnect a node if there's no internet connection.
The implemented solution is detailed in #1355.
Best regards,
Marta