Improve keepalive checks in cluster by mgmacias95 · Pull Request #1482 · wazuh/wazuh

mgmacias95 · 2018-09-28T13:51:03Z

This PR improves keepalive based checks in the cluster, to disconnect a node if there's no internet connection.

The implemented solution is detailed in #1355.

Best regards,
Marta

The time to disconnect a worker must be the same in both master and workers. This is why the attempts have been decreated to two. A master disconnects a worker after 120s.

Even though all worker threads were stopped, the asyncore loop was still running, blocking the worker node.

mgmacias95 · 2018-10-01T09:45:12Z

An update about this:
The worker threads were correctly stopped but the worker was blocked in the asyncore loop (this is why it took so long to reconnect):

wazuh/framework/scripts/wazuh-clusterd.py

Line 179 in 588888a

asyncore.loop(timeout=1, use_poll=False, map=manager.handler.map, count=None)

To fix this, it is necessary to stop the asyncore loop before stopping the threads:

wazuh/framework/wazuh/cluster/worker.py

Line 615 in a8ffad1

self.worker_handler.handle_close()

jesuslinares · 2018-10-03T09:11:15Z

framework/wazuh/cluster/cluster.json

-            "string_transfer_send": 0.1
+            "string_transfer_send": 0.1,
+
+            "max_allowed_lastkeepalive": 120


This should be in "master".

jesuslinares · 2018-10-03T09:13:46Z

framework/wazuh/cluster/worker.py


            logger.info("{0}: End. Sleeping: {1}s.".format(self.thread_tag, self.interval))
            self.sleep(self.interval)
+        logger.info("{0}: Stopped.".format(self.thread_tag))


jesuslinares · 2018-10-03T09:15:49Z

framework/wazuh/cluster/worker.py

+                self.worker_handler.handle_close()
+                self.stopper.set()
+            else:
+                raise e


jesuslinares · 2018-10-03T09:19:24Z

framework/wazuh/cluster/master.py

+
+            for worker, worker_info in self.master.get_connected_workers().items():
+                if time.time() - worker_info['status']['last_keep_alive'] > get_cluster_items_communication_intervals()['max_allowed_lastkeepalive']:
+                    logger.critical("[Master] [ClientStatus] Last keep alive from worker {} is higher than allowed maximum. Disconnecting.".format(worker))


WorkerChecks instead of ClientStatus

…guration in cluster.json

…ched If an exception is raised, a log message saying "Unknown error" is printed. I think it's better to return None and get a message saying "Result: Error".

Marta Gómez Macías added 3 commits September 25, 2018 09:08

Save last keep alive timestamp for each worker in cluster

0957cd7

Add last keep alive to API and cluster_control cluster healthcheck

71f554c

Remove client from master if its last keep alive is longer than 120s

54f645c

mgmacias95 added module/framework module/cluster labels Sep 28, 2018

mgmacias95 assigned jesuslinares Sep 28, 2018

mgmacias95 requested review from druizz90 and jesuslinares September 28, 2018 13:51

mgmacias95 mentioned this pull request Sep 28, 2018

Cluster doesn't work if the network service starts after the cluster #1355

Closed

Close connection in worker side if keep alives fail to be sent

648b4ad

mgmacias95 force-pushed the dev-3.7-cluster-lastkeepalive-checker branch from 4c72024 to 648b4ad Compare September 28, 2018 14:09

Marta Gómez Macías added 2 commits October 1, 2018 11:39

Decrease keep alive attempts to 2 in worker nodes

63b31fa

The time to disconnect a worker must be the same in both master and workers. This is why the attempts have been decreated to two. A master disconnects a worker after 120s.

Stop asyncore loop before stopping threads in worker

a8ffad1

Even though all worker threads were stopped, the asyncore loop was still running, blocking the worker node.

jesuslinares suggested changes Oct 3, 2018

View reviewed changes

Marta Gómez Macías added 6 commits October 3, 2018 12:05

Rename max_allowed_lastkeepalive variable and move it to master confi…

ebaf226

…guration in cluster.json

Change message level to debug

c1da92e

Add more comments in the Worker's keep alive thread

2fe042f

Change thread tag: ClientStatus -> WorkerChecks

1a1435b

Change client for worker in log message

405ceb6

Change behaviour when max attempts of failed timeouts havent been rea…

3277602

…ched If an exception is raised, a log message saying "Unknown error" is printed. I think it's better to return None and get a message saying "Result: Error".

jesuslinares merged commit 38de3fd into 3.7 Oct 3, 2018

jesuslinares deleted the dev-3.7-cluster-lastkeepalive-checker branch October 3, 2018 15:27

albertomn86 mentioned this pull request Nov 23, 2018

Test: Cluster #1952

Closed

89 tasks

chemamartinez mentioned this pull request Jan 9, 2019

Test: Cluster #2257

Closed

89 tasks

albertomn86 mentioned this pull request Jan 15, 2019

Test: Cluster 2 #2363

Closed

89 tasks

jesuslinares mentioned this pull request Feb 4, 2019

Testing new cluster and embedded Python #2520

Closed

albertomn86 mentioned this pull request Feb 18, 2019

Test: Cluster wazuh/wazuh-qa#14

Closed

38 tasks

albertomn86 mentioned this pull request Apr 25, 2019

Test: Cluster wazuh/wazuh-qa#67

Closed

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve keepalive checks in cluster#1482

Improve keepalive checks in cluster#1482
jesuslinares merged 12 commits into3.7from
dev-3.7-cluster-lastkeepalive-checker

mgmacias95 commented Sep 28, 2018

Uh oh!

mgmacias95 commented Oct 1, 2018

Uh oh!

jesuslinares Oct 3, 2018

Uh oh!

jesuslinares Oct 3, 2018

Uh oh!

jesuslinares Oct 3, 2018

Uh oh!

jesuslinares Oct 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgmacias95 commented Sep 28, 2018

Uh oh!

mgmacias95 commented Oct 1, 2018

Uh oh!

jesuslinares Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jesuslinares Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jesuslinares Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jesuslinares Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants