Monitoring tool for Framaspace. [Online documentation](https://argos-monitoring.framasoft.org/)
Find a file
2023-10-20 01:09:38 +02:00
argos Updates while reading the code with Luc. 2023-10-19 18:10:41 +02:00
docs Add some more documentation 2023-10-20 01:09:38 +02:00
tests 🌱 Let's start with the Gitlab CI 2023-10-18 23:17:23 +02:00
.env.example 📝 Start a documentation 2023-10-19 22:46:01 +02:00
.gitignore 📝 Start a documentation 2023-10-19 22:46:01 +02:00
.gitlab-ci.yml 📝 Start a documentation 2023-10-19 22:46:01 +02:00
CHANGELOG.md 📝 Start a documentation 2023-10-19 22:46:01 +02:00
config-example.yaml 📝 Start a documentation 2023-10-19 22:46:01 +02:00
log_conf.yaml Start working with FastAPI 2023-10-02 12:15:57 +02:00
Makefile 📝 Start a documentation 2023-10-19 22:46:01 +02:00
pyproject.toml Add some more documentation 2023-10-20 01:09:38 +02:00
README.md 📝 Start a documentation 2023-10-19 22:46:01 +02:00

Argos

Argos is an HTTP monitoring service. It allows you to define a list of websites to monitor, and a list of checks to run on these websites. It will then run these checks periodically, and alert you if something goes wrong.

Todo:

  • Retrying: attempt 1413 ended with: <Future at 0x104f39390 state=finished raised RuntimeError> Cannot reopen a client instance, once it has been closed.
  • Cleandb should keep max number of results by task
  • Do not return empty list on / when no results from agents.
  • Last seen agents
  • donner un aperçu rapide de létat de la supervision.
  • Rename error in unexpected error
  • Use background tasks for alerting
  • Delete outdated tasks from config
  • Implement alerting tasks
  • Handles multiple alerting backends (email, sms, gotify)
  • Un flag de configuration permet dajouter automatiquement un job de vérification de redirection 301 de la version HTTP vers HTTPS
  • add an "unknown" severity for check errors
  • Add a way to specify the severity of the alerts in the config
  • Add a command to generate new authentication token

Implemented checks :

  • Returned status code matches what you expect ;
  • Returned body matches what you expect ;
  • SSL certificate expires in more than X days ;

Development notes

On service start.

  1. Read the job definitions file and populate the database.
  2. From the job definition, create a list of tasks to execute.
  3. From time to time (?) clean the db.

On configuration changes :

  • Find and tombstone the JobDefinitions that are not useful anymore.
  • Cascade delete the child tasks that are planned. Tombstone them as wel.

On worker demand :

  • Find the tasks for which :
    • last_check is not defined
    • OR last_check + max_timedelta > datetime.now()
    • AND selected_by not defined.
  • Mark these tasks as selected by the current worker, on the current date.

From time to time (cleanup):

  • Check for stalled tasks (datetime.now() - selected_at) > MAX_WORKER_TIME. Remove the lock.

On the worker side

  1. Hey, I'm XX, give me some work.
  2. OK, this is done, here are the results for Task: response.