argos/README.md

4.5 KiB
Raw Blame History

Argos

Argos is an HTTP monitoring service. It allows you to define a list of websites to monitor, and a list of checks to run on these websites. It will then run these checks periodically, and alert you if something goes wrong.

Todo:

  • Retrying: attempt 1413 ended with: <Future at 0x104f39390 state=finished raised RuntimeError> Cannot reopen a client instance, once it has been closed.
  • Cleandb should keep max number of results by task
  • Do not return empty list on / when no results from agents.
  • Last seen agents
  • donner un aperçu rapide de létat de la supervision.
  • Rename error in unexpected error
  • Use background tasks for alerting
  • Delete outdated tasks from config
  • Implement alerting tasks
  • Handles multiple alerting backends (email, sms, gotify)
  • Un flag de configuration permet dajouter automatiquement un job de vérification de redirection 301 de la version HTTP vers HTTPS
  • add an "unknown" severity for check errors
  • Add a way to specify the severity of the alerts in the config
  • Add a command to generate new authentication token

Features :

  • Uses .yaml files for configuration ;
  • Read the configuration file and convert it to tasks ;
  • Store tasks in a database ;
  • Multiple paths per websites can be tested ;
  • Handle jobs failures on the clients
  • Exposes an HTTP API that can be consumed by other systems ;
  • Checks can be distributed on the network thanks to a job queue ;

Implemented checks :

  • Returned status code matches what you expect ;
  • Returned body matches what you expect ;
  • SSL certificate expires in more than X days ;

How to run ?

To install it, create a virtualenv and install the dependencies:

python3 -m venv venv
source venv/bin/activate
pip install -e .

Prepare a configuration file, you can copy the config-example.yaml file and edit it:

cp config-example.yaml config.yaml

Then, you can run the server:

argos server run

You can specify the environment variables to configure the server, or you can put them in an .env file:

ARGOS_DATABASE_URL=postgresql://localhost/argos
ARGOS_YAML_FILE=config.yaml

The server will read a yaml file at startup, and will populate the tasks specified in it. See the configuration section below for more information on how to configure the checks you want to run.

And here is how to run the agent:

argos agent http://localhost:8000 "<auth-token>"

You also need to run cleaning tasks periodically. argos server clean --help will give you more information on how to do that.

Here is a crontab example:

# Run the cleaning tasks every hour (at minute 7)
7 * * * * argos server clean --max-results 100000 --max-lock-seconds 3600

Configuration

Here is a simple configuration file:

general:
  frequency: "1m" # Run checks every minute.
  alerts:
    error:
      - local
    warning:
      - local
    alert:
      - local
service:
  secrets:
    # Secrets can be generated using `openssl rand -base64 32`.
    # DO NOT REUSE THESE ONES.
    - "O4kt8Max9/k0EmHaEJ0CGGYbBNFmK8kOZNIoUk3Kjwc"
    - "x1T1VZR51pxrv5pQUyzooMG4pMUvHNMhA5y/3cUsYVs="

ssl:
  thresholds:
    - "1d": critical
    - "5d": warning

# It's also possible to define the checks in another file
# with the include syntax:
# 
#   websites: !include websites.yaml
#
websites:
  - domain: "https://mypads.framapad.org"
    paths:
      - path: "/mypads/"
        checks:
          - status-is: "200"
          - body-contains: '<div id= "mypads"></div>'
          - ssl-certificate-expiration: "on-check"
      - path: "/admin/"
        checks:
          - status-is: "401"

Development notes

On service start.

  1. Read the job definitions file and populate the database.
  2. From the job definition, create a list of tasks to execute.
  3. From time to time (?) clean the db.

On configuration changes :

  • Find and tombstone the JobDefinitions that are not useful anymore.
  • Cascade delete the child tasks that are planned. Tombstone them as wel.

On worker demand :

  • Find the tasks for which :
    • last_check is not defined
    • OR last_check + max_timedelta > datetime.now()
    • AND selected_by not defined.
  • Mark these tasks as selected by the current worker, on the current date.

From time to time (cleanup):

  • Check for stalled tasks (datetime.now() - selected_at) > MAX_WORKER_TIME. Remove the lock.

On the worker side

  1. Hey, I'm XX, give me some work.
  2. OK, this is done, here are the results for Task: response.