Monitoring tool for Framaspace. [Online documentation](https://argos-monitoring.framasoft.org/)
Find a file
Alexis Métaireau 83f57c6e47 Started working on a simple web interface.
- The web interface is exposed at /, and the api
  at /api.
- Include picocss for a minimal CSS framework
- Added some queries and models.Task properties
  to access the latest results
2023-10-13 09:43:47 +02:00
argos Started working on a simple web interface. 2023-10-13 09:43:47 +02:00
tests Started working on a simple web interface. 2023-10-13 09:43:47 +02:00
.gitignore Working SSL checks, refactoring of the codebase. 2023-10-09 19:33:58 +02:00
config.yaml Started working on a simple web interface. 2023-10-13 09:43:47 +02:00
log_conf.yaml Start working with FastAPI 2023-10-02 12:15:57 +02:00
Pipfile Loop the agents and enhance the selection of tasks on the server 2023-10-10 19:24:50 +02:00
Pipfile.lock Loop the agents and enhance the selection of tasks on the server 2023-10-10 19:24:50 +02:00
pyproject.toml Support !include filename in the yaml files. 2023-10-10 11:45:33 +02:00
README.md Started working on a simple web interface. 2023-10-13 09:43:47 +02:00

Argos

🚧 This is mainly a work in progress for now. It's not working, don't try to install it ! 🚧

Argos is an HTTP monitoring service. It's meant to be simple to configure and simple to use.

Features :

  • Uses .yaml files for configuration ;
  • Read the configuration file and convert it to tasks ;
  • Store tasks in a database ;
  • Multiple paths per websites can be tested ;
  • Handle jobs failures on the clients
  • Exposes an HTTP API that can be consumed by other systems ;
  • Checks can be distributed on the network thanks to a job queue ;
  • Change the naming and use service/agent.
  • Packaging (and argos agent / argos service commands)
  • Endpoints are protected by an authentication token
  • Task frequency can be defined in the configuration
  • Add a command to generate new authentication tokens
  • Local task for database cleanup (to run periodically)
  • Handles multiple alerting backends (email, sms, gotify) ;
  • Exposes a simple read-only website.
  • Add a way to specify the severity of the alerts in the config
  • No need to return the expected and got values in case it worked in check-status and body-contains

Implemented checks :

  • Returned status code matches what you expect ;
  • Returned body matches what you expect ;
  • SSL certificate expires in more than X days ;

How to run ?

We're using pipenv to manage the virtual environment and the dependencies. You can install it with pipx:

pipx install pipenv

And then, checkout this repository and sync its pipenv

pipenv sync

Once all the dependencies are in place, here is how to run the server:

pipenv run argos server

The server will read a config.yaml file at startup, and will populate the tasks specified in it. See the configuration section below for more information on how to configure the checks you want to run.

And here is how to run the agent:

pipenv run argos agent --server http://localhost:8000 --auth "<auth-token>"

Configuration

Here is a simple configuration file:

general:
    frequency: 4h # Run checks every 4 hours.
    alerts:
        error:
            - local
        warning:
            - local
        alert:
            - local
service:
    port: 8888
    # Can be generated using `openssl rand -base64 32`.
    secrets:
        - "O4kt8Max9/k0EmHaEJ0CGGYbBNFmK8kOZNIoUk3Kjwc"
        - "x1T1VZR51pxrv5pQUyzooMG4pMUvHNMhA5y/3cUsYVs="

ssl:
    thresholds:
        critical: "1d"
        warning: "10d"

websites:
    - domain: "https://blog.notmyidea.org"
      paths:
          - path: "/"
            checks:
                - status-is: 200
                - body-contains: "Alexis"
                - ssl-certificate-expiration: "on-check"
          - path: "/foo"
            checks:
                - status-is: 400

Development notes

On service start.

  1. Read the job definitions file and populate the database.
  2. From the job definition, create a list of tasks to execute.
  3. From time to time (?) clean the db.

On configuration changes :

  • Find and tombstone the JobDefinitions that are not useful anymore.
  • Cascade delete the child tasks that are planned. Tombstone them as wel.

On worker demand :

  • Find the tasks for which :
    • last_check is not defined
    • OR last_check + max_timedelta > datetime.now()
    • AND selected_by not defined.
  • Mark these tasks as selected by the current worker, on the current date.

From time to time (cleanup):

  • Check for stalled tasks (datetime.now() - selected_at) > MAX_WORKER_TIME. Remove the lock.

On the worker side

  1. Hey, I'm XX, give me some work.
  2. OK, this is done, here are the results for Task: response.