mirror of
https://framagit.org/framasoft/framaspace/argos.git
synced 2025-04-28 18:02:41 +02:00
153 lines
No EOL
4.5 KiB
Markdown
153 lines
No EOL
4.5 KiB
Markdown
# Argos
|
||
|
||
Argos is an HTTP monitoring service. It allows you to define a list of websites to monitor, and a list of checks to run on these websites. It will then run these checks periodically, and alert you if something goes wrong.
|
||
|
||
Todo:
|
||
|
||
- [ ] Retrying: attempt 1413 ended with: <Future at 0x104f39390 state=finished raised RuntimeError> Cannot reopen a client instance, once it has been closed.
|
||
- [ ] Cleandb should keep max number of results by task
|
||
- [ ] Do not return empty list on / when no results from agents.
|
||
- [ ] Last seen agents
|
||
- [ ] donner un aperçu rapide de l’état de la supervision.
|
||
- [ ] Rename error in unexpected error
|
||
- [ ] Use background tasks for alerting
|
||
- [ ] Delete outdated tasks from config
|
||
- [ ] Implement alerting tasks
|
||
- [ ] Handles multiple alerting backends (email, sms, gotify)
|
||
- [ ] Un flag de configuration permet d’ajouter automatiquement un job de vérification de redirection 301 de la version HTTP vers HTTPS
|
||
- [ ] add an "unknown" severity for check errors
|
||
- [ ] Add a way to specify the severity of the alerts in the config
|
||
- [ ] Add a command to generate new authentication token
|
||
|
||
Features :
|
||
|
||
- [x] Uses `.yaml` files for configuration ;
|
||
- [x] Read the configuration file and convert it to tasks ;
|
||
- [x] Store tasks in a database ;
|
||
- [x] Multiple paths per websites can be tested ;
|
||
- [x] Handle jobs failures on the clients
|
||
- [x] Exposes an HTTP API that can be consumed by other systems ;
|
||
- [x] Checks can be distributed on the network thanks to a job queue ;
|
||
|
||
Implemented checks :
|
||
|
||
- [x] Returned status code matches what you expect ;
|
||
- [x] Returned body matches what you expect ;
|
||
- [x] SSL certificate expires in more than X days ;
|
||
|
||
## How to run ?
|
||
|
||
To install it, create a virtualenv and install the dependencies:
|
||
|
||
```bash
|
||
python3 -m venv venv
|
||
source venv/bin/activate
|
||
pip install -e .
|
||
```
|
||
|
||
Prepare a configuration file, you can copy the `config-example.yaml` file and edit it:
|
||
|
||
```bash
|
||
cp config-example.yaml config.yaml
|
||
```
|
||
|
||
Then, you can run the server:
|
||
|
||
```bash
|
||
argos server run
|
||
```
|
||
|
||
You can specify the environment variables to configure the server, or you can put them in an `.env` file:
|
||
|
||
```bash
|
||
ARGOS_DATABASE_URL=postgresql://localhost/argos
|
||
ARGOS_YAML_FILE=config.yaml
|
||
```
|
||
|
||
The server will read a `yaml` file at startup, and will populate the tasks specified in it. See the configuration section below for more information on how to configure the checks you want to run.
|
||
|
||
And here is how to run the agent:
|
||
|
||
```bash
|
||
argos agent http://localhost:8000 "<auth-token>"
|
||
```
|
||
|
||
You also need to run cleaning tasks periodically. `argos server clean --help` will give you more information on how to do that.
|
||
|
||
Here is a crontab example:
|
||
|
||
```bash
|
||
# Run the cleaning tasks every hour (at minute 7)
|
||
7 * * * * argos server clean --max-results 100000 --max-lock-seconds 3600
|
||
```
|
||
|
||
|
||
## Configuration
|
||
|
||
Here is a simple configuration file:
|
||
|
||
```yaml
|
||
general:
|
||
frequency: "1m" # Run checks every minute.
|
||
alerts:
|
||
error:
|
||
- local
|
||
warning:
|
||
- local
|
||
alert:
|
||
- local
|
||
service:
|
||
secrets:
|
||
# Secrets can be generated using `openssl rand -base64 32`.
|
||
# DO NOT REUSE THESE ONES.
|
||
- "O4kt8Max9/k0EmHaEJ0CGGYbBNFmK8kOZNIoUk3Kjwc"
|
||
- "x1T1VZR51pxrv5pQUyzooMG4pMUvHNMhA5y/3cUsYVs="
|
||
|
||
ssl:
|
||
thresholds:
|
||
- "1d": critical
|
||
- "5d": warning
|
||
|
||
# It's also possible to define the checks in another file
|
||
# with the include syntax:
|
||
#
|
||
# websites: !include websites.yaml
|
||
#
|
||
websites:
|
||
- domain: "https://mypads.framapad.org"
|
||
paths:
|
||
- path: "/mypads/"
|
||
checks:
|
||
- status-is: "200"
|
||
- body-contains: '<div id= "mypads"></div>'
|
||
- ssl-certificate-expiration: "on-check"
|
||
- path: "/admin/"
|
||
checks:
|
||
- status-is: "401"
|
||
```
|
||
|
||
## Development notes
|
||
|
||
### On service start.
|
||
|
||
1. Read the job definitions file and populate the database.
|
||
2. From the job definition, create a list of tasks to execute.
|
||
3. From time to time (?) clean the db.
|
||
|
||
### On configuration changes :
|
||
- Find and tombstone the JobDefinitions that are not useful anymore.
|
||
- Cascade delete the child tasks that are planned. Tombstone them as wel.
|
||
|
||
### On worker demand :
|
||
- Find the tasks for which :
|
||
- last_check is not defined
|
||
- OR last_check + max_timedelta > datetime.now()
|
||
- AND selected_by not defined.
|
||
- Mark these tasks as selected by the current worker, on the current date.
|
||
|
||
### From time to time (cleanup):
|
||
- Check for stalled tasks (datetime.now() - selected_at) > MAX_WORKER_TIME. Remove the lock.
|
||
|
||
### On the worker side
|
||
1. Hey, I'm XX, give me some work.
|
||
2. <Service answers> OK, this is done, here are the results for Task<id>: response. |