blog.notmyidea.org/content/code/2024-04-13-geojson-crdt.md at 1273ac3c463ee4071c5799e16cdba727d43bcc93

almet/blog.notmyidea.org

mirror of https://github.com/almet/notmyidea.git synced 2025-04-28 19:42:37 +02:00

update

2024-04-23 23:05:38 +02:00

8.1 KiB

Raw Blame History

title	status
A GeoJSON distributed Key/Value store	draft

Our requirements for uMap made it clear we don't really need a complex CRDT library. We need a way to have peers agree on who is the last writer, or to state it differently, to order the operations.

The rest of the picture is rather simple: just replace the value at a given location.

The context

In our case, we have a server, and for the foreseable future we will continue to have one. We don't want to rely on the server to assign timestamps on each write operation because that would add too much delay, but we can use it to make sure the messages are transmitted to the right peers.

So:

The server has a GeoJSON document, and want to sync it with other peers.
The server has a way to reach all the peers, using a WebSocket connection.
Peers can go online and offline, and sync their changes with other peers when they have connection.

A Hybrid Logical Clock (HLC)

The different peers will send the server operations when something is changed on their end. We need to have a way for peers to agree on the ordering of the operations, without having to concert with each other.

One way to solve this is by using an hybrid logical clock. On each peer, an hybrid physical+ logical clock is kept, composed of:

The physical time: the current clock time.
A logical counter: an incrementing counter, to differentiate between events that happened at the same clock tick.
A peer identifier: A unique identifier for the peer

This makes it possible to order between different "times": first compare physical times, then logical times, and finally any additional disambiguating information such as peer identifiers if the other components are equal.

def hlc_compare(hlc1, hlc2):
    physical_time1, logical_time1, peer_id1 = hlc1
    physical_time2, logical_time2, peer_id2 = hlc2

    if physical_time1 != physical_time2:
        return (physical_time1 > physical_time2) - (physical_time1 < physical_time2)
    elif logical_time1 != logical_time2:
        return (logical_time1 > logical_time2) - (logical_time1 < logical_time2)
    else:
        return (peer_id1 > peer_id2) - (peer_id1 < peer_id2)

Operations

For now, let's reprensent the operations like this:


# Set foobar to peerA
(1712938520, 500, "Peer A", "foobar=peerA")

Peers will send operations to the server, which will assign an incremental ID to each, so it's possible to keep track of which operations have been already received by the peers.

ID	Clock	Payload
90	1712938501, 500, "Peer A"	title=super
99	1712938520, 501, "Peer A"	layertype=cluster
100	1712938502, 300, "Peer B"	color=blue
101	1712938510, 301, "Peer B"	markertype=drop
102	1712938520, 502, "Peer A"	foobar=peerA
103	1712938510, 302, "Peer B"	foobar=peerB

As we can see, incremental IDs might not match the clock order.

They are mainly here as a way for the server to know which operations should be returned to the peers, based on their last known ID.

For instance, if peer A knows up to id 99, when it gets back online it can ask the server for operations that happened since then, and then apply them locally before rendering its interface.

Compacting the data

We need to distinguish between these use cases:

A new peer enters, and needs to get the freshest view from the server
An existing peer reconnects, it might have local data to send, and need to reconciliate it with the freshest state.

A new peer enters

When a new peer enters and needs to get fresh data, we can do it in multiple ways. The simplest one is to make it redo the exact same things that happened on other peers: get the same GeoJSON and apply the same operations on it.

It will work, but will require the server to serve all the operations that happened over time to all clients, which is suboptimal. We can avoid this by compacting the data on the server prior to serving it.

This view of the document needs to have an HLC associated with each key.

An existing peer reconnects

T0   = Peer A retrieves data. It goes offline with it.
T1   = Peer A updates the data locally
T100 = Peer B updates the data and syncs
T101 = Peer A goes back online.

At T101, Peer A will tell the server: "hey, let me know what data you have since T0". The server will return the operations since then, and Peer A will apply it locally. It will then send the missing data that happened at T1.

Because peers know the time at which each change has been made, receiving old data will be applied only if it's fresher than the current one.

The data

Because we're using GeoJSON, the data is well defined. It looks like this:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "name": "Dinagat Islands"
  }
}

On uMap, a simple .geojson map looks like this (it's long, but I already removed a lot of data in there)

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [
      -1.426849365234375,
      48.182112019648514
    ]
  },
  "properties": {
    "name": "Title",
    "zoom": 10,
    "miniMap": false,
    "overlay": {},
    "slideshow": {},
    "tilelayer": {
      "tms": false,
      "name": "OSM Positron (Carto)",
      "maxZoom": 20,
      "minZoom": 0,
      "attribution": "Map tiles by [[http://cartodb.com/attributions#basemaps|CartoDB]], under [[https://creativecommons.org/licenses/by/3.0/|CC BY 3.0]]. map data © [[http://osm.org/copyright|OpenStreetMap contributors]] under ODbL ",
      "url_template": "https://cartodb-basemaps-{s}.global.ssl.fastly.net/light_all/{z}/{x}/{y}.png"
    },
    "longCredit": "https://www.data.gouv.fr/fr/datasets/repertoire-national-des-associations/",
    "defaultView": "data",
    "description": "",
    "limitBounds": {},
    "onLoadPanel": "caption",
    // redacted. Tons of properties
    "permissions": {
      "edit_status": 3,
      "share_status": 1,
      "owner": {
        "id": 2712,
        "name": "ametaireau",
        "url": "/fr/user/ametaireau/"
      },
      "editors": []
    },
    "umap_id": 1053526,
    "umap_version": "2.1.3",
    "featuresHaveOwner": false,
    "datalayers": [
      {
        "id": "badb1518-9ff1-40a2-b8a8-4d6976904469",
        "fill": true,
        "name": "Concentration d'associations",
        "type": "Choropleth",
        "opacity": 0.1,
        "editMode": "disabled",
        "labelKey": "{nom}: {orgCount} ",
        "browsable": true,
        "inCaption": true,
        "showLabel": null,
        "choropleth": {
          "mode": "kmeans",
          "breaks": "0,11,25,51,149,420,708,2746,2746",
          "classes": 8,
          "property": "orgCount"
        },
        "remoteData": {},
        "description": "Le nombre d'associations par commune",
        "fillOpacity": 0.8,
        "permissions": {
          "edit_status": 0
        },
        "displayOnLoad": true,
        "labelInteractive": false
      }
    ]
  }
}

And a layer looks like this:


{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [125.6, 10.1]
      },
      "properties": {
        "nom": "Dinagat Islands",
        "orgCount": 10,
      }
    }
  ],
  "_umap_options": {
    "displayOnLoad": true,
    "inCaption": true,
    "browsable": true,
    "editMode": "advanced",
    "name": "Photographies de 2019",
    "color": "Crimson",
    "iconUrl": "/uploads/pictogram/embassy-24.png",
    "iconClass": "Ball",
    "remoteData": {},
    "description": ""
  }
}

It's interesting to note that the data here has mixed purpose. It's at the same time useful data for uMap, and for the geographical objects.

Now, what we want to do is to propagate the changes from one peer to another, going trough the server.

Because uMap doesn't internally use the GeoJSON keys to handle its changes, we will need to match between changes in the geoJSON and changes in uMap. It actually goes both ways, when detecting the modified data, and when applying the received changes.

8.1 KiB Raw Blame History