blog.notmyidea.org/content/code/2024-02-24-crdts.md

23 KiB
Raw Blame History

title status tags display_toc
A comparison of different JavaScript CRDTs draft crdts, umap, sync true

Collaboration is one of the most requested features on uMap since a long time. We've added a way to merge same-layers edit, but ideally we would love to make things easier to understand and more fluid for the users.

For this reason, I got more into CRDTs, with the goal of understanding how they work, what are the different libraries out there, and which one would be a good fit for us, if any.

So far, the way I've though about collaboration features on uMap is by:

  • a) catching when changes are done on the interface ;
  • b) sending messages to the other party and ;
  • c) applying the changes on the receiving client.

This works well in general, but it doesn't take care of conflicts handling, especially when a disconnect can happen.

Our requirements

We're looking for something that:

  • Stores and retrieves arbitrary key/value pairs, like the name of the map, the default color of the markers, the type of layers, etc.
  • Propagates the changes to the other parties, so they can see the edits in real-time ;
  • Handles disconnections, so it's possible to work offline, or with an intermittent/bad connection.

In terms of API, we want something reactive and simple. I'm thinking about:

let store = new Store({websocket:"wss://server"})
store.on((updates) => {
	// do something with the updates
})
store.set('key', 'value')

What are CRDTs anyway?

Conflict-free Resolution Data Types (CRDTs) are as a specific datatype able to merge its state with other states without generating conflicts. They handle consistency in distributed systems, making them particularly well-suited for collaborative real-time applications.

CRDTs ensure that multiple participants can make changes without strict coordination, and all replicas converge to the same state upon synchronization, without conflicts.

"Append-only sets" are probably one of the most common type of CRDT: if multiple parties add the same element, it will be present only once. It's our old friend Set.

How can CRDT help in our case?

For uMap, CRDTs offer a solution to several challenges:

  1. Simultaneous Editing: When multiple users interact with the same map, their changes must not only be reflected in real-time but also merged seamlessly without overwriting each other's contributions.

  2. Network Latency and Partition: uMap operates over networks that can experience delays or temporary outages. CRDTs can handle these conditions gracefully, enabling offline editing and eventual consistency.

  3. Simplified Conflict Resolution: Traditional methods often require complex algorithms to resolve conflicts, while CRDTs inherently minimize the occurrence of conflicts altogether.

  4. Decentralization: While uMap currently relies on a central server, adopting CRDTs could pave the way for a more decentralized architecture, increasing resilience and scalability.

How do CRDTs differ from traditional data synchronization methods?

Traditional data synchronization methods typically rely on a central source of truth, such as a server, to manage and resolve conflicts. When changes are made by different users, these traditional systems require a round-trip to the server for conflict resolution and thus can be slow or inadequate for real-time collaboration.

In contrast, CRDTs leverage mathematical properties (the fact that the datatypes can converge) to ensure that every replica independently reaches the same state, without the need for a central authority, thus minimizing the amount of coordination and communication needed between nodes.

This ability to maintain consistency sets CRDTs apart from conventional synchronization approaches and makes them particularly valuable for the development of collaborative tools like uMap, where real-time updates and reliability are important.

Last Write Wins Registers

For managing key/value data, I'm leaning onto Last-Write-Wins (LWW) registers within CRDTs. With LWW, the main concern is establishing the sequence of updates. In a single-client scenario or with a central time reference, sequencing is straightforward. However, in a distributed environment, time discrepancies across peers can complicate things, as clocks may drift and lose synchronization.

To address this, CRDTs use vector clocks — a specialized data structure that helps to solve the relative timing of events across distributed systems and pinpoint any inconsistencies.

A vector clock is a data structure used for determining the partial ordering of events in a distributed system and detecting causality violations.

Wikipedia

At first, I found CRDTs somewhat confusing, owing to their role in addressing complex challenges. CRDTs come in various forms, with much of their intricacy tied to resolving content conflicts within textual data or elaborate hierarchical structures. Fortunately for us, our use case is comparatively straightforward.

CRDTs converging to the same state

Note that we could also use a library such as rxdb — to handle the syncing, offline, etc — because we have a master: we use the server, and we can use it to handle the merge conflicts. But by doing so, we also give more responsibility to the server, whereas when using CRDTs it's possible to do the merge only on the clients.


The libraries

So, with this in mind, I've tried to have a look at the different libraries that are out there, and assess how they would work for us.

I'll be comparing these CRDTs, in the context of a JS mapping application, built with Leaflet :

Here are the different areas I'm interested in :

  1. Type: Is it state-based or operation-based? What is the impact for us?
  2. Offline Sync: How does it handle offline edits and their integration upon reconnection?
  3. Efficiency: Probe the bandwidth and storage demands. What's being transmitted over the wire? To test that, I've just connected two peers and added 20 points on each, and looked at the network impact.
  4. Community and Support: How is the size and activity of the developer community / ecosystem?
  5. Size of the JavaScript library
  6. Browser support in general.

Different types of CRDTs

While reading the literature, I found that there are two kinds of CRDTs: state-based and operation-based. So, what do we need ?

It turns out most of the CRDTs implementation I looked at are operation-based, and propose an API to interact with them as you're changing the state, so it doesn't really matter.

The two alternatives are theoretically equivalent, as each can emulate the other. However, there are practical differences. State-based CRDTs are often simpler to design and to implement; their only requirement from the communication substrate is some kind of gossip protocol. Their drawback is that the entire state of every CRDT must be transmitted eventually to every other replica, which may be costly. In contrast, operation-based CRDTs transmit only the update operations, which are typically small.

Wikipedia on CRDTs

Offline support

One thing that I would like to clarify is how does these libs work when peers get offline, and back online, if they do something at all.

I was expecting something in the lines of:

  1. Connection is lost: changes are applied locally (messages are piling up somehow);
  2. Connection is back: the client syncs with other clients.

Which is exactly what's happening: CRDTs don't do any magic here: they send the "sync" messages whenever they need, which brings all the clients in the same state.

How the server fits in the picture

The sync protocol is well defined and documented. While discussing with the team, I quickly understood that I was expecting the server to pass along the messages to the other parties, and that would be the way the synchronisation would be done. It turns out I was mistaken: in this approach, the clients send updates to the server, which merges everything together and only then sends the updates to the other peers.

In order to have peers working with each other, I would need to change the way the provider works, so we can have the server be "brainless" and just relay the messages.

For automerge, it would mean the provider will "just" handle the websocket connection (disconnect and reconnect) and all the peers would be able to talk with each other. The other solution for us would be to have the merge algorithm working on the server side, which comes with upsides (no need to find when the document should be saved by the client to the server) and downsides (it takes some cpu and memory to run the CRDTs on the server)

Versionning

A quick note about versioning: it's possible to get a snapshot of the document at any point in time, and it's possible to store this information.

Y.js

Y.js is the first library I've looked at, because it's the oldest one, and the more commonly referred to.

The API seem to offer what we look for, and provides a way to observe changes.

const doc = new Y.Doc()
const map = ydoc.getMap()
map.set('key', 'value')
map.observe((event) => {
  // read the keys that changed
  event.keysChanged

  // If I need to iterate on the keys, or get the old values, it's possible.
  event.changes.keys.forEach((change, key) {
    // here, change.action can be "add", "update" or "delete"
    // So it's the right place to know what happened, and act respectively
    // You can get the last value of the key with `map.get`
    map.get(key)
  })
})

It comes with multiple "providers", which make it possible to sync with different protocols (there is even a way to sync over the matrix protocol 😇). More usefully for us, there is an implemented protocol for websockets.

Using a provider is as easy as:

// Sync clients with the y-websocket provider
const provider = new WebsocketProvider("ws://localhost:1234", "leaflet-sync", doc );

It's also possible to send "awareness" information (some state you don't want to persist, like the position of the cursor). It contains some useful meta information, such as the number of connected peers.

map.on("mousemove", ({ latlng }) => {
  awareness.setLocalStateField("user", {
    cursor: latlng,
  });
});  

I made a quick proof of concept with Y.js in a few hours flawlessly. It handles offline and reconnects, and exposes awareness information.

Python

Y.js has been rewritten in rust, with the Y.rs project, which makes it possible to use with Python (see Y.py) if needed. The project has been implemented quite recently and is currently looking for a maintainer.

Library size

Size: Y.js is 4,16 Ko, Y-Websocket is 21,14 Ko The library is currently used in production for large projects such as AFFiNE and Evernote.

The data being transmitted

In the scenario where all clients connect to a central server, which handle the CRDT locally and then transmits back to other parties, I found that adding 20 points on one client, and then 20 points in another generates ~5 ko of data (~16 bytes per edit).

Pros Cons
the API was feeling natural to me: it handles plain old JavaScript objects, making it easy to integrate. It doesn't seem to work well without a JS bundler which could be a problem for us.
It seems to be widely used, and the community seems active.
It is well documented
There is awareness support

Automerge

Automerge is another library to handle CRDTs. Automerge is actually the low level interface, and there is a higher-level interface exposed as Automerge-repo. Here is how to use it:

let handle = repo.create()

handle.change(d => { d.key = "value"})

When you change the document, you actually call change which makes it possible to do the changes in a kind of "transaction function".

You can observe the changes, getting you the whole list of patches:

handle.on("change", ({ doc, patches }) => {
	patches.forEach(({ action, path }) => {
		// Here I'm only taking action when a value is inserted
		// At the position "uuid".
		if (path.length == 2 && action === "insert") {
		let value = doc[path[0]];
		// do something here with the value
	}
});

There is a high-level API, with support for Websocket:

import { Repo } from "@automerge/automerge-repo";
import { BrowserWebSocketClientAdapter } from "@automerge/automerge-repo-network-websocket";
import { IndexedDBStorageAdapter } from "@automerge/automerge-repo-storage-indexeddb";

const repo = new Repo({
	network: [new BrowserWebSocketClientAdapter("wss://sync.automerge.org")],
	storage: new IndexedDBStorageAdapter(),
});

Python

There is an automerge.py project, but no changes has been made to it since 3 years ago.

Library size

Size: 1,64 mb, total is 1,74 mb. It's relying on Web assembly by default.

The large bundle size is something that the team is aware of, and are working on a solution for. For us, it's important to have something as lightweight as possible, considering CRDTs is only one part of what we're doing, and that mapping can be done in context where connection is not that reliable and fast.

The data being transmitted

In the same scenario, I found that adding 20 points on one client, and then 20 points in another generates 90 messages and 24,94 Ko of data transmitted (~12 Ko sent and ~12Ko received), so approximately 75 bytes per edit.

Pros Cons
There is an API to get informed when a conflict occured

Documentation was a bit hard to understand and to look at. Sometimes, it's easier to go look at the code.

In general, the documentation is low level, which can be a good thing while debuging, or when getting more advanced usage. The API is more verbose. You can see it as "less magical".
The team was responsive and trying to help. There is no way (at the moment) to tell that a transaction is local or remote (but in practice it wasn't a problem)

JSON Joy

Json Joy is the latest to the party. It takes another stake at this by providing small libraries with a small functional perimeter. It sounds promising, even if still quite new, and would left us with the hands free in order to implement the protocol that would work for us.

import {Model} from 'json-joy/es2020/json-crdt';
import { s } from "json-joy/es6/json-crdt-patch";
import { encode, decode } from "json-joy/es6/json-crdt-patch/codec/verbose";

// Create a new JSON CRDT document.
const model = Model.withLogicalClock();
const modelMarkers = 

// Find "obj" object node at path [].
model.api.root({
	markers: {},
});

model.api.obj(["markers", uuid]).set(s.con(target._latlng));
model.api.obj("markers").events.onViewChanges.listen((changes) => console.log(changes))

When receiving an update, you could apply it, like this:

let patch = decode(payload);
model.api.apply(patch);

// And see the model with 
model.api.view();

Metrics:

  • Size: 143 ko
  • Data transmitted for 2 peers and 40 edits: (35 bytes per edit)
Pros Cons
It's low level, so you know what you're doing It doesn't provide a high level interface for sync

It's separated as different small atomic libraries, which makes it easy to switch bits of the code if needed, without throwing everything away. Seems to be a one-person project until now
The interface proposes to store different type of data (constants, values, arrays, etc) Quite recent, so probably rough spots are to be found
Distributed as different type of JS bundles (modules, wasm, etc) I didn't find a lot of support, (probably because the project is still quite new)

Summarizing CRDT Libraries for uMap

Let's summarize the key considerations and how each CRDT library aligns with our objectives for uMap. The goal is to assess their fit in terms of specific collaborative features, efficiency, and ease of integration.

Feature / Library Y.js Automerge JSON Joy
Intuitive API Provides a natural feel with native JS objects. API is transactional, detailed for change tracking. Low-level API offers granular control.
Synchronization Protocol Multiple options with providers, including WebSockets. Multiple options with providers, including WebSockets. Requires custom implementation for sync.
Conflict Resolution Automatic merging with Y.js's internal mechanisms. Detailed conflict detection and resolution API. Allows control over data types and operations.
Offline Changes Handling Inbuilt support for offline edits and synchronization. Requires a more manual approach to handle offline changes. Focused on model updates without specifics on network handling.
Versioning and History Supports selective versioning through snapshots. Designed with robust version history tracking. Model is versioned but favors compact storage.
Community and Support Active community with regular updates. Strong support with a focus on collaboration. Smaller community; promising but less established.
Library Size / Efficiency Small size with efficient operation. Larger library with dependency on WebAssembly. Modular design with compact size.
Browser Compatibility Broad compatibility, some bundler dependencies. Supports modern browsers with potential polyfills. Flexible bundles for diverse browser support.
Suitability for uMap Ready-to-use with good documentation and examples. Strong features, may require significant integration. Promising, would need robustness as it matures.

Notes on YATA and RGA

While researching, I found that the two popular CRDTs implementation out there use different approaches for the virtual counter:

  • RGA [used by Automerge] maintains a single globally incremented counter (which can be ordinary integer value), that's updated anytime we detect that remote insert has an id with sequence number higher that local counter. Therefore every time, we produce a new insert operation, we give it a highest counter value known at the time.
  • YATA [used by Yjs] also uses a single integer value, however unlike in case of RGA we don't use a single counter shared with other replicas, but rather let each peer keep its own, which is incremented monotonically only by that peer. Since increments are monotonic, we can also use them to detect missing operations eg. updates marked as A:1 and A:3 imply, that there must be another (potentially missing) update A:2.

Resources