diff --git a/docs/developer/gvisor.md b/docs/developer/gvisor.md new file mode 100644 index 0000000..3f41dda --- /dev/null +++ b/docs/developer/gvisor.md @@ -0,0 +1,284 @@ +# gVisor integration + +Dangerzone has relied on the container runtime available in each supported +operating system (Docker Desktop on Windows / macOS, Podman on Linux) to isolate +the host from the sanitization process. The problem with this type of isolation +is that it exposes a rather large attack surface; the Linux kernel. + +[gVisor](https://gvisor.dev/) is an application kernel, that emulates a +substantial portion of the Linux Kernel API in Go. What's more interesting to +Dangerzone is that it also offers an OCI runtime (`runsc`) that enables +containers to transparently run this application kernel. + +As of writing this, Dangerzone uses two containers to sanitize a document: +* The first container reads a document from stdin, converts each page to pixels, + and writes them to stdout. +* The second container reads the pixels from a mounted volume (the host has + taken care of this), and saves the final PDF to another mounted volume. + +Our threat model considers the computation and output of the first container +as **untrusted**, and the computation and output of the second container as +trusted. For this reason, and because we are about to remove the need for the +second container, our integration plan will focus on the first container. + +## Design overview + +Our integration goals are to: +* Make gVisor available to all of our supported platforms. +* Do not ask from users to run any commands on their system to do so. + +Because gVisor does not support Windows and macOS systems out of the box, +Dangerzone will be responsible for "shipping" gVisor to those users. It will do +so using nested containers: +* The **outer** container is the Docker/Podman container that Dangerzone uses + already. This container acts as our **portability** layer. It's main purpose + is to bundle all the necessary configuration files and program to run gVisor + in all of our platforms. +* The **inner** container is the gVisor container, created with `runsc`. This + container acts as our **isolation layer**. It is responsible for running the + Python code that rasterizes a document, in a way that will be fully isolated + from the host. + +### Building the container image + +This nested container approach directly affects the container image as well, +which will also have two layers: +* The **outer** container image will contain just Python3 and `runsc`, the + latter downloaded from the official gVisor website. It will also contain an + entrypoint that will launch `runsc`. Finally, it will contain the **inner** + container image (see below) as filesystem clone under + `/dangerzone-image/rootfs`. +* The **inner** container image is practically the original Dangerzone image, as + we've always built it, which contains the necessary tooling to rasterize a + document. + +### Spawning the container + +Spawning the container now becomes a multi-stage process: + +The `Container` isolation provider spawns the container as before, with the +following changes: + +* It adds two Linux capabilities to the **outer** container that didn't exist + before: `SETFCAP` and `SYS_CHROOT`. Those capabilities are necessary to run + `runsc` rootless, and are not inherited by the **inner** container. +* It removes the `--userns keep-id` argument, which mapped the user outside the + container to the same UID (normally `1000`) within the container. This was + originally required when we were mounting host directories within the + container, but this no longer applies to the gVisor integration. By removing + this flag, the host user maps to the root user within the container (UID `0`). + - In distributions that offer Podman version 4 or greater, we use the + `--userns nomap` flag. This flag greatly minimizes the attack surface, + since the host user is not mapped within the container at all. +* In distributions that offer Podman 3.x, we add a seccomp filter that adds the + `ptrace` syscall, which is required for running gVisor. + +Then, the following happens when Podman/Docker spawns the container: + +1. _(outer container)_ The entrypoint code finds from `sys.argv` the command + that Dangerzone passed to the `docker run` / `podman run` invocation. + Typically, this command is: + + ``` + /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels + ``` + +2. _(outer container)_ The entrypoint code then creates an OCI config for + `runsc` with the following properties: + * Use UID/GID 1000 in the **inner** container image. + * Run the command we detected on step 1. + * Drop all Linux capabilities. + * Limit the number of open files to 4096. + * Use the `/dangerzone-image/rootfs` directory as the root path for the + **inner** container. + * Mount a gVisor view of the `procfs` hierarchy under `/proc` , and then + mount `tmpfs` in the `/dev`, `/sys` and `/tmp` mount points. This way, no + host-specific info may leak to the **inner** container. + - Mount `tmpfs` on some more mountpoints where we want write access. +3. _(outer container)_ If `RUNSC_DEBUG` has been specified, add some debug + arguments to `runsc` (applies to development environments only). +4. _(outer container)_ If `RUNSC_FLAGS` has been specified, pass some + user-specified flags to `runsc` (applies to development environments only). +5. _(outer container)_ Spawn `runsc` as a Python subprocess, and wait for it to + complete. +6. _(inner container)_ Read the document from stdin and write pixels to stdout. + - In practice, nothing changes here, as far as the document conversion is + concerned. The Python process transparently uses the emulated Linux Kernel + API that gVisor provides. +7. _(outer container)_ Exit the container with the same exit code as the inner + container. + +## Implementation details + +### Creating the outer container image + +In order to achieve the above, we add one more build stage in our Dockerfile +(see [multi-stage builds](https://docs.docker.com/build/building/multi-stage/)) +that copies the result of the previous stages under `/dangerzone-image/rootfs`. +Also, we install `runsc` and Python, and copy our entrypoint to that layer. + +Here's how it looks like: + +```dockerfile +# NOTE: The following lines are appended to the end of our original Dockerfile. + +# Install some commands required by the entrypoint. +FROM alpine:latest +RUN apk --no-cache -U upgrade && \ + apk --no-cache add \ + python3 \ + su-exec + +# Add the previous build stage (`dangerzone-image`) as a filesystem clone under +# the /dangerzone-image/rootfs directory. +RUN mkdir --mode=0755 -p /dangerzone-image/rootfs +COPY --from=dangerzone-image / /dangerzone-image/rootfs + +# Download and install gVisor, based on the official instructions. +RUN GVISOR_URL="https://storage.googleapis.com/gvisor/releases/release/latest/$(uname -m)"; \ + wget "${GVISOR_URL}/runsc" "${GVISOR_URL}/runsc.sha512" && \ + sha512sum -c runsc.sha512 && \ + rm -f runsc.sha512 && \ + chmod 555 runsc /entrypoint.py && \ + mv runsc /usr/bin/ + +COPY gvisor_wrapper/entrypoint.py / +ENTRYPOINT ["/entrypoint.py"] +``` + +### OCI config + +The OCI config that gets produced is similar to this: + +```json +{ + "ociVersion": "1.0.0", + "process": { + "user": { + "uid": 1000, + "gid": 1000 + }, + "args": [ + "/usr/bin/python3", + "-m", + "dangerzone.conversion.doc_to_pixels" + ], + "env": [ + "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", + "PYTHONPATH=/opt/dangerzone", + "TERM=xterm" + ], + "cwd": "/", + "capabilities": { + "bounding": [], + "effective": [], + "inheritable": [], + "permitted": [], + }, + "rlimits": [ + { + "type": "RLIMIT_NOFILE", + "hard": 4096, + "soft": 4096 + } + ] + }, + "root": { + "path": "rootfs", + "readonly": true + }, + "hostname": "dangerzone", + "mounts": [ + { + "destination": "/proc", + "type": "proc", + "source": "proc" + }, + { + "destination": "/dev", + "type": "tmpfs", + "source": "tmpfs", + "options": [ + "nosuid", + "noexec", + "nodev" + ] + }, + { + "destination": "/sys", + "type": "tmpfs", + "source": "tmpfs", + "options": [ + "nosuid", + "noexec", + "nodev", + "ro" + ] + }, + { + "destination": "/tmp", + "type": "tmpfs", + "source": "tmpfs", + "options": [ + "nosuid", + "noexec", + "nodev" + ] + }, + { + "destination": "/home/dangerzone", + "type": "tmpfs", + "source": "tmpfs", + "options": [ + "nosuid", + "noexec", + "nodev" + ] + }, + { + "destination": "/usr/lib/libreoffice/share/extensions/", + "type": "tmpfs", + "source": "tmpfs", + "options": [ + "nosuid", + "noexec", + "nodev" + ] + } + ], + "linux": { + "namespaces": [ + { + "type": "pid" + }, + { + "type": "network" + }, + { + "type": "ipc" + }, + { + "type": "uts" + }, + { + "type": "mount" + } + ] + } +} + +``` + +## Security considerations + +* gVisor does not have an official release on Alpine Linux. The developers + provide gVisor binaries from a GCS bucket. In order to verify the integrity of + these binaries, they also provide a SHA-512 hash of the files. + - If we choose to pin the hash, then we essentially pin gVisor, and we may + lose security updates. + +## Alternatives + +gVisor can be integrated with Podman/Docker, but this is the case only on Linux. +Because we want gVisor on Windows and macOS as well, we decided to not move +forward with this approach.