The `exit()` [1] function is not necessarily present in every Python
environment, as it's added by the `site` module. Also, this function is
"[...] useful for the interactive interpreter shell and should not be
used in programs"
For this reason, we replace all such occurrences with `sys.exit()` [2],
which is the canonical function to exit Python programs.
[1]: https://docs.python.org/3/library/constants.html#exit
[2]: https://docs.python.org/3/library/sys.html#sys.exit
On Windows, if we somehow attempt to archive the same document twice
(e.g, because it got archived once, and then we copy it back), we will
get an error, because Windows does not overwrite the target path, if it
already exists.
Fix this issue by always removing the previously archived version, when
performing the next archival action, and update our tests.
Quote from `pytest-mock` docs [1]:
> The purpose of this plugin is to make the use of context managers and
> function decorators for mocking unnecessary, so it will emit a warning
> when used as such.
Thus using it as a context manager currently produces a warning during
test runs in CI which is extra noise that could make new (possibly more
important) warnings harder to spot.
[1]: https://pytest-mock.readthedocs.io/en/latest/usage.html#usage-as-context-manager
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).
From `click` docs[2]:
> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.
To fix that, we use `utils.replace_control_chars()` before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.
Fixes#768
The `util.replace_control_chars()` function was overly strict, and
would replace every non-ASCII character with "_". This included both
control characters, as well as normal characters in a non-English
alphabet.
Relax these restrictions by checking each character and deciding if it's
a Unicode control character, using the `unicodedata` Python package.
With this change, emojis and non-English letters are now allowed.
Add a pytest fixture that crafts a filename with Unicode characters that
are not considered common for this use. By default, this fixture uses
an invalid Unicode character as well, but we strip it in case of macOS
(APFS) since filenames must be UTF-8 encoded.
[1]: https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations
Add termination tests for the Dummy provider, so that we can have
cross-platform coverage in our Windows/macOS CI runners, which can't use
the Container isolation providers.
Add termination-related tests for Qubes. To achieve this, we need
to make a change to the Qubes isolation provider. More specifically,
we need to make the isolation provider yield control to the caller only
when the disposable qube is up and running.
Qubes does not provide us a solid guarantee to do so, but we've found a
hacky workaround, whereby we wait until the `qrexec-client-vm` process
opens a `/dev/xen` character device. This should happen, in theory, once
the disposable qube is ready, and has sent a `MSG_SERVICE_CONNECT` RPC
message to the caller.
Add termination-related tests for containers. To achieve this, we need
to make a change to the container isolation provider. More specifically,
we need to make the isolation provider yield control to the caller only
when the container is up and running. Failure to do so may lead to
lingering processes.
Add some test cases in the isolation provider tests, that check how it
behaves when a process completes successfully, lingers, or cannot
terminate.
These tests cannot run yet, since they must be imported by a concrete
isolation provider test class. In subsequent commits, we will start
enabling them.
Pass the Document instance that will be converted to the
`IsolationProvider.start_doc_to_pixels_proc()` method. Concrete classes
can then associate this name with the started process, so that they can
later on kill it.
Previously settings was implicitly tested on tests/gui/test_updater.py.
However this was concerned with updater-related tests only, which
incidentally covered almost all of settings.py. However, a few tests
were missing. This commit increases the test coverage, but also tests
additional test conditions.
The goal is to help us increase the test coverage of the previous
scenario, which tested for the persistence of user data (settings). This
way we can drop the requirement to test this on linux hosts, which is
slightly harder (more cumbersome) to do.
When we get an early EOF from the converter process, we should
immediately get the exit code of that process, to find out the actual
underlying error. Currently, the exception we raise masks the underlying
error.
Raise a ConverterProcException, that in turns makes our error handling
code read the exit code of the spawned process, and converts it to a
helpful error message.
Fixes#714
Remove timeouts due to several reasons:
1. Lost purpose: after implementing the containers page streaming the
only subprocess we have left is LibreOffice. So don't have such a
big risk of commands hanging (the original reason for timeouts).
2. Little benefit: predicting execution time is generically unsolvable
computer science problem. Ultimately we were guessing an arbitrary
time based on the number of pages and the document size. As a guess
we made it pretty lax (30s per page or MB). A document hanging for
this long will probably lead to user frustration in any case and the
user may be compelled to abort the conversion.
3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)
Fixes#687
This reverts commit fea193e935.
This is part of the purge of timeout-related code since we no longer
need it [1]. Non-blocking reads were introduced in the reverted commit
in order to be able to cut a stream mid-way due to a timeout. This is
no longer needed now that we're getting rid of timeouts.
[1]: https://github.com/freedomofpress/dangerzone/issues/687
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.
The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.
Fixes#456
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.
This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.
The main conversion method (_convert) now in the superclass simply calls
two methods:
- doc_to_pixels()
- pixels_to_pdf()
Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.
Known regressions:
- progress reports stopped working on containers
Fixes#443
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.
We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.
Fixes#658
[1]: https://github.com/freedomofpress/dangerzone/issues/658#issuecomment-1861033149
This PR reverts the patch that disables HWP / HWPX conversion on MacOS M1.
It does not fix conversion on Qubes OS (#494).
Previously, HWP / HWPX conversion didn't work on MacOS (Apple silicon CPU) (#498)
because libreoffice wasn't built with Java support on Alpine Linux for ARM (aarch64).
Gratefully, the Alpine team has enabled Java support on the aarch64
system [1], so we can enable it again for ARM architectures.
And this patch is included in Alpine 3.19
This commit was included in #541 and reverted on #562 due to a stability issue.
Fixes#498
[1]: 74d443f479
Fix a bug in the "Change Selection" action, whereby changing your
selection and picking files from another directory results in:
"Dangerzone does not support adding documents from multiple
locations. The newly added documents were ignored."
To fix this, change the output directory when we change selection as
well.
In Qubes OS it's often the case that the user doesn't have enough
RAM to start the conversion. In this case it raises BrokenPipeException
and exits with code 126.
It didn't seem possible to distinguish this kind of failure to one
where the user has misconfigured qrexec policies.
NOTE: this approach is not ideal UX-wise. After the first doc failing
the next one will also try and fail. Upon first failure we should
inform the user that they need to close some programs or qubes.
Because the server also checks the MAX_PAGES limit, the test in base
would hide the fact that the client is not enforcing the limit. This
ensures that's not the case.
When the pages in containers are streamed (#443), then this test should
be in base.py.
Theoretically the max pages would be 65536 (2byte unsigned int.
However this limit is much higher than practical documents have
and larger ones can lead to unforseen problems, for example RAM
limitations.
We thus opted to use a lower limit of 10K. The limit must be
detected client-side, given that the server is distrusted. However
we also check it in the server, just as a fail-early mechanism.
Isolation provider tests done in tests/test_base.py and had
pytest.mark.parameterize() for each isolation provider. This logic
would not work well when we had test that diverge. We could have marked
each one as compatible with one provider or another, but in the end it
turned out to be better to have the common ones in a base class and
the divergent ones in each.
NOTE: this has a strange side-effect: inherited test classes need to
have imports for all of the fixtures even if they are not explictly used
This PR reverts the patch that disables HWP / HWPX conversion on MacOS
M1. It does not fix conversion on Qubes OS (#494)
Previously, HWP / HWPX conversion didn't work on MacOS M1 systems (#498)
because libreoffice wasn't built with Java support on Alpine Linux for
ARM (aarch64).
Gratefully, the Alpine team has enabled Java support on the aarch64
system [1], so we can enable it again for ARM architectures.
Fixes#498
[1]: 74d443f479
The "check for updates" button wasn't showing up immediately as checked
as soon as the user is prompted for checking updates. This fixes that.
Fixes#513
Reporting script now parses JunitXML instead of a series of
".container_log" files. The script in in changed submodule.
Additionally it makes failed tests actually fail so that this is
recorded in the JunitXML report.
Adds a large pool of document that can and should be used prior to a
release to understand effects of the new release over a real-world
scenario.
Documents are stored in an external git LFS repo under
`tests/test_docs_large` and currently it's about 11K documents gathered
from multiple PDF readers and office suite's test sets.
Documentation on how to run the tests is under
`docs/developer/TESTING.md`
The HWP / HWPX conversion feature does not work on the following
platforms:
* MacOS with Apple Silicon CPU
* Native Qubes OS
For this reason, we need to:
1. Disable it on the GUI side, by not allowing the user to select these
files.
2. Throw an error on the isolation provider side, in case the user
directly attempts to convert the file (either through CLI or via
"Open With").
Refs #494
Refs #498
Add extra files and base64 encode externally contributed docs. This
prevents the accidental opening of such documents, since they couldn't
be rebuit by the Dangerzone developers to ensure their safety.
Improve the `parse_progress()` method of the container isolation
provider in the following ways:
1. Make sure that the fields of the progress report have the expected
type.
2. In case of a JSON parsing error, sanitize the invalid string so that
it doesn't contain escape sequences, or the user considers it as
trusted.
Update the common `print_progress()` method in the base
`IsolationProvider` class, with two extra features:
1. Always sanitize the provided text argument.
2. Mark the sanitized text argument as untrusted.
This is default behavior from now on, since this function is commonly
used to parse progress reports from the conversion sandbox.
Fully test the update check logic, by introducing several Qt tests.
Also, improve the `UpdaterThread.get_letest_info()` method, that gets
the latest version and changelog from GitHub, with several checks.
These checks are also tested in our newly added tests.
We want to differentiate between the user clicking on "Cancel" and
clicking on "X", since in the second case, we want to remind them again
on the next run.
Now that sample_doc was renamed to sample_pdf it could cause some
confusion the fact that that the TestBase class had an attribute called
sample_doc which referenced the sample PDF.
By removing this attribute and passing the fixture instead we are
following a more pytest-native approach of passing arguments explicitly.
Add a Pytest fixture that returns an UpdaterThread instance which has
its own unique settings directory. Note that the UpdaterThread instance
needs to be slightly nerfed, so that it doesn't rely on Qt functionality
or any isolation providers.
In PR #378 ("container: Allow converting more document formats"), we
added support for the following MIME types:
* application/zip
* application/octet-stream
* application/x-ole-storage
* application/vnd.oasis.opendocument.spreadsheet-template
* application/vnd.oasis.opendocument.text-template
However, we forgot to add some tests for these MIME types in the repo.
In this commit, we add a file for each of these MIME types, to make sure
we have no regressions in the future.
Due to a bump in our Python dependencies, we now install Mypy 1.1.1
instead of 0.982. This change triggered the following errors:
* Incompatible default for argument <a> (default has type
None, argument has type <t>):
Mypy further explains here that PEP 484 prohibits implicit Optional,
so we need to make these types explicit Optional.
* Unused "type: ignore" comment, use narrower [method-assign] instead of
[assignment]:
Mypy has specialized some of its lints, meaning that we should switch
to the newer variants.
Also, it detected several other small inconsistencies. We fix all of
these errors in this commit.
Do not leave stale temporary directories when conversion fails
unexpectedly. Instead, wrap the conversion operation in a context
manager that wipes the temporary dir afterwards.
Fixes#317
Run each CLI command in a separate config/cache dir, to avoid leaks
between tests. Moreover, this way we are able to check the contents of
the config/cache dirs for a single CLI run.
Adds tests for macOS and Windows with the dummy converter. Tests won't
actually perform the conversion. But it should be enough for us to test
the remainder of the codebase.
Fixes#229
When enabled, the conversion part does nothing but print some simulated
output. This can be useful for testing non-conversion code (e.g. GUI).
Activated with the hidden flag --unsafe-dummy-conversion.
If a user has provided an output filename for a document, then we should
no longer accept suffixes. The reason is that we can't do something
meaningful with it, as we can't alter the provided output filename.
The proper behavior is to reject this action with an exception. Note
that this acts more of a safeguard, since (currently) there is no path
where a user may add a suffix to a document that already has an output
filename.
These will be needed in for the GUI's settings. This also adds test
cases for these documents. The methods are the following:
- set_output_dir()
For changing the output directory of the safe file
- suffix setter and getter - for changing the suffix of the file
Checking if files were writeable created files in the process. In the
case where someone adds a list of N files to dangerzone but exits before
converting, they would be left with N 0-byte files for the -safe
version. Now they don't.
Fixes#214
All filename-related exceptions were of class DocumentFilenameException.
This made it difficult to disambiguate them. Specializing them makes it
it easier for tests to detect which exception in particular we want to
verify.
The document's state update is better update in the convert() function.
This is because this function is always called for the conversion
progress regardless of the frontend.
Wildcard arguments like `*` can lead to security vulnerabilities
if files are maliciously named as would-be parameters. In the following
scenario if a file in the current directory was named '--help', running
the following command would show the help.
$ dangerzone-cli *
By checking if parameters also happen to be files, we mitigate this
risk and have a chance to warn the user.
Let the Document class suggest the default filename for the safe PDF,
based on the provided input filename, appended with the extension
`-safe.pdf`.
Previously, this logic was copy-pasted throughout the code, which made
it difficult to maintain.
Run Mypy static checks against our tests. This brings them inline with
the rest of the codebase, and we have an extra level of certainty that
the tests (and unit tests in particular) will not significantly diverge
from the code they are testing.
Concatenate directories and filenames in a platform-independent way, by
using pathlib.Path. This fixes issues in the tests where the "/" path
separator made the tests fail on Windows.
Add two tests that check if Dangerzone properly handles input and output
filenames with spaces in them. Previously this was not straight-forward
because we didn't tokenize arguments, which lead to Click splitting
filenames with spaces in two.
Pass tokenized arguments (i.e., arguments as lists of strings) to CLI
invocations, else Click will attempt to tokenize them internally. The
problem with leaving tokenization to Click is that it uses
`shlex.split()`, which is Unix-oriented, and may miss some cases in
Windows.
Wrap Click results (`Result`) with a new class (`CLIResult`), which
includes:
1. Assertion statements.
2. Logic for formatting and printing a Click result.
3. Invocation arguments, which are missing from the original `Result`
class.
- display_banner() was only displayed in CLI mode so it makes sense
for it to be in the CLI.
- get_version(), was mvoed to util since it is a static function
that is needed in multiple parts of the application.
static methods that are used application-wide should belong to
the utilities python file.
inspired by @gmarmstrong's PR #166 on refactoring global_common
methods to be static and have a dzutil.py
originally PDF files were included for these edge-cases but in
reality all we want to test is the filename itself. So it reduces
repo size if we have them generated dynamically.