Advice for parallelization#

SwhFS is designed to provide access to the complete archive: users may use it to scan large portions of the archive, if not the entire archive. This requires parallelization. SwhFS has been tested up to 10000 concurrent processes, which was not trivial due to the FUSE architecture, to SwhFS being implemented in Python, or to system constraints sometimes enforced by HPC infrastructures. This section collects advice and tips for large deployments.

Note

If you do not need to read files’ contents at all, we advise you instead use the compressed graph directly.

Use local and fast data sources#

The most important thing for large scans is to use local and fast data sources as most as possible. At the time of writing this documentation, this means:

a compressed graph,
a digestmap,
a local objstorage, or at worst its HTTP implementation pointing to S3 (see our configuration examples).

This may change over time: contact us on our development channels to validate your architecture before launching major operations.

One mountpoint per user process with SwhFsTmpMount#

Code scanners usually crawl each repository or source tree one by one. This can be parallelized by launching as many instances as CPUs available, where each instance picking in a list of directory SWHIDs to be analyzed. However, all instances should not access the archive/ folder of the same mountpoint: due to the FUSE architecture, and SwhFS being implemented in Python, SwhFS might become a bottleneck even when running just a dozen instances. We instead advise you create one mountpoint per running instance of your scanner.

This can be simplified by using the swh.fuse.fuse.SwhFsTmpMount context manager from your batch manager, which is especially useful to dispatch work from a Python script. In that case, be careful to also disable on-disk caching entirely, using the in-memory and bypass settings.

How to shortcut Python startup times#

When creating many mountpoints you should avoid calling swh fs mount repeatedly, because each call would start a new Python process, that has to import the many libraries we need. This can take a few seconds per process. Instead you can create a main Python process that forks to subprocesses thanks to a ProcessPoolExecutor.

Warning

Importing swh.fuse does not trigger all necessary Python imports: the remaining ones are imported only at mounting time, depending on the configuration (this avoids importing all supported data sources’ dependencies). Therefore, when using a ProcessPoolExecutor, take care to mount once before creating the pool.

SwhFS sources include an example batch manager relying on ProcessPoolExecutor and SwhFsTmpMount, whose worker processes avoid Python startups entirely: examples/parallel_processing.py.

Workaround missing permissions#

Some Linux environments may not grant you an admin access, no allow you to mount FUSE filesystems, even if you can install libfuse and fusermount3. In that case, we advise you gain super-powers by running your batch manager program in a Linux namespace with the appropriate options of unshare (from the util-linux package):

unshare --pid --kill-child --user --map-root-user --mount ./parallel_processor.py

Advice for parallelization#

Use local and fast data sources#

One mountpoint per user process with SwhFsTmpMount#

How to shortcut Python startup times#

Workaround missing permissions#

See also#

This Page