Development#

Getting started#

Run your own Software Heritage → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or
Developer setup → get a working development setup that allows to hack on the Software Heritage software stack
Frequently Asked Questions

Contributing#

Submitting code to SWH → learn how to submit your code to the Software Heritage codebase
Code review → rules and guidelines to review code in Software Heritage
Python style guide → how to format the Python code you write

Architecture#

Software Architecture Overview → get a glimpse of the Software Heritage software architecture
Metadata workflow → learn how Software Heritage stores and handles metadata

Data Model and Specifications#

SoftWare Heritage persistent IDentifiers (SWHIDs) Specifications of the SoftWare Hash persistent IDentifiers (SWHID).
Data model Documentation of the main Software Heritage archive data model.
Journal Specification Documentation of the Kafka journal of the Software Heritage archive.

Tutorials#

Roadmap#

Current roadmap: Roadmap 2025
Previous roadmaps

System Administration#

Network Infrastructure
Description → learn what a Software Heritage mirror is and how to set up one
Keycloak → learn how to use Keycloak, the authentication system used by Software Heritage’s web interface and public APIs

Components#

Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the Software Architecture Overview, which presents each of them in a didactic order.

Each component name is linked to the development documentation of the corresponding Python module.

swh.alter: archive alteration facilities
swh.auth: low-level library used by modules needing keycloak authentication
swh.coarnotify: a COAR Notify server implementation in Django
swh.core: low-level utilities and helpers used by almost all other modules in the stack
swh.counters: service providing efficient estimates of the number of objects in the SWH archive, using Redis’s Hyperloglog
swh.datasets: datasets derived from periodic data dumps created by swh.export
swh.deposit: push-based deposit of software artifacts to the archive
swh.digestmap: efficient mapping of content hashes
swh.docs: developer documentation (used to generate this doc you are reading)
swh.export: public datasets and periodic data dumps of the archive released by Software Heritage
swh.fuse: Virtual file system to browse the Software Heritage archive, based on FUSE
swh.graph: Fast, compressed, in-memory representation of the archive, with tooling to generate and query it.
swh.graphql: GraphQL API to request archive data offering more precise and flexible queries than the REST API.
swh.indexer: tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it
swh.journal: persistent logger of changes to the archive, with publish-subscribe support
swh.lister: collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.)
swh.loader-core: low-level loading utilities and helpers used by all other loaders
swh.loader-bzr: loader for Bazaar and Breezy repositories
swh.loader-git: loader for Git repositories
swh.loader-mercurial: loader for Mercurial repositories
swh.loader-metadata: pseudo-loader, which fetches extrinsic metadata from forges instead of software artifacts
swh.loader-svn: loader for Subversion repositories
swh.loader-cvs: loader for CVS repositories
swh.model: implementation of the Data model to archive source code artifacts
swh.objstorage: content-addressable object storage
swh.objstorage.replayer: Object storage replication tool
swh.shard: Low level management for read-only content-addressable object storage indexed with a perfect hash table
swh.provenance: query service for questions like: “where does this given object come from?” or “what it the oldest revision in which this object has been found?”
swh.scanner: source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage
swh.scheduler: task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package)
swh.scrubber: Tooling to check integrity of various data stores (swh.journal, swh.objstorage, swh.storage) and fix corrupt objects they contain.
swh.search: search engine for the archive
swh.storage: abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata
swh.vault: implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)
swh.web: Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use
swh.web.client: Python client for swh.web

Dependencies#

The dependency relationships among the various modules are depicted below.

../_images/py-deps-swh.svg — Dependencies among top-level Python modules (click to zoom).#

Archive#

Archive ChangeLog: notable changes to the archive over time