Development#
Getting started#
- Run your own Software Heritage → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or 
- Developer setup → get a working development setup that allows to hack on the Software Heritage software stack 
Contributing#
- Submitting code to SWH → learn how to submit your code to the Software Heritage codebase 
- Code review → rules and guidelines to review code in Software Heritage 
- Python style guide → how to format the Python code you write 
Architecture#
- Software Architecture Overview → get a glimpse of the Software Heritage software architecture 
- Metadata workflow → learn how Software Heritage stores and handles metadata 
Data Model and Specifications#
- SoftWare Heritage persistent IDentifiers (SWHIDs) Specifications of the SoftWare Hash persistent IDentifiers (SWHID). 
- Data model Documentation of the main Software Heritage archive data model. 
- Journal Specification Documentation of the Kafka journal of the Software Heritage archive. 
Tutorials#
Roadmap#
- Current roadmap: Roadmap 2025 
- Previous roadmaps 
System Administration#
- Description → learn what a Software Heritage mirror is and how to set up one 
- Keycloak → learn how to use Keycloak, the authentication system used by Software Heritage’s web interface and public APIs 
Components#
Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the Software Architecture Overview, which presents each of them in a didactic order.
Each component name is linked to the development documentation of the corresponding Python module.
- swh.alter
- archive alteration facilities 
- swh.auth
- low-level library used by modules needing keycloak authentication 
- swh.coarnotify
- a COAR Notify server implementation in Django 
- swh.core
- low-level utilities and helpers used by almost all other modules in the stack 
- swh.counters
- service providing efficient estimates of the number of objects in the SWH archive, using Redis’s Hyperloglog 
- swh.datasets
- datasets derived from periodic data dumps created by swh.export 
- swh.deposit
- push-based deposit of software artifacts to the archive 
- swh.digestmap
- efficient mapping of content hashes 
- swh.docs
- developer documentation (used to generate this doc you are reading) 
- swh.export
- public datasets and periodic data dumps of the archive released by Software Heritage 
- swh.fuse
- Virtual file system to browse the Software Heritage archive, based on FUSE 
- swh.graph
- Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. 
- swh.graphql
- GraphQL API to request archive data offering more precise and flexible queries than the REST API. 
- swh.indexer
- tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it 
- swh.journal
- persistent logger of changes to the archive, with publish-subscribe support 
- swh.lister
- collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) 
- swh.loader-core
- low-level loading utilities and helpers used by all other loaders 
- swh.loader-bzr
- swh.loader-git
- loader for Git repositories 
- swh.loader-mercurial
- loader for Mercurial repositories 
- swh.loader-metadata
- pseudo-loader, which fetches extrinsic metadata from forges instead of software artifacts 
- swh.loader-svn
- loader for Subversion repositories 
- swh.loader-cvs
- loader for CVS repositories 
- swh.model
- implementation of the Data model to archive source code artifacts 
- swh.objstorage
- content-addressable object storage 
- swh.objstorage.replayer
- Object storage replication tool 
- swh.shard
- Low level management for read-only content-addressable object storage indexed with a perfect hash table 
- swh.provenance
- query service for questions like: “where does this given object come from?” or “what it the oldest revision in which this object has been found?” 
- swh.scanner
- source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage 
- swh.scheduler
- task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) 
- swh.scrubber
- Tooling to check integrity of various data stores (swh.journal, swh.objstorage, swh.storage) and fix corrupt objects they contain. 
- swh.search
- search engine for the archive 
- swh.storage
- abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata 
- swh.vault
- implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) 
- swh.web
- Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use 
- swh.web.client
- Python client for swh.web 
Dependencies#
The dependency relationships among the various modules are depicted below.
Dependencies among top-level Python modules (click to zoom).#
Archive#
- Archive ChangeLog: notable changes to the archive over time