Context

This MR introduces a layered store for archive nodes. Thanks to irmin.3.7 archive nodes are now implemented such as:

upper layer with all the live blocks -- which consists in a few cycles only (5 complete cycles on mainnet)
lower layer with all the blocks that are below the last allowed fork level (all the blocks that are not subject to reorgs anymore)

Before that, archive nodes were made of a huge, single file, storage file that was slow to access to because of the sparse data locality. With this, the upper layer is expected to be way faster (as fast as Full and rolling nodes) and the lower layer is expected to be slower that the upper, but faster than the previous implementation anyway.

Closes: #4754 (closed)

Benchmarks

Perfs

To get the best from the lower store, it is necessary to calibrate some constants. Indeed, the lower store is made of volumes which are storing a finite number of commits. It is necessary to determine the size of the volume, as a big volumes is slower than the small one and a volume introduces a disk size overhead, to find an optimum.

To do so, we have made various experiments to compare the performances of "master/10 cycles per volume/50 cycles per volume" on a limanet node with 500_000 blocks. The benchmark consist in requesting, deterministically, the context data (using the /context/raw/bytes rpc) of random blocks of the upper or layer. The LH corresponds to blocks accessed from a low level to a high one -- HL is the opposite. This aims to consider the caching mechanism and redundancy of the data read. Then, we compare the mean rpc time.

Limanet 500k, SSD vs HDD

On SSDs, the performances are quite better for the upper store, but not really noticeable. However, on HDDs, the upper store performances is way better. Surprisingly, the lower store is faster with the non layered store.

\textcolor{red}{\text{TO BE UPDATED}} To be representative, benchmarks must be done with the latest mainnet state as the cycle size and it's content increased a lot.

Disk usage

As many volumes comes with more prefixes files, a small disk overhead is expected.

For the 500k limanet experiment, it is 776Mb bigger with 10 cycles per volumes and 312Mb bigger with 50, compared to the 77.7Gb reference.

For a 3M mainnet experiment, the overall context size is 17Gb bigger with 50 cycles per volume compared to the 100 cycles per volumes -- note that not only the size of volumes are taken into account, the pending suffixes to be added to the next volume are bigger with big volumes.

nl@machine2:/hdd2$ du -h mainnet_irmin.3.7_100/context/volumes/ | sort -k 1 -h
11G     mainnet_irmin.3.7_100/context/volumes/volume.0
16G     mainnet_irmin.3.7_100/context/volumes/volume.1
20G     mainnet_irmin.3.7_100/context/volumes/volume.2
65G     mainnet_irmin.3.7_100/context/volumes/volume.3
344G    mainnet_irmin.3.7_100/context/volumes/volume.4
454G    mainnet_irmin.3.7_100/context/volumes/
nl@machine2:/hdd2$ du -h mainnet_irmin.3.7_50/context/volumes/ | sort -k 1 -h
4,1G    mainnet_irmin.3.7_50/context/volumes/volume.0
6,2G    mainnet_irmin.3.7_50/context/volumes/volume.1
7,5G    mainnet_irmin.3.7_50/context/volumes/volume.2
8,4G    mainnet_irmin.3.7_50/context/volumes/volume.3
9,5G    mainnet_irmin.3.7_50/context/volumes/volume.4
11G     mainnet_irmin.3.7_50/context/volumes/volume.5
15G     mainnet_irmin.3.7_50/context/volumes/volume.6
25G     mainnet_irmin.3.7_50/context/volumes/volume.11
51G     mainnet_irmin.3.7_50/context/volumes/volume.7
133G    mainnet_irmin.3.7_50/context/volumes/volume.8
167G    mainnet_irmin.3.7_50/context/volumes/volume.9
175G    mainnet_irmin.3.7_50/context/volumes/volume.10
610G    mainnet_irmin.3.7_50/context/volumes/

Upgrade

TODO

At this MR comes with a new irmin version, it comes with a new storage format that requires an upgrade.

To completely benefit from this, it is necessary to bootstrap a node from scratch. Indeed, the automatic upgrade is there for backward compatibility and will simply move the former data as a single, huge, volume. Accessing to the data of this volume will be as slow as before. However, benefits will come with the new blocks/cycles/volumes. Bootstrapping from scratch will shape the storage in an optimized way.

Manually testing the MR

Manual test only:

run a node on your own and bootstrap it from scratch, in archive mode.
try the upgrade procedure on an archive node

Look at the green CI

Checklist

Document the interface of any function added or modified (see the coding guidelines)
Document any change to the user interface, including configuration parameters (see node configuration)
Provide automatic testing (see the testing guide).
For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
Select suitable reviewers using the Reviewers field below.
Select as Assignee the next person who should take action on that MR

Edited Apr 25, 2023 by Victor Allombert

Draft: Layered store for archive nodes