[go: up one dir, main page]

DAL: Tests for skip list migration

Global note: migration_level was set to 20 instead of 10 in the manual tests below (tests with title test commitments history with migration).

Legacy unattested case in Content_v2 encoding

Tests like --file tezt/tests/dal.ml --title 'Quebec->Alpha: test commitments history with migration (storage_backend = legacy)' is already covering this case.

To test, just comment the unattested case and run the test. It should fail with an error like:

[07:58:09.693] [node1] Oct 29 08:58:09.693:   Storage error:
[07:58:09.693] [node1] Oct 29 08:58:09.693:     Failed to parse the data at 'dal/slot_headers_history'.
[07:58:09.694] [client2] Error:
[07:58:09.694] [client2]   Storage error:
[07:58:09.694] [client2]     Failed to parse the data at 'dal/slot_headers_history'.

Legacy attested case in Content_v2 encoding

Doing the same as the unattested case above doesn't trigger any issue.

With the extension of the test (see MR (!TODO)), removing the legacy attested encoding case in Content_v2 module of dal_slot_repr produces the following error, as expected:

[16:22:42.363] [RPC] RPC response: 500 Internal Server Error
[16:22:42.363] [RPC] [
[16:22:42.363] [RPC]   {
[16:22:42.363] [RPC]     "kind": "temporary",
[16:22:42.363] [RPC]     "id": "failure",
[16:22:42.363] [RPC]     "msg": "Data_encoding.Read_error(Unexpected tag 1)"
[16:22:42.363] [RPC]   }
[16:22:42.363] [RPC] ]

Legacy vs new hash function when adding new cells to the skip list

Tests like --file tezt/tests/dal.ml --title 'Quebec->Alpha: test commitments history with migration (storage_backend = legacy)' are already covering this case. It allowed uncovering a bug in the comparison:

            if Raw_level_repr.(attested_level >= migration_level) then

found here: https://gitlab.com/tezos/tezos/-/blob/master/src/proto_alpha/lib_protocol/dal_slot_repr.ml#L575. In fact, the condition should rather be attested_level > migration_level, as migration_level = Storage.Tenderbake.First_level_of_protocol.get refers to when the new protocol is activated but not ran yet (the first effective level of the new protocol is rather migration_level + 1)

To help finding the bug and testing the fix, we added some debug traces in the Alpha and Quebec protocols. With the fix, we got something like (where migration_level was fixed to 20 and attestation_lag to 8):

...
### [Quebec] hashing cell at level published_level: 12, index: 30 yield hash dask13HZTUFEurzUD1GDtfVDn8TBvBQUB9ZjVE2rq2XGAmuGCFR2Fop
### [Quebec] hashing cell at level published_level: 12, index: 31 yield hash dask12ahsH5aCQ5ex1Jb6XNux6T4PKHprTsSQkJk8MLoPn1F3bUASwS
### [Alpha] hashing cell at level published_level: 12, index: 31 yield hash dask12ahsH5aCQ5ex1Jb6XNux6T4PKHprTsSQkJk8MLoPn1F3bUASwS (version=legacy)
### [Alpha] hashing cell at level published_level: 13, index: 0 yield hash dask12fUNgSM9ziEHC4A85XeQBsedtt9VzyQB5PQ1hi925WZyGNqkwv (version=new)
### [Alpha] hashing cell at level published_level: 13, index: 0 yield hash dask12fUNgSM9ziEHC4A85XeQBsedtt9VzyQB5PQ1hi925WZyGNqkwv (version=new)
...

Notice that the first hash function used on a cell of published_level = 12 by protocol Alpha at level 13 is the legacy hash function.

Testing remove_old_published_dal_slot_headers in init storage

Since the encoding/type of commitments waiting for attestation we store in the L1 context evolved, we made the choice to wipe all published slots in the interval [migration_level - attestation_lag, migration_level] (those that would be published in the old protocol but published in the new protocol).

With the current implementation, test --file tezt/tests/dal.ml --title 'Quebec->Alpha: test commitments history with migration (storage_backend = legacy)' succeeds. But:

  • if we change aux ctxt 0 to aux ctxt 1 in function remove_old_published_dal_slot_headers of init_storage.ml (to keep slots published at migration_level), we get a parsing error as expected:
[18:35:38.173] [node1] Oct 30 19:35:38.174:   Storage error:
[18:35:38.173] [node1] Oct 30 19:35:38.174:     Failed to parse the data at 'dal/level/20/slot_headers'.
[18:35:38.174] [client2] Error:
[18:35:38.174] [client2]   Storage error:
[18:35:38.174] [client2]     Failed to parse the data at 'dal/level/20/slot_headers'.
  • if we change if Compare.Int.(n > attestation_lag) then to if Compare.Int.(n >= attestation_lag) then in the same function (to keep slots published at migration_level - attestation_lag), the test still succeeds, because there is an off-by-one in the code (not an issue). But changing the condition to if Compare.Int.(n >= attestation_lag - 1) then triggers the expected error:
[18:49:10.072] [node1] Oct 30 19:49:10.072:   Storage error:
[18:49:10.072] [node1] Oct 30 19:49:10.072:     Failed to parse the data at 'dal/level/13/slot_headers'.
[18:49:10.072] [client2] Error:
[18:49:10.072] [client2]   Storage error:
[18:49:10.072] [client2]     Failed to parse the data at 'dal/level/13/slot_headers'.
[18:49:10.075] client2 exited with code 1.

Testing migration_level value when adding new cells to the skip list

In dal_slot_storage.ml, function update_skip_list:

  • altering the proto_migration_level by making it one level in the past:
  let proto_migration_level =
    Raw_level_repr.sub proto_migration_level 1 |> function
    | Some v -> v
    | None -> assert false
  in

triggers an error, because this would make the new protocol hash a cell/cells produced (and already hashed) by the old one, which makes the hash function(s) globally not deterministic:

[06:54:27.653] [RPC] GET http://127.0.0.1:52554/plugin/ProtoALphaALphaALphaALphaALphaALphaALphaALphaDdp3zK/commitments_history/hash/dask12SAoH6WDwMzritC4CVyBFrc51mrsihVY53fUVQfUqntmNv3L6g
[06:54:27.653] [RPC] RPC response: 500 Internal Server Error
[06:54:27.653] [RPC] [
[06:54:27.653] [RPC]   {
[06:54:27.653] [RPC]     "kind": "permanent",
[06:54:27.653] [RPC]     "id": "stdlib_unix.missing_kvs_data",
[06:54:27.653] [RPC]     "filepath": "/var/folders/z3/8nqng_p9727_38v_n0bj14lm0000gn/T/tezt-98489/1/octez-dal-node1/store/skip_list_store/cells/AAAAIEu3wsJVQUQfDBUQxben0ul36-0HWDHAKM7xkOpp_kim",
[06:54:27.653] [RPC]     "index": 0
[06:54:27.653] [RPC]   }
[06:54:27.653] [RPC] ]
[06:54:27.653] [error] GET http://127.0.0.1:52554/plugin/ProtoALphaALphaALphaALphaALphaALphaALphaALphaDdp3zK/commitments_history/hash/dask12SAoH6WDwMzritC4CVyBFrc51mrsihVY53fUVQfUqntmNv3L6g returned 500 Internal Server Error
  • altering the proto_migration_level by making it one level in the future:
  let proto_migration_level = Raw_level_repr.add proto_migration_level 1 in

doesn't trigger any error. This is fine and can be explained by the fact that it's ok for the new protocol to produce some first cells of the skip list and keep using the old hash function (the hash function is still globally consistent/deterministic).

Initial notes

Add a test for:

  • the migration case in init_storage (migration of published slot headers' type to remember the publisher's TZ address)
  • the migration of skip list cells

See also

Edited by Eugen Zalinescu