dstm<->dtss msync appears to have cache update issues

Overview

Given a setup with dtss<->dstm, and steady operation, where things are working as it should, there has nevertheless been observation of the following:

Initial conditions

System running (for hours/days) with dtss<->dstm services running ok, where observing the time-series, at the db-storage, or at dtss.cache, or mirrored cache in dstm.dtss.cache are equal. Also changing time-series at the dtss, or through dstm, works as it should, including cache propagation, notify-change, and expression subscription re-evaluation.

Other information/assumptions (needs verification, where possible):

Version assumed to be latest: 19.0.0 (it might not matter for this specific case, but could)
msync parameters default:
- master_poll every 0.01s,
- unsubscribe parameters: max items 100(usub carried out immediately if more than this), max defer(any items) 1.0s
there was no msync loss/restore of communication (observed or known)
all time-series requests was requesting for existing items (e.g. no misconfiguration in the setup causing non-existent time-series to be read (and subscribed to)
there was no change of time-series resolution (e.g. overwriting an hourly time-series with 15 min time-series etc).

Observations done

At some time, after a change/write to a time-series 'a' on the dtss, that are well reflected and observed correctly at the dtss, both at db-level and cache.

The observed version in the dstm.cache seems to still have the old 'version' of the 'a' time-series. Thus the normal and required propagation of changes from the dtss to the dstm.cache seems to be missed.

Additional actions, and observations

dstm.read('a',use_cache_if_available) to look at the data reveals the old values from 'a'.
dstm.read('a',with_no_cache) gives the correct dtss master values of 'a'
dstm.read('a',with_no_cache,but_update_local_cache_with_read_result) gives correct values of 'a' and subsequent reads using cache now give correct values.
dstm msync worker seems to be alive (not dead, other time-series in cache are updated if dtss.changes, and dstm.cache.misses delta is zero, so it means that the msync has transported the change from the dtss to the dstm cache.)
At the initial startup, the task startup might have removed time-series at the dtss (related to tasks), where the dstm.msync already have established a subscription of (web-clients, with the old tasks, will generate requests, that will give subs/cache etc.).

Assessments

It appears 'as if' the msync has missed a 'notify-change', and subscription-read that normally and pr. design should update the dstm.cache. This is based on the observations above:

The dtss master was correct, in db, and cache
The dstm could be forced to update its own cache, and it had the wanted effect

We have also looked into ts_frag.merge, that are in action when caching ts_frags, but test coverage is extensive, including the cases for this specific time-seres:

a break point ts, stair -case
its initial cache content was like two weeks with a few values in the beginning, and the last one extending to the end of the time-series period
the 'patch', or change, introduced two new values into an 'empty'period, most likely not overwriting existing points, but merely adding two points (t1,0.0), (t2,1.0), which is a trivial case covered.

However, since it was possible to 'make the cache' in sync, by a simple read/update, and its pure algo, test-coverage, the merge algo does not seem to be a probable candidate to look at first.

So, it seems like:

dstm.cache did have 'a' with the old values
dstm.msync did not detect change, so it implies it did not have 'subs' for the 'a'

Thus the question is:

Is it possible that the dstm/msync can allow an item into its cache (like 'a'), without at the same time having an active 'sub' (with gross period covering the cache entry) ?

(Notice that there is variations of this, but kind of ruled out by the observations done.)

Possible root-cause finding (investigation concluded)

Given that there where a restart, or communication loss:

Reading through the code, related to, recovering communication loss.

It appears that it is indeed possible, and very likely, that the dstm ends up loosing dtss.subs.

Given the following sequence:

dstm & dtss start up, establishes normal operation, with ten-thousands of subs that are kept in sync, and working. (the assumed observed state).
the dtss is shut down (a restart or network connection)
the dstm tries to recover the subs, but fails (takes longer than 60 s, or dlib socket error is (re)thrown). In this case the worker just continue its loop.
the dtss comes online again (and existing sessions are on the dtss are cleared out, either due to a complete restart, or if network problems, server-side clean-up of failed client-connections)
worker resumes its work using the msync io connection

IF we on this attempt do not detect there was a lost connection, and restore it, then all subs prior to the broken connection will be 'cold' (not present on the dtss.side, thus there will be no updates on this, but updates on any new/refreshed subs will start to fill in).

Findings:

The dtss::client connection_count is quite clever telling about any restore of communication, incrementing the connection-loss. So it does not seem likely that, at first walk through, a communication loss/recover goes undetected (e.g.when detected, it will attempt to restore all subscription and continue with the full-set of subs)
The unsubscribe call invoked by the worker thread, might throw exception if connection is lost during that call. This is could cause the msync.worker to terminate (and thus ALL updates are lost). However, as of the 'observations', the msync.worker seems to be alive, so this is not likely the cause.
As @jehelset pointed out, there is a small possibility, related to optimizing away not needed subscription checks, where the check is done prior to some work, and the conditions established prior to work, was changed when it comes to making actions on the decisions. The concrete case that the dstm establishes the first sub (so is_active() goes from false, to true), while doing this work. Given that coincident, the dstm would have read and cached the old values, set the water-mark for the cached series, while at the dtss server, the notify-change (on write) was skipped (because subs where empty when starting to store). Ref. to fix for possible glitch.
Found repeatable trivial case, a combination of with break-point ts surrounding-read semantic, and updates at the main dtss, that will cause old values to appear in replicated cache.

Actions to resolve issue

Possible actions for workaround on existing systems

These are actions that could be done by configuring existing parameters, with no code changes.

Limit dstm.cache size, so that caching is minimal, or zero (but it does not solve a missed subs update, so the end user on the web interface could look at 'stale values', but a refresh would give new values). Would require restart of service.
Flush the dstm.cache (can be done from py interfaces). Does not require restart of service, but only useful if errors are detected.

Edited Oct 11, 2024 by Sigbjørn Helset

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information