dstm<->dtss msync appears to have cache update issues
Overview
Given a setup with dtss<->dstm, and steady operation, where things are working as it should, there has nevertheless been observation of the following:
Initial conditions
System running (for hours/days) with dtss<->dstm services running ok, where observing the time-series, at the db-storage, or at dtss.cache, or mirrored cache in dstm.dtss.cache are equal. Also changing time-series at the dtss, or through dstm, works as it should, including cache propagation, notify-change, and expression subscription re-evaluation.
Other information/assumptions (needs verification, where possible):
- Version assumed to be latest: 19.0.0 (it might not matter for this specific case, but could)
- msync parameters default:
- master_poll every 0.01s,
- unsubscribe parameters: max items 100(usub carried out immediately if more than this), max defer(any items) 1.0s
- there was no msync loss/restore of communication (observed or known)
- all time-series requests was requesting for existing items (e.g. no misconfiguration in the setup causing non-existent time-series to be read (and subscribed to)
- there was no change of time-series resolution (e.g. overwriting an hourly time-series with 15 min time-series etc).
Observations done
At some time, after a change/write to a time-series 'a' on the dtss, that are well reflected and observed correctly at the dtss, both at db-level and cache.
The observed version in the dstm.cache seems to still have the old 'version' of the 'a' time-series. Thus the normal and required propagation of changes from the dtss to the dstm.cache seems to be missed.
Additional actions, and observations
- dstm.read('a',use_cache_if_available) to look at the data reveals the old values from 'a'.
- dstm.read('a',with_no_cache) gives the correct dtss master values of 'a'
- dstm.read('a',with_no_cache,but_update_local_cache_with_read_result) gives correct values of 'a' and subsequent reads using cache now give correct values.
- dstm msync worker seems to be alive (not dead, other time-series in cache are updated if dtss.changes, and dstm.cache.misses delta is zero, so it means that the msync has transported the change from the dtss to the dstm cache.)
- At the initial startup, the task startup might have removed time-series at the dtss (related to tasks), where the dstm.msync already have established a subscription of (web-clients, with the old tasks, will generate requests, that will give subs/cache etc.).
Assessments
It appears 'as if' the msync has missed a 'notify-change', and subscription-read that normally and pr. design should update the dstm.cache. This is based on the observations above:
- The dtss master was correct, in db, and cache
- The dstm could be forced to update its own cache, and it had the wanted effect
We have also looked into ts_frag.merge, that are in action when caching ts_frags, but test coverage is extensive, including the cases for this specific time-seres:
- a break point ts, stair -case
- its initial cache content was like two weeks with a few values in the beginning, and the last one extending to the end of the time-series period
- the 'patch', or change, introduced two new values into an 'empty'period, most likely not overwriting existing points, but merely adding two points (t1,0.0), (t2,1.0), which is a trivial case covered.
However, since it was possible to 'make the cache' in sync, by a simple read/update, and its pure algo, test-coverage, the merge algo does not seem to be a probable candidate to look at first.
So, it seems like:
- dstm.cache did have 'a' with the old values
- dstm.msync did not detect change, so it implies it did not have 'subs' for the 'a'
Thus the question is:
Is it possible that the dstm/msync can allow an item into its cache (like 'a'), without at the same time having an active 'sub' (with gross period covering the cache entry) ?
(Notice that there is variations of this, but kind of ruled out by the observations done.)
Possible root-cause finding (investigation concluded)
Given that there where a restart, or communication loss:
Reading through the code, related to, recovering communication loss.
It appears that it is indeed possible, and very likely, that the dstm ends up loosing dtss.subs.
Given the following sequence:
- dstm & dtss start up, establishes normal operation, with ten-thousands of subs that are kept in sync, and working. (the assumed observed state).
- the dtss is shut down (a restart or network connection)
- the dstm tries to recover the subs, but fails (takes longer than 60 s, or dlib socket error is (re)thrown). In this case the worker just continue its loop.
- the dtss comes online again (and existing sessions are on the dtss are cleared out, either due to a complete restart, or if network problems, server-side clean-up of failed client-connections)
- worker resumes its work using the msync io connection
IF we on this attempt do not detect there was a lost connection, and restore it, then all subs prior to the broken connection will be 'cold' (not present on the dtss.side, thus there will be no updates on this, but updates on any new/refreshed subs will start to fill in).
Findings:
- The dtss::client connection_count is quite clever telling about any restore of communication, incrementing the connection-loss. So it does not seem likely that, at first walk through, a communication loss/recover goes undetected (e.g.when detected, it will attempt to restore all subscription and continue with the full-set of subs)
- The unsubscribe call invoked by the worker thread, might throw exception if connection is lost during that call. This is could cause the msync.worker to terminate (and thus ALL updates are lost). However, as of the 'observations', the msync.worker seems to be alive, so this is not likely the cause.
- As @jehelset pointed out, there is a small possibility, related to optimizing away not needed subscription checks, where the check is done prior to some work, and the conditions established prior to work, was changed when it comes to making actions on the decisions. The concrete case that the dstm establishes the first sub (so is_active() goes from false, to true), while doing this work. Given that coincident, the dstm would have read and cached the old values, set the water-mark for the cached series, while at the dtss server, the notify-change (on write) was skipped (because subs where empty when starting to store). Ref. to fix for possible glitch.
- Found repeatable trivial case, a combination of with break-point ts surrounding-read semantic, and updates at the main dtss, that will cause old values to appear in replicated cache.
Actions to resolve issue
-
Verify we do have test covering the case (Done: indeed, there are) -
1st Read through code and look for possible glitches/race conditions (Done: hard to find anything, but possible to improve, none found that could explain observations) -
Initial code changes and improvements msync fixes -
Enforce msync alwasy subscribes read, (remove option to not subscribe on read). The original code in play did indeed always pass the subscribe on read, so as of now we can not consider this as a root cause. -
Put mechanisms in place to detect if there are dstm.cache items that msync are not subscribing are evicted from cache (its not addressing the root cause, but eliminate a long lasting effect of it, and also logs if 'the impossible' happens, helping us to resolve it while working on it)
-
-
Add specific test to repeat the observed issue, try to reconstruct repeatable error -
Is it possible to force dstm.cache having 'a' entry, without valid subs covering it? Finally!, thanks to @jehelset inputs, we found a repeatable trivial case for break-point ts, related to surrounding reads(not the subs but the effect of it), that will cause the cache-inconsistency observed. - initial condition:
- cache ok, subs ok with a
-
- read-period [t3 ..t4) that gives a surrounding read, that is cached
-
- reply period [t1 ... t5), total period [t0 .. t7)
- event 1 at server:
-
- new frag write [t1,t2 ..t2) notify change done
-
- event 2 subs update dstm/slave
-
- subs update read [ t3 .. t4)
-
- reply frag [t2 .. t5 )
-
- cache update on dstm/slave will now have the old t1/value in cache!!! Permanent damage until client side read-period is extended outside the first surrounding read, e.g. most likely never
-
- All values observed to the left of new surrounding read value will be outdated (never updated).
- Combination of subs period, order of updates, can be constructed to get arbitrary outdated values in the gap between the first surround read and sub.period
- The same applies to right-hand-side surrounding read (linear ts).
- initial condition:
-
Find fix that resolve the repeatable issue above. The fix: ensure to extend the subs period at dtss main and msync to always cover the read real period. subs.covering what is cached, resolves problem consistently
-
-
Improve/deduplicate code -
Removing bw.compat code that by now should be obsolete (try_read/read, try_store,store). This is code maintenance, that would be done anyway, that helps minimize code, and thus improve code quality (needed). -
ensure read-subscription is best effort (so a failed ts-name is pruned out, or does not fail entire read operation). -
and uses non-tracking strategy for serialization(the use-case where bug was detected had ~200k series in cache (and subs), under massive changes to 200k series, tracking serialization will cause serious load on the system that can easily be handled if we where using non-tracking serialization. -
utctime was tracked, should be untracked, and any larger array operations should be blob, e.g. time_axis::point_dt.t. Simply using binary serialize would break existing stored models, so for now untracked utctime, and binary blob load/store vectors of binary serialized data seems to be sufficient for this fix
-
-
during repair subs, the current approach uses grand period read, this could harm the system if there is one ts that have lets say 30 years of period, and the other 14 days. The repair routine should bundle ts with same period, then read pr. bundle time-series. -
during update subs.period introduced, hash lookups are used where it could be transformed to linear lookup, this causes potential large load/cpu penalty for large change-sets and subs. - Dtss/geo functions are currently pass-through, so they effectively are routed to the server. Thus geo functions, and related time-series, will not be cached locally. This inhibits effective use of this scalability, followed up in #1237 (closed)
-
expose dstm.client.cache_stats and flush_cache_all() to enable more insights and control to the cache -
ensure default parameters to dstm.evaluate and dstm.evaluate_model do cache results at dstm (performance and consistency requirement, by default the wanted behaviour, similar to dtss)
-
-
Considered, but for now not needed: Optional/other strategy for cache not subscribed:Ensure msync always do msync/mirror of items in dstm.cache (the periods in cache are known, so it is indeed possible to to precise).
-
Add logging/statistics features for the msync interface, so that any loss//restore of communication is possible to detect. -
Rerun manual tests verifying that msync always can recover lost connections from master dtss The reason is that during code review of this issue, it was discovered that the setup of the automatic test is was broken at some time, and configured to be skipped. So it either needs to be fixed so that it is automatically tested, or manual test procedures for the functions need to be done prior to any release(that could influence this part of the code). -
With zero subs -
With many subs -
Connection loss detected by the msync worker -
Connection loss detected by dstm.clients (that causes use of the msync.master connection)
-
-
Run dedicated more-than-one-person code-review on the msync (dtss<->dtss, dstm<->dtss) releated code -
Add 'torture' test-case that simulates high load/updates/subscriptions, in separate issue to follow up #1235 - A writer thread that randomly updates random period ranges of mix of fixed interval, break-point ts at the main dtss.
- Multiple ts-readers that reads/subscribes to the sub-node dtss, or dstm, that should reflect the main dtss (after some acceptable delay.
- A writer thread that updates some few time-series from the sub-node dtss, or dstm (emulates user input etc.)
- Run this at high speed for some time.
- Then stop writer activity (and let msync run for a while to catch up)
- Then verify that time-series slices as observed from the sub-node dtss, or dstm, are identical to the content on the dtss main.
-
Ensure to run valgrind/memcheck after changes -
Add any manual tests as a required checklist for release of new Shyft versions, follow up in #1236
Possible actions for workaround on existing systems
These are actions that could be done by configuring existing parameters, with no code changes.
- Limit dstm.cache size, so that caching is minimal, or zero (but it does not solve a missed subs update, so the end user on the web interface could look at 'stale values', but a refresh would give new values). Would require restart of service.
- Flush the dstm.cache (can be done from py interfaces). Does not require restart of service, but only useful if errors are detected.