-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update consitent_index when applying fails #13942
Conversation
Without this fix, the new added test case
After applying the fix in this PR, then the case succeeds. |
Codecov Report
@@ Coverage Diff @@
## main #13942 +/- ##
==========================================
- Coverage 72.52% 72.14% -0.38%
==========================================
Files 469 469
Lines 38413 38411 -2
==========================================
- Hits 27859 27712 -147
- Misses 8776 8901 +125
- Partials 1778 1798 +20
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
94ae5ee
to
d72aebd
Compare
Just rebased this PR. |
7b60b76
to
b5bda01
Compare
b5bda01
to
1ffc0f2
Compare
1ffc0f2
to
449785a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure - flake (https://github.com/etcd-io/etcd/runs/6091231156?check_suite_focus=true):
logger.go:130: 2022-04-20T08:11:00.714Z FATAL m0 Verification failed {"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (24) must be >= last snapshot index (25)"}
requires a second look.
91f4fdb
to
30ec049
Compare
The root cause should be coming from server.go#L919, and it should be a legacy issue. When sending snapshot to a slow follwer,
|
6bd4edb
to
f087e20
Compare
When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may smaller than the snapshot index.
f087e20
to
6eef7ed
Compare
Everything should be good now. Eventually the if statement is as below. It's really subtle & interesting :)
|
I think the change makes sense, however it's just a band-aid solution for the consistent index code that needs a architectural redesign. |
I think that the @ptabor PR wanted to fix the development issue of picking either |
So would not consider it 'alternative', even if not contributing directly to the new approach.
|
In general, I agree with @ptabor , but we still need to clarify and breakdown the details. I will keep thinking about it and probably send out a draft/doc for review later. |
Fix issues/13937
When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may be smaller than the snapshot index.
If the Auth isn't enabled, then users will not run into this issue. Usually K8s depends on certificate/TLS to communicate with etcd, and the Auth isn't enabled, so it should be fine in this case.
If the Auth is enabled, it's also not easy to run into this issue, because the default value for
--snapshot-count
is 100000. The consistent_index will be updated each time when there is a successful applying. This issue will be bypassed if there is at least one successful applying after generating each snapshot.If users eventually run into this issue, please build a binary with the following patch based on 3.5.3. Replace the binary, and start the etcd cluster again. Afterwards try to put & delete at least one K/V. At last, stop the cluster and rollback the binary. Everything will be fine by then.