Update consitent_index when applying fails #13942

ahrtr · 2022-04-14T01:16:29Z

When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may be smaller than the snapshot index.

If the Auth isn't enabled, then users will not run into this issue. Usually K8s depends on certificate/TLS to communicate with etcd, and the Auth isn't enabled, so it should be fine in this case.

If the Auth is enabled, it's also not easy to run into this issue, because the default value for --snapshot-count is 100000. The consistent_index will be updated each time when there is a successful applying. This issue will be bypassed if there is at least one successful applying after generating each snapshot.

If users eventually run into this issue, please build a binary with the following patch based on 3.5.3. Replace the binary, and start the etcd cluster again. Afterwards try to put & delete at least one K/V. At last, stop the cluster and rollback the binary. Everything will be fine by then.

$ git diff
diff --git a/server/etcdserver/backend.go b/server/etcdserver/backend.go
index 2beef5763..44e583b66 100644
--- a/server/etcdserver/backend.go
+++ b/server/etcdserver/backend.go
@@ -104,6 +104,9 @@ func recoverSnapshotBackend(cfg config.ServerConfig, oldbe backend.Backend, snap
        if snapshot.Metadata.Index <= consistentIndex {
                return oldbe, nil
        }
+       if true {
+               return oldbe, nil
+       }
        oldbe.Close()
        return openSnapshotBackend(cfg, snap.New(cfg.Logger, cfg.SnapDir()), snapshot, hooks)
 }

ahrtr · 2022-04-14T01:18:20Z

Without this fix, the new added test case TestV3AuthEmptyUserPut always fails.

logger.go:130: 2022-04-14T07:42:28.034+0800	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestV3AuthEmptyUserPut3114970556/002/etcd371615291", "error": "backend.ConsistentIndex (8) must be >= last snapshot index (16)"}

After applying the fix in this PR, then the case succeeds.

codecov-commenter · 2022-04-14T03:17:11Z

Codecov Report

Merging #13942 (77ee442) into main (4555fc3) will decrease coverage by 0.37%.
The diff coverage is 100.00%.

❗ Current head 77ee442 differs from pull request most recent head 6eef7ed. Consider uploading reports for the commit 6eef7ed to get more accurate results

@@            Coverage Diff             @@
##             main   #13942      +/-   ##
==========================================
- Coverage   72.52%   72.14%   -0.38%     
==========================================
  Files         469      469              
  Lines       38413    38411       -2     
==========================================
- Hits        27859    27712     -147     
- Misses       8776     8901     +125     
- Partials     1778     1798      +20

Flag	Coverage Δ
all	`72.14% <100.00%> (-0.38%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/etcdserver/server.go	`83.80% <100.00%> (-0.59%)`	⬇️
server/proxy/grpcproxy/election.go	`8.69% <0.00%> (-73.92%)`	⬇️
server/proxy/grpcproxy/lock.go	`33.33% <0.00%> (-66.67%)`	⬇️
...ver/proxy/grpcproxy/adapter/lock_client_adapter.go	`33.33% <0.00%> (-66.67%)`	⬇️
...proxy/grpcproxy/adapter/election_client_adapter.go	`6.89% <0.00%> (-62.07%)`	⬇️
...xy/grpcproxy/adapter/maintenance_client_adapter.go	`5.71% <0.00%> (-28.58%)`	⬇️
server/proxy/grpcproxy/maintenance.go	`68.29% <0.00%> (-14.64%)`	⬇️
client/pkg/v3/fileutil/lock_linux.go	`72.22% <0.00%> (-8.34%)`	⬇️
server/storage/schema/cindex.go	`93.33% <0.00%> (-6.67%)`	⬇️
client/v3/leasing/util.go	`91.66% <0.00%> (-6.67%)`	⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4555fc3...6eef7ed. Read the comment docs.

server/etcdserver/server.go

ahrtr · 2022-04-16T19:36:50Z

Just rebased this PR.

server/etcdserver/server.go

tests/integration/v3_auth_test.go

server/etcdserver/server.go

ptabor

Test failure - flake (https://github.com/etcd-io/etcd/runs/6091231156?check_suite_focus=true):

    logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (24) must be >= last snapshot index (25)"}

requires a second look.

ahrtr · 2022-04-20T11:40:25Z

Test failure - flake (https://github.com/etcd-io/etcd/runs/6091231156?check_suite_focus=true):
    logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (24) must be >= last snapshot index (25)"}
requires a second look.

The root cause should be coming from server.go#L919, and it should be a legacy issue. When sending snapshot to a slow follwer, appliedId and appliedTerm are used as the snapshot index and term. But if the sender's consistent_index is somehow smaller than the appliedIndex, then we will run into this issue.

logger.go:130: 2022-04-20T08:10:41.907Z INFO m0 applying snapshot {"member": "m0", "current-snapshot-index": 0, "current-applied-index": 8, "incoming-leader-snapshot-index": 24, "incoming-leader-snapshot-term": 17}

logger.go:130: 2022-04-20T08:11:00.714Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync501892535/002/etcd4044943490", "error": "backend.ConsistentIndex (23) must be >= last snapshot index (24)"}

When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may smaller than the snapshot index.

ahrtr · 2022-04-20T14:00:02Z

Everything should be good now.

Eventually the if statement is as below. It's really subtle & interesting :)

if !applyV3Performed || (ar != nil && ar.err != nil)

serathius · 2022-04-21T10:24:02Z

I think the change makes sense, however it's just a band-aid solution for the consistent index code that needs a architectural redesign.

ahrtr · 2022-04-21T10:51:31Z

I think the change makes sense, however it's just a band-aid solution for the consistent index code that needs a architectural redesign.

Yes, I agree. But redesign isn't an easy task, and I will try to drive this if @ptabor hasn't bandwidth, and probably will be top of @ptabor 's PR 13878

serathius · 2022-04-21T12:25:27Z

I think that the @ptabor PR wanted to fix the development issue of picking either LockInsideApply or LockOutsideApply. However looking that apply code that was changed in this PR, we will still be hard to decide if CI should be flushed or not. For v3.5.3 we fixed one type of unintentional update of consistent index, but broke case where update should still happen when authorization fails. We can continue to detect and fix such issues, however it might never end as new features and error states are added.

ptabor · 2022-04-21T12:47:55Z

broke case where update should still happen when authorization fails -> I think it was preexisting issue, just observed when we more frequently dump snapshot.
I think PR Encapsulation of applier logic: Move Txn related code out of applier.go. #13878 helps with overall code quality:
- breaks huge etcdserver in 2 subpackages with cleaner dependencies between them
- reduces objects exposed directly in EtcdServer
- makes wrapping story or appiers 'cleaner'. No cyclic dependency.

So would not consider it 'alternative', even if not contributing directly to the new approach.

I don't see a way to avoid passing some form of applyingContext to all the transactions. We will need to do this in every code-paths leading to transaction or make it part of 'applying' specific helpers on top of objects representing post-raft-state. In general I'm in favor of strongly separating pre-raft and post-raft state objects... and then we know that post-raft object mutators are responsible for cindex.

ahrtr · 2022-04-21T21:53:29Z

In general, I agree with @ptabor , but we still need to clarify and breakdown the details. I will keep thinking about it and probably send out a draft/doc for review later.

ahrtr added the backport/v3.5 label Apr 14, 2022

This was referenced Apr 14, 2022

Plans for v3.5.3 release #13894

Closed

[3.5] Update consitent_index when applying fails #13946

Merged

ptabor reviewed Apr 16, 2022

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

ahrtr force-pushed the move_cindex_on_apply_fail branch from 94ae5ee to d72aebd Compare April 16, 2022 19:35

ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 7b60b76 to b5bda01 Compare April 18, 2022 23:24

serathius approved these changes Apr 19, 2022

View reviewed changes

ptabor reviewed Apr 19, 2022

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

serathius mentioned this pull request Apr 19, 2022

Plans for v3.5.4 release #13960

Closed

6 tasks

ptabor changed the title ~~Update conssitent_index when applying fails~~ Update consitent_index when applying fails Apr 19, 2022

ahrtr force-pushed the move_cindex_on_apply_fail branch from b5bda01 to 1ffc0f2 Compare April 20, 2022 01:40

serathius reviewed Apr 20, 2022

View reviewed changes

tests/integration/v3_auth_test.go Show resolved Hide resolved

ahrtr force-pushed the move_cindex_on_apply_fail branch from 1ffc0f2 to 449785a Compare April 20, 2022 08:05

ptabor reviewed Apr 20, 2022

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

ptabor approved these changes Apr 20, 2022

View reviewed changes

ptabor suggested changes Apr 20, 2022

View reviewed changes

ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 91f4fdb to 30ec049 Compare April 20, 2022 11:39

ahrtr force-pushed the move_cindex_on_apply_fail branch 2 times, most recently from 6bd4edb to f087e20 Compare April 20, 2022 13:27

Update conssitent_index when applying fails

6eef7ed

When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may smaller than the snapshot index.

ahrtr force-pushed the move_cindex_on_apply_fail branch from f087e20 to 6eef7ed Compare April 20, 2022 13:49

ptabor approved these changes Apr 20, 2022

View reviewed changes

serathius merged commit e02ac59 into etcd-io:main Apr 21, 2022

ptabor mentioned this pull request Apr 22, 2022

Verification of exchanged etcd snapshots #13973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update consitent_index when applying fails #13942

Update consitent_index when applying fails #13942

Update consitent_index when applying fails #13942

Update consitent_index when applying fails #13942

Conversation

Codecov Report

Choose a reason for hiding this comment