-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate IPv6 addresses after detecting IPv6 support #47406
Conversation
9f40892
to
9b7b3c0
Compare
9b7b3c0
to
4a26d99
Compare
4a26d99
to
c37c9d4
Compare
157a62b
to
ee61361
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to insist that IPAM must continue to happen when the endpoint is created, I would rather go back to the way you had it before where the daemon inspected the sandbox to decide whether or not to enable IPv6 on the endpoint. What you have currently is the same thing with many extra steps and lots of complexity without any upside.
Why is deferring IPAM allocation to Join/Restore infeasible and what do manually-assigned addresses have to do with it?
ee61361
to
ec2be5b
Compare
ec2be5b
to
e478178
Compare
I've un-merged CreateEndpoint/Endpoint.Join, so that the Sandbox can be configured for legacy links between the two. CreateEndpoint still needs to know whether to set up IPv6 addresses, so I've added CreateEndpointForSandbox that takes the Sandbox as a param. Once we get rid of legacy links perhaps we'll make it do the Join as well, but there's no need to do that as part of this change. Moving address assignment to Join would mean CreateEndpoint doesn't need to see the Sandbox, but it wouldn't help with legacy links ... Join would need a callback after allocating addresses, before doing the Join. The final commit isn't intended to be merged in its current state, if we want to do it or something like it, the changes will need to be pushed down into the other commits - it's just an experiment/proposal for removing references to legacy-links from the libnetwork code, while making the Sandbox update for legacy links type safe.
That sounds good. But, the Sandbox options to implement legacy links are very legacy-linky, not generic things that we'll want to leave behind for other purposes when we drop support for legacy links. For example ...
Even So, I don't think removing references to legacy links gives us anything but obfuscation. But, I know there are strong feelings on the subject - so, can do something like the change in the final commit? Or, is there a better approach? |
e478178
to
669cd2e
Compare
697436a
to
fae93cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we discussed on Slack about the attachment ordering issue in the 2nd commit, let me send my review as is. I'll take a look at other commits.
libnetwork/sandbox_linux.go
Outdated
return fmt.Errorf("failed to set IPv6 gateway while updating gateway: %v", err) | ||
// If IPv6 has been disabled in the sandbox a gateway may still have been | ||
// configured, don't attempt to apply it. | ||
if ipv6, ok := sb.ipv6Enabled(); !ok || ipv6 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at ipv6Enabled()
, the only case where ok == false
is when sb.osSbox == nil
. This means, if we enter this branch because ok == false
, the next line will panic.
moby/libnetwork/sandbox_linux.go
Lines 219 to 221 in 32c33db
if osSbox == nil { | |
return false, false | |
} |
if ipv6, ok := sb.ipv6Enabled(); !ok || ipv6 { | |
if ipv6, ok := sb.ipv6Enabled(); ipv6 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes - the function would have returned early if osSbox == nil
, so the check is just unnecessary (until some change happens in the implementation ipv6Enabled
that means it returns !ok
for some other reason).
That code goes away again in a later commit, so I've changed it anyway.
// on sandbox initialize for link, but the sandbox only be initialized | ||
// on first network connecting. | ||
defaultNetName := network.DefaultNetwork | ||
if nConf, ok := ctr.NetworkSettings.Networks[defaultNetName]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the CLI prevents you from attaching to the default network and custom networks at the same time, there's no such check in the daemon. That means, this condition was previously defining a specific order of operations:
- The default
bridge
network, if it's specified; - Then, any custom networks.
But since the order of network attachment defines the default gateway, removing that code means the default bridge might not be the default gateway now. Although the legacy --link justification doesn't hold anymore, we need to keep that old code / behavior for the time being (until we add a way to specify which attachment should be the default gateway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting more about this, I think you're right: the default gateway is changed whenever an endpoint that sorts first (in lexicographic order) is attached to the sandbox. We should be good to go with that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as "unresolved" to keep these comments visible (making it easier to discover in future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note then ... the endpoints are sorted by gateway-ness, then by name. (So, the "best" choices of gateway endpoints are preferred, but equally good endpoints are just ordered lexicographically.)
Lines 677 to 683 in 59eba0a
// <=> Returns true if a < b, false if a > b and advances to next level if a == b | |
// epi.prio <=> epj.prio # 2 < 1 | |
// epi.gw <=> epj.gw # non-gw < gw | |
// epi.internal <=> epj.internal # non-internal < internal | |
// epi.joininfo <=> epj.joininfo # ipv6 < ipv4 | |
// epi.name <=> epj.name # bar < foo | |
func (epi *Endpoint) Less(epj *Endpoint) bool { |
1c123e4
to
72c93c0
Compare
Rebased to resolve merge conflicts and run up-to-date tests. |
72c93c0
to
6e19ae6
Compare
6e19ae6
to
57b26b9
Compare
Rebased again (new conflict in |
return nil | ||
} | ||
for _, child := range children { | ||
if !isLinkable(child) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a follow-up; now that this code is in a platform-specific file, we should inline this function (makes it clearer what it does);
moby/daemon/container_operations_unix.go
Lines 403 to 407 in 9eeec5f
func isLinkable(child *container.Container) bool { | |
// A container is linkable only if it belongs to the default network | |
_, ok := child.NetworkSettings.Networks[network.DefaultNetwork] | |
return ok | |
} |
And remove the windows stub;
moby/daemon/container_operations_windows.go
Lines 153 to 155 in 9eeec5f
func isLinkable(child *container.Container) bool { | |
return false | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's now #48821.
return errors.Wrapf(err, "failed to add IPv6 address to /etc/hosts for link to %s", child.Name) | ||
} | ||
} | ||
cEndpointID = defaultNW.EndpointID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's existing code, but this is confusing; i.e., we overwrite cEndpointID
in each iteration, and cEndpointID
is used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I see Cory commented about that as well; 68a4761#r1526578113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That link doesn't work - but I think it's #47406 (comment)
In any case, I think it's something for follow-up as it's not changed by this PR - which is already big enough!
daemon/container_operations.go
Outdated
func (daemon *Daemon) initializeNetworking( | ||
ctx context.Context, | ||
cfg *config.Config, | ||
ctr *container.Container, | ||
) (newSandbox *libnetwork.Sandbox, retErr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly tempted to keep this unwrapped; yes, it's a long line, but most of it can be skimped over, knowing that it's just the func's signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Is the rule "no line-wrapping in function declarations"?
(To me, it's just slightly harder to read and see the return types. And diffs won't pick out future changes as clearly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the codebase is written that way, but I admit I'm not a big fan of that. To @robmry's point, it's harder to read IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tempted to say; if the list of arguments for a function is becoming so long that it needs wrapping, that's a very good indicator that the signature may be in need of an overhaul 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two params and a return value plus context and error, seems ok? So we need shorter names?
3640838
to
bb99866
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass; overall looking good. Perhaps ready to go, but more eyes won't hurt on this one!
daemon/container_operations.go
Outdated
if err := daemon.updateNetworkSettings(ctr, n, endpointConfig); err != nil { | ||
return err | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (but there may be some linters complaining);
return daemon.updateNetworkSettings(ctr, n, endpointConfig)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
@@ -20,6 +20,7 @@ func TestDNSNamesOrder(t *testing.T) { | |||
Config: &containertypes.Config{ | |||
Hostname: "baz", | |||
}, | |||
HostConfig: &containertypes.HostConfig{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious (no need to change); did this panic somewhere? (in that case; are we missing a check somewhere to prevent that)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ...
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x17aaf44]
goroutine 27 [running]:
testing.tRunner.func1.2({0x1aa5ce0, 0x3171920})
/usr/local/go/src/testing/testing.go:1631 +0x1c4
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1634 +0x33c
panic({0x1aa5ce0?, 0x3171920?})
/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/docker/docker/daemon.(*Daemon).updateNetworkSettings(0x400051dd00, 0x400051dab8, 0x40000504e0, 0x40001a5b20)
/go/src/github.com/docker/docker/daemon/container_operations.go:170 +0x224
github.com/docker/docker/daemon.(*Daemon).updateNetworkConfig(0x400051dd00, 0x400051dab8, 0x40000504e0, 0x40001a5b20)
/go/src/github.com/docker/docker/daemon/container_operations.go:631 +0x2ac
github.com/docker/docker/daemon.TestDNSNamesOrder(0x4000563d40)
/go/src/github.com/docker/docker/daemon/container_operations_test.go:35 +0x264
testing.tRunner(0x4000563d40, 0x1ee8270)
/usr/local/go/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1742 +0x318
Which comes from:
moby/daemon/container_operations.go
Line 170 in 0cba410
if !ctr.HostConfig.NetworkMode.IsHost() && containertypes.NetworkMode(n.Type()).IsHost() { |
Looks like it's normally initialised in daemon.newContainer
- and there's lots of code that assumes it exists:
Line 151 in afdfe4f
base.HostConfig = &containertypes.HostConfig{} |
sb.config.extraHosts = append(sb.config.extraHosts, extraHost{name: name, IP: ip}) | ||
return sb.rebuildHostsFile(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely new (I think), but it caught my eye now;
This logic;
- Appends extrahosts
- Calls "rebuild" for the hosts file
But looking at the "rebuild";
- It generates a new host-file (to replace the existing one?) with the new config (which includes the new "extra hosts")
- After that, it updates the file for each endpoint
That last part is done in a loop, and for each iteration it adds an endpoint, and (from a quick look), locks the hosts-file and writes it to disk
I'm wondering;
- Could we could we skip the "updating" part if we would update the config before we call
sb.buildHostsFile()
? - (or have
sb.buildHostsFile()
accept the list of "additional hosts" to add) - ☝️ so that it has all the data it needs, and just has to construct (and write) once
func (sb *Sandbox) rebuildHostsFile(ctx context.Context) error {
if err := sb.buildHostsFile(); err != nil {
return errdefs.System(err)
}
for _, ep := range sb.Endpoints() {
if err := sb.updateHostsFile(ctx, ep.getEtcHostsAddrs()); err != nil {
return errdefs.System(err)
}
}
return nil
}
If that's not an option, I noticed that sb.updateHostsFile()
accepts a slice of addresses; so instead of updating it one at a time, would that mean we could just collect the endpoints, and pass it? Something like;
eps := sb.Endpoints()
addrs := make([]string, 0, len(eps))
for _, ep := range eps {
addrs = append(addrs, ep.getEtcHostsAddrs()...)
if err := sb.updateHostsFile(ctx, ep.getEtcHostsAddrs()); err != nil {
return errdefs.System(err)
}
}
if err := sb.updateHostsFile(ctx, addrs...); err != nil {
// NOTE: should sb.updateHostsFile already return a proper errdefs type (errdefs.System)?
return errdefs.System(err)
}
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sandbox.AddHostsEntry
is only used for legacy --link
options for containers on the default bridge network, before they're started - and those containers are only likely to be connected to the default bridge (only one Endpoint) ... so I don't think there's much to gain by optimising it.
Sandbox.rebuildHostsFile
is called from SetKey
for a newly created container once the its support for IPv6 is known. That container could be connected to multiple endpoints - so maybe there are some msecs to be gained by accumulating addresses from all of the endpoints. But, optimising the whole etchosts
thing so that it doesn't keep writing/reading/merging the contents (before a container starts and the user's had a chance to modify it), is probably the way to go if that's an issue.
These /etc/hosts
entries are pretty questionable anyway for user-defined networks that have an internal DNS resolver. If we figure out a fix for buildkit, so that we can use the internal resolver for the default network too, even more so.
I don't really want to muck about with it in this PR, maybe as a follow-up if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now #48823
} | ||
|
||
return nil, ipv6Miss | ||
// this map is to avoid IP duplicates, this can happen during a transition period where 2 services are using the same IP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly curious; is there any order in the results? In this case, should "last" or "first" take precedence? (currently the first one is used, and the last one is discarded; wondering if we should loop in reverse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only changes whitespace here ... but no, I don't think the order can matter - the results come from a map in the first place, and DNS resolvers are free to rotate addresses in responses anyway to balance load.
} | ||
|
||
return nil, ipv6Miss | ||
// this map is to avoid IP duplicates, this can happen during a transition period where 2 services are using the same IP | ||
noDup := make(map[string]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant for this PR, but this could uses a map[string]struct{}
to avoid allocations.
bb99866
to
1814519
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Second attempt to stop using the OCI prestart hook to call SetKey to set up the OS Sandbox's key and perform network config in the new network namespace. The first attempt was reverted because it made it impossible to use --sysctl to set per-interface sysctls on an interface that had not yet been moved into the new network namespace. Now, per-interface sysctls can be used to do that (with less ambiguity because the setting is not tied to the interface using an unpredictably assigned name). Signed-off-by: Rob Murray <rob.murray@docker.com>
When connecting a container to a new network, its NetworkSettings were unconditionally updated. But, when creating a new container, they were only updated if there were no NetworkSettings before a network was connected. But, that's always the case - so, make the update unconditionally. Signed-off-by: Rob Murray <rob.murray@docker.com>
If config for legacy links needs to be added to a libnetwork.Sandbox, add it when constructing the Endpoint that needs it - removing the constraint on ordering of Endpoint construction, and the dependency between Endpoint and Sandbox construction. So, now a Sandbox can be constructed in one place, before the first Endpoint. Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
For Linux, delay construction and configuration of network endpoints until the container has been created (but not started). Signed-off-by: Rob Murray <rob.murray@docker.com>
When a container doesn't support IPv6 and it's joined to an IPv6 network, don't allocate an IPv6 address for it. Update the DNS resolver to understand that it can have an 'ipv6miss' (meaning an IPv4 address exists, but no IPv6) when a network is IPv6 enabled. Signed-off-by: Rob Murray <rob.murray@docker.com>
Interface DNSBackend.ResolveName, implemented by Network, Sandbox (and noopDNSBackend) had a bool return value that meant 'ipv6Miss'. But, it was always set to true on a hit, and callers had to deal with that. So, changed the meaning of the return value to indicate whether the name was found - which will also work for 'ipv4Miss' when we have IPv6-only containers/networks. Signed-off-by: Rob Murray <rob.murray@docker.com>
1814519
to
caf2d5d
Compare
- What I did
If IPv6 is disabled for a container, do not allocate an IPv6 address when it's attached to an IPv6 network.
- How I did it
There are a few commits - it's probably easiest to review them separately ...
Unconditionally update container.NetworkSettings
In
daemon.allocateNetwork()
, flagupdateSettings
was set ifcontainer.NetworkSettings.Networks
was empty. That flag was passed down through the call stack to avoid a call todaemon.updateNetworkSettings()
indaemon.updateNetworkConfig()
.Other calls to
daemon.updateNetworkSettings()
, which happen when connecting an existing container to another network, unconditionally made the NetworkSettings update.I don't think there's a circumstance where NetworkSettings is not empty during container creation. Even if there is, the update is harmless and cheap. Eliminating the
updateSettings
flag simplifies the code a little and, in the next commit whereallocateNetwork()
is split out ofinitializeNetworking()
, it avoids the need to pass that flag between the two.Separate Sandbox/Endpoint construction
Previously, the
libnetwork.Sandbox
was created indaemon.allocateNetwork()
if there was no network - otherwise it happened while connecting an Endpoint indaemon.connectToNetwork()
. If there was an endpoint in the default bridge, it had to be connected first, because extra configuration is needed in the Sandbox for legacy links and that could only happen during Sandbox construction.Now, config for legacy links is added to the Sandbox when constructing the Endpoint that needs it - removing the constraint on ordering of Endpoint construction, and the dependency between Endpoint and Sandbox construction.
So, now a Sandbox can be constructed in one place, before the first Endpoint.
Also, replaces some legacy-link specific Sandbox option-setters with Sandbox methods for updating
/etc/hosts
, and updates a legacy-link parent's IPv6 address as well as its IPv4 address when a child container restarts.Configure network endpoints after creating a container
This commit splits the bulk of
daemon.allocateNetwork()
out ofdaemon.initializeNetworking()
.daemon.initializeNetworking()
hasn't moved, but it now only prepares configuration and doesn't do any Endpoint construction or configuration. Sandbox construction still happens here.Endpoint construction is in
daemon.allocateNetwork()
which, on non-Windows, is called after the container task has been created (before it's started).On Windows, Endpoint construction still happens before the container task is created. If it's created afterwards, some DNS lookups for the container don't seem to end up in our resolver - see the TODO comment in the code. (But, Windows can't benefit from the delayed assignment of IPv6 addresses anyway.)
Doing the Endpoint construction after the
osSbox
has been set up makes it possible to probe the container's network for IPv6 support first - enabling the next commit ...Only assign IPv6 addresses if required
If a Sandbox is provided to
CreateEndpointForSandbox
, check whether it has IPv6 before allocating IPv6 addresses.Because no IPv6 address is added to the DNS when the Sandbox doesn't support it, a lookup for a container on an IPv6 network can now have no IPv6 address. The resolver is updated to treat a missing IPv6 address on an IPv4 hit as
ipv6miss
even if the network has IPv6 enabled. (So, an empty DNS response is returned, avoiding the upstream DNS request and NXDOMAIN.)The
macvlan
driver expected an endpoint on a network with IPv6 subnets to have an IPv6 address - it searched the subnets for the address assigned to the endpoint, to work out which gateway address to use (crashing if there was no address). Now, it only makes that search if there is an IPv6 address.- How to verify it
Existing regression tests for the refactoring.
New integration test checks that a container with IPv6 disabled via
sysctl
on an IPv6 network has no IPv6 address.- Description for the changelog
If IPv6 is disabled for a container, do not allocate an IPv6 address when it's attached to an IPv6 network.