Starting with commit 0d6e7cd983
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.
Signed-off-by: Cory Snider <csnider@mirantis.com>
When libnetwork receives a watch event for a driver table entry from
NetworkDB it passes the event along to the interested driver. This code
contains a subtle bug: update events from NetworkDB are passed along to
the driver as Delete events! This bug was lying dormant as driver-table
entries can only be added by the driver, not updated. Now that NetworkDB
broadcasts an UpdateEvent to watchers if the entry is already known to
the local NetworkDB, irrespective of whether the event received from the
remote peer was a CREATE or UPDATE event, the bug is causing problems.
Whenever a remote node replaces an entry in the overlay_peer_table but
the intermediate delete state was not received by the local node, the
new CREATE event would be translated to an UpdateEvent by NetworkDB and
subsequently handled by the overlay driver as if the entry was deleted!
Bubble table UPDATE events up to the network driver as Update events.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The Inter-Network Communication rules in the iptables chains
DOCKER-ISOLATION-STAGE-1 / DOCKER-ISOLATION-STAGE-2 (which are
called from filter-FORWARD) currently:
- Block access from containers in one bridge network, to ports
published to host addresses by containers in other bridge
networks, when the userland-proxy is disabled.
- But, that access is allowed when the proxy is enabled.
- Block access to all ports on container addresses in gateway
mode "nat-unprotected" networks.
- But, those ports can be accessed from anywhere else, including
other hosts. Just not other bridge networks.
- Allow access from containers in "nat" bridge networks to published
ports on container addresses in "routed" networks. But, to do that,
extra INC rules are added for the routed network.
The INC rules are no longer needed to block access from containers
in one network to unpublished ports on container addresses in
other networks. Direct routing to containers in NAT networks is
blocked by the "raw-PREROUTING" rules that block access from
untrusted interfaces (all interfaces apart from the network's
own bridge).
Drop these INC rules to resolve the inconsistencies listed above,
with this change:
- Published ports on host addresses can be accessed from containers
in other networks (even without the userland-proxy).
- The rules for direct routing between bridge networks are the same
as the rules for direct routing from outside the Docker host
(allowed for gw modes "routed" and "nat-unprotected", disallowed
for "nat").
Fewer rules, so it's simpler, and perhaps slightly faster.
Internal networks (with no access to networks outside the host)
are also implemented using rules in the DOCKER-ISOLATION chains.
This change moves those rules to a new chain, DOCKER-INTERNAL,
and drops the DOCKER-ISOLATION chains.
Signed-off-by: Rob Murray <rob.murray@docker.com>
A network node is responsible for both broadcasting table events for
entries it owns and for rebroadcasting table events from other nodes it
has received. Table events to be broadcast are added to a single queue
per network, including events for rebroadcasting. As the memberlist
TransmitLimitedQueue is (to a first approximation) LIFO, a flood of
events from other nodes could delay the broadcasting of
locally-generated events indefinitely. Prioritize broadcasting local
events by splitting up the queues and only pulling from the rebroadcast
queue if there is free space in the gossip packet after draining the
local-broadcast queue.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Log more details when assertions fail to provide a more complete picture
of what went wrong when TestCRUDTableEntries fails. Log the state of
each NetworkDB instance at various points in TestCRUDTableEntries to
provide an even more complete picture.
Increase the global logger verbosity in tests so warnings and debug logs
are printed to the test log.
Signed-off-by: Cory Snider <csnider@mirantis.com>
NetworkDB uses a muli-dimensional map of struct network to keep track of
network attachments for both remote nodes and the local node. Only a
subset of the struct fields are used for remote nodes' network
attachments. The tableBroadcasts pointer field in particular is
always initialized for network values representing local attachments
(read: nDB.networks[nDB.config.NodeID]) and always nil for remote
attachments. Consequently, unnecessary defensive nil-pointer checks are
peppered throughout the code despite the aforementioned invariant.
Enshrine the invariant that tableBroadcasts is initialized iff the
network attachment is for the local node in the type system. Pare down
struct network to only the fields needed for remote network attachments
and move the local-only fields into a new struct thisNodeNetwork. Elide
the unnecessary nil-checks.
Signed-off-by: Cory Snider <csnider@mirantis.com>
When joining a network that was previously joined but not yet reaped,
NetworkDB replaces the network struct value with a zeroed-out one with
the entries count copied over. This is also the case when joining a
network that is currently joined! Consequently, joining a network has
the side effect of clearing the broadcast queue. If the queue is cleared
while messages are still pending broadcast, convergence may be delayed
until the next bulk sync cycle.
Make it an error to join a network twice without leaving. Retain the
existing broadcast queue when rejoining a network that has not yet been
reaped.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The map key for nDB.networks is the network ID. The struct field is not
actually used anywhere in practice.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Also adding test-cases for;
- empty options for all fields
- invalid nameServer (domain instead of IP).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Gracefully leaving the memberlist cluster is a best-effort operation.
Failing to successfully broadcast the leave message to a peer should not
prevent NetworkDB from cleaning up the memberlist instance on close. But
that was not the case in practice. Log the error returned from
(*memberlist.Memberlist).Leave instead of returning it and proceed with
shutting down irrespective of whether Leave() returns an error.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.
One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.
Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling
checkEncryption(nid, netip.Addr{}, true)
Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.
Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.
Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.
Signed-off-by: Cory Snider <csnider@mirantis.com>
We set SO_REUSEADDR on sockets used for host port mappings by
docker-proxy - which means it's possible to bind the same port
on a specific address as well as 0.0.0.0/::.
For TCP sockets, an error is raised when listen() is called on
both sockets - and the port allocator will be called again to
avoid the clash (if the port was allocated from a range, otherwise
the container will just fail to start).
But, for UDP sockets, there's no listen() - so take more care
to avoid the clash in the portallocator.
The port allocator keeps a set of allocated ports for each of
the host IP addresses it's seen, including 0.0.0.0/::. So, if a
mapping to 0.0.0.0/:: is requested, find a port that's free in
the range for each of the known IP addresses (but still only
mark it as allocated against 0.0.0.0/::). And, if a port is
requested for specific host addresses, make sure it's also
free in the corresponding 0.0.0.0/:: set (but only mark it as
allocated against the specific addresses - because the same
port can be allocated against a different specific address).
Signed-off-by: Rob Murray <rob.murray@docker.com>