Commit Graph

3990 Commits

Author SHA1 Message Date
Sebastiaan van Stijn
ed1406cb93 Merge pull request #50236 from corhere/libn/overlay-really-delete-neigh
libn/d/overlay: delete FDB entry from AF_BRIDGE
2025-06-24 18:13:54 +02:00
Albin Kerouanton
a41225dcfd Merge pull request #50091 from corhere/libn/overlay-refactor-checkencryption
libnetwork/d/overlay: simplify the encryption logic
2025-06-23 21:32:02 +02:00
Sebastiaan van Stijn
6025adfbef Merge pull request #50226 from robmry/driver_api_optional_extconn
libnet: split ProgramExternalConnectivity/RevokeExternalConnectivity out of driverapi
2025-06-20 13:31:18 +02:00
Cory Snider
7a12bbe5d3 libn/d/overlay: delete FDB entry from AF_BRIDGE
Starting with commit 0d6e7cd983
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-19 17:24:16 -04:00
Cory Snider
a7f01d238e libnetwork: fix flaky Swarm service DNS
When libnetwork receives a watch event for a driver table entry from
NetworkDB it passes the event along to the interested driver. This code
contains a subtle bug: update events from NetworkDB are passed along to
the driver as Delete events! This bug was lying dormant as driver-table
entries can only be added by the driver, not updated. Now that NetworkDB
broadcasts an UpdateEvent to watchers if the entry is already known to
the local NetworkDB, irrespective of whether the event received from the
remote peer was a CREATE or UPDATE event, the bug is causing problems.
Whenever a remote node replaces an entry in the overlay_peer_table but
the intermediate delete state was not received by the local node, the
new CREATE event would be translated to an UpdateEvent by NetworkDB and
subsequently handled by the overlay driver as if the entry was deleted!

Bubble table UPDATE events up to the network driver as Update events.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-18 17:58:08 -04:00
Cory Snider
de24c536b0 Merge pull request #50193 from corhere/libn/networkdb-fix-crudtable-flakes-harder
libnetwork/networkdb: prioritize local table broadcasts over event rebroadcasts
2025-06-18 13:34:57 -04:00
Rob Murray
bf002e51a0 Split [Program|Revoke]ExternalConnectivity out of libnet driverapi
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-18 17:38:35 +01:00
Rob Murray
b387526fce Merge pull request #49981 from robmry/drop_inc_rules
Drop DOCKER-ISOLATION rules
2025-06-17 20:22:00 +01:00
Rob Murray
ec185e57cf Test Nftabler params
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-17 18:28:20 +01:00
Rob Murray
126f99d776 Add a way to undo nftables.Enable(), for unit tests
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-17 09:52:48 +01:00
Rob Murray
c66abe486b nftabler: add mirrored WSL2 loopback0 workaround
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 15:19:47 +01:00
Rob Murray
d31956b2f7 Add an outline nftabler
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 15:19:45 +01:00
Rob Murray
1ad9599da7 Drop DOCKER-ISOLATION rules
The Inter-Network Communication rules in the iptables chains
DOCKER-ISOLATION-STAGE-1 / DOCKER-ISOLATION-STAGE-2 (which are
called from filter-FORWARD) currently:
- Block access from containers in one bridge network, to ports
  published to host addresses by containers in other bridge
  networks, when the userland-proxy is disabled.
  - But, that access is allowed when the proxy is enabled.
- Block access to all ports on container addresses in gateway
  mode "nat-unprotected" networks.
  - But, those ports can be accessed from anywhere else, including
    other hosts. Just not other bridge networks.
- Allow access from containers in "nat" bridge networks to published
  ports on container addresses in "routed" networks. But, to do that,
  extra INC rules are added for the routed network.

The INC rules are no longer needed to block access from containers
in one network to unpublished ports on container addresses in
other networks. Direct routing to containers in NAT networks is
blocked by the "raw-PREROUTING" rules that block access from
untrusted interfaces (all interfaces apart from the network's
own bridge).

Drop these INC rules to resolve the inconsistencies listed above,
with this change:
- Published ports on host addresses can be accessed from containers
  in other networks (even without the userland-proxy).
- The rules for direct routing between bridge networks are the same
  as the rules for direct routing from outside the Docker host
  (allowed for gw modes "routed" and "nat-unprotected", disallowed
  for "nat").

Fewer rules, so it's simpler, and perhaps slightly faster.

Internal networks (with no access to networks outside the host)
are also implemented using rules in the DOCKER-ISOLATION chains.
This change moves those rules to a new chain, DOCKER-INTERNAL,
and drops the DOCKER-ISOLATION chains.

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 14:54:31 +01:00
Cory Snider
6ec6e0991a libnetwork/networkdb: prioritize local broadcasts
A network node is responsible for both broadcasting table events for
entries it owns and for rebroadcasting table events from other nodes it
has received. Table events to be broadcast are added to a single queue
per network, including events for rebroadcasting. As the memberlist
TransmitLimitedQueue is (to a first approximation) LIFO, a flood of
events from other nodes could delay the broadcasting of
locally-generated events indefinitely. Prioritize broadcasting local
events by splitting up the queues and only pulling from the rebroadcast
queue if there is free space in the gossip packet after draining the
local-broadcast queue.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-13 17:08:18 -04:00
Cory Snider
e9a7154909 libnetwork/networkdb: improve TestCRUDTableEntries
Log more details when assertions fail to provide a more complete picture
of what went wrong when TestCRUDTableEntries fails. Log the state of
each NetworkDB instance at various points in TestCRUDTableEntries to
provide an even more complete picture.

Increase the global logger verbosity in tests so warnings and debug logs
are printed to the test log.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-13 17:08:17 -04:00
Cory Snider
dbb0d88109 libn/networkdb: use distinct type for own networks
NetworkDB uses a muli-dimensional map of struct network to keep track of
network attachments for both remote nodes and the local node. Only a
subset of the struct fields are used for remote nodes' network
attachments. The tableBroadcasts pointer field in particular is
always initialized for network values representing local attachments
(read: nDB.networks[nDB.config.NodeID]) and always nil for remote
attachments. Consequently, unnecessary defensive nil-pointer checks are
peppered throughout the code despite the aforementioned invariant.

Enshrine the invariant that tableBroadcasts is initialized iff the
network attachment is for the local node in the type system. Pare down
struct network to only the fields needed for remote network attachments
and move the local-only fields into a new struct thisNodeNetwork. Elide
the unnecessary nil-checks.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-13 17:08:11 -04:00
Cory Snider
51f31826ee libnetwork/networkdb: don't clear queue on rejoin
When joining a network that was previously joined but not yet reaped,
NetworkDB replaces the network struct value with a zeroed-out one with
the entries count copied over. This is also the case when joining a
network that is currently joined! Consequently, joining a network has
the side effect of clearing the broadcast queue. If the queue is cleared
while messages are still pending broadcast, convergence may be delayed
until the next bulk sync cycle.

Make it an error to join a network twice without leaving. Retain the
existing broadcast queue when rejoining a network that has not yet been
reaped.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-13 17:08:09 -04:00
Cory Snider
30b27ab6ea libnetwork/networkdb: drop id field from network
The map key for nDB.networks is the network ID. The struct field is not
actually used anywhere in practice.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-13 16:25:19 -04:00
Matthieu MOREL
6d737371b8 fix comparison rule from errorlint
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-13 08:26:56 +00:00
Paweł Gronowski
c430c9c7f2 Merge pull request #50115 from corhere/libn/fix-47859-networkdb-clusterleave-leak
libnetwork/networkdb: always shut down memberlist
2025-06-12 11:49:00 +00:00
Paweł Gronowski
5bbdb066d8 Merge pull request #50031 from vvoland/bbolt-1.4
vendor: go.etcd.io/bbolt v1.4.0
2025-06-12 10:09:09 +00:00
Sebastiaan van Stijn
13879e7b49 Merge pull request #50082 from mmorel-35/go-critic
fix go-critic linter
2025-06-10 13:09:05 +02:00
Paweł Gronowski
52a8109a6b Merge pull request #50155 from robmry/windows_no_mirrored_plugin
Windows: don't try to load "mirrored" network plugin
2025-06-10 09:36:22 +00:00
Paweł Gronowski
2e25775c83 libnetwork: Replace deprecated usages
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
2025-06-09 19:30:00 +02:00
Rob Murray
55f47f9e34 Windows: don't try to load "mirrored" network plugin
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-09 11:38:13 +01:00
Matthieu MOREL
b8a4f6534f fix stringsCompare and stringConcatSimplify from go-critic
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-07 09:57:59 +02:00
Matthieu MOREL
a62de57aa1 fix sprintfQuotedString from go-critic
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-07 09:57:59 +02:00
Matthieu MOREL
bc9ec5fc02 fix emptyStringTest from go-critic
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-07 09:57:59 +02:00
Matthieu MOREL
e5be7b54b1 fix yodaStyleExpr from go-critic
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-07 09:57:58 +02:00
Rob Murray
793dd8385a Only "prune" Windows networks created by Docker
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-06 20:24:04 +01:00
Rob Murray
9663b36b6d Merge pull request #50054 from robmry/fix_port_mapping
Avoid selecting duplicate host ports for mappings to 0.0.0.0 and specific addresses
2025-06-04 16:46:29 +01:00
Sebastiaan van Stijn
ee69d8ac95 Merge pull request #50051 from robmry/nftables_no_docker_user
nftables: don't create DOCKER-USER iptables chains
2025-06-03 13:19:12 +02:00
Sebastiaan van Stijn
b6fa565cba libnetwork/resolvconf: Build: decorate error for invalid nameservers
Using the same prefix as is used in `Sandbox.loadResolvConf`, but omiting
the value, as it's already part of the error message;
829b695375/libnetwork/sandbox_dns_unix.go (L258-L261)

Unfortunately, `netip.ParseAddr` returns a non-exported (`parseAddrError`)
error-type; https://cs.opensource.google/go/go/+/refs/tags/go1.24.3:src/net/netip/netip.go;l=115

So we don't have the option to omit the `` from the error-message, and to
take the underlying `msg` field;
https://cs.opensource.google/go/go/+/refs/tags/go1.24.3:src/net/netip/netip.go;l=141-153

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-06-02 12:33:28 +02:00
Sebastiaan van Stijn
35e062dde1 libnetwork/resolvconf: rewrite TestBuild tests to a table-test
Also adding test-cases for;

- empty options for all fields
- invalid nameServer (domain instead of IP).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-06-02 12:20:19 +02:00
Cory Snider
16ed51d864 libnetwork/networkdb: always shut down memberlist
Gracefully leaving the memberlist cluster is a best-effort operation.
Failing to successfully broadcast the leave message to a peer should not
prevent NetworkDB from cleaning up the memberlist instance on close. But
that was not the case in practice. Log the error returned from
(*memberlist.Memberlist).Leave instead of returning it and proceed with
shutting down irrespective of whether Leave() returns an error.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-30 13:56:47 -04:00
Sebastiaan van Stijn
fca97dae9d libnet/d/overlay/overlayutils: prevent uint32 overflow
CodeQL was complaining about the conversion to uint32

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-05-30 14:00:14 +02:00
Cory Snider
df6b405796 libnetwork/d/overlay: drop initEncryption function
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.

One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
713f887698 libnetwork/d/overlay: drop checkEncryption function
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.

Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
cb4e7b2f03 libnetwork/d/overlay: make setupEncryption a method
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
0d893252ac libnetwork/d/overlay: checkEncryption: drop isLocal param
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling

    checkEncryption(nid, netip.Addr{}, true)

Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
4b1c1236b9 libnetwork/d/overlay: peerdb: drop isLocal param
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
48e0b24ff7 libnetwork/d/overlay: elide vtep for local peers
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.

Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
a9e2d6d06e libnetwork/d/overlay: filter local peers explicitly
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.

Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
f144264bae Merge pull request #50090 from corhere/libn/overlay-netip
libnetwork/d/overlay: use netip types more
2025-05-29 14:12:28 -04:00
Rob Murray
a2652d4b81 Don't set up iptables chain DOCKER-USER when using nftables
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-29 19:12:23 +01:00
Rob Murray
768cfaeb62 Merge pull request #50050 from robmry/nftables_internal_dns
nftables: rules for the internal DNS resolver
2025-05-29 19:11:27 +01:00
Rob Murray
d3289dda4b Add nftables NAT rules for internal DNS resolver
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-29 17:20:25 +01:00
Rob Murray
b43afbf898 Merge pull request #50098 from robmry/remove_docker-user_return_rule
iptables: Drop explicit RETURN rule from DOCKER-USER
2025-05-29 11:27:54 +01:00
Rob Murray
d6620915db portallocator: always check for ports allocated for 0.0.0.0/::
We set SO_REUSEADDR on sockets used for host port mappings by
docker-proxy - which means it's possible to bind the same port
on a specific address as well as 0.0.0.0/::.

For TCP sockets, an error is raised when listen() is called on
both sockets - and the port allocator will be called again to
avoid the clash (if the port was allocated from a range, otherwise
the container will just fail to start).

But, for UDP sockets, there's no listen() - so take more care
to avoid the clash in the portallocator.

The port allocator keeps a set of allocated ports for each of
the host IP addresses it's seen, including 0.0.0.0/::. So, if a
mapping to 0.0.0.0/:: is requested, find a port that's free in
the range for each of the known IP addresses (but still only
mark it as allocated against 0.0.0.0/::). And, if a port is
requested for specific host addresses, make sure it's also
free in the corresponding 0.0.0.0/:: set (but only mark it as
allocated against the specific addresses - because the same
port can be allocated against a different specific address).

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-28 14:00:33 +01:00
Paweł Gronowski
0e2cc22d36 Merge pull request #50049 from robmry/nftables_env_var_enable
nftables: enable using env var
2025-05-28 12:58:21 +00:00