Commit Graph

1168 Commits

Author SHA1 Message Date
Sebastiaan van Stijn
ed1406cb93 Merge pull request #50236 from corhere/libn/overlay-really-delete-neigh
libn/d/overlay: delete FDB entry from AF_BRIDGE
2025-06-24 18:13:54 +02:00
Albin Kerouanton
a41225dcfd Merge pull request #50091 from corhere/libn/overlay-refactor-checkencryption
libnetwork/d/overlay: simplify the encryption logic
2025-06-23 21:32:02 +02:00
Cory Snider
7a12bbe5d3 libn/d/overlay: delete FDB entry from AF_BRIDGE
Starting with commit 0d6e7cd983
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-06-19 17:24:16 -04:00
Rob Murray
bf002e51a0 Split [Program|Revoke]ExternalConnectivity out of libnet driverapi
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-18 17:38:35 +01:00
Rob Murray
b387526fce Merge pull request #49981 from robmry/drop_inc_rules
Drop DOCKER-ISOLATION rules
2025-06-17 20:22:00 +01:00
Rob Murray
ec185e57cf Test Nftabler params
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-17 18:28:20 +01:00
Rob Murray
c66abe486b nftabler: add mirrored WSL2 loopback0 workaround
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 15:19:47 +01:00
Rob Murray
d31956b2f7 Add an outline nftabler
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 15:19:45 +01:00
Rob Murray
1ad9599da7 Drop DOCKER-ISOLATION rules
The Inter-Network Communication rules in the iptables chains
DOCKER-ISOLATION-STAGE-1 / DOCKER-ISOLATION-STAGE-2 (which are
called from filter-FORWARD) currently:
- Block access from containers in one bridge network, to ports
  published to host addresses by containers in other bridge
  networks, when the userland-proxy is disabled.
  - But, that access is allowed when the proxy is enabled.
- Block access to all ports on container addresses in gateway
  mode "nat-unprotected" networks.
  - But, those ports can be accessed from anywhere else, including
    other hosts. Just not other bridge networks.
- Allow access from containers in "nat" bridge networks to published
  ports on container addresses in "routed" networks. But, to do that,
  extra INC rules are added for the routed network.

The INC rules are no longer needed to block access from containers
in one network to unpublished ports on container addresses in
other networks. Direct routing to containers in NAT networks is
blocked by the "raw-PREROUTING" rules that block access from
untrusted interfaces (all interfaces apart from the network's
own bridge).

Drop these INC rules to resolve the inconsistencies listed above,
with this change:
- Published ports on host addresses can be accessed from containers
  in other networks (even without the userland-proxy).
- The rules for direct routing between bridge networks are the same
  as the rules for direct routing from outside the Docker host
  (allowed for gw modes "routed" and "nat-unprotected", disallowed
  for "nat").

Fewer rules, so it's simpler, and perhaps slightly faster.

Internal networks (with no access to networks outside the host)
are also implemented using rules in the DOCKER-ISOLATION chains.
This change moves those rules to a new chain, DOCKER-INTERNAL,
and drops the DOCKER-ISOLATION chains.

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-16 14:54:31 +01:00
Matthieu MOREL
6d737371b8 fix comparison rule from errorlint
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-13 08:26:56 +00:00
Sebastiaan van Stijn
13879e7b49 Merge pull request #50082 from mmorel-35/go-critic
fix go-critic linter
2025-06-10 13:09:05 +02:00
Paweł Gronowski
52a8109a6b Merge pull request #50155 from robmry/windows_no_mirrored_plugin
Windows: don't try to load "mirrored" network plugin
2025-06-10 09:36:22 +00:00
Rob Murray
55f47f9e34 Windows: don't try to load "mirrored" network plugin
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-09 11:38:13 +01:00
Matthieu MOREL
bc9ec5fc02 fix emptyStringTest from go-critic
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-06-07 09:57:59 +02:00
Rob Murray
793dd8385a Only "prune" Windows networks created by Docker
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-06-06 20:24:04 +01:00
Rob Murray
9663b36b6d Merge pull request #50054 from robmry/fix_port_mapping
Avoid selecting duplicate host ports for mappings to 0.0.0.0 and specific addresses
2025-06-04 16:46:29 +01:00
Sebastiaan van Stijn
fca97dae9d libnet/d/overlay/overlayutils: prevent uint32 overflow
CodeQL was complaining about the conversion to uint32

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-05-30 14:00:14 +02:00
Cory Snider
df6b405796 libnetwork/d/overlay: drop initEncryption function
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.

One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
713f887698 libnetwork/d/overlay: drop checkEncryption function
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.

Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
cb4e7b2f03 libnetwork/d/overlay: make setupEncryption a method
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
0d893252ac libnetwork/d/overlay: checkEncryption: drop isLocal param
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling

    checkEncryption(nid, netip.Addr{}, true)

Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
4b1c1236b9 libnetwork/d/overlay: peerdb: drop isLocal param
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
48e0b24ff7 libnetwork/d/overlay: elide vtep for local peers
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.

Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
a9e2d6d06e libnetwork/d/overlay: filter local peers explicitly
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.

Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-29 14:13:13 -04:00
Cory Snider
f144264bae Merge pull request #50090 from corhere/libn/overlay-netip
libnetwork/d/overlay: use netip types more
2025-05-29 14:12:28 -04:00
Paweł Gronowski
0e2cc22d36 Merge pull request #50049 from robmry/nftables_env_var_enable
nftables: enable using env var
2025-05-28 12:58:21 +00:00
Paweł Gronowski
e37efd4c2d Merge pull request #50068 from mmorel-35/github.com/containerd/errdefs
refactor: replace uses of errdefs package
2025-05-28 12:57:15 +00:00
Rob Murray
19dc38f79b Listen on mapped host ports before mapping more ports
Because we set SO_REUSEADDR on sockets for host ports, if there
are port mappings for INADDR_ANY (the default) as well as for
specific host ports - bind() cannot be used to detect clashes.

That means, for example, on daemon startup, if the port allocator
returns the first port in its ephemeral range for a specific host
adddress, and the next port mapping is for 0.0.0.0 - the same port
is returned and both bind() calls succeed. Then, the container
fails to start later when listen() spots the problem and it's too
late to find another port.

So, bind and listen to each set of ports as they're allocated
instead of just binding.

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-28 11:38:59 +01:00
Rob Murray
21a165de23 Use env-var DOCKER_FIREWALL_BACKEND=nftables to enable nftables
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-28 09:57:26 +01:00
Matthieu MOREL
8561016335 libnetwork: replace uses of errdefs package
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-05-28 05:40:38 +00:00
Cory Snider
d188df0039 libn/d/overlay: use netip types more
The netip types are really useful for tracking state in the overlay
driver as they are hashable, unlike net.IP and friends, making them
directly useable as map keys. Converting between netip and net types is
fairly trivial, but fewer conversions is more ergonomic.

The NetworkDB entries for the overlay peer table encode the IP addresses
as strings. We need to parse them to some representation before
processing them further. Parse directly into netip types and pass those
values around to cut down on the number of conversions needed.

The peerDB needs to marshal the keys and entries to structs of hashable
values to be able to insert them into the SetMatrix. Use netip.Addr in
peerEntry so that peerEntry values can be directly inserted into the
SetMatrix without conversions. Use a hashable struct type as the
SetMatrix key to avoid having to marshal the whole struct to a string
and parse it back out.

Use netip.Addr as the map key for the driver's encryption map so the
values do not need to be converted to and from strings. Change the
encryption configuration methods to take netip types so the peerDB code
can pass netip values directly.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-27 13:47:11 -04:00
Cory Snider
0317f773a6 libnetwork/internal/setmatrix: make keys generic
Make the SetMatrix key's type generic so that e.g. netip.Addr values can
be used as matrix keys.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-27 13:29:41 -04:00
Cory Snider
0d6e7cd983 libnetwork/osl: stop tracking neighbor entries
The Namespace keeps some state for each inserted neighbor-table entry
which is used to delete the entry (and any related entries) given only
the IP and MAC address of the entry to delete. This state is not
strictly required as the retained data is a pure function of the
parameters passed to AddNeighbor(), and the kernel can inform us whether
an attempt to add a neighbor entry would conflict with an existing
entry. Get rid of the neighbor state in Namespace. It's just one more
piece of state that can cause lots of grief if it falls out of sync with
ground truth. Require callers to call DeleteNeighbor() with the same
aguments as they had passed to AddNeighbor(). Push the responsibility
for detecting attempts to insert conflicting entries into the neighbor
table onto the kernel by using (*netlink.Handle).NeighAdd() instead of
NeighSet().

Modernize the error messages and logging in DeleteNeighbor() and
AddNeighbor().

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-27 11:46:21 -04:00
Cory Snider
3bdf99d127 libn/osl: drop unused AddNeighbor force parameter
func (*Namespace) AddNeighbor is only ever called with the force
parameter set to false. Remove the parameter and eliminate dead code.

Signed-off-by: Cory Snider <csnider@mirantis.com>
2025-05-27 11:30:30 -04:00
Rob Murray
0facb0cd63 Merge pull request #49969 from robmry/firewaller_wsl2_param
Make WSL2Mirrored into a Firewaller param
2025-05-21 15:32:04 +01:00
Sebastiaan van Stijn
2a96d2eb8d align //go:build versions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-05-16 17:52:43 +02:00
Matthieu MOREL
205ba05feb fix usestdlibvars
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-05-15 18:24:58 +02:00
Rob Murray
eeba428939 Make WSL2Mirrored a Firewaller param
The bridge driver should figure out whether it's running in
a mirrored WSL2 setup, and tell the firewaller.

So, move the WSL2-deciding code back into the bridge driver
and unit test it there. Use TestIptabler to check the rules
are constructed properly.

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-14 09:38:32 +01:00
Rob Murray
ba0ad9e80f Unit test the bridge driver in terms of its firewaller
Don't inspect iptables rules, because the driver's Firewaller won't
always be an iptabler.

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-13 18:27:55 +01:00
Rob Murray
daeb080ff1 Test Iptabler params
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-12 17:30:40 +01:00
Rob Murray
cb38cc0fdd Merge pull request #49860 from robmry/firewaller_interface
Firewaller interface
2025-05-12 14:18:16 +01:00
Sebastiaan van Stijn
cfdfbfab9b libnetwork/drivers/remote: inline decodeToMap
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-05-08 14:33:09 +02:00
Rob Murray
b0777be89e Use firewaller.IPVersion instead of iptables.IPVersion for gwmode
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-06 10:18:22 +01:00
Rob Murray
3cbb1ae736 Move filter-FORWARD DROP setting to the firewaller
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-06 10:18:22 +01:00
Rob Murray
44843d9917 Pass context to more places
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-06 10:18:22 +01:00
Rob Murray
a9bf151260 Put Iptabler behind a Firewaller interface.
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-05-06 10:18:22 +01:00
Matthieu MOREL
70139978d3 fix(ST1016): Use consistent method receiver names
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-05-01 14:30:52 +02:00
Sebastiaan van Stijn
4ae3705d06 Merge pull request #49895 from thaJeztah/linting_fix_noshadow
fix various "no shadow" linting issues (govet)
2025-05-01 13:09:38 +02:00
Rob Murray
44a3453d73 Add daemon option --allow-direct-routing
Per-network option com.docker.network.bridge.trusted-host-interfaces
accepts a list of interfaces that are allowed to route
directly to a container's published ports in a bridge
network with nat enabled.

This daemon level option disables direct access filtering,
enabling direct access to published ports on container
addresses in all bridge networks, via all host interfaces.

It overlaps with short-term env-var workaround:
  DOCKER_INSECURE_NO_IPTABLES_RAW=1
- it does not allow packets sent from outside the host to reach
  ports published only to 127.0.0.1
- it will outlive iptables (the workaround was initially intended
  for hosts that do not have kernel support for the "raw" iptables
  table).

Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-04-30 20:59:28 +01:00
Rob Murray
c16caabe36 Add TestNetworkConfigurationMarshalling
Signed-off-by: Rob Murray <rob.murray@docker.com>
2025-04-30 20:59:28 +01:00