Add integration tests for Windows container functionality focusing on network drivers and container isolation modes.
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
The TestBridgeICCWindows test was failing on Windows due to a context timeout:
=== FAIL: github.com/docker/docker/integration/networking TestBridgeICCWindows/User_defined_nat_network (9.02s)
bridge_test.go:243: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.44/containers/62a4ed964f125e023cc298fde2d4d2f8f35415da970fd163b24e181b8c0c6654/start": context deadline exceeded
panic.go:635: assertion failed: error is not nil: Error response from daemon: error while removing network: network mynat id 25066355c070294c1d8d596c204aa81f056cc32b3e12bf7c56ca9c5746a85b0c has active endpoints
=== FAIL: github.com/docker/docker/integration/networking TestBridgeICCWindows (17.65s)
Windows appears to be slower to start, so these timeouts are expected.
Increase the context timeout to give it a little more time.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 0ea28fede0)
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
It looks like the error returned by Windows changed in Windows 2025; before
Windows 2025, this produced a `ERROR_INVALID_NAME`;
The filename, directory name, or volume label syntax is incorrect.
But Windows 2025 produces a `ERROR_DIRECTORY` ("The directory name is invalid."):
CreateFile \\\\?\\Volume{d9f06b05-0405-418b-b3e5-4fede64f3cdc}\\windows\\system32\\drivers\\etc\\hosts\\: The directory name is invalid.
Docs; https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d3d20b9195)
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
Introduce the DOCKER_DISABLE_WEAK_CIPHERS environment variable to allow
disabling weak TLS ciphers. When set to true, the daemon restricts
TLS to a modern, secure subset of cipher suites, disabling known weak
ciphers such as CBC-mode ciphers.
This is intended as an edge-case option and is not exposed via a CLI flag or
config option. By default, weak ciphers remain enabled for backward compatibility.
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
This change reworks the Go mod tidy/vendor checks to run for all tracked Go modules by the project and fail for any uncommitted changes.
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
(cherry picked from commit f6e1bf2808)
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
`tar` utility is included in Windows 10 (17063+) and Windows Server
2019+ so we can use it directly.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit 8c8324b37f)
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.
The libnetwork controller naively handled events for endpoint_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, adding a new
service binding without removing the old one. This would
have various side effects, such as having the "transient state" of
having multiple conflicting service bindings where more than one
endpoint is assigned an IP address never settling.
Modify the libnetwork controller to handle an UPDATE by removing the
previous service binding then adding the new one.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 4538a1de0a)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.
The overlay driver naively handled events for overlay_peer_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, inserting a new
entry into peerDB without removing the old one. This would
have various side effects, such as having the "transient state" of
multiple entries in peerDB with the same peer IP never settle.
Update driverapi to pass both the previous and new value of a table
entry into the driver. Modify the overlay driver to handle an UPDATE by
removing the previous peer entry from peerDB then adding the new one.
Modify the Windows overlay driver to match.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit e1a586a9a7)
libn/d/overlay: don't deref nil PeerRecord on error
If unmarshaling the peer record fails, there is no need to check if it's
a record for a local peer. Attempting to do so anyway will result in a
nil-dereference panic. Don't do that.
The Windows overlay driver has a typo: prevPeer is being checked twice
for whether it was a local-peer record. Check prevPeer once and newPeer
once each, as intended.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 12c6345d3a)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Windows and Linux overlay driver instances are interoperable, working
from the same NetworkDB table for peer discovery. As both drivers
produce and consume serialized data through the table, they both need to
have a shared understanding of the shape and semantics of that data.
The Windows overlay driver contains a duplicate copy of the protobuf
definitions used for marshaling and unmarshaling the NetworkDB peer
entries for dubious reasons. It gives us the flexibility to have the
definitions diverge, which is only really useful for shooting ourselves
in the foot.
Make libnetwork/drivers/overlay the source of truth for the peer record
definitions and the name of the NetworkDB table for distributing peer
records.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 8340e109de)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The macAddr and ipmac types are generally useful within libnetwork. Move
them to a dedicated package and overhaul the API to be more like that of
the net/netip package.
Update the overlay driver to utilize these types, adapting to the new
API.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit c7b93702b9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Overlay is the only driver which makes use of the EventNotify facility,
yet all other driver implementations are forced to provide a stub
implementation. Move the EventNotify and DecodeTableEntry methods into a
new optional TableWatcher interface and remove the stubs from all the
other drivers.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 844023f794)
Signed-off-by: Cory Snider <csnider@mirantis.com>
When handling updates to existing entries, it is often necessary to know
what the previous value was. NetworkDB knows the previous and new values
when it broadcasts an update event for an entry. Include both values in
the update event so the watchers do not have to do their own parallel
bookkeeping.
Unify the event types under WatchEvent as representing the operation kind
in the type system has been inconvenient, not useful. The operation is
now implied by the nilness of the Value and Prev event fields.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 69c3c56eba)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The concurrency control in the overlay driver is logically unsound.
While the use of mutexes is sufficient to prevent data races --
violations of the Go memory model -- many operations which need to be
atomic are performed with unbounded concurrency.
Overhaul the use of locks in the overlay network driver. Implement sound
locking at the network granularity: operations may proceed concurrently
iff they are being applied to distinct networks. Push the responsibility
of locking up to the code which calls methods or accesses struct fields
to avoid deadlock situations like we had previously with
d.initSandboxPeerDB() and to make the code easier to reason about.
Each overlay network has a distinct peer db. The NetworkDB watch for the
overlay peer table for the network will only start after
(*driver).CreateNetwork returns and will be stopped before libnetwork
calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for
a network is constrained to the lifetime of the network itself. Yet the
peer db for a network is tracked in a dedicated map, separately from the
network objects themselves. This has resulted in a parallel set of
mutexes to manage concurrency of the peer db distinct from the mutexes
for the driver and networks. Move the peer db for a network into a field
of the network struct and guard it from concurrent access using the
per-network lock. Move the methods for manipulating the peer db into the
network struct so that the methods can only be called if the caller has
a reference to the network object.
Network creation and deletion are synchronized using the driver-scope
mutex, but some of the kernel programming is performed outside of the
critical section. It is possible for network deletion to race with
recreating the network, interleaving the kernel programming for the
network creation and deletion, resulting in inconsistent kernel state.
Parallelize network creation and deletion soundly. Use a double-checked
locking scheme to soundly handle the case of concurrent CreateNetwork
and DeleteNetwork for the same network id without blocking operations
on other networks. Synchronize operations on a network so that
operations on the network such as adding a neighbor to the peer db are
performed atomically, not interleaved with deleting the network.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 89d3419093)
Signed-off-by: Cory Snider <csnider@mirantis.com>
There is a dedicated mutex for synchronizing access to the encrMap.
Separately, the main driver mutex is used for synchronizing access to
the encryption keys. Their use is sufficient to prevent data races (if
used correctly, which is not the case) but not logical race conditions.
Programming the encryption parameters for a peer can race with
encryption keys being updated, which could lead to inconsistencies
between the parameters programmed into the kernel and the desired state.
Introduce a new mutex for synchronizing encryption operations. Use that
mutex to synchronize access to both encrMap and keys. Handle encryption
key updates in a critical section so they can no longer be interleaved
with kernel programming of encryption parameters.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 843cd96725)
Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*driver) secMapWalk is a curious beast. It is named walk, yet it
also mutates the collection being iterated over. It returns an error,
but that error is always nil. It takes a callback that can break
iteration, yet the only caller makes no use of that affordance. Its
utility is limited and the abstraction hinders readability more than it
helps. Open-code the d.secMap.nodes loop into
func (*driver) updateKeys(), the only caller.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a1d299749c)
Signed-off-by: Cory Snider <csnider@mirantis.com>
It is easier to find all references when they are struct fields rather
than embedded structs.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 74713e1a7d)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and
Security Policy Database entries) for a particular overlay network peer
(VTEP) are shared global state as they have to be programmed into the
root network namespace. The same parameters are used when encrypting
VXLAN traffic to a particular VTEP for all overlay networks. Deleting
the entries for a VTEP will break encryption to that VTEP across all
encrypted overlay networks, therefore the decision of when to delete the
entries must take the state of all overlay networks into account.
Unfortunately this is not the case.
The overlay driver uses local per-network state to decide when to
program and delete the parameters for a VTEP. In practice, the
parameters for all VTEPs participating in an encrypted overlay network
are deleted when the network is deleted. Encryption to that VTEP over
all other active encrypted overlay networks would be broken until some
other incidental peerDB event triggered a re-programming of the
parameters for that VTEP.
Change the setupEncryption and removeEncryption functions to be
reference-counted. The removeEncryption function needs to be called the
same number of times as addEncryption before the parameters are deleted
from the kernel.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 057e35dd65)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always
converge to a 1:1:1 mapping from peer endpoint IP address to MAC address
to VTEP. While this currently holds true in practice most of the time,
it is not an invariant and there are ways that users can violate this
assumption.
The driver detects whether peer entries conflict with each other by
matching up (IP, MAC) tuples. In the common case this works out fine as
the MAC address for an endpoint is generally derived from the assigned
IP address. If an IP address gets reassigned to a container on another
node the MAC address will follow, so the driver's conflict resolution
logic will behave as intended. However users may explicitly configure
the MAC address for a container's network endpoints. If an IP address
gets reassigned from a container with an auto-generated MAC address to a
container with a manually-configured MAC, or vice versa, the driver
would not detect the conflict as the (IP, MAC) tuples won't match up. It
would attempt to program the kernel's neighbor table with two
conflicting MAC addresses for one IP, which will fail. And since it
does not realize that there is a conflict, the driver won't reprogram
the kernel from the remaining entry when the other entry is deleted.
The assumption that only one IP address may resolve to a given MAC
address is violated if multiple IP addresses are assigned to an
endpoint. This rarely comes up in practice today as the overlay driver
only supports IPv4 single-stack connectivity for endpoints. If multiple
distinct peer entries exist with the same MAC address, the driver will
delete the MAC->VTEP mapping from the kernel's forwarding database when
any entry is deleted, even if other entries remain active. This
limitation is one of the biggest obstacles in the way of supporting IPv6
and dual-stack connectivity for endpoints attached to overlay networks.
Modify the peer db logic to correctly handle the cases where peer
entries have non-unique MAC or VTEP values. Treat any set of entries
with non-unique IP addresses as a conflict, irrespective of the entries'
MAC addresses. Maintain a reference count of forwarding database entries
and only delete the MAC->VTEP mapping from the kernel when there are no
longer any neighbor entries which resolve to that MAC.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 1c2b744ca2)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The peer db implementation is more complex than it needs to be.
Notably, the peerCRUD / peerCRUDOp function split is a vestige of its
evolution from a worker goroutine receiving commands over a channel.
Refactor the peer db operations to be easier to read, understand and
modify. Factor the kernel-programming operations out into dedicated
addNeighbor and deleteNeighbor functions. Inline the rest of the
peerCRUDOp functions into their respective peerCRUD wrappers.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 59437f56f9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.
One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit df6b405796)
Signed-off-by: Cory Snider <csnider@mirantis.com>
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.
Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 713f887698)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit cb4e7b2f03)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling
checkEncryption(nid, netip.Addr{}, true)
Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0d893252ac)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 4b1c1236b9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.
Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 48e0b24ff7)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.
Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a9e2d6d06e)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The netip types are really useful for tracking state in the overlay
driver as they are hashable, unlike net.IP and friends, making them
directly useable as map keys. Converting between netip and net types is
fairly trivial, but fewer conversions is more ergonomic.
The NetworkDB entries for the overlay peer table encode the IP addresses
as strings. We need to parse them to some representation before
processing them further. Parse directly into netip types and pass those
values around to cut down on the number of conversions needed.
The peerDB needs to marshal the keys and entries to structs of hashable
values to be able to insert them into the SetMatrix. Use netip.Addr in
peerEntry so that peerEntry values can be directly inserted into the
SetMatrix without conversions. Use a hashable struct type as the
SetMatrix key to avoid having to marshal the whole struct to a string
and parse it back out.
Use netip.Addr as the map key for the driver's encryption map so the
values do not need to be converted to and from strings. Change the
encryption configuration methods to take netip types so the peerDB code
can pass netip values directly.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d188df0039)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Make the SetMatrix key's type generic so that e.g. netip.Addr values can
be used as matrix keys.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0317f773a6)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The Namespace keeps some state for each inserted neighbor-table entry
which is used to delete the entry (and any related entries) given only
the IP and MAC address of the entry to delete. This state is not
strictly required as the retained data is a pure function of the
parameters passed to AddNeighbor(), and the kernel can inform us whether
an attempt to add a neighbor entry would conflict with an existing
entry. Get rid of the neighbor state in Namespace. It's just one more
piece of state that can cause lots of grief if it falls out of sync with
ground truth. Require callers to call DeleteNeighbor() with the same
aguments as they had passed to AddNeighbor(). Push the responsibility
for detecting attempts to insert conflicting entries into the neighbor
table onto the kernel by using (*netlink.Handle).NeighAdd() instead of
NeighSet().
Modernize the error messages and logging in DeleteNeighbor() and
AddNeighbor().
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0d6e7cd983)
libn/d/overlay: delete FDB entry from AF_BRIDGE
Starting with commit 0d6e7cd983
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 7a12bbe5d3)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The isDefault and nlHandle fields are immutable once the Namespace is
constructed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 9866738736)
Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*Namespace) AddNeighbor is only ever called with the force
parameter set to false. Remove the parameter and eliminate dead code.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 3bdf99d127)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The writeToStore() call was removed from CreateNetwork in
commit 0fa873c0fe. The comment about
undoing the write is no longer applicable.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d90277372f)
Signed-off-by: Cory Snider <csnider@mirantis.com>