Add integration tests for Windows container functionality focusing on network drivers and container isolation modes.
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
The TestBridgeICCWindows test was failing on Windows due to a context timeout:
=== FAIL: github.com/docker/docker/integration/networking TestBridgeICCWindows/User_defined_nat_network (9.02s)
bridge_test.go:243: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.44/containers/62a4ed964f125e023cc298fde2d4d2f8f35415da970fd163b24e181b8c0c6654/start": context deadline exceeded
panic.go:635: assertion failed: error is not nil: Error response from daemon: error while removing network: network mynat id 25066355c070294c1d8d596c204aa81f056cc32b3e12bf7c56ca9c5746a85b0c has active endpoints
=== FAIL: github.com/docker/docker/integration/networking TestBridgeICCWindows (17.65s)
Windows appears to be slower to start, so these timeouts are expected.
Increase the context timeout to give it a little more time.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 0ea28fede0)
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
It looks like the error returned by Windows changed in Windows 2025; before
Windows 2025, this produced a `ERROR_INVALID_NAME`;
The filename, directory name, or volume label syntax is incorrect.
But Windows 2025 produces a `ERROR_DIRECTORY` ("The directory name is invalid."):
CreateFile \\\\?\\Volume{d9f06b05-0405-418b-b3e5-4fede64f3cdc}\\windows\\system32\\drivers\\etc\\hosts\\: The directory name is invalid.
Docs; https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d3d20b9195)
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
Introduce the DOCKER_DISABLE_WEAK_CIPHERS environment variable to allow
disabling weak TLS ciphers. When set to true, the daemon restricts
TLS to a modern, secure subset of cipher suites, disabling known weak
ciphers such as CBC-mode ciphers.
This is intended as an edge-case option and is not exposed via a CLI flag or
config option. By default, weak ciphers remain enabled for backward compatibility.
Signed-off-by: Sopho Merkviladze <smerkviladze@mirantis.com>
This change reworks the Go mod tidy/vendor checks to run for all tracked Go modules by the project and fail for any uncommitted changes.
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
(cherry picked from commit f6e1bf2808)
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
`tar` utility is included in Windows 10 (17063+) and Windows Server
2019+ so we can use it directly.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit 8c8324b37f)
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.
The libnetwork controller naively handled events for endpoint_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, adding a new
service binding without removing the old one. This would
have various side effects, such as having the "transient state" of
having multiple conflicting service bindings where more than one
endpoint is assigned an IP address never settling.
Modify the libnetwork controller to handle an UPDATE by removing the
previous service binding then adding the new one.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 4538a1de0a)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.
The overlay driver naively handled events for overlay_peer_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, inserting a new
entry into peerDB without removing the old one. This would
have various side effects, such as having the "transient state" of
multiple entries in peerDB with the same peer IP never settle.
Update driverapi to pass both the previous and new value of a table
entry into the driver. Modify the overlay driver to handle an UPDATE by
removing the previous peer entry from peerDB then adding the new one.
Modify the Windows overlay driver to match.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit e1a586a9a7)
libn/d/overlay: don't deref nil PeerRecord on error
If unmarshaling the peer record fails, there is no need to check if it's
a record for a local peer. Attempting to do so anyway will result in a
nil-dereference panic. Don't do that.
The Windows overlay driver has a typo: prevPeer is being checked twice
for whether it was a local-peer record. Check prevPeer once and newPeer
once each, as intended.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 12c6345d3a)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Windows and Linux overlay driver instances are interoperable, working
from the same NetworkDB table for peer discovery. As both drivers
produce and consume serialized data through the table, they both need to
have a shared understanding of the shape and semantics of that data.
The Windows overlay driver contains a duplicate copy of the protobuf
definitions used for marshaling and unmarshaling the NetworkDB peer
entries for dubious reasons. It gives us the flexibility to have the
definitions diverge, which is only really useful for shooting ourselves
in the foot.
Make libnetwork/drivers/overlay the source of truth for the peer record
definitions and the name of the NetworkDB table for distributing peer
records.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 8340e109de)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The macAddr and ipmac types are generally useful within libnetwork. Move
them to a dedicated package and overhaul the API to be more like that of
the net/netip package.
Update the overlay driver to utilize these types, adapting to the new
API.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit c7b93702b9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Overlay is the only driver which makes use of the EventNotify facility,
yet all other driver implementations are forced to provide a stub
implementation. Move the EventNotify and DecodeTableEntry methods into a
new optional TableWatcher interface and remove the stubs from all the
other drivers.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 844023f794)
Signed-off-by: Cory Snider <csnider@mirantis.com>
When handling updates to existing entries, it is often necessary to know
what the previous value was. NetworkDB knows the previous and new values
when it broadcasts an update event for an entry. Include both values in
the update event so the watchers do not have to do their own parallel
bookkeeping.
Unify the event types under WatchEvent as representing the operation kind
in the type system has been inconvenient, not useful. The operation is
now implied by the nilness of the Value and Prev event fields.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 69c3c56eba)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The concurrency control in the overlay driver is logically unsound.
While the use of mutexes is sufficient to prevent data races --
violations of the Go memory model -- many operations which need to be
atomic are performed with unbounded concurrency.
Overhaul the use of locks in the overlay network driver. Implement sound
locking at the network granularity: operations may proceed concurrently
iff they are being applied to distinct networks. Push the responsibility
of locking up to the code which calls methods or accesses struct fields
to avoid deadlock situations like we had previously with
d.initSandboxPeerDB() and to make the code easier to reason about.
Each overlay network has a distinct peer db. The NetworkDB watch for the
overlay peer table for the network will only start after
(*driver).CreateNetwork returns and will be stopped before libnetwork
calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for
a network is constrained to the lifetime of the network itself. Yet the
peer db for a network is tracked in a dedicated map, separately from the
network objects themselves. This has resulted in a parallel set of
mutexes to manage concurrency of the peer db distinct from the mutexes
for the driver and networks. Move the peer db for a network into a field
of the network struct and guard it from concurrent access using the
per-network lock. Move the methods for manipulating the peer db into the
network struct so that the methods can only be called if the caller has
a reference to the network object.
Network creation and deletion are synchronized using the driver-scope
mutex, but some of the kernel programming is performed outside of the
critical section. It is possible for network deletion to race with
recreating the network, interleaving the kernel programming for the
network creation and deletion, resulting in inconsistent kernel state.
Parallelize network creation and deletion soundly. Use a double-checked
locking scheme to soundly handle the case of concurrent CreateNetwork
and DeleteNetwork for the same network id without blocking operations
on other networks. Synchronize operations on a network so that
operations on the network such as adding a neighbor to the peer db are
performed atomically, not interleaved with deleting the network.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 89d3419093)
Signed-off-by: Cory Snider <csnider@mirantis.com>
There is a dedicated mutex for synchronizing access to the encrMap.
Separately, the main driver mutex is used for synchronizing access to
the encryption keys. Their use is sufficient to prevent data races (if
used correctly, which is not the case) but not logical race conditions.
Programming the encryption parameters for a peer can race with
encryption keys being updated, which could lead to inconsistencies
between the parameters programmed into the kernel and the desired state.
Introduce a new mutex for synchronizing encryption operations. Use that
mutex to synchronize access to both encrMap and keys. Handle encryption
key updates in a critical section so they can no longer be interleaved
with kernel programming of encryption parameters.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 843cd96725)
Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*driver) secMapWalk is a curious beast. It is named walk, yet it
also mutates the collection being iterated over. It returns an error,
but that error is always nil. It takes a callback that can break
iteration, yet the only caller makes no use of that affordance. Its
utility is limited and the abstraction hinders readability more than it
helps. Open-code the d.secMap.nodes loop into
func (*driver) updateKeys(), the only caller.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a1d299749c)
Signed-off-by: Cory Snider <csnider@mirantis.com>
It is easier to find all references when they are struct fields rather
than embedded structs.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 74713e1a7d)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and
Security Policy Database entries) for a particular overlay network peer
(VTEP) are shared global state as they have to be programmed into the
root network namespace. The same parameters are used when encrypting
VXLAN traffic to a particular VTEP for all overlay networks. Deleting
the entries for a VTEP will break encryption to that VTEP across all
encrypted overlay networks, therefore the decision of when to delete the
entries must take the state of all overlay networks into account.
Unfortunately this is not the case.
The overlay driver uses local per-network state to decide when to
program and delete the parameters for a VTEP. In practice, the
parameters for all VTEPs participating in an encrypted overlay network
are deleted when the network is deleted. Encryption to that VTEP over
all other active encrypted overlay networks would be broken until some
other incidental peerDB event triggered a re-programming of the
parameters for that VTEP.
Change the setupEncryption and removeEncryption functions to be
reference-counted. The removeEncryption function needs to be called the
same number of times as addEncryption before the parameters are deleted
from the kernel.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 057e35dd65)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always
converge to a 1:1:1 mapping from peer endpoint IP address to MAC address
to VTEP. While this currently holds true in practice most of the time,
it is not an invariant and there are ways that users can violate this
assumption.
The driver detects whether peer entries conflict with each other by
matching up (IP, MAC) tuples. In the common case this works out fine as
the MAC address for an endpoint is generally derived from the assigned
IP address. If an IP address gets reassigned to a container on another
node the MAC address will follow, so the driver's conflict resolution
logic will behave as intended. However users may explicitly configure
the MAC address for a container's network endpoints. If an IP address
gets reassigned from a container with an auto-generated MAC address to a
container with a manually-configured MAC, or vice versa, the driver
would not detect the conflict as the (IP, MAC) tuples won't match up. It
would attempt to program the kernel's neighbor table with two
conflicting MAC addresses for one IP, which will fail. And since it
does not realize that there is a conflict, the driver won't reprogram
the kernel from the remaining entry when the other entry is deleted.
The assumption that only one IP address may resolve to a given MAC
address is violated if multiple IP addresses are assigned to an
endpoint. This rarely comes up in practice today as the overlay driver
only supports IPv4 single-stack connectivity for endpoints. If multiple
distinct peer entries exist with the same MAC address, the driver will
delete the MAC->VTEP mapping from the kernel's forwarding database when
any entry is deleted, even if other entries remain active. This
limitation is one of the biggest obstacles in the way of supporting IPv6
and dual-stack connectivity for endpoints attached to overlay networks.
Modify the peer db logic to correctly handle the cases where peer
entries have non-unique MAC or VTEP values. Treat any set of entries
with non-unique IP addresses as a conflict, irrespective of the entries'
MAC addresses. Maintain a reference count of forwarding database entries
and only delete the MAC->VTEP mapping from the kernel when there are no
longer any neighbor entries which resolve to that MAC.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 1c2b744ca2)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The peer db implementation is more complex than it needs to be.
Notably, the peerCRUD / peerCRUDOp function split is a vestige of its
evolution from a worker goroutine receiving commands over a channel.
Refactor the peer db operations to be easier to read, understand and
modify. Factor the kernel-programming operations out into dedicated
addNeighbor and deleteNeighbor functions. Inline the rest of the
peerCRUDOp functions into their respective peerCRUD wrappers.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 59437f56f9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.
One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit df6b405796)
Signed-off-by: Cory Snider <csnider@mirantis.com>
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.
Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 713f887698)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit cb4e7b2f03)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling
checkEncryption(nid, netip.Addr{}, true)
Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0d893252ac)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 4b1c1236b9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.
Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 48e0b24ff7)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.
Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a9e2d6d06e)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The netip types are really useful for tracking state in the overlay
driver as they are hashable, unlike net.IP and friends, making them
directly useable as map keys. Converting between netip and net types is
fairly trivial, but fewer conversions is more ergonomic.
The NetworkDB entries for the overlay peer table encode the IP addresses
as strings. We need to parse them to some representation before
processing them further. Parse directly into netip types and pass those
values around to cut down on the number of conversions needed.
The peerDB needs to marshal the keys and entries to structs of hashable
values to be able to insert them into the SetMatrix. Use netip.Addr in
peerEntry so that peerEntry values can be directly inserted into the
SetMatrix without conversions. Use a hashable struct type as the
SetMatrix key to avoid having to marshal the whole struct to a string
and parse it back out.
Use netip.Addr as the map key for the driver's encryption map so the
values do not need to be converted to and from strings. Change the
encryption configuration methods to take netip types so the peerDB code
can pass netip values directly.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d188df0039)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Make the SetMatrix key's type generic so that e.g. netip.Addr values can
be used as matrix keys.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0317f773a6)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The Namespace keeps some state for each inserted neighbor-table entry
which is used to delete the entry (and any related entries) given only
the IP and MAC address of the entry to delete. This state is not
strictly required as the retained data is a pure function of the
parameters passed to AddNeighbor(), and the kernel can inform us whether
an attempt to add a neighbor entry would conflict with an existing
entry. Get rid of the neighbor state in Namespace. It's just one more
piece of state that can cause lots of grief if it falls out of sync with
ground truth. Require callers to call DeleteNeighbor() with the same
aguments as they had passed to AddNeighbor(). Push the responsibility
for detecting attempts to insert conflicting entries into the neighbor
table onto the kernel by using (*netlink.Handle).NeighAdd() instead of
NeighSet().
Modernize the error messages and logging in DeleteNeighbor() and
AddNeighbor().
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0d6e7cd983)
libn/d/overlay: delete FDB entry from AF_BRIDGE
Starting with commit 0d6e7cd983
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 7a12bbe5d3)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The isDefault and nlHandle fields are immutable once the Namespace is
constructed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 9866738736)
Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*Namespace) AddNeighbor is only ever called with the force
parameter set to false. Remove the parameter and eliminate dead code.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 3bdf99d127)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The writeToStore() call was removed from CreateNetwork in
commit 0fa873c0fe. The comment about
undoing the write is no longer applicable.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d90277372f)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Passing the Auth to the redirected location was fixed in curl 7.58:
https://curl.se/changes.html#7_58_0 so we no longer need the extra
handling and can just use `-L` to let curl handle redirects.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Correctly parse HTTP response that doesn't contain an HTTP version with a decimal place:
```
< HTTP/2 307
```
The previous version would only match strings like `HTTP/2.0 307`.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The property test for the mRandomNodes function revealed that it may
sometimes pick out a sample of fewer than m nodes even when the number
of nodes to pick from (excluding the local node) is >= m. Rewrite it
using a random shuffle or permutation so that it always picks a
uniformly-distributed sample of the requested size whenever the
population is large enough.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit ac5f464649)
Signed-off-by: Cory Snider <csnider@mirantis.com>
TestNetworkDBAlwaysConverges will occasionally find a failure where one
entry is missing on one node even after waiting a full five minutes. One
possible explanation is that the selection of nodes to gossip with is
biased in some way. Test that the mRandomNodes function picks a
uniformly distributed sample of node IDs of sufficient length.
The new test reveals that mRandomNodes may sometimes pick out a sample
of fewer than m nodes even when the number of nodes to pick from
(excluding the local node) is >= m. Put the test behind an xfail tag so
it is opt-in to run, without interfering with CI or bisecting.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 5799deb853)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Add a property-based test which asserts that a cluster of NetworkDB
nodes always eventually converges to a consistent state. As this test
takes a long time to run it is build-tagged to be excluded from CI.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d8730dc1d3)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Add a feature to NetworkDB to log the encryption keys to a file for the
Wireshark memberlist plugin to consume, configured using an environment
variable.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit ebfafa1561)
Signed-off-by: Cory Snider <csnider@mirantis.com>
When a node leaves a network, all entries owned by that node are
implicitly deleted. The other NetworkDB nodes handle the leave by
setting the deleted flag on the entries owned by the left node in their
local stores. This behaviour is problematic as it results in two
conflicting entries with the same Lamport timestamp propagating
through the cluster.
Consider two NetworkDB nodes, A, and B, which are both joined to some
network. Node A in quick succession leaves the network, immediately
rejoins it, then creates an entry. If Node B processes the
entry-creation event first, it will add the entry to its local store
then set the deleted flag upon processing the network-leave. No matter
how many times B bulk-syncs with A, B will ignore the live entry for
having the same timestamp as its local tombstone entry. Once this
situation occurs, the only way to recover is for the entry to get
updated by A with a new timestamp.
There is no need for a node to store forged tombstones for another
node's entries. All nodes will purge the entries naturally when they
process the network-leave or node-leave event. Simply delete the
non-owned entries from the local store so there is no inconsistent state
to interfere with convergence when nodes rejoin a network. Have nodes
update their local store with tombstones for entries when leaving a
network so that after a rapid leave-then-rejoin the entry deletions
propagate to nodes which may have missed the leave event.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 21d9109750)
Signed-off-by: Cory Snider <csnider@mirantis.com>
NetworkDB's JoinNetwork function enqueues a message onto a
TransmitLimitedQueue while holding the NetworkDB mutex locked for
writing. The TransmitLimitedQueue has its own synchronization;
it locks its mutex when enqueueing a message. Locking order:
1. (NetworkDB).RWMutex.Lock()
2. (TransmitLimitedQueue).mu.Lock()
NetworkDB's gossip periodic task calls GetBroadcasts on the same
TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts
invokes the queue's NumNodes callback while the mutex is locked. The
NumNodes callback function that NetworkDB sets locks the NetworkDB mutex
for reading to take the length of the nodes map. Locking order:
1. (TransmitLimitedQueue).mu.Lock()
2. (NetworkDB).RWMutex.RLock()
If one goroutine calls GetBroadcasts on the queue concurrently with
another goroutine calling JoinNetwork on the NetworkDB, the goroutines
may deadlock due to the lock inversion.
Fix the deadlock by caching the number of nodes in an atomic variable so
that the NumNodes callback can load the value without blocking or
violating Go's memory model. And fix a similar deadlock situation with
the table-event broadcast queues.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 08bde5edfa)
Signed-off-by: Cory Snider <csnider@mirantis.com>
With rejoinClusterBootStrap fixed in tests, split clusters should
reliably self-heal in tests as well as production. Work around the other
source of flakiness in TestNetworkDBIslands: timing out waiting for a
failed node to transition to gracefully left. This flake happens when
one of the leaving nodes sends its NodeLeft message to the other leaving
node, and the second is shut down before it has a chance to rebroadcast
the message to the remaining nodes. The proper fix would be to leverage
memberlist's own bookkeeping instead of duplicating it poorly with user
messages, but doing so requires a change in the memberlist module.
Instead have the test check that the sum of failed+left nodes is
expected instead of waiting for all nodes to have failed==3 && left==0.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit aff444df86)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The rejoinClusterBootStrap periodic task rejoins with the bootstrap
nodes if none of them are members of the cluster. It correlates the
cluster nodes with the bootstrap list by comparing IP addresses,
ignoring ports. In normal operation this works out fine as every node
has a unique IP address, but in unit tests every node listens on a
distinct port of 127.0.0.1. This situation causes the check to
incorrectly filter out all nodes from the list, mistaking them for the
local node.
Filter out the local node using pointer equality of the *node to avoid
any ambiguity. Correlate the remote nodes by IP:port so that the check
behaves the same in tests and in production.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 1e1be54d3e)
Signed-off-by: Cory Snider <csnider@mirantis.com>
A network node is responsible for both broadcasting table events for
entries it owns and for rebroadcasting table events from other nodes it
has received. Table events to be broadcast are added to a single queue
per network, including events for rebroadcasting. As the memberlist
TransmitLimitedQueue is (to a first approximation) LIFO, a flood of
events from other nodes could delay the broadcasting of
locally-generated events indefinitely. Prioritize broadcasting local
events by splitting up the queues and only pulling from the rebroadcast
queue if there is free space in the gossip packet after draining the
local-broadcast queue.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 6ec6e0991a)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Log more details when assertions fail to provide a more complete picture
of what went wrong when TestCRUDTableEntries fails. Log the state of
each NetworkDB instance at various points in TestCRUDTableEntries to
provide an even more complete picture.
Increase the global logger verbosity in tests so warnings and debug logs
are printed to the test log.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit e9a7154909)
Signed-off-by: Cory Snider <csnider@mirantis.com>
NetworkDB uses a muli-dimensional map of struct network to keep track of
network attachments for both remote nodes and the local node. Only a
subset of the struct fields are used for remote nodes' network
attachments. The tableBroadcasts pointer field in particular is
always initialized for network values representing local attachments
(read: nDB.networks[nDB.config.NodeID]) and always nil for remote
attachments. Consequently, unnecessary defensive nil-pointer checks are
peppered throughout the code despite the aforementioned invariant.
Enshrine the invariant that tableBroadcasts is initialized iff the
network attachment is for the local node in the type system. Pare down
struct network to only the fields needed for remote network attachments
and move the local-only fields into a new struct thisNodeNetwork. Elide
the unnecessary nil-checks.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit dbb0d88109)
Signed-off-by: Cory Snider <csnider@mirantis.com>
When joining a network that was previously joined but not yet reaped,
NetworkDB replaces the network struct value with a zeroed-out one with
the entries count copied over. This is also the case when joining a
network that is currently joined! Consequently, joining a network has
the side effect of clearing the broadcast queue. If the queue is cleared
while messages are still pending broadcast, convergence may be delayed
until the next bulk sync cycle.
Make it an error to join a network twice without leaving. Retain the
existing broadcast queue when rejoining a network that has not yet been
reaped.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 51f31826ee)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The map key for nDB.networks is the network ID. The struct field is not
actually used anywhere in practice.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 30b27ab6ea)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The loopback-test fixes seem to be sufficient to resolve the flakiness
of all the tests aside from TestFlakyNetworkDBIslands.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 697c17ca95)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Gracefully leaving the memberlist cluster is a best-effort operation.
Failing to successfully broadcast the leave message to a peer should not
prevent NetworkDB from cleaning up the memberlist instance on close. But
that was not the case in practice. Log the error returned from
(*memberlist.Memberlist).Leave instead of returning it and proceed with
shutting down irrespective of whether Leave() returns an error.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 16ed51d864)
Signed-off-by: Cory Snider <csnider@mirantis.com>
- refactor programIngressPorts to use Rule.Insert/Append/Delete for improved rule management
- split programIngress() and dependent functions on Add and Del functions
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
(cherry picked from commit 8b208f1b95)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
- Extract plumpIngressProxy steps in a separate function
- Don't create a new listener if there's already one in ingressProxyTbl
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
(cherry picked from commit c2e2e7fe24)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
On firewalld reload, all the iptables rules are deleted. Legacy
links use iptables.OnReloaded to achieve that - but there's no
way to deleted an OnRelaoded callback. So, a firewalld reload
after the linked containers are deleted results in zombie rules
being re-created.
Legacy links are created by ProgramExternalConnectivity, but
removed in Leave (rather than RevokeExternalConnectivity).
So, restore legacy links for current endpoints, along with the
other per-network/per-port rules.
Move link-removal to RevokeExternalConnectivity, so that it
happens with the configNetwork lock held.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Using iptables.OnReloaded to restore individual per-network rules
on firewalld reload means rules for deleted networks pop back in
to existence (because there was no way to delete the callbacks on
network-delete).
So, on firewalld reload, walk over current networks and ask them
to restore their iptables rules.
Signed-off-by: Rob Murray <rob.murray@docker.com>
(cherry picked from commit a527e5a546)
Test that firewalld reload doesn't re-create deleted iptables rules
Signed-off-by: Rob Murray <rob.murray@docker.com>
(cherry picked from commit c3fa7c1779)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
Doesn't look like it would ever have worked, but:
- init the dbus connection to avoid a segv
- include the chain name when creating the rule
- remove the test rule if it's created
Signed-off-by: Rob Murray <rob.murray@docker.com>
(cherry picked from commit 0ab6f07c31)
Signed-off-by: Cory Snider <csnider@mirantis.com>
When libnetwork receives a watch event for a driver table entry from
NetworkDB it passes the event along to the interested driver. This code
contains a subtle bug: update events from NetworkDB are passed along to
the driver as Delete events! This bug was lying dormant as driver-table
entries can only be added by the driver, not updated. Now that NetworkDB
broadcasts an UpdateEvent to watchers if the entry is already known to
the local NetworkDB, irrespective of whether the event received from the
remote peer was a CREATE or UPDATE event, the bug is causing problems.
Whenever a remote node replaces an entry in the overlay_peer_table but
the intermediate delete state was not received by the local node, the
new CREATE event would be translated to an UpdateEvent by NetworkDB and
subsequently handled by the overlay driver as if the entry was deleted!
Bubble table UPDATE events up to the network driver as Update events.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a7f01d238e)
Signed-off-by: Cory Snider <csnider@mirantis.com>
While github.com/stretchr/testify is not used directly by any of the
repository code, it is a transitive dependency via Swarmkit and
therefore still easy to use without having to revendor. Add lint rules
to ban importing testify packages to make sure nobody does.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 7ebd88d2d9)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Apply command gotest.tools/v3/assert/cmd/gty-migrate-from-testify to the
cnmallocator package to be consistent with the assertion library used
elsewhere in moby.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry-picked from commit 4f30a930ad)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Moby imports Swarmkit; Swarmkit no longer imports Moby. In order to
accomplish this feat, Swarmkit has introduced a new plugin.Getter
interface so it could stop importing our pkg/plugingetter package. This
new interface is not entirely compatible with our
plugingetter.PluginGetter interface, necessitating a thin adapter.
Swarmkit had to jettison the CNM network allocator to stop having to
import libnetwork as the cnmallocator package is deeply tied to
libnetwork. Move the CNM network allocator into libnetwork, where it
belongs. The package had a short an uninteresting Git history in the
Swarmkit repository so no effort was made to retain history.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry-picked from commit 7b0ab1011c)
d/cluster/convert: expose Addr() on plugins
The swarmPlugin type does not implement the Swarm plugin.AddrPlugin
interface because it embeds an interface value which does not include
that method in its method set. (You can type-assert an interface value
to another interface which the concrete type implements, but a struct
embedding an interface value is not itself an interface value.) Wrap the
plugin with a different adapter type which exposes the Addr() method if
the concrete plugin implements it.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 8b6d6b9ad5)
libnetwork/cnmallocator: fix non-constant format string in call (govet)
libnetwork/cnmallocator/drivers_ipam.go:43:31: printf: non-constant format string in call to (*github.com/docker/docker/vendor/github.com/sirupsen/logrus.Entry).Infof (govet)
log.G(context.TODO()).Infof("Swarm initialized global default address pool to: " + str.String())
^
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 7b60a7047d)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The "fluentd-async-connect" option was deprecated in 20.10 through
cc1f3c750e, and removed in 28.0 trough
49ec488036, which added migration code
on daemon startup.
This patch ports the migration code to the 25.0 branch to prevent future
disruption when upgrading existing containers to a new version of the
daemon.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When a node leaves a network or the cluster, or memberlist considers the
node as failed, NetworkDB atomically deletes all table entries (for the
left network) owned by the node. This maintains the invariant that table
entries owned by a node are present in the local database indices iff
that node is an active cluster member which is participating in the
network the entries pertain to.
(*NetworkDB).handleTableEvent() is written in a way which attempts to
minimize the amount of time it is in a critical section with the mutex
locked for writing. It first checks under a read-lock whether both the
local node and the node where the event originated are participating in
the network which the event pertains to. If the check passes, the mutex
is unlocked for reading and locked for writing so the local database
state is mutated in a critical section. That leaves a window of time
between the participation check the write-lock being acquired for a
network or node event to arrive and be processed. If a table event for a
node+network races a node or network event which triggers the purge of
all table entries for the same node+network, the invariant could be
violated. The table entry described by the table event may be reinserted
into the local database state after being purged by the node's leaving,
resulting in an orphaned table entry which the local node will bulk-sync
to other nodes indefinitely.
It's not completely wrong to perform a pre-flight check outside of the
critical section. It allows for an early return in the no-op case
without having to bear the cost of synchronization. But such optimistic
concurrency control is only sound if the condition is double-checked
inside the critical section. It is tricky to get right, and this
instance of optimistic concurrency control smells like a case of
premature optimization. Move the pre-flight check into the critical
section to ensure that the invariant is maintained.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 270a4d41dc)
Signed-off-by: Cory Snider <csnider@mirantis.com>
NetworkDB's Watch() facility is problematic to use in practice. The
stream of events begins when the watch is started, so the watch cannot
be used to process table entries that existed beforehand. Either option
to process existing table entries is racy: walking the table before
starting the watch leaves a race window where events could be missed,
and walking the table after starting the watch leaves a race window
where created/updated entries could be processed twice.
Modify Watch() to initialize the channel with synthetic CREATE events
for all existing entries owned by remote nodes before hooking it up to
the live event stream. This way watchers observe an equivalent sequence
of events irrespective of whether the watch was started before or after
entries from remote nodes are added to the database. Remove the bespoke
and racy synthetic event replay logic for driver watches from the
libnetwork agent.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit a3aea15257)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The gossip protocol which powers NetworkDB does not guarantee in-order
reception of events. This poses a problem with deleting entries: without
some mechanism to discard stale CREATE or UPDATE events received after a
DELETE, out-of-order reception of events could result in a deleted entry
being spuriously resurrected in the local NetworkDB state! NetworkDB
handles this situation by storing "tombstone" entries for a period of
time with the Lamport timestamps of the entries' respective DELETE
events. Out-of-order CREATE or UPDATE events will be ignored by virtue
of having older timestmaps than the tombstone entry, just like how it
works for entries that have not yet been deleted.
NetworkDB was only storing a tombstone if the entry was already present
in the local database at the time of the DELETE event. If the first
event received for an entry is a DELETE, no tombstone is stored. If a
stale CREATE/UPDATE event for the entry (with an older timestamp than
the DELETE) is subsequently received, NetworkDB erroneously creates a
live entry in the local state with stale data. Modify NetworkDB to store
tombstones for DELETE events irrespective of whether the entry was known
to NetworkDB beforehand so that it correctly discards out-of-order
CREATEs and UPDATEs in all cases.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit ada8bc3695)
Signed-off-by: Cory Snider <csnider@mirantis.com>
NetworkDB gossips changes to table entries to other nodes using distinct
CREATE, UPDATE and DELETE events. It is unfortunate that the wire
protocol distinguishes CREATEs from UPDATEs as nothing useful can be
done with this information. Newer events for an entry invalidate older
ones, so there is no guarantee that a CREATE event is broadcast to any
node before an UPDATE is broadcast. And due to the nature of gossip
protocols, even if the CREATE event is broadcast from the originating
node, there is no guarantee that any particular node will receive the
CREATE before an UPDATE. Any code which handles an UPDATE event
differently from a CREATE event is therefore going to behave in
unexpected ways in less than perfect conditions.
NetworkDB table watchers also receive CREATE, UPDATE and DELETE events.
Since the watched tables are local to the node, the events could all
have well-defined meanings that are actually useful. Unfortunately
NetworkDB is just bubbling up the wire-protocol event types to the
watchers. Redefine the table-watch events such that a CREATE event is
broadcast when an entry pops into existence in the local NetworkDB, an
UPDATE event is broadcast when an entry which was already present in the
NetworkDB state is modified, and a DELETE event is broadcast when an
entry which was already present in the NetworkDB state is marked for
deletion. DELETE events are broadcast with the same value as the most
recent CREATE or UPDATE event for the entry.
The handler for endpoint table events in the libnetwork agent assumed
incorrectly that CREATE events always correspond to adding a new active
endpoint and that UPDATE events always correspond to disabling an
endpoint. Fix up the handler to handle CREATE and UPDATE events using
the same code path, checking the table entry's ServiceDisabled flag to
determine which action to take.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit c68671d908)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The function was accessing the index map without holding the mutex, so
it would race any mutation to the database indexes. Fetch the reference
to the tree's root while holding a read lock. Since the radix tree is
immutable, taking a reference to the root is equivalent to starting a
read-only database transaction, providing a consistent view of the data
at a snapshot in time, even as the live state is mutated concurrently.
Also optimize the WalkTable function by leveraging the immutability of
the radix tree.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit ec65f2d21b)
(*NetworkDB).SetPrimaryKey() acquires a read lock on the NetworkDB
instance. That seems sound on the surface as it is only reading from the
NetworkDB struct, not mutating it. However, concurrent calls to
(*memberlist.Keyring).UseKey() would get flagged by Go's race detector
due to some questionable locking in its implementation. Acquire an
exclusive lock in SetPrimaryKey so concurrent calls don't race each
other.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit c9b01e0c4c)
Signed-off-by: Drew Erny <derny@mirantis.com>
Logic was added to the Swarm executor in commit 0d9b0ed678
to clean up managed networks whenever the node's load-balancer IP
address is removed or changed in order to free up the address in the
case where the container fails to start entirely. Unfortunately, due to
an oversight the function returns early if the Swarm is lacking
an ingress network. Remove the early return so that load-balancer IP
addresses for all the other networks are freed as appropriate,
irrespective of whether an ingress network exists in the Swarm.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 56ad941564)
Signed-off-by: Drew Erny <derny@mirantis.com>
NetworkDB uses a hierarchy of queues to prioritize messages for
broadcast. Unfortunately the logic to pull from multiple queues is
flawed. The length of the messages pulled from the first queue is not
taken into account when pulling messages from the second queue. A list
of messages up to tiwce as long as the limit could be returned! Messages
beyond the limit will be truncated unceremoniously by memberlist.
Memberlist broadcast queues assume that all messages returned from a
GetBroadcasts call will be broadcasted to other nodes in the cluster.
Messages are popped from the queue once they have hit their retransmit
limit. On a busy system messages may be broadcast fewer times than
intended, possibly even being dropped without ever being broadcast!
Subtract the length of messages pulled from the first queue from the
broadcast size limit so the limit is not exceeded when pulling from the
second queue.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit dacf445614)
Signed-off-by: Drew Erny <derny@mirantis.com>
NetworkDB defaults to binding to the unspecified address for gossip
communications, with no advertise address set. In this configuration,
the memberlist instance listens on all network interfaces and picks one
of the host's public IP addresses as the advertise address.
The NetworkDB unit tests don't override this default, leaving them
vulnerable to flaking out as a result of rogue network traffic
perturbing the test, or the inferred advertise address not being useable
for loopback testing. And macOS prompts for permission to allow the test
executable to listen on public interfaces every time it is rebuilt.
Modify the NetworkDB tests to explicitly bind to, advertise, and join
ports on 127.0.0.1 to make the tests more robust to flakes in CI and
more convenient to run locally.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 90ec2c209b)
Signed-off-by: Drew Erny <derny@mirantis.com>
The NetworkDB unit tests instantiate clusters which communicate over
loopback where every "node" listens on a distinct localhost port. The
tests make use of a NetworkDB configuration knob to set the port. When
the NetworkDB configuration's BindPort field is set to a nonzero value,
its memberlist instance is configured to bind to the specified port
number. However, the advertise port is left at the
memberlist.DefaultLANConfig() default value of 7946. Because of this,
nodes would be unable to contact any of the other nodes in the cluster
learned by gossip as the gossiped addresseses specify the wrong ports!
The flaky tests passed as often as they did thanks to the robustness of
the memberlist module: NetworkDB gossip and and memberlist node
liveness-probe pings to unreachable nodes can all be relayed through
the reachable nodes, the nodes on the bootstrap join list.
Make the NetworkDB unit tests less flaky by setting each node's
advertise port to the bind port.
The daemon is unaffected by this oversight as it unconditionally uses
the default listen port of 7946, which aligns with the advertise port.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit e3f9edd348)
Signed-off-by: Drew Erny <derny@mirantis.com>
Mark the following tests as flaky:
- TestNetworkDBCRUDTableEntry
- TestNetworkDBCRUDTableEntries
- TestNetworkDBIslands
- TestNetworkDBNodeLeave
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit 9893520c62)
libnetwork tests tend to be flaky (namely `TestNetworkDBIslands` and
`TestNetworkDBCRUDTableEntries`).
Move execution of tests which name has `TestFlaky` prefix to a separate
gotestsum pass which allows them to be reran 4 times.
On Windows, the libnetwork test execution is not split into a separate
pass.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit d0d8d5d97d)
Go maintainers started to unconditionally update the minimum go version
for golang.org/x/ dependencies to go1.23, which means that we'll no longer
be able to support any version below that when updating those dependencies;
> all: upgrade go directive to at least 1.23.0 [generated]
>
> By now Go 1.24.0 has been released, and Go 1.22 is no longer supported
> per the Go Release Policy (https://go.dev/doc/devel/release#policy).
>
> For golang/go#69095.
This updates our minimum version to go1.23, as we won't be able to maintain
compatibility with older versions because of the above.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 7c52c4d92e)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
# Conflicts:
# api/server/router/container/inspect.go
# api/server/router/grpc/grpc.go
# api/server/router/system/system.go
# api/server/router/system/system_routes.go
# api/types/registry/registry.go
# api/types/registry/registry_test.go
# builder/builder-next/adapters/containerimage/pull.go
# container/view.go
# daemon/container_operations.go
# daemon/containerd/image_inspect.go
# daemon/containerd/image_push_test.go
# daemon/create.go
# daemon/daemon.go
# daemon/daemon_unix.go
# daemon/info.go
# daemon/inspect.go
# daemon/logger/loggerutils/logfile.go
# internal/gocompat/modulegenerator.go
# internal/maputil/maputil.go
# internal/platform/platform_linux.go
# internal/sliceutil/sliceutil.go
# libnetwork/config/config.go
# libnetwork/drivers/bridge/port_mapping_linux.go
# libnetwork/drivers/overlay/peerdb.go
# libnetwork/endpoint.go
# libnetwork/endpoint_store.go
# libnetwork/internal/l2disco/unsol_arp_linux.go
# libnetwork/internal/l2disco/unsol_na_linux.go
# libnetwork/internal/nftables/nftables_linux.go
# libnetwork/internal/resolvconf/resolvconf.go
# libnetwork/internal/setmatrix/setmatrix.go
# libnetwork/ipams/defaultipam/address_space.go
# libnetwork/ipamutils/utils.go
# libnetwork/iptables/iptables.go
# libnetwork/netutils/utils_linux.go
# libnetwork/network.go
# libnetwork/network_store.go
# libnetwork/networkdb/networkdb.go
# libnetwork/options/options.go
# libnetwork/osl/interface_linux.go
# libnetwork/osl/route_linux.go
# libnetwork/portallocator/portallocator.go
# libnetwork/sandbox.go
# libnetwork/service.go
# oci/defaults.go
# plugin/v2/plugin_linux.go
# testutil/daemon/daemon.go
# testutil/helpers.go
Go maintainers started to unconditionally update the minimum go version
for golang.org/x/ dependencies to go1.23, which means that we'll no longer
be able to support any version below that when updating those dependencies;
> all: upgrade go directive to at least 1.23.0 [generated]
>
> By now Go 1.24.0 has been released, and Go 1.22 is no longer supported
> per the Go Release Policy (https://go.dev/doc/devel/release#policy).
>
> For golang/go#69095.
This updates our minimum version to go1.23, as we won't be able to maintain
compatibility with older versions because of the above.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 6e8eb8a90f)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
# Conflicts:
# hack/with-go-mod.sh
# vendor.mod
full diff: https://github.com/golang/go/compare/go1.23.7...go1.23.8
release notes: https://go.dev/doc/devel/release#go1.24.2
go1.23.8 (released 2025-04-01) includes security fixes to the net/http package,
as well as bug fixes to the runtime and the go command. See the Go 1.23.8
milestone on our issue tracker for details;
https://github.com/golang/go/issues?q=milestone%3AGo1.23.8+label%3ACherryPickApproved
From the mailing list:
Hello gophers,
We have just released Go versions 1.24.2 and 1.23.8, minor point releases.
These minor releases include 1 security fixes following the security policy:
- net/http: request smuggling through invalid chunked data
The net/http package accepted data in the chunked transfer encoding
containing an invalid chunk-size line terminated by a bare LF.
When used in conjunction with a server or proxy which incorrectly
interprets a bare LF in a chunk extension as part of the extension,
this could permit request smuggling.
The net/http package now rejects chunk-size lines containing a bare LF.
Thanks to Jeppe Bonde Weikop for reporting this issue.
This is CVE-2025-22871 and Go issue https://go.dev/issue/71988.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 74b71c41ac)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
# Conflicts:
# .github/workflows/.test.yml
# .github/workflows/.windows.yml
# .github/workflows/arm64.yml
# .github/workflows/buildkit.yml
# .github/workflows/codeql.yml
# .github/workflows/test.yml
# .golangci.yml
# Dockerfile
# Dockerfile.simple
# Dockerfile.windows
# hack/dockerfiles/generate-files.Dockerfile
# hack/dockerfiles/govulncheck.Dockerfile
These tests don't actually run the integration-cli suite, but
the global hack/xxx script errors because it's not set;
---> Making bundle: test-docker-py (in bundles/test-docker-py)
---> Making bundle: .integration-daemon-start (in bundles/test-docker-py)
Using test binary /usr/local/cli-integration/docker
# DOCKER_EXPERIMENTAL is set: starting daemon with experimental features enabled!
# cgroup v2 requires TEST_SKIP_INTEGRATION_CLI to be set
make: *** [Makefile:220: test-docker-py] Error 1
Error: Process completed with exit code 2.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 06b87d80ee)
Signed-off-by: Andrey Epifanov <aepifanov@mirantis.com>
This is the fifth patch release in the 1.2.z series of runc. It primarily fixes
an issue caused by an upstream systemd bug.
* There was a regression in systemd v230 which made the way we define
device rule restrictions require a systemctl daemon-reload for our
transient units. This caused issues for workloads using NVIDIA GPUs.
Workaround the upstream regression by re-arranging how the unit properties
are defined.
* Dependency github.com/cyphar/filepath-securejoin is updated to v0.4.1,
to allow projects that vendor runc to bump it as well.
* CI: fixed criu-dev compilation.
* Dependency golang.org/x/net is updated to 0.33.0.
full diff: https://github.com/opencontainers/runc/compare/v1.2.4...v1.2.5
release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.5
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 838ae09a23)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
This minor release include 1 security fix following the security policy:
- crypto/elliptic: timing sidechannel for P-256 on ppc64le
Due to the usage of a variable time instruction in the assembly implementation
of an internal function, a small number of bits of secret scalars are leaked on
the ppc64le architecture. Due to the way this function is used, we do not
believe this leakage is enough to allow recovery of the private key when P-256
is used in any well known protocols.
This is CVE-2025-22866 and Go issue https://go.dev/issue/71383.
View the release notes for more information:
https://go.dev/doc/devel/release#go1.22.12
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit a584f0b227)
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
go1.22.11 (released 2025-01-16) includes security fixes to the crypto/x509 and
net/http packages, as well as bug fixes to the runtime. See the Go 1.22.11
milestone on our issue tracker for details.
- https://github.com/golang/go/issues?q=milestone%3AGo1.22.11+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.22.10...go1.22.11
Hello gophers,
We have just released Go versions 1.23.5 and 1.22.11, minor point releases.
These minor releases include 2 security fixes following the security policy:
- crypto/x509: usage of IPv6 zone IDs can bypass URI name constraints
A certificate with a URI which has a IPv6 address with a zone ID may
incorrectly satisfy a URI name constraint that applies to the certificate
chain.
Certificates containing URIs are not permitted in the web PKI, so this
only affects users of private PKIs which make use of URIs.
Thanks to Juho Forsén of Mattermost for reporting this issue.
This is CVE-2024-45341 and Go issue https://go.dev/issue/71156.
- net/http: sensitive headers incorrectly sent after cross-domain redirect
The HTTP client drops sensitive headers after following a cross-domain redirect.
For example, a request to a.com/ containing an Authorization header which is
redirected to b.com/ will not send that header to b.com.
In the event that the client received a subsequent same-domain redirect, however,
the sensitive headers would be restored. For example, a chain of redirects from
a.com/, to b.com/1, and finally to b.com/2 would incorrectly send the Authorization
header to b.com/2.
Thanks to Kyle Seely for reporting this issue.
This is CVE-2024-45336 and Go issue https://go.dev/issue/70530.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit c83862c541)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
This is the fourth patch release of the 1.2.z release branch of runc. It
includes a fix for a regression introduced in 1.2.0 related to the
default device list.
- Re-add tun/tap devices to built-in allowed devices lists.
In runc 1.2.0 we removed these devices from the default allow-list
(which were added seemingly by accident early in Docker's history) as
a precaution in order to try to reduce the attack surface of device
inodes available to most containers. At the time we thought
that the vast majority of users using tun/tap would already be
specifying what devices they need (such as by using --device with
Docker/Podman) as opposed to doing the mknod manually, and thus
there would've been no user-visible change.
Unfortunately, it seems that this regressed a noticeable number of
users (and not all higher-level tools provide easy ways to specify
devices to allow) and so this change needed to be reverted. Users
that do not need these devices are recommended to explicitly disable
them by adding deny rules in their container configuration.
full diff: https://github.com/opencontainers/runc/compare/v1.2.3...v1.2.4
release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.4
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit aad7bcedd2)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
This is the third patch release of the 1.2.z release branch of runc. It
primarily fixes some minor regressions introduced in 1.2.0.
- Fixed a regression in use of securejoin.MkdirAll, where multiple
runc processes racing to create the same mountpoint in a shared rootfs
would result in spurious EEXIST errors. In particular, this regression
caused issues with BuildKit.
- Fixed a regression in eBPF support for pre-5.6 kernels after upgrading
Cilium's eBPF library version to 0.16 in runc.
full diff: https://github.com/opencontainers/runc/compare/v1.2.2...v1.2.3
release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.3
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit ec5c9e06e3)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
The output variable was renamed in 0503cf2510,
but that commit failed to change this defer, which was now checking the
wrong error.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 01a55860c6)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fixes compatibility with alpine 3.21
- Fix additional possible `xx-cc`/`xx-cargo` compatibility issue with Alpine 3.21
- Support for Alpine 3.21
- Fix `xx-verify` with `file` 5.46+
- Fix possible error taking lock in `xx-apk` in latest Alpine without `coreutils`
full diff: https://github.com/tonistiigi/xx/compare/v1.5.0...v1.6.1
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 89899b71a0)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When dockerd is executed with the `dockerd-rootless.sh` script, make
/etc/cdi and /var/run/cdi available to the daemon if they exist.
This makes it possible to enable the CDI integration in rootless mode.
Fixes: #47676
Signed-off-by: Rafael Fernández López <ereslibre@ereslibre.es>
(cherry picked from commit 4e30acb63f)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After the image is tagged, the engine attempts to delete a dangling
image of the source image, so the image is no longer dangling.
When the source image is not dangling, the removal errors out (as
expected), but a warning is logged to the daemon log:
```
time="2024-12-02T10:44:25.386957553Z" level=warning msg="unexpected error when deleting dangling image" error="NotFound: image \"moby-dangling@sha256:54d8c2251c811295690b53af7767ecaf246f1186c36e4f2b2a63e0bfa42df045\": not found" imageID="sha256:54d8c2251c811295690b53af7767ecaf246f1186c36e4f2b2a63e0bfa42df045" spanID=bd10a21a07830d72 tag="docker.io/library/test:latest" traceID=4cf61671c2dc6da3dc7a09c0c6ac4e16
```
Remove that log as it causes unnecessary confusion, as the failure is
expected.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
(cherry picked from commit a93f6c61db)
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
- 1.2.2 release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.2
- 1.2.1 release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.1
- 1.2.0 release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.0
Breaking changes and deprecations are included below;
Breaking changes:
Several aspects of how mount options work has been adjusted in a way that
could theoretically break users that have very strange mount option strings.
This was necessary to fix glaring issues in how mount options were being
treated. The key changes are:
- Mount options on bind-mounts that clear a mount flag are now always
applied. Previously, if a user requested a bind-mount with only clearing
options (such as rw,exec,dev) the options would be ignored and the
original bind-mount options would be set. Unfortunately this also means
that container configurations which specified only clearing mount options
will now actually get what they asked for, which could break existing
containers (though it seems unlikely that a user who requested a specific
mount option would consider it "broken" to get the mount options they
asked foruser who requested a specific mount option would consider it
"broken" to get the mount options they asked for). This also allows us to
silently add locked mount flags the user did not explicitly request to be
cleared in rootless mode, allowing for easier use of bind-mounts for
rootless containers.
- Container configurations using bind-mounts with superblock mount flags
(i.e. filesystem-specific mount flags, referred to as "data" in
mount(2), as opposed to VFS generic mount flags like MS_NODEV) will
now return an error. This is because superblock mount flags will also
affect the host mount (as the superblock is shared when bind-mounting),
which is obviously not acceptable. Previously, these flags were silently
ignored so this change simply tells users that runc cannot fulfil their
request rather than just ignoring it.
Deprecated
- runc option --criu is now ignored (with a warning), and the option will
be removed entirely in a future release. Users who need a non-standard
criu binary should rely on the standard way of looking up binaries in
$PATH.
- runc kill option -a is now deprecated. Previously, it had to be specified
to kill a container (with SIGKILL) which does not have its own private PID
namespace (so that runc would send SIGKILL to all processes). Now, this is
done automatically.
- github.com/opencontainers/runc/libcontainer/user is now deprecated, please
use github.com/moby/sys/user instead. It will be removed in a future
release.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit e257856116)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
- validate-prepare and smoke-prepare took 10 seconds; limiting to 10 minutes
- smoke tests took less than 3 minutes; limiting to 10 minutes
- validate: most took under a minute, but "deprecate-integration-cli" took
14 minutes; limiting to 30 minutes
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit a051aba82e)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
To be more explicit on what we're using.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 91c448bfb5)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
- add `--quiet` to suppress pull progress output
- use `./` instead of `$(pwd)` now that relative paths are supported
- set the working directory on the container, so that we don't have to `cd`
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 9a14299540)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Alpine 3.16 has been EOL for some time. Update to the latest version.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 3cb98d759d)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Regular runs are under 5 minutes.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit cfe0d2a131)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Regular runs are under a minute.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit e75f7aca2f)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Make sure the module is loaded, as we're not able to load it from within
the dev-container;
time="2024-11-29T20:40:42Z" level=error msg="Running modprobe br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found in directory /lib/modules/5.15.0-1072-aws\n" error="exit status 1"
Also moving these steps _before_ the "print info" step, so that docker info
doesn't show warnings that bridge-nf-call-iptables and bridge-nf-call-ip6tables
are not loaded.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit cce5dfe1e7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Update the runc binary that's used in CI and for the static packages.
diff: https://github.com/opencontainers/runc/compare/v1.1.13...v1.1.14
Release Notes:
- Fix CVE-2024-45310, a low-severity attack that allowed maliciously configured containers to create empty files and directories on the host.
- Add support for Go 1.23.
- Revert "allow overriding VERSION value in Makefile" and add EXTRA_VERSION.
- rootfs: consolidate mountpoint creation logic.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 2189aa2426)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Update the runc binary that's used in CI and for the static packages.
full diff: https://github.com/opencontainers/runc/compare/v1.1.12...v1.1.13
Release notes:
* If building with Go 1.22.x, make sure to use 1.22.4 or a later version.
* Support go 1.22.4+.
* runc list: fix race with runc delete.
* Fix set nofile rlimit error.
* libct/cg/fs: fix setting rt_period vs rt_runtime.
* Fix a debug msg for user ns in nsexec.
* script/*: fix gpg usage wrt keyboxd.
* CI fixes and misc backports.
* Fix codespell warnings.
* Silence security false positives from golang/net.
* libcontainer: allow containers to make apps think fips is enabled/disabled for testing.
* allow overriding VERSION value in Makefile.
* Vagrantfile.fedora: bump Fedora to 39.
* ci/cirrus: rm centos stream 8.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 9101392309)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
These log-entries were added in 10d57fde44,
but it looks like I accidentally left them as Error-logs following some
debugging (whoops!).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 352b4ff2f1)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
`Parser.ParseMountRaw()` labels anonymous volumes with a `AnonymousLabel` label
(`com.docker.volume.anonymous`) label based on whether a volume has a name
(named volume) or no name (anonymous) (see [1]).
However both `VolumesService.Create()` (see [1]) and `Parser.ParseMountRaw()`
(see [2], [3]) were generating a random name for anonymous volumes. The latter
is called before `VolumesService.Create()` is called, resulting in such volumes
not being labeled as anonymous.
Generating the name was originally done in Create (fc7b904dce),
but duplicated in b3b7eb2723 with the introduction
of the new Mounts field in HostConfig. Duplicating this effort didn't have a
real effect until (`Create` would just skip generating the name), until
618f26ccbc introduced the `AnonymousLabel` in
(v24.0.0, backported to v23.0.0).
Parsing generally should not fill in defaults / generate names, so this patch;
- Removes generating volume names from `Parser.ParseMountRaw()`
- Adds a debug-log entry to `VolumesService.Create()`
- Touches up some logs to use structured logs for easier correlating logs
With this patch applied:
docker run --rm --mount=type=volume,target=/toto hello-world
DEBU[2024-10-24T22:50:36.359990376Z] creating anonymous volume volume-name=0cfd63d4df363571e7b3e9c04e37c74054cc16ff1d00d9a005232d83e92eda02
DEBU[2024-10-24T22:50:36.360069209Z] probing all drivers for volume volume-name=0cfd63d4df363571e7b3e9c04e37c74054cc16ff1d00d9a005232d83e92eda02
DEBU[2024-10-24T22:50:36.360341209Z] Registering new volume reference driver=local volume-name=0cfd63d4df363571e7b3e9c04e37c74054cc16ff1d00d9a005232d83e92eda02
[1]: 032721ff75/volume/service/service.go (L72-L83)
[2]: 032721ff75/volume/mounts/linux_parser.go (L330-L336)
[3]: 032721ff75/volume/mounts/windows_parser.go (L394-L400)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 10d57fde44)
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.