This package was originally internal, but was moved out when BuildKit
used it for its integration tests. That's no longer the case, so we
can make it internal again.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Move the option-types to the client and in some cases create a
copy for the backend. These types are used to construct query-
args, and not marshaled to JSON, and can be replaced with functional
options in the client.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For nftables only, never enable IP forwarding on the host. Instead,
return an error on network creation if forwarding is not enabled,
required by a bridge network, and --ip-forward=true.
If IPv4 forwarding is not enabled when the daemon is started with
nftables enabled and other config at defaults, the daemon will
exit when it tries to create the default bridge.
Otherwise, network creation will fail with an error if IPv4/IPv6
forwarding is not enabled when a network is created with IPv4/IPv6.
It's the user's responsibility to configure and secure their host
when they run Docker with nftables.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Follow-up to 494677f93f, which added
the aliases, but did not yet replace our own use of the nat types.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Packets with the given firewall mark are accepted by the bridge
driver's filter-FORWARD rules.
The value can either be an integer mark, or it can include a
mask in the format "<mark>/<mask>".
Signed-off-by: Rob Murray <rob.murray@docker.com>
When an endpoint in a gateway mode "nat" network is selected
as a container's default gateway, the bridge driver sets up
bindings between host and container ports (NAT, userland proxy
etc).
When gateway mode "routed" was added as an alternative to
the default "nat" mode - port bindings followed the same rules.
But, unlike "nat" mode, there's no host port binding to set
up - there's routing between remote client and the container,
so it doesn't matter what the default gateway is.
So, in "routed" mode, set up the rules to make a container's
published ports accessible when the endpoint is added, and
remove those rules when the endpoint is removed (when the
container is disconnected from the endpoint's network).
Port mappings are only provided by ProgramExternalConnectivity,
they can't be set up during the Join. So, include routed
bindings in the port bindings mode that's stored as part of
endpoint state - and use that to work out whether to add or
remove bindings.
Signed-off-by: Rob Murray <rob.murray@docker.com>
The stdcopy package is used to produce and read multiplexed streams for
"attach" and "logs". It is used both by the API server (to produce), and
the client (to read / de-multiplex).
Move it to the api package, so that it can be included in the api module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Port bindings are currently sorted — to form groups that should be
mapped in one go — and then normalized by `configurePortBindingIPv[4|6]`.
However, gw_modes might not be the same for IPv4/v6, so the upcoming
split of NATed / routed portmappers will require that they're processed
independently.
With this commit, PBs are now normalized (by calling the `configure...`
funcs), and then sorted. The sort func is updated to group routed PBs.
`needSamePort` was comparing the container's IP address, but this field
was never set by the time it's called. Now it's set, and has a different
value when IPv4 / IPv6 portmappings are mixed, so remove it from the
comparison.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
We set SO_REUSEADDR on sockets used for host port mappings by
docker-proxy - which means it's possible to bind the same port
on a specific address as well as 0.0.0.0/::.
For TCP sockets, an error is raised when listen() is called on
both sockets - and the port allocator will be called again to
avoid the clash (if the port was allocated from a range, otherwise
the container will just fail to start).
But, for UDP sockets, there's no listen() - so take more care
to avoid the clash in the portallocator.
The port allocator keeps a set of allocated ports for each of
the host IP addresses it's seen, including 0.0.0.0/::. So, if a
mapping to 0.0.0.0/:: is requested, find a port that's free in
the range for each of the known IP addresses (but still only
mark it as allocated against 0.0.0.0/::). And, if a port is
requested for specific host addresses, make sure it's also
free in the corresponding 0.0.0.0/:: set (but only mark it as
allocated against the specific addresses - because the same
port can be allocated against a different specific address).
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because we set SO_REUSEADDR on sockets for host ports, if there
are port mappings for INADDR_ANY (the default) as well as for
specific host ports - bind() cannot be used to detect clashes.
That means, for example, on daemon startup, if the port allocator
returns the first port in its ephemeral range for a specific host
adddress, and the next port mapping is for 0.0.0.0 - the same port
is returned and both bind() calls succeed. Then, the container
fails to start later when listen() spots the problem and it's too
late to find another port.
So, bind and listen to each set of ports as they're allocated
instead of just binding.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because nftables tables/chain aren't fixed, like they are
in iptables - this change makes an assumption about the
bridge driver's naming.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Per-network option com.docker.network.bridge.trusted-host-interfaces
accepts a list of interfaces that are allowed to route
directly to a container's published ports in a bridge
network with nat enabled.
This daemon level option disables direct access filtering,
enabling direct access to published ports on container
addresses in all bridge networks, via all host interfaces.
It overlaps with short-term env-var workaround:
DOCKER_INSECURE_NO_IPTABLES_RAW=1
- it does not allow packets sent from outside the host to reach
ports published only to 127.0.0.1
- it will outlive iptables (the workaround was initially intended
for hosts that do not have kernel support for the "raw" iptables
table).
Signed-off-by: Rob Murray <rob.murray@docker.com>
trusted_host_interface have access to published ports on container
addresses - enabling direct routing to the container via those
interfaces.
Signed-off-by: Rob Murray <rob.murray@docker.com>
With firewalld enabled in CI, TestAccessPublishedPortFromHost/userland-proxy=true/IPv6=true
consistently fails when trying to use a link-local address on
eth0 (it's ok for the ULL added by the test).
In a local moby dev container, it passes - although it sometimes
fails when making its request to the host's ::1.
Signed-off-by: Rob Murray <rob.murray@docker.com>
The daemon runs in a separate netns, but when it wants to create
an iptables rule it sends a dbus message to firewalld - which is
running in the host's netns.
Signed-off-by: Rob Murray <rob.murray@docker.com>
For kernels that don't have CONFIG_IP_NF_RAW, if the env
var DOCKER_INSECURE_NO_IPTABLES_RAW is set to "1", don't
try to create raw rules.
This means direct routing to published ports is possible
from other hosts on the local network, even if the port
is published to a loopback address.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Traditionally when Linux receives remote packets with daddr set to a
loopback address, it reject them as 'martians'. However, when a NAT rule
is applied through iptables this doesn't happen. Our current DNAT rule
used to map host ports to containers is applied unconditionally, even
for such 'martian' packets.
This means a neighbor host (ie. a host connected to the same L2
segment) can send packets to a port mapped on a loopback address. The
purpose of publishing on a loopback address is to make ports
inaccessible to remote hosts -- lack of proper filtering defeats that.
This commit adds an iptables rule to the raw-PREROUTING chain to drop
packets with a loopback dest address and coming from any interface other
than lo.
To accomodate WSL2 mirrored mode, another rule is inserted beforehand to
specifically accept packets coming from the loopback0 interface.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
When a NAT-based port mapping is created, the daemon adds a DNAT rule in
nat-DOCKER to replace the dest addr with the container IP. However, the
daemon never sets up rules to filter packets destined directly to the
container port. This allows a rogue neighbor (ie. a host that shares a
L2 segment with the host) to send packets directly to the container on
its container-side exposed port.
For instance, if container port 5000 is mapped to host port 6000, a
neighbor could send packets directly to the container on its port 5000.
Since nat-DOCKER mangles the dest addr, and the nat table forbids DROP
rules, this change adds a new rule in the raw-PREROUTING chain to filter
ingress connections targeting the container's IP address.
This filtering is only done when gw_mode=nat. For the unprotected
variant, no filtering is done.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Commit fc7caf96d reverted 433b1f9b1 as it was introducing a regression,
ie. containers couldn't reach ports published on the host using their
gateway's IP address or the host IP address.
These scenarios are now tested.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
When a NAT-based port mapping is created with a HostIP specified, we
insert a DNAT rule in nat-DOCKER to replace the dest addr with the
container IP. Then, in filter chains, we allow access to the container
port for any packet not coming from the container's network itself (if
hairpinning is disabled), nor from another host bridge.
However we don't set any rule that prevents a rogue neighbor that shares
a L2 segment with the host, but not the one where the port binding is
expected to be published, from sending packets destined to that HostIP.
For instance, if a port binding is created with HostIP == '127.0.0.1',
this port should not be accessible from anything but the lo interface.
That's currently not the case and this provides a false sense of
security.
Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP
rules, this change adds rules into raw-PREROUTING to filter ingress
packets destined to mapped ports based on the input interface, the dest
addr and the dest port.
Interfaces are dynamically resolved when packets hit the host, thanks
to iptables' addrtype extension. This extension does a fib lookup of the
dest addr and checks that it's associated with the interface reached.
Also, when a proxy-based port mapping is created, as is the case when an
IPv6 HostIP is specified but the container is only IPv4-capable, we
don't set any sort of filtering. So the same issue might happen. The
reason is a bit different - in that case, that's just how the kernel
works. But, in order to stay consistent with NAT-based mappings, these
rules are also applied.
The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any
true-ish value to globally disable this behavior.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Same as "nat" mode, there's masquerading and port mapping from the
host - but no port/protocol filtering for direct access to the
container's address from remote hosts.
This is the old default behaviour for IPv4 when the filter-FORWARD
chain's default policy was "ACCEPT" (the daemon would only set it
to "DROP" when it set sysctl "ip_forward" itself, but it didn't set
up DROP rules for unpublished ports).
Now, port filtering doesn't depend on the filter-FORWARD policy. So,
this mode is added as a way to restore the old/surprising/insecure
behaviour for anyone who's depending on it. Networks will need to
be re-created with this new gateway mode.
Signed-off-by: Rob Murray <rob.murray@docker.com>
When a container sends a packet to one of its own published ports on the
host, it's normally picked up by the userland proxy and sent back.
When the userland proxy is disabled, a masquerade rule is needed in
order for responses to the container to have the host's source address.
The masquerade rule matches the container's address as source and dest,
and the published port as the dest. It's only used for the no-proxy
case.
So, when the userland proxy is enabled, don't create the masquerade
rule.
Signed-off-by: Rob Murray <rob.murray@docker.com>
In release 27.0, ip6tables was enabled by default. That caused a
problem on some hosts where iptables was explicitly disabled and
loading the br_netfilter module (which loads with its nf-call-iptables
settings enabled) caused user-defined iptables rules to block traffic
on bridges, breaking inter-container communication.
In 27.3.0, commit 5c499fc4b2 delayed
loading of the br_netfilter module until it was needed. The load
now happens in the function that sets bridge-nf-call-ip[6]tables when
needed. It was only called for icc=false networks.
However, br_netfilter is also needed when userland-proxy=false.
Without it, packets addressed to a host-mapped port for a container
on the same network are not DNAT'd properly (responses have the server
container's address instead of the host's).
That means, in all releases including 26.x, if br_netfilter was loaded
before the daemon started - and the OS/user/other-application had
disabled bridge-nf-call-ip[6]tables, it would not be enabled by the
daemon. So, ICC would fail for host-mapped ports with the userland-proxy
disabled.
The change in 27.3.0 made this worse - previously, loading br_netfilter
whenever iptables/ip6tables was enabled meant that bridge-netfiltering
got enabled, even though the daemon didn't check it was enabled.
So... check that br_netfilter is loaded, with bridge-nf-call-ip[6]tables
enabled, if userland-proxy=false.
Signed-off-by: Rob Murray <rob.murray@docker.com>