Packets with the given firewall mark are accepted by the bridge
driver's filter-FORWARD rules.
The value can either be an integer mark, or it can include a
mask in the format "<mark>/<mask>".
Signed-off-by: Rob Murray <rob.murray@docker.com>
The stdcopy package is used to produce and read multiplexed streams for
"attach" and "logs". It is used both by the API server (to produce), and
the client (to read / de-multiplex).
Move it to the api package, so that it can be included in the api module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Port bindings are currently sorted — to form groups that should be
mapped in one go — and then normalized by `configurePortBindingIPv[4|6]`.
However, gw_modes might not be the same for IPv4/v6, so the upcoming
split of NATed / routed portmappers will require that they're processed
independently.
With this commit, PBs are now normalized (by calling the `configure...`
funcs), and then sorted. The sort func is updated to group routed PBs.
`needSamePort` was comparing the container's IP address, but this field
was never set by the time it's called. Now it's set, and has a different
value when IPv4 / IPv6 portmappings are mixed, so remove it from the
comparison.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The TestNatNetworkICC and TestFlakyPortMappedHairpinWindows (TestPortMappedHairpinWindows)
tests were frequently failing on Windows with a context timeout;
=== FAIL: github.com/docker/docker/integration/networking TestNatNetworkICC/User_defined_nat_network (9.67s)
nat_windows_test.go:62: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.51/containers/4357bd24c9b77b955ee961530d1f552ce099b3dcbeb396db599971b2396d8b08/start": context deadline exceeded
panic.go:636: assertion failed: error is not nil: Error response from daemon: error while removing network: network mynat has active endpoints (name:"ctr2" id:"dc8d597dafef")
=== FAIL: github.com/docker/docker/integration/networking TestNatNetworkICC (18.34s)
=== FAIL: github.com/docker/docker/integration/networking TestFlakyPortMappedHairpinWindows (13.02s)
nat_windows_test.go:110: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.51/containers/65207ae3d6953d85cd2123feac45af60b059842d570d4f897ea53c813cba3cb4/start": context deadline exceeded
panic.go:636: assertion failed: error is not nil: Error response from daemon: error while removing network: network clientnet has active endpoints (name:"amazing_visvesvaraya" id:"18add58d415e")
These timeouts were set in c1ab6eda4b and
2df4391473, and were shared between Linux
and Windows; likely Windows is slower to start, so these timeouts to be
expected.
Let's increase the context timeout to give it a bit more time.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The Inter-Network Communication rules in the iptables chains
DOCKER-ISOLATION-STAGE-1 / DOCKER-ISOLATION-STAGE-2 (which are
called from filter-FORWARD) currently:
- Block access from containers in one bridge network, to ports
published to host addresses by containers in other bridge
networks, when the userland-proxy is disabled.
- But, that access is allowed when the proxy is enabled.
- Block access to all ports on container addresses in gateway
mode "nat-unprotected" networks.
- But, those ports can be accessed from anywhere else, including
other hosts. Just not other bridge networks.
- Allow access from containers in "nat" bridge networks to published
ports on container addresses in "routed" networks. But, to do that,
extra INC rules are added for the routed network.
The INC rules are no longer needed to block access from containers
in one network to unpublished ports on container addresses in
other networks. Direct routing to containers in NAT networks is
blocked by the "raw-PREROUTING" rules that block access from
untrusted interfaces (all interfaces apart from the network's
own bridge).
Drop these INC rules to resolve the inconsistencies listed above,
with this change:
- Published ports on host addresses can be accessed from containers
in other networks (even without the userland-proxy).
- The rules for direct routing between bridge networks are the same
as the rules for direct routing from outside the Docker host
(allowed for gw modes "routed" and "nat-unprotected", disallowed
for "nat").
Fewer rules, so it's simpler, and perhaps slightly faster.
Internal networks (with no access to networks outside the host)
are also implemented using rules in the DOCKER-ISOLATION chains.
This change moves those rules to a new chain, DOCKER-INTERNAL,
and drops the DOCKER-ISOLATION chains.
Signed-off-by: Rob Murray <rob.murray@docker.com>
We set SO_REUSEADDR on sockets used for host port mappings by
docker-proxy - which means it's possible to bind the same port
on a specific address as well as 0.0.0.0/::.
For TCP sockets, an error is raised when listen() is called on
both sockets - and the port allocator will be called again to
avoid the clash (if the port was allocated from a range, otherwise
the container will just fail to start).
But, for UDP sockets, there's no listen() - so take more care
to avoid the clash in the portallocator.
The port allocator keeps a set of allocated ports for each of
the host IP addresses it's seen, including 0.0.0.0/::. So, if a
mapping to 0.0.0.0/:: is requested, find a port that's free in
the range for each of the known IP addresses (but still only
mark it as allocated against 0.0.0.0/::). And, if a port is
requested for specific host addresses, make sure it's also
free in the corresponding 0.0.0.0/:: set (but only mark it as
allocated against the specific addresses - because the same
port can be allocated against a different specific address).
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because we set SO_REUSEADDR on sockets for host ports, if there
are port mappings for INADDR_ANY (the default) as well as for
specific host ports - bind() cannot be used to detect clashes.
That means, for example, on daemon startup, if the port allocator
returns the first port in its ephemeral range for a specific host
adddress, and the next port mapping is for 0.0.0.0 - the same port
is returned and both bind() calls succeed. Then, the container
fails to start later when listen() spots the problem and it's too
late to find another port.
So, bind and listen to each set of ports as they're allocated
instead of just binding.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because nftables tables/chain aren't fixed, like they are
in iptables - this change makes an assumption about the
bridge driver's naming.
Signed-off-by: Rob Murray <rob.murray@docker.com>
This test is failing frequently on Windows;
=== FAIL: github.com/docker/docker/integration/networking TestPortMappedHairpinWindows (12.37s)
nat_windows_test.go:108: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.49/containers/1181d6510a2f55c742b7b183aa7324eddbc213cd15797428c4062dcb031fb825/start": context deadline exceeded
panic.go:636: assertion failed: error is not nil: Error response from daemon: error while removing network: network clientnet has active endpoints (name:"laughing_lederberg" id:"8605ebbc2c7c")
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Per-network option com.docker.network.bridge.trusted-host-interfaces
accepts a list of interfaces that are allowed to route
directly to a container's published ports in a bridge
network with nat enabled.
This daemon level option disables direct access filtering,
enabling direct access to published ports on container
addresses in all bridge networks, via all host interfaces.
It overlaps with short-term env-var workaround:
DOCKER_INSECURE_NO_IPTABLES_RAW=1
- it does not allow packets sent from outside the host to reach
ports published only to 127.0.0.1
- it will outlive iptables (the workaround was initially intended
for hosts that do not have kernel support for the "raw" iptables
table).
Signed-off-by: Rob Murray <rob.murray@docker.com>
trusted_host_interface have access to published ports on container
addresses - enabling direct routing to the container via those
interfaces.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Commit 27adcd5 ("libnet/d/bridge: drop connections to lo mappings, and
direct remote connections") introduced an iptables rule to drop 'direct'
remote connections made to the container's IP address - for each
published port on the container.
The normal filter-FORWARD rules would then drop packets sent directly to
unpublished ports. This rule was only created along with the rest of port
publishing (when a container's endpoint was selected as its gateway). Until
then, all packets addressed directly to the container's ports were dropped
by the filter-FORWARD rules.
But, the rule doesn't need to be per-port. Just drop packets sent
directly to a container's address unless they originate on the host.
That means fewer rules, that can be created along with the endpoint (then
directly-routed get dropped at the same point whether or not the endpoint
is currently the gateway - very slightly earlier than when it's not the
gateway).
Signed-off-by: Rob Murray <rob.murray@docker.com>
Report FirewallBackend in "docker info".
It's currently "iptables" or "iptables+firewalld" on Linux, and
omitted on Windows.
Signed-off-by: Rob Murray <rob.murray@docker.com>
With firewalld enabled in CI, TestAccessPublishedPortFromHost/userland-proxy=true/IPv6=true
consistently fails when trying to use a link-local address on
eth0 (it's ok for the ULL added by the test).
In a local moby dev container, it passes - although it sometimes
fails when making its request to the host's ::1.
Signed-off-by: Rob Murray <rob.murray@docker.com>
The daemon runs in a separate netns, but when it wants to create
an iptables rule it sends a dbus message to firewalld - which is
running in the host's netns.
Signed-off-by: Rob Murray <rob.murray@docker.com>
For kernels that don't have CONFIG_IP_NF_RAW, if the env
var DOCKER_INSECURE_NO_IPTABLES_RAW is set to "1", don't
try to create raw rules.
This means direct routing to published ports is possible
from other hosts on the local network, even if the port
is published to a loopback address.
Signed-off-by: Rob Murray <rob.murray@docker.com>
If the userland-proxy is running, packets from one bridge network
addressed to the host port are not DNAT'd - so that docker-proxy
can pick them up, and therefore the packet bypasses the network
isolation rules.
Without the userland-proxy, there's no way for a packet from one
bridge network to bypass the network isolation rules. So, in this
case, DNAT is not skipped - and that at-least allows packets
originating from the network that published the port to access
the host port.
Commit 0546d90 improved support for routed mode networks (allowing
nat-mode networks access to containers in routed-mode networks, as
well as just remote access).
That commit changed the "SKIP DNAT" logic, making sure DNAT was
skipped for a routed-mode network if the userland-proxy was enabled
(so, containers in routed mode networks could access ports published
by other networks).
But, it still skipped DNAT for a routed mode network if the userland
proxy was disabled - packets from the routed mode network aimed at
any other network would be dropped by the network isolation rules
anyway, and containers in a routed mode network don't need access to
ports published from that network (because, by definition, there
can't be any).
However, network isolation rules can be worked-around with a rule
in the DOCKER-USER chain, but the SKIP DNAT rule is harder to deal
with.
So, for routed-mode, only skip DNAT if the userland-proxy is
enabled (just like nat-mode networks).
Signed-off-by: Rob Murray <rob.murray@docker.com>
When an IPv6 network is first created with no specific IPAM config,
network inspect adds a CIDR range to the gateway address. After the
daemon has been restarted, it's just a plain address.
Once the daaemon's been restated, "info" becomes "config", and the
address is reported correctly from "config".
Make the IPv6 code to report the gateway from "info" use net.IPNet.IP
instead of the whole net.IPNet - like the IPv4 code.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Traditionally when Linux receives remote packets with daddr set to a
loopback address, it reject them as 'martians'. However, when a NAT rule
is applied through iptables this doesn't happen. Our current DNAT rule
used to map host ports to containers is applied unconditionally, even
for such 'martian' packets.
This means a neighbor host (ie. a host connected to the same L2
segment) can send packets to a port mapped on a loopback address. The
purpose of publishing on a loopback address is to make ports
inaccessible to remote hosts -- lack of proper filtering defeats that.
This commit adds an iptables rule to the raw-PREROUTING chain to drop
packets with a loopback dest address and coming from any interface other
than lo.
To accomodate WSL2 mirrored mode, another rule is inserted beforehand to
specifically accept packets coming from the loopback0 interface.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
When a NAT-based port mapping is created, the daemon adds a DNAT rule in
nat-DOCKER to replace the dest addr with the container IP. However, the
daemon never sets up rules to filter packets destined directly to the
container port. This allows a rogue neighbor (ie. a host that shares a
L2 segment with the host) to send packets directly to the container on
its container-side exposed port.
For instance, if container port 5000 is mapped to host port 6000, a
neighbor could send packets directly to the container on its port 5000.
Since nat-DOCKER mangles the dest addr, and the nat table forbids DROP
rules, this change adds a new rule in the raw-PREROUTING chain to filter
ingress connections targeting the container's IP address.
This filtering is only done when gw_mode=nat. For the unprotected
variant, no filtering is done.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Commit fc7caf96d reverted 433b1f9b1 as it was introducing a regression,
ie. containers couldn't reach ports published on the host using their
gateway's IP address or the host IP address.
These scenarios are now tested.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This test has failed a couple of times in CI, but can't repro locally.
Let's find out whether there are any clues in the daemon log.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Now a gratuitous/unsolicted ARP is sent, there's no need to
use an IPv4-based MAC address to preserve arp-cache mappings
between an endpoint's IP addresses and its MAC addresses.
Because a random MAC address is used for the default bridge,
it no longer makes sense to derive container IPv6 addresses
from the MAC address. This "postIPv6" behaviour was needed
before IPv6 addresses could be configured, but not now. So,
IPv6 addresses will now be IPAM-allocated on the default
bridge network, just as they are for user-defined bridges.
Signed-off-by: Rob Murray <rob.murray@docker.com>
When a NAT-based port mapping is created with a HostIP specified, we
insert a DNAT rule in nat-DOCKER to replace the dest addr with the
container IP. Then, in filter chains, we allow access to the container
port for any packet not coming from the container's network itself (if
hairpinning is disabled), nor from another host bridge.
However we don't set any rule that prevents a rogue neighbor that shares
a L2 segment with the host, but not the one where the port binding is
expected to be published, from sending packets destined to that HostIP.
For instance, if a port binding is created with HostIP == '127.0.0.1',
this port should not be accessible from anything but the lo interface.
That's currently not the case and this provides a false sense of
security.
Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP
rules, this change adds rules into raw-PREROUTING to filter ingress
packets destined to mapped ports based on the input interface, the dest
addr and the dest port.
Interfaces are dynamically resolved when packets hit the host, thanks
to iptables' addrtype extension. This extension does a fib lookup of the
dest addr and checks that it's associated with the interface reached.
Also, when a proxy-based port mapping is created, as is the case when an
IPv6 HostIP is specified but the container is only IPv4-capable, we
don't set any sort of filtering. So the same issue might happen. The
reason is a bit different - in that case, that's just how the kernel
works. But, in order to stay consistent with NAT-based mappings, these
rules are also applied.
The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any
true-ish value to globally disable this behavior.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Same as "nat" mode, there's masquerading and port mapping from the
host - but no port/protocol filtering for direct access to the
container's address from remote hosts.
This is the old default behaviour for IPv4 when the filter-FORWARD
chain's default policy was "ACCEPT" (the daemon would only set it
to "DROP" when it set sysctl "ip_forward" itself, but it didn't set
up DROP rules for unpublished ports).
Now, port filtering doesn't depend on the filter-FORWARD policy. So,
this mode is added as a way to restore the old/surprising/insecure
behaviour for anyone who's depending on it. Networks will need to
be re-created with this new gateway mode.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Use the same logic to generate IPAMConf for IPv6 as for IPv4.
- When no fixed-cidr-v6 is specified, rather than error out, use
the default address pools (as for an IPv4 default bridge with no
fixed-cidr, and as for user-defined networks).
- Add daemon option --bip6, similar to --bip.
- Necessary because it's the only way to override an old address
on docker0 (daemon-managed default bridge), as illustrated by
test cases.
- For a user-managed default bridge (--bridge), use IPv6 addresses
on the user's bridge to determine the pool, sub-pool and gateway.
Following the same rules as IPv4.
- Don't set up IPv6 IPAMConf if IPv6 is not enabled.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Add an integration test to check that a container on a network
with gateway-mode=nat can access a container on a network with
gateway-mode=routed, but not vice-versa.
Signed-off-by: Rob Murray <rob.murray@docker.com>