kernel-memory limits are not supported in cgroups v2, and were obsoleted in
[kernel v5.4], producing a `ENOTSUP` in kernel v5.16. Support for this option
was removed in runc and other runtimes, as various LTS kernels contained a
broken implementation, resulting in unpredictable behavior.
We deprecated this option in [moby@b8ca7de], producing a warning when used,
and actively ignore the option since [moby@0798f5f].
Given that setting this option had no effect in most situations, we should
just remove this option instead of continuing to handle it with the expectation
that a runtime may still support it.
Note that we still support RHEL 8 (kernel 4.18) and RHEL 9 (kernel 5.14). We
no longer build packages for Ubuntu 20.04 (kernel 5.4) and Debian Bullseye 11
(kernel 5.10), which still have an LTS / ESM programme, but for those it would
only impact situations where a runtime is used that still supports it, and
an old API version was used.
[kernel v5.4]: https://github.com/torvalds/linux/commit/0158115f702b0ba208ab0
[moby@b8ca7de]: b8ca7de823
[moby@0798f5f]: 0798f5f5cf
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These comments were added to enforce using the correct import path for
our packages ("github.com/docker/docker", not "github.com/moby/moby").
However, when working in go module mode (not GOPATH / vendor), they have
no effect, so their impact is limited.
Remove these imports in preparation of migrating our code to become an
actual go module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- use t.TempDir()
- combine various tests to check if New() sets expected values instead
of skipping tests when not.
- remove gotest.tools, as it was only used minimally
- replace uses of "path" for filepath operations.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The netfilter module is now loaded on-demand, and no longer during daemon
startup, making these fields obsolete. These fields are now always `false`
and will be removed in the next relase.
This patch deprecates:
- the `BridgeNfIptables` field in `api/types/system.Info`
- the `BridgeNfIp6tables` field in `api/types/system.Info`
- the `BridgeNFCallIPTablesDisabled` field in `pkg/sysinfo.SysInfo`
- the `BridgeNFCallIP6TablesDisabled` field in `pkg/sysinfo.SysInfo`
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The github.com/containerd/containerd/log package was moved to a separate
module, which will also be used by upcoming (patch) releases of containerd.
This patch moves our own uses of the package to use the new module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Omit `KernelMemory` and `KernelMemoryTCP` fields in `/info` response if they're
not supported, or when using API v1.42 or up.
- Re-enable detection of `KernelMemory` (as it's still needed for older API versions)
- Remove warning about kernel memory TCP in daemon logs (a warning is still returned
by the `/info` endpoint, but we can consider removing that).
- Prevent incorrect "Minimum kernel memory limit allowed" error if the value was
reset because it's not supported by the host.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- remove KernelMemory option from `v1.42` api docs
- remove KernelMemory warning on `/info`
- update changes for `v1.42`
- remove `KernelMemory` field from endpoints docs
Signed-off-by: aiordache <anca.iordache@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Reimplement GetCgroupMounts using the github.com/containerd/cgroups and
github.com/moby/sys/mountinfo packages.
Signed-off-by: Cory Snider <csnider@mirantis.com>
This replaces the local SeccompSupported() utility for the implementation in containerd,
which performs the same check.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
The "quiet" argument was only used in a single place (at daemon startup), and
every other use had to pass "false" to prevent this function from logging
warnings.
Now that SysInfo contains the warnings that occurred when collecting the
system information, we can make leave it up to the caller to use those
warnings (and log them if wanted).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We pass the SysInfo struct to all functions. Adding cg2Controllers as a
(non-exported) field makes passing around this information easier.
Now that infoCollector and infoCollectorV2 have the same signature, we can
simplify some bits and use a single slice for all "collectors".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We pass the SysInfo struct to all functions. Adding cgMounts as a
(non-exported) field makes passing around this information easier.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes it clearer that this code is the cgroups v1 equivalent of newV2().
Also moves the "options" handling to newV2() because it's currently only used
for cgroupsv2.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
libcontainer does not guarantee a stable API, and is not intended
for external consumers.
this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
As it turns out, we call this function every time someone calls `docker
info`, every time a contianer is created, and every time a container is
started.
Certainly this should be refactored as a whole, but for now, memoize the
seccomp value.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.
Same for CPU realtime.
For compatibility reasons, /info stays the same for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, commit 69cf03700f chose not to use information
already fetched, and called cgroups.FindCgroupMountpoint() instead.
This is not a cheap call, as it has to parse the whole nine yards
of /proc/self/mountinfo, and the info which it tries to get (whether
the pids controller is present) is already available from cgMounts map.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
enable resource limitation by disabling cgroup v1 warnings
resource limitation still doesn't work with rootless mode (even with systemd mode)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.
Fixes#38332
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
Please refer to `docs/rootless.md`.
TLDR:
* Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
* `dockerd-rootless.sh --experimental`
* `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This fix tries to address the issue raised in 37038 where
there were no memory.kernelTCP support for linux.
This fix add MemoryKernelTCP to HostConfig, and pass
the config to runtime-spec.
Additional test case has been added.
This fix fixes 37038.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
Upon each container create I'm seeing these warning **every** time in the
daemon output:
```
WARN[0002] Your kernel does not support swap memory limit
WARN[0002] Your kernel does not support cgroup rt period
WARN[0002] Your kernel does not support cgroup rt runtime
```
Showing them for each container.create() fills up the logs and encourages
people to ignore the output being generated - which means its less likely
they'll see real issues when they happen. In short, I don't think we
need to show these warnings more than once, so let's only show these
warnings at daemon start-up time.
Signed-off-by: Doug Davis <dug@us.ibm.com>
containers may specify these cgroup values at runtime. This will allow
processes to change their priority to real-time within the container
when CONFIG_RT_GROUP_SCHED is enabled in the kernel. See #22380.
Also added sanity checks for the new --cpu-rt-runtime and --cpu-rt-period
flags to ensure that that the kernel supports these features and that
runtime is not greater than period.
Daemon will support a --cpu-rt-runtime flag to initialize the parent
cgroup on startup, this prevents the administrator from alotting runtime
to docker after each restart.
There are additional checks that could be added but maybe too far? Check
parent cgroups to ensure values are <= parent, inspecting rtprio ulimit
and issuing a warning.
Signed-off-by: Erik St. Martin <alakriti@gmail.com>