kernel-memory limits are not supported in cgroups v2, and were obsoleted in
[kernel v5.4], producing a `ENOTSUP` in kernel v5.16. Support for this option
was removed in runc and other runtimes, as various LTS kernels contained a
broken implementation, resulting in unpredictable behavior.
We deprecated this option in [moby@b8ca7de], producing a warning when used,
and actively ignore the option since [moby@0798f5f].
Given that setting this option had no effect in most situations, we should
just remove this option instead of continuing to handle it with the expectation
that a runtime may still support it.
Note that we still support RHEL 8 (kernel 4.18) and RHEL 9 (kernel 5.14). We
no longer build packages for Ubuntu 20.04 (kernel 5.4) and Debian Bullseye 11
(kernel 5.10), which still have an LTS / ESM programme, but for those it would
only impact situations where a runtime is used that still supports it, and
an old API version was used.
[kernel v5.4]: https://github.com/torvalds/linux/commit/0158115f702b0ba208ab0
[moby@b8ca7de]: b8ca7de823
[moby@0798f5f]: 0798f5f5cf
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These comments were added to enforce using the correct import path for
our packages ("github.com/docker/docker", not "github.com/moby/moby").
However, when working in go module mode (not GOPATH / vendor), they have
no effect, so their impact is limited.
Remove these imports in preparation of migrating our code to become an
actual go module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `BridgeNfIptables` and `BridgeNfIp6tables` fields in the
`GET /info` response were deprecated in API v1.48, and are now omitted
in API v1.50.
With this patch, old API version continue to return the field:
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.48/info | jq .BridgeNfIp6tables
false
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.48/info | jq .BridgeNfIptables
false
Omitting the field in API v1.50 and above
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.50/info | jq .BridgeNfIp6tables
null
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.50/info | jq .BridgeNfIptables
null
This reverts commit eacbbdeec6, and re-applies
a variant of 5d2006256f
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This reverts commit 5d2006256f, which
caused some issues in the docker/cli formatting code that needs some
investigating.
Let's (temporarily) revert this while we look what's wrong.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `BridgeNfIptables` and `BridgeNfIp6tables` fields in the
`GET /info` response were deprecated in API v1.48, and are now omitted
in API v1.49.
With this patch, old API version continue to return the field:
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.48/info | jq .BridgeNfIp6tables
false
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.48/info | jq .BridgeNfIptables
false
Omitting the field in API v1.49 and above
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.49/info | jq .BridgeNfIp6tables
null
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.49/info | jq .BridgeNfIptables
null
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The netfilter module is now loaded on-demand, and no longer during daemon
startup, making these fields obsolete. These fields are now always `false`
and will be removed in the next relase.
This patch deprecates:
- the `BridgeNfIptables` field in `api/types/system.Info`
- the `BridgeNfIp6tables` field in `api/types/system.Info`
- the `BridgeNFCallIPTablesDisabled` field in `pkg/sysinfo.SysInfo`
- the `BridgeNFCallIP6TablesDisabled` field in `pkg/sysinfo.SysInfo`
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Omit `KernelMemory` and `KernelMemoryTCP` fields in `/info` response if they're
not supported, or when using API v1.42 or up.
- Re-enable detection of `KernelMemory` (as it's still needed for older API versions)
- Remove warning about kernel memory TCP in daemon logs (a warning is still returned
by the `/info` endpoint, but we can consider removing that).
- Prevent incorrect "Minimum kernel memory limit allowed" error if the value was
reset because it's not supported by the host.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We pass the SysInfo struct to all functions. Adding cg2Controllers as a
(non-exported) field makes passing around this information easier.
Now that infoCollector and infoCollectorV2 have the same signature, we can
simplify some bits and use a single slice for all "collectors".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We pass the SysInfo struct to all functions. Adding cg2GroupPath as a
(non-exported) field makes passing around this information easier.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We pass the SysInfo struct to all functions. Adding cgMounts as a
(non-exported) field makes passing around this information easier.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.
Same for CPU realtime.
For compatibility reasons, /info stays the same for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use Getpid and SchedGetaffinity from golang.org/x/sys/unix to get the
number of CPUs in numCPU on Linux.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.
Fixes#38332
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
This fix tries to address the issue raised in 37038 where
there were no memory.kernelTCP support for linux.
This fix add MemoryKernelTCP to HostConfig, and pass
the config to runtime-spec.
Additional test case has been added.
This fix fixes 37038.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Using a value such as `--cpuset-mems=1-9223372036854775807` would cause
`dockerd` to run out of memory allocating a map of the values in the
validation code. Set limits to the normal limit of the number of CPUs,
and improve the error handling.
Reported by Huawei PSIRT.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
containers may specify these cgroup values at runtime. This will allow
processes to change their priority to real-time within the container
when CONFIG_RT_GROUP_SCHED is enabled in the kernel. See #22380.
Also added sanity checks for the new --cpu-rt-runtime and --cpu-rt-period
flags to ensure that that the kernel supports these features and that
runtime is not greater than period.
Daemon will support a --cpu-rt-runtime flag to initialize the parent
cgroup on startup, this prevents the administrator from alotting runtime
to docker after each restart.
There are additional checks that could be added but maybe too far? Check
parent cgroups to ensure values are <= parent, inspecting rtprio ulimit
and issuing a warning.
Signed-off-by: Erik St. Martin <alakriti@gmail.com>
This fix tries to fix wrong CPU count after CPU hot-plugging.
On windows, GetProcessAffinityMask has been used to probe the
number of CPUs in real time.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Before this patch libcontainer badly errored out with `invalid
argument` or `numerical result out of range` while trying to write
to cpuset.cpus or cpuset.mems with an invalid value provided.
This patch adds validation to --cpuset-cpus and --cpuset-mems flag along with
validation based on system's available cpus/mems before starting a container.
Signed-off-by: Antonio Murdaca <runcom@linux.com>
Carried: #14015
If kernel is compiled with CONFIG_FAIR_GROUP_SCHED disabled cpu.shares
doesn't exist.
If kernel is compiled with CONFIG_CFQ_GROUP_IOSCHED disabled blkio.weight
doesn't exist.
If kernel is compiled with CONFIG_CPUSETS disabled cpuset won't be
supported.
We need to handle these conditions by checking sysinfo and verifying them.
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Memory swappiness option takes 0-100, and helps to tune swappiness
behavior per container.
For example, When a lower value of swappiness is chosen
the container will see minimum major faults. When no value is
specified for memory-swappiness in docker UI, it is inherited from
parent cgroup. (generally 60 unless it is changed).
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
If memory cgroup is mounted, memory limit is always supported,
no need to check if these files are exist.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>