Commit Graph

162 Commits

Author SHA1 Message Date
Sebastiaan van Stijn
b57d41c4bf Merge pull request #49799 from thaJeztah/apparmor_cleanups
profiles/apparmor: add some optimisations and tests
2025-04-16 13:21:25 +02:00
Sebastiaan van Stijn
7eda35fd05 profiles/apparmor: IsLoaded: optimize
- Use a bufio.Scanner to read the profiles
- Use strings.Cut

Before/After:

    BenchmarkIsLoaded-10  2258	    508049 ns/op    244266 B/op    10004 allocs/op
    BenchmarkIsLoaded-10  5680	    208703 ns/op      4264 B/op	       4 allocs/op

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:13 +02:00
Sebastiaan van Stijn
0462b5e318 profiles/apparmor: add BenchmarkIsLoaded
go test -bench=. ./profiles/apparmor/
    goos: linux
    goarch: arm64
    pkg: github.com/docker/docker/profiles/apparmor
    BenchmarkIsLoaded-10    	    2258	    508049 ns/op	  244266 B/op	   10004 allocs/op
    PASS
    ok  	github.com/docker/docker/profiles/apparmor	1.210s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:13 +02:00
Sebastiaan van Stijn
b23d267cb5 profiles/apparmor: add basic unit-test for IsLoaded
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:12 +02:00
Sebastiaan van Stijn
0dd5959eeb profiles/apparmor: InstallDefault: slight cleanup and optimization
The existing code was more complicated than needed. By default, the daemon
runs "unconfined", but we try to detect the current profile that's set.
When failing to do so (error, or detected profile is empty), we assume
the default ("unconfined").

This patch simplifies the logic;

- Set the default ("unconfined")
- Only update the default when we successfully found the current profile
  (no error occurred, and the profile is not empty).

While updating, also;

- Replaced use of `strings.SplitN` for `strings.Cut`, which is more
  efficient, and doesn't allocate.
- Move constructing the profileData closer to where it's used.
- Remove intermediate var.
- Combine defers and change the order (close file first, before removing),
  and suppress errors to keep linters happy.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:12 +02:00
Sebastiaan van Stijn
0bb761698c profiles/apparmor: loadprofile: fix double command in error message
`exec.Cmd.Path` already contains the command that was executed, so we
were printing the command twice. However, `exec.Cmd` implements a stringer
interface, which provides a readable version of the command that was
executed, so use that instead. While updating, lso change backticks in
the error for regular quotes.

Before:

    running `/usr/sbin/apparmor_parser apparmor_parser -Kr /no/such/file` failed with output: Cache read/write disabled: interface file missing. (Kernel needs AppArmor 2.4 compatibility patch.)
    Warning: unable to find a suitable fs in /proc/mounts, is it mounted?
    Use --subdomainfs to override.

    error: exit status 1

After:

    running '/usr/sbin/apparmor_parser -Kr /no/such/file' failed with output: Cache read/write disabled: interface file missing. (Kernel needs AppArmor 2.4 compatibility patch.)
    Warning: unable to find a suitable fs in /proc/mounts, is it mounted?
    Use --subdomainfs to override.

    error: exit status 1

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:12 +02:00
Sebastiaan van Stijn
8e1c366773 profiles/apparmor: remove "// import" comments
We are considering moving the apparmor profile to a separate module,
so removing these comments in preparation. These comments are ignored
already when building in go module mode, so have little benefits.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 13:34:12 +02:00
Sebastiaan van Stijn
1fa6a46c5d profiles/seccomp: remove "// import" comments
We are considering moving the seccomp profile to a separate module,
so removing these comments in preparation. These comments are ignored
already when building in go module mode, so have little benefits.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 10:46:03 +02:00
Sebastiaan van Stijn
89604f1df1 profiles/seccomp: use stdlib for asserting
We are considering moving the seccomp profile to a separate module,
so reducing the list of dependencies for this package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2025-04-12 10:26:10 +02:00
Sebastiaan van Stijn
443a074fa4 profiles/seccomp: remove redundant capturing of loop vars (copyloopvar)
profiles/seccomp/kernel_linux_test.go:52:3: The copy of the 'for' variable "tc" can be deleted (Go 1.22+) (copyloopvar)
            tc := tc
            ^
    profiles/seccomp/kernel_linux_test.go:111:3: The copy of the 'for' variable "tc" can be deleted (Go 1.22+) (copyloopvar)
            tc := tc
            ^
    profiles/seccomp/seccomp_test.go:135:3: The copy of the 'for' variable "tc" can be deleted (Go 1.22+) (copyloopvar)
            tc := tc
            ^
    profiles/seccomp/seccomp_test.go:223:3: The copy of the 'for' variable "tc" can be deleted (Go 1.22+) (copyloopvar)
            tc := tc
            ^
    profiles/seccomp/seccomp_test.go:265:3: The copy of the 'for' variable "tc" can be deleted (Go 1.22+) (copyloopvar)
            tc := tc
            ^

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-11-12 14:02:10 +01:00
Sebastiaan van Stijn
5c48736863 remove redundant alias for runtime-spec
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-10-26 18:31:39 +02:00
George Adams
1161b790cf seccomp: add riscv64 mapping to seccomp_linux.go
Signed-off-by: George Adams <georgeadams1995@gmail.com>
2024-09-09 21:18:30 +01:00
Tomáš Virtus
5ebe2c0d6b apparmor: Allow confined runc to kill containers
/usr/sbin/runc is confined with "runc" profile[1] introduced in AppArmor
v4.0.0. This change breaks stopping of containers, because the profile
assigned to containers doesn't accept signals from the "runc" peer.
AppArmor >= v4.0.0 is currently part of Ubuntu Mantic (23.10) and later.

In the case of Docker, this regression is hidden by the fact that
dockerd itself sends SIGKILL to the running container after runc fails
to stop it. It is still a regression, because graceful shutdowns of
containers via "docker stop" are no longer possible, as SIGTERM from
runc is not delivered to them. This can be seen in logs from dockerd
when run with debug logging enabled and also from tracing signals with
killsnoop utility from bcc[2] (in bpfcc-tools package in Debian/Ubuntu):

  Test commands:

    root@cloudimg:~# docker run -d --name test redis
    ba04c137827df8468358c274bc719bf7fc291b1ed9acf4aaa128ccc52816fe46
    root@cloudimg:~# docker stop test

  Relevant syslog messages (with wrapped long lines):

    Apr 23 20:45:26 cloudimg kernel: audit:
      type=1400 audit(1713905126.444:253): apparmor="DENIED"
      operation="signal" class="signal" profile="docker-default" pid=9289
      comm="runc" requested_mask="receive" denied_mask="receive"
      signal=kill peer="runc"
    Apr 23 20:45:36 cloudimg dockerd[9030]:
      time="2024-04-23T20:45:36.447016467Z"
      level=warning msg="Container failed to exit within 10s of kill - trying direct SIGKILL"
      container=ba04c137827df8468358c274bc719bf7fc291b1ed9acf4aaa128ccc52816fe46
      error="context deadline exceeded"

  Killsnoop output after "docker stop ...":

    root@cloudimg:~# killsnoop-bpfcc
    TIME      PID      COMM             SIG  TPID     RESULT
    20:51:00  9631     runc             3    9581     -13
    20:51:02  9637     runc             9    9581     -13
    20:51:12  9030     dockerd          9    9581     0

This change extends the docker-default profile with rules that allow
receiving signals from processes that run confined with either runc or
crun profile (crun[4] is an alternative OCI runtime that's also confined
in AppArmor >= v4.0.0, see [1]). It is backward compatible because the
peer value is a regular expression (AARE) so the referenced profile
doesn't have to exist for this profile to successfully compile and load.

Note that the runc profile has an attachment to /usr/sbin/runc. This is
the path where the runc package in Debian/Ubuntu puts the binary. When
the docker-ce package is installed from the upstream repository[3], runc
is installed as part of the containerd.io package at /usr/bin/runc.
Therefore it's still running unconfined and has no issues sending
signals to containers.

[1] https://gitlab.com/apparmor/apparmor/-/commit/2594d936
[2] https://github.com/iovisor/bcc/blob/master/tools/killsnoop.py
[3] https://download.docker.com/linux/ubuntu
[4] https://github.com/containers/crun

Signed-off-by: Tomáš Virtus <nechtom@gmail.com>
2024-04-24 13:07:48 +02:00
George Ma
14a8fac092 chore: fix mismatched function names in godoc
Signed-off-by: George Ma <mayangang@outlook.com>
2024-03-22 16:24:31 +08:00
Sebastiaan van Stijn
4adc40ac40 fix duplicate words (dupwords)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-03-07 10:57:03 +01:00
Sebastiaan van Stijn
d69729e053 seccomp: add futex_wake syscall (kernel v6.7, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 9f6c532f59

    futex: Add sys_futex_wake()

    To complement sys_futex_waitv() add sys_futex_wake(). This syscall
    implements what was previously known as FUTEX_WAKE_BITSET except it
    uses 'unsigned long' for the bitmask and takes FUTEX2 flags.

    The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 14:12:40 +01:00
Sebastiaan van Stijn
10d344d176 seccomp: add futex_wait syscall (kernel v6.7, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: cb8c4312af

    futex: Add sys_futex_wait()

    To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This
    syscall implements what was previously known as FUTEX_WAIT_BITSET
    except it uses 'unsigned long' for the value and bitmask arguments,
    takes timespec and clockid_t arguments for the absolute timeout and
    uses FUTEX2 flags.

    The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 14:12:40 +01:00
Sebastiaan van Stijn
df57a080b6 seccomp: add futex_requeue syscall (kernel v6.7, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 0f4b5f9722

    futex: Add sys_futex_requeue()

    Finish off the 'simple' futex2 syscall group by adding
    sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are
    too numerous to fit into a regular syscall. As such, use struct
    futex_waitv to pass the 'source' and 'destination' futexes to the
    syscall.

    This syscall implements what was previously known as FUTEX_CMP_REQUEUE
    and uses {val, uaddr, flags} for source and {uaddr, flags} for
    destination.

    This design explicitly allows requeueing between different types of
    futex by having a different flags word per uaddr.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 14:12:31 +01:00
Sebastiaan van Stijn
8826f402f9 seccomp: add map_shadow_stack syscall (kernel v6.6, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: c35559f94e

    x86/shstk: Introduce map_shadow_stack syscall

    When operating with shadow stacks enabled, the kernel will automatically
    allocate shadow stacks for new threads, however in some cases userspace
    will need additional shadow stacks. The main example of this is the
    ucontext family of functions, which require userspace allocating and
    pivoting to userspace managed stacks.

    Unlike most other user memory permissions, shadow stacks need to be
    provisioned with special data in order to be useful. They need to be setup
    with a restore token so that userspace can pivot to them via the RSTORSSP
    instruction. But, the security design of shadow stacks is that they
    should not be written to except in limited circumstances. This presents a
    problem for userspace, as to how userspace can provision this special
    data, without allowing for the shadow stack to be generally writable.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 14:02:33 +01:00
Sebastiaan van Stijn
6f242f1a28 seccomp: add fchmodat2 syscall (kernel v6.6, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: 09da082b07

    fs: Add fchmodat2()

    On the userspace side fchmodat(3) is implemented as a wrapper
    function which implements the POSIX-specified interface. This
    interface differs from the underlying kernel system call, which does not
    have a flags argument. Most implementations require procfs [1][2].

    There doesn't appear to be a good userspace workaround for this issue
    but the implementation in the kernel is pretty straight-forward.

    The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
    unlike existing fchmodat.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 13:59:04 +01:00
Sebastiaan van Stijn
4d0d5ee10d seccomp: add cachestat syscall (kernel v6.5, libseccomp v2.5.5)
Add this syscall to match the profile in containerd

containerd: a6e52c74fa
libseccomp: 53267af3fb
kernel: cf264e1329

    NAME
        cachestat - query the page cache statistics of a file.

    SYNOPSIS
        #include <sys/mman.h>

        struct cachestat_range {
            __u64 off;
            __u64 len;
        };

        struct cachestat {
            __u64 nr_cache;
            __u64 nr_dirty;
            __u64 nr_writeback;
            __u64 nr_evicted;
            __u64 nr_recently_evicted;
        };

        int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
            struct cachestat *cstat, unsigned int flags);

    DESCRIPTION
        cachestat() queries the number of cached pages, number of dirty
        pages, number of pages marked for writeback, number of evicted
        pages, number of recently evicted pages, in the bytes range given by
        `off` and `len`.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 13:57:00 +01:00
Sebastiaan van Stijn
1251982cf7 seccomp: add set_mempolicy_home_node syscall (kernel v5.17, libseccomp v2.5.4)
This syscall is gated by CAP_SYS_NICE, matching the profile in containerd.

containerd: a6e52c74fa
libseccomp: d83cb7ac25
kernel: c6018b4b25

    mm/mempolicy: add set_mempolicy_home_node syscall
    This syscall can be used to set a home node for the MPOL_BIND and
    MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
    setting up a memory policy for the specified range as shown below.

      mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
            new_nodes->size + 1, 0);
      sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
                    home_node, 0);

    The syscall allows specifying a home node/preferred node from which
    kernel will fulfill memory allocation requests first.
    ...

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-02-06 13:53:15 +01:00
Sebastiaan van Stijn
6fae583dba pkg/aaparser: remove, and integrate into profiles/apparmor
This package provided utilities to obtain the apparmor_parser version, as well
as loading a profile.

Commit e3e715666f (included in v24.0.0 through
bfffb0974e) deprecated GetVersion, as it was no
longer used, which made LoadProfile the only utility remaining in this package.

LoadProfile appears to have no external consumers, and the only use in our code
is "profiles/apparmor".

This patch moves the remaining code (LoadProfile) to profiles/apparmor as a
non-exported function, and deletes the package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-01-02 15:15:49 +01:00
Albin Kerouanton
891241e7e7 seccomp: block io_uring_* syscalls in default profile
This syncs the seccomp profile with changes made to containerd's default
profile in [1].

The original containerd issue and PR mention:

> Security experts generally believe io_uring to be unsafe. In fact
> Google ChromeOS and Android have turned it off, plus all Google
> production servers turn it off. Based on the blog published by Google
> below it seems like a bunch of vulnerabilities related to io_uring can
> be exploited to breakout of the container.
>
> [2]
>
> Other security reaserchers also hold this opinion: see [3] for a
> blackhat presentation on io_uring exploits.

For the record, these syscalls were added to the allowlist in [4].

[1]: a48ddf4a20
[2]: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
[3]: https://i.blackhat.com/BH-US-23/Presentations/US-23-Lin-bad_io_uring.pdf
[4]: https://github.com/moby/moby/pull/39415

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
2023-11-02 19:05:47 +01:00
Bjorn Neergaard
bddd826d7a profiles/apparmor: deny /sys/devices/virtual/powercap
While this is not strictly necessary as the default OCI config masks this
path, it is possible that the user disabled path masking, passed their
own list, or is using a forked (or future) daemon version that has a
modified default config/allows changing the default config.

Add some defense-in-depth by also masking out this problematic hardware
device with the AppArmor LSM.

Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-09-18 16:41:03 -06:00
Sebastiaan van Stijn
498da44aab remove some remaining pre-go1.17 build-tags
commit ab35df454d removed most of the pre-go1.17
build-tags, but for some reason, "go fix" doesn't remove these, so removing
the remaining ones manually

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-08-24 17:51:07 +02:00
Sebastiaan van Stijn
169fab5146 profiles/seccomp: format code with gofumpt
Formatting the code with https://github.com/mvdan/gofumpt

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-06-29 00:31:50 +02:00
Bjorn Neergaard
b335e3d305 seccomp: add name_to_handle_at to allowlist
Based on the analysis on [the previous PR][1].

  [1]: https://github.com/moby/moby/pull/45766#pullrequestreview-1493908145

Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-06-28 05:44:48 -06:00
Vitor Anjos
fdc9b7cceb remove name_to_handle_at(2) from filtered syscalls
Signed-off-by: Vitor Anjos <bartier@users.noreply.github.com>
2023-06-27 09:49:38 -03:00
Sebastiaan van Stijn
ab35df454d remove pre-go1.17 build-tags
Removed pre-go1.17 build-tags with go fix;

    go mod init
    go fix -mod=readonly ./...
    rm go.mod

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-05-19 20:38:51 +02:00
Sebastiaan van Stijn
ecaab085db profiles/apparmor: remove use of aaparser.GetVersion()
commit 7008a51449 removed version-conditional
rules from the template, so we no longer need the apparmor_parser Version.

This patch removes the call to `aaparser.GetVersion()`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-05-08 13:50:13 +02:00
Sebastiaan van Stijn
7008a51449 profiles/apparmor: remove version-conditional constraints (< 2.8.96)
These conditions were added in 8cf89245f5
to account for old versions of debian/ubuntu (apparmor_parser < 2.8.96)
that lacked some options;

> This allows us to use the apparmor profile we have in contrib/apparmor/
> and solves the problems where certain functions are not apparent on older
> versions of apparmor_parser on debian/ubuntu.

Those patches were from 2015/2016, and all currently supported distro
versions should now have more current versions than that. Looking at the
oldest supported versions;

Ubuntu 18.04 "Bionic":

    apparmor_parser --version
    AppArmor parser version 2.12
    Copyright (C) 1999-2008 Novell Inc.
    Copyright 2009-2012 Canonical Ltd.

Debian 10 "Buster"

    apparmor_parser --version
    AppArmor parser version 2.13.2
    Copyright (C) 1999-2008 Novell Inc.
    Copyright 2009-2018 Canonical Ltd.

This patch removes the conditionals.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-02-08 18:04:04 +01:00
Sebastiaan van Stijn
57b229012a seccomp: block socket calls to AF_VSOCK in default profile
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as 17a9324035

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: https://github.com/moby/moby/pull/29076#commitcomment-21831387
> [2]: dcf2632945
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: https://github.com/kubevirt/kubevirt/pull/8546

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-12-01 14:06:37 +01:00
Sebastiaan van Stijn
7b7d1132e8 seccomp: allow "bpf", "perf_event_open", gated by CAP_BPF, CAP_PERFMON
Update the profile to make use of CAP_BPF and CAP_PERFMON capabilities. Prior to
kernel 5.8, bpf and perf_event_open required CAP_SYS_ADMIN. This change enables
finer control of the privilege setting, thus allowing us to run certain system
tracing tools with minimal privileges.

Based on the original patch from Henry Wang in the containerd repository.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-08-18 18:34:09 +02:00
zhubojun
e258d66f17 profiles: seccomp: add syscalls related to PKU in default policy
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.

close issue: #43481

Signed-off-by: zhubojun <bojun.zhu@foxmail.com>
2022-07-11 09:50:53 +08:00
Bastien Pascard
420142a886 profiles: seccomp: allow clock_settime64 when CAP_SYS_TIME is added
Signed-off-by: Bastien Pascard <bpascard@hotmail.com>
2022-07-06 23:45:13 +02:00
Phil Sphicas
66f14e4ae9 Fix AppArmor profile docker-default /proc/sys rule
The current docker-default AppArmor profile intends to block write
access to everything in `/proc`, except for `/proc/<pid>` and
`/proc/sys/kernel/shm*`.

Currently the rules block access to everything in `/proc/sys`, and do
not successfully allow access to `/proc/sys/kernel/shm*`. Specifically,
a path like /proc/sys/kernel/shmmax matches this part of the pattern:

    deny @{PROC}/{[^1-9][^0-9][^0-9][^0-9]*     }/** w,
         /proc  / s     y     s     /     kernel /shmmax

This patch updates the rule so that it works as intended.

Closes #39791

Signed-off-by: Phil Sphicas <phil.sphicas@att.com>
2022-06-30 21:12:58 +02:00
Kir Kolyshkin
8a5c13155e all: use unix.ByteSliceToString for utsname fields
This also fixes the GetOperatingSystem function in
pkg/parsers/operatingsystem which mistakenly truncated utsname.Machine
to the index of \0 in utsname.Sysname.

Fixes: 7aeb3efcb4
Cc: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-05-18 17:13:20 -07:00
Djordje Lukic
7de9f4f82d Allow different syscalls from kernels 5.12 -> 5.16
Kernel 5.12:

    mount_setattr: needs CAP_SYS_ADMIN

Kernel 5.14:

    quotactl_fd: needs CAP_SYS_ADMIN
    memfd_secret: always allowed

Kernel 5.15:

    process_mrelease: always allowed

Kernel 5.16:

    futex_waitv: always allowed

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
2022-05-13 12:35:08 +02:00
Justin Cormack
f1dd6bf84e Merge pull request #43553 from AkihiroSuda/riscv64
seccomp: support riscv64
2022-05-13 10:41:53 +01:00
Sebastiaan van Stijn
e9712464ad Merge pull request #43199 from Xyene/allow-landlock
seccomp: add support for Landlock syscalls in default policy
2022-05-13 10:18:45 +02:00
Tianon Gravi
c9e19a2aa1 Remove "seccomp" build tag
Similar to the (now removed) `apparmor` build tag, this build-time toggle existed for users who needed to build without the `libseccomp` library.  That's no longer necessary, and given the importance of seccomp to the overall default security profile of Docker containers, it makes sense that any binary built for Linux should support (and use by default) seccomp if the underlying host does.

Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
2022-05-12 14:48:35 -07:00
Akihiro Suda
4c2f18f6cc seccomp: support riscv64
Corresponds to containerd PR 6882

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-05-02 17:41:43 +09:00
Tudor Brindus
af819bf623 seccomp: add support for Landlock syscalls in default policy
This commit allows the Landlock[0] system calls in the default seccomp
policy.

Landlock was introduced in kernel 5.13, to fill the gap that inspecting
filepaths passed as arguments to filesystem system calls is not really
possible with pure `seccomp` (unless involving `ptrace`).

Allowing Landlock by default fits in with allowing `seccomp` for
containerized applications to voluntarily restrict their access rights
to files within the container.

[0]: https://www.kernel.org/doc/html/latest/userspace-api/landlock.html

Signed-off-by: Tudor Brindus <me@tbrindus.ca>
2022-01-31 08:44:04 -05:00
Sören Tempel
85eaf23bf4 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.

Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.

This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.

[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2021-12-18 14:06:07 +01:00
Eng Zer Jun
c55a4ac779 refactor: move from io/ioutil to io and os package
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2021-08-27 14:56:57 +08:00
Sebastiaan van Stijn
686be57d0a Update to Go 1.17.0, and gofmt with Go 1.17
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-24 23:33:27 +02:00
Sebastiaan van Stijn
2480bebf59 Merge pull request #42649 from kinvolk/rata/seccomp-default-errno
seccomp: Use explicit DefaultErrnoRet
2021-08-03 15:13:42 +02:00
Rodrigo Campos
fb794166d9 seccomp: Use explicit DefaultErrnoRet
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.

Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-07-30 19:13:21 +02:00
Daniel P. Berrangé
9f6b562dd1 seccomp: add support for "clone3" syscall in default policy
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  7a8d7162f9

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-07-27 10:56:07 +01:00