eqvinox 3 days ago

Using seccomp with a default-open filter is a terrible idea to begin with; it wasn't really designed for any of this. Seccomp in its most basic form didn't even have a filter list, it just allowed read() and write(). (And close() or something, don't quote me on the details, the point is it was a fixed list.) You're supposed to use it with a default-closed filter and fully enumerate what you need. (Yes, that's hard in a lot of cases, but still.)

There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.

(Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)

  • poincaredisk 3 days ago

    >it just allowed read() and write(). A fun consequence of this is that even though there was a function to check if seccomp is enabled or not, it could only ever do one of two things: return "not enabled" or crash the process.

    I agree with everything you wrote. I'll add that having a whitelist is not easy too, I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)

    • eqvinox 3 days ago

      > I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)

      ACK, that's what I meant with "hard in a lot of cases"… to be honest I think this is a failure of the ecosystem at-large. It's a bit of a half feature without some kind of higher-level userspace mechanism to collect who needs what, especially when a bunch libraries are involved. It's admittedly a very hard problem, e.g. just because something is linking libcurl as a 2nd or 3rd level dependency doesn't mean you intend your process to ever make network connections… I don't think it's unsolveable though.

deathanatos 4 days ago

This seems like an instance of an anti-pattern I've seen, which is inflating "permission" and "API call" to the same thing.

IIRC, AWS does this, where permission is by API call. As an example, you can have permission to call ssm:GetParameter n times, but if you try to combine those n API calls into a batch with GetParameters, that's a different IAM perm, even though exactly the same thing is occurring.

  • thayne 3 days ago

    I find that so frustrating. Another example is uploading an image to ECR (elastic container registry). You need like four different permissions to do it, which I think correspond to individual http requests, but it is usually just a single docker/podman/skopeo command, and I can't think of a situation where you would want to grant permission to initiate an upload but not complete it.

    Multipart uploads in s3 have a similar problem.

cpuguy83 4 days ago

Both Docker and containerd have started to block io_uring in the default profile for about a year now due to too many security issues with it.

  • bri3d 3 days ago

    And Google, in ChromeOS, Android, and purportedly, Google production servers, for around a year and a half, as well. For this reason it's also disabled in several of the kernelCTF configurations and in the ones where it remains (GKE), it only pays out at half-rate in bug bounty.

  • hinkley 4 days ago

    Has anyone speculated yet about how much slower a secure io_uring has to be? Is it still a net win once you lock it down fully?

    • JackSlateur 3 days ago

      As far as I know, io_uring is quite secure: a user cannot perform a syscall through it unless it has the privileges required to perform this syscall directly

      I would gladly get more details about the exact purpose of seccomp in a container environment. Reading a bit of internet, I find that docker "uses seccomp to block mount(2), which could be used to escape the container", which makes no sense to me because mount(2) requires CAP_SYS_ADMIN

      • poincaredisk 3 days ago

        That's not contradictory. Capabilities in docker are also limited, but both are used as a part of defense in depth.

    • cpuguy83 3 days ago

      That would be impossible to know. The main thing with io_uring is it makes it so you don't need to context switch (ie make system calls) to perform a number of operations.

theamk 3 days ago

I was thinking about how one would change io_uring design to be compatible with seccomp, and came up with a very simple one:

A new io_uring fd comes with all operations disabled by default. User has to call "io_uring_register(fd, ENABLE_OP, op)" before operation is used for the first time. Then seccomp filter can easily filter enable_op calls to prohibit certain operations.

It could even be added now in backward-compatible way - add a new feature to io_uring_setup that enables it. Then one could set seccomp filter to only accept setup requests with this feature set, and deny all others. Together, this should allow cooperating programs to pass seccomp filter, while programs that won't register ops could not use seccomp at all.

  • eqvinox 3 days ago

    I agree and think your approach would work, but I need to point out that seccomp BPF filters can also match on syscall arguments. For example, you can allow fcntl(F_DUPFD, …) but deny fcntl(F_SETLEASE, …). For some syscalls (fcntl, ioctl, setsockopt, …), this is rather important.

FridgeSeal 3 days ago

Surely this is a seccomp shortcoming, or kernel auth shortcoming, rather than an io_uring problem?

That is, seccomp is (apparently? I’ve never used it myself) capable of intercepting direct calls. Obviously, that design isn’t going to be able to handle “indirect” calls in its default implementation.

Either seccomp needs a way to act on the buffer or intercept io_uring calls, or there’s a need for a new auth mechanism that’s capable of handling io_uring style API’s.

Torpedoing the whole api (a la gcp) feels like throwing the baby out with the bath water.

  • tptacek 3 days ago

    That framing doesn't make sense. System calls and their arguments are an obvious security boundary and have been a sandboxing component for decades. io_uring blows that boundary apart. The "problem" is io_uring, not seccomp.

    If you want to make a case for io_uring being benign for security, the right argument is probably against all unmediated shared-kernel multitenancy (ie: multitenancy either through virtualization, or WASM/V8-type language runtimes, and nothing else). It doesn't make sense to say system call filters are flawed because someone came up with an omni-syscall that breaks those filters.

    • asveikau 3 days ago

      The syscall implementations themselves do checks and return EPERM/EACCES when appropriate. The mechanism for doing the syscall can change. I mean, in the 90s it happened via int 0x80, then we got sysenter, then the vdso. io_uring just moved part of it to user mode.

      It seems like a totally reasonable design to me to "just" put the right hooks into the filter mechanism and make it get called the same way regardless of the syscall mechanism.

    • thayne 3 days ago

      The obvious solution is to block operations over io_uring if the equivalent syscall would have been blocked by seccomp. But I'm not sure if there is some reason that wouldn't work.

      Another possibility would be to allow setting restrictions on all io_uring operations for the current and all child processes, although that would be less convenient than using the existing seccomp system.

      • tptacek 3 days ago

        I assume it's not so much that it can't be done, just that it hasn't been done yet.

leni536 4 days ago

> But if you've got a separation of duties where a sysadmin sets up seccomp filtering generically across applications

Is this even possible, regardless of io_uring?

  • amarshall 4 days ago

    Well the article brings up containers as an example. If the sysadmin controls “your” parent or root process (e.g. the login shell), they can just perform seccomp filtering there and it applies to everything within it (like any other sandbox).

    • 0x74696d 3 days ago

      (author here) I'm one of the maintainers of HashiCorp's Nomad, so that example was likely inspired by the separation of duties that's part of our security model. In that environment, there's a subset of task (ex. container) configuration that's controlled by the cluster admin and a subset that's controlled by the job author deploying onto the cluster.

  • klooney 3 days ago

    Yes- systemd will let you do that, as well docker/containerd/podman.

0x74696d 3 days ago

Author here! The motivating example of this post is frankly pretty lousy in retrospect (and was even so soon after writing, given the friendly reminder from Giovanni Campagna that `socket` wasn't one of the io_uring opcodes). At best this is an interesting limitation of seccomp. Maybe relevant if you were using gVisor?