Io_uring and seccomp (2022)

blog.0x74696d.com

82 points by pncnmnp a year ago

eqvinox a year ago

Using seccomp with a default-open filter is a terrible idea to begin with; it wasn't really designed for any of this. Seccomp in its most basic form didn't even have a filter list, it just allowed read() and write(). (And close() or something, don't quote me on the details, the point is it was a fixed list.) You're supposed to use it with a default-closed filter and fully enumerate what you need. (Yes, that's hard in a lot of cases, but still.)

There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.

(Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)

poincaredisk a year ago

>it just allowed read() and write(). A fun consequence of this is that even though there was a function to check if seccomp is enabled or not, it could only ever do one of two things: return "not enabled" or crash the process.
I agree with everything you wrote. I'll add that having a whitelist is not easy too, I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)
- eqvinox a year ago
  
  > I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)
  ACK, that's what I meant with "hard in a lot of cases"… to be honest I think this is a failure of the ecosystem at-large. It's a bit of a half feature without some kind of higher-level userspace mechanism to collect who needs what, especially when a bunch libraries are involved. It's admittedly a very hard problem, e.g. just because something is linking libcurl as a 2nd or 3rd level dependency doesn't mean you intend your process to ever make network connections… I don't think it's unsolveable though.

deathanatos a year ago

This seems like an instance of an anti-pattern I've seen, which is inflating "permission" and "API call" to the same thing.

IIRC, AWS does this, where permission is by API call. As an example, you can have permission to call ssm:GetParameter n times, but if you try to combine those n API calls into a batch with GetParameters, that's a different IAM perm, even though exactly the same thing is occurring.

thayne a year ago

I find that so frustrating. Another example is uploading an image to ECR (elastic container registry). You need like four different permissions to do it, which I think correspond to individual http requests, but it is usually just a single docker/podman/skopeo command, and I can't think of a situation where you would want to grant permission to initiate an upload but not complete it.
Multipart uploads in s3 have a similar problem.

cpuguy83 a year ago

Both Docker and containerd have started to block io_uring in the default profile for about a year now due to too many security issues with it.

bri3d a year ago

And Google, in ChromeOS, Android, and purportedly, Google production servers, for around a year and a half, as well. For this reason it's also disabled in several of the kernelCTF configurations and in the ones where it remains (GKE), it only pays out at half-rate in bug bounty.
hinkley a year ago

Has anyone speculated yet about how much slower a secure io_uring has to be? Is it still a net win once you lock it down fully?
- JackSlateur a year ago
  
  As far as I know, io_uring is quite secure: a user cannot perform a syscall through it unless it has the privileges required to perform this syscall directly
  I would gladly get more details about the exact purpose of seccomp in a container environment. Reading a bit of internet, I find that docker "uses seccomp to block mount(2), which could be used to escape the container", which makes no sense to me because mount(2) requires CAP_SYS_ADMIN
  
  cpuguy83 a year ago
  
  io_uring cve's: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
  seccomp is used for defense in depth. If someone managed to escalate privileges through some means the seccomp policy will still prevent them from doing nasty things or escalating further.
  
  poincaredisk a year ago
  
  That's not contradictory. Capabilities in docker are also limited, but both are used as a part of defense in depth.
- cpuguy83 a year ago
  
  That would be impossible to know. The main thing with io_uring is it makes it so you don't need to context switch (ie make system calls) to perform a number of operations.

theamk a year ago

I was thinking about how one would change io_uring design to be compatible with seccomp, and came up with a very simple one:

A new io_uring fd comes with all operations disabled by default. User has to call "io_uring_register(fd, ENABLE_OP, op)" before operation is used for the first time. Then seccomp filter can easily filter enable_op calls to prohibit certain operations.

It could even be added now in backward-compatible way - add a new feature to io_uring_setup that enables it. Then one could set seccomp filter to only accept setup requests with this feature set, and deny all others. Together, this should allow cooperating programs to pass seccomp filter, while programs that won't register ops could not use seccomp at all.

eqvinox a year ago

I agree and think your approach would work, but I need to point out that seccomp BPF filters can also match on syscall arguments. For example, you can allow fcntl(F_DUPFD, …) but deny fcntl(F_SETLEASE, …). For some syscalls (fcntl, ioctl, setsockopt, …), this is rather important.

FridgeSeal a year ago

Surely this is a seccomp shortcoming, or kernel auth shortcoming, rather than an io_uring problem?

That is, seccomp is (apparently? I’ve never used it myself) capable of intercepting direct calls. Obviously, that design isn’t going to be able to handle “indirect” calls in its default implementation.

Either seccomp needs a way to act on the buffer or intercept io_uring calls, or there’s a need for a new auth mechanism that’s capable of handling io_uring style API’s.

Torpedoing the whole api (a la gcp) feels like throwing the baby out with the bath water.

tptacek a year ago

That framing doesn't make sense. System calls and their arguments are an obvious security boundary and have been a sandboxing component for decades. io_uring blows that boundary apart. The "problem" is io_uring, not seccomp.
If you want to make a case for io_uring being benign for security, the right argument is probably against all unmediated shared-kernel multitenancy (ie: multitenancy either through virtualization, or WASM/V8-type language runtimes, and nothing else). It doesn't make sense to say system call filters are flawed because someone came up with an omni-syscall that breaks those filters.
- asveikau a year ago
  
  The syscall implementations themselves do checks and return EPERM/EACCES when appropriate. The mechanism for doing the syscall can change. I mean, in the 90s it happened via int 0x80, then we got sysenter, then the vdso. io_uring just moved part of it to user mode.
  It seems like a totally reasonable design to me to "just" put the right hooks into the filter mechanism and make it get called the same way regardless of the syscall mechanism.
- thayne a year ago
  
  The obvious solution is to block operations over io_uring if the equivalent syscall would have been blocked by seccomp. But I'm not sure if there is some reason that wouldn't work.
  Another possibility would be to allow setting restrictions on all io_uring operations for the current and all child processes, although that would be less convenient than using the existing seccomp system.
  
  tptacek a year ago
  
  I assume it's not so much that it can't be done, just that it hasn't been done yet.

leni536 a year ago

> But if you've got a separation of duties where a sysadmin sets up seccomp filtering generically across applications

Is this even possible, regardless of io_uring?

amarshall a year ago

Well the article brings up containers as an example. If the sysadmin controls “your” parent or root process (e.g. the login shell), they can just perform seccomp filtering there and it applies to everything within it (like any other sandbox).
- 0x74696d a year ago
  
  (author here) I'm one of the maintainers of HashiCorp's Nomad, so that example was likely inspired by the separation of duties that's part of our security model. In that environment, there's a subset of task (ex. container) configuration that's controlled by the cluster admin and a subset that's controlled by the job author deploying onto the cluster.
klooney a year ago

Yes- systemd will let you do that, as well docker/containerd/podman.

0x74696d a year ago

Author here! The motivating example of this post is frankly pretty lousy in retrospect (and was even so soon after writing, given the friendly reminder from Giovanni Campagna that `socket` wasn't one of the io_uring opcodes). At best this is an interesting limitation of seccomp. Maybe relevant if you were using gVisor?