kkielhofner 4 days ago

This is very interesting but many of the motivations listed are far better served with alternate approaches.

For "remote" model training there is NCCL + Deepspeed/FSDP/etc. For remote inferencing there are solutions like Triton Inference Server[0] that can do very high-performance hosting of any model for inference. For LLMs specifically there are nearly countless implementations.

That said, the ability to use this for testing is interesting but I wonder about GPU contention and as others have noted the performance of such a solution will be terrible even with relatively high speed interconnect (100/400gb ethernet, etc).

NCCL has been optimized to support DMA directly between network interfaces and GPUs which is of course considerably faster than solutions like this. Triton can also make use of shared memory, mmap, NCCL, MPI, etc which is one of the many tricks it uses for very performant inference - even across multiple chassis over another network layer.

[0] - https://github.com/triton-inference-server/server

  • theossuary 4 days ago

    I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda, as they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

    This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter), or submitting remote jobs (through SLURM or a library specific Kubernetes integration). Scuda is an interesting step towards a better solution for utilizing remote GPUs easily across a wide range of libraries, not just Pytorch and Tensorflow.

    • kkielhofner 3 days ago

      > I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda

      I put "remote" in quotes because they're not direct equivalents but from a practical standpoint it's the alternate current approach.

      > they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

      IME this has changed quite a bit. Between improved support for torch FSDP, Deepspeed, and especially HF Accelerate wrapping of each with transformer models it's been a while since I've had to put much (if any) work in.

      That said if you're running random training scripts it likely won't "just work" but given larger models becoming more common I see a lot more torchrun, accelerate, deepspeed, etc in READMEs and code.

      > This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter)

      Remotely, as in over the internet? 400gb ethernet is already too slow vs PCIe5 x16 (forget SXM). A 10gb internet connection is 40x slower (plus latency impacts).

      Remote development via the internet with scuda would be impossibly and completely uselessly slow.

    • seattleeng 4 days ago

      Why is working locally important?

      • theossuary 4 days ago

        Working locally still matters, and this is from someone who normally works in tmux/nvim. When working on vision and 3D ML work, being able to quickly open a visualizer windows is imperative to understanding what's going on. For Gaussian Splatting, point cloud work, SLAM, etc. you have to have access to a desktop environment to see visualizations; they very rarely work well remotely (even if they have some Jupyter support).

        Working remotely, when having to use a desktop environment is painful, no matter the technology. The best tice come up with is using tmux/vim and sunshine/moonlight, but even still I'd rather just have access to everything locally.

ranger_danger 4 days ago

This appears to only support CUDA on nvidia. I'm curious why they didn't just expose /dev/nvidia-uvm as a socket and forward that over the network instead of hooking hundreds of functions (maybe it's not that simple and I just don't know).

  • monocasa 4 days ago

    You can't mmap a socket, and mmap is core to how /dev/nvidia-uvm works.

    • yencabulator 10 hours ago

      What you can do is mmap a file that's in a FUSE filesystem and relay reads/writes over the network to a server that holds that mmap.

    • afr0ck 4 days ago

      Well, it's not impossible. It's just software after all. You can mmap a remote device file, but you need OS support to do the magical paging for you, probably some sort of page ownership tracking protocol like in HMM [1], but outside a coherence domain.

      I was once working on CXL [2] and memory ownership tracking in the Linux kernel and wanted to play with Nvidia GPUs, but then I hit a wall when I realised that a lot of the functionalities were running on the GSP or the firmware blob with very little to no documentation, so I ended up generally not liking the system software stack of Nvidia and I gave up the project. UVM subsystem in the open kernel driver is a bit of an exception, but a lot of the control path is still handled and controlled from closed-source cuda libraries in userspace.

      tldr; it's very hard to do systems hacking with Nvidia GPUs.

      [1] https://www.kernel.org/doc/html/v5.0/vm/hmm.html [2] https://en.wikipedia.org/wiki/Compute_Express_Link

      • monocasa 4 days ago

        Yeah, the Nvidia stuff isn't really made to be hacked on.

        I'd check out the AMD side since you can at least have a full open source GPU stack to play with, and they make a modicum of effort to document their gpus.

    • majke 4 days ago

      It's a first time I hear about /dev/nvidia-uvm. Is there any documentation on how nvidia API works? Especially, how strong is the multi-tenancy story. Can two users use one GPU and expect reasonable security?

      Last time I checked the GPU did offer some kind of memory isolation, but that was only for their datacenter, not consumer cards.

      • monocasa 4 days ago

        There's not a lot of docs on how it works. It used to be entirely in the closed source driver, now it's mainly a thin bridge to the closed source firmware blob.

        But yes, for more than a decade now even with consumer cards, separate user processes have separate hardware enforced contexts. This is as true for consumer cards as it is for datacenter cards. This is core to how something like webgl works without exposing everything else being rented on your desktop to public Internet. There have been bugs, but per process hardware isolation with a GPU local mmu has been tablestakes for a modern gpu for nearly twenty years.

        What datacenter gpus expose in addition to that is multiple virtual gpus, sort of like sr-iov, where a single gpu can be exposed to multiple CPU kernels running in virtual machines.

    • XorNot 4 days ago

      Which seems weird to me: if we're going to have device files, it's super annoying that they actually don't really act like files.

      Like we really should just have enough rDMA in the kernel to let that work.

      • monocasa 4 days ago

        At it's core, this device file is responsible for managing a GPU local address space, and sharing memory securely with that address space in order to have a place to write command buffers and data that the gpu can see. It doesn't really make sense without a heavy memory mapping component.

        A plan 9 like model that's heavily just a standard file would massively cut into gpu performance.

      • gorkish 4 days ago

        I agree with you that making RDMA a more accessible commodity technology is very important for "the future of compute". Properly configuring something like RoCEv2 or Infiniband is expensive and difficult. These technologies need to be made more robust in order to be able to run on commodity networks.

    • gorkish 4 days ago

      Granted it requires additional support from your nics/switches, but it is probably straightforward to remote nvidia-uvm with an RDMA server

dschuetz 4 days ago

More like "virtual cuda only gpu" over IP.

some1else 4 days ago

You might have a problem using CUDA as part of the name, since Nvidia has it trademarked. Maybe you can switch to Scuba if they give you trouble, sounds like a good name for the tool.

  • n3storm 4 days ago

    Buda may Be a Better name

  • teeray 4 days ago

    We need to do for CUDA what was done for Jell-o and Kleenex.

AkashKaStudio 4 days ago

Would this let Nvidia card be accessible on Apple Silicon over TB4 for training on a e-GPU caddy? Would happily relegate my desktop to HTPC/Gaming duties.

ghxst 4 days ago

This looks more like CUDA over IP or am I missing something?

saurik 4 days ago

Reminds me of this, from a couple months ago.

https://news.ycombinator.com/item?id=41203475

  • friedtofu 4 days ago

    Was going to post a reference to the same thing! Not sure about you but I tested it, and I'm not sure if it was just being hugged to death when I used it or not, but the network performance was incredibly poor.

    Having something that you can self-host, as a user I find this really neat but what I really want is something more like

    https://github.com/city96/ComfyUI_NetDist + OP's project mashed together.

    Say I'm almost able to execute a workflow that would normally require ~16Gb VRAM. I have a nvidia 3060 12Gb running headless with prime/executing the workflow via the CLI.

    Right now, I'd probably just have to run the workflow in a paperspace(or any other cloud compute) container, or borrow the power of a local apple M1 when using the second repository I mentioned.

    I wish I had something that could lend me extra resources and temporarily act as either the host GPU or a secondary depending on the memory needed, only when I needed it(if that makes sense)

gchamonlive 4 days ago

I have a laptop with a serviceable GPU but only 16gb of ram, and another with a low tier GPU but 32gb of ram. Wondering, will it be too slow to use the later as the control plane and delegate inference to the former laptop using something like comfyui to run text-to-image models?

  • friedtofu 3 days ago

    I referenced this already, but definitely check out https://github.com/city96/ComfyUI_NetDist?tab=readme-ov-file...

    I guess that depends on what you mean by "too slow". What card is the low tier GPU? A Nvidia Tesla? I've always been under the assumption that when running two cards in parallel the faster card will almost always slow down to the speed of the card with the most memory, though the only reference I have is from using Nvidia SLI with two 8800s almost a decade ago.

    I could also be completely and utterly wrong, would love to hear from anyone in the field of GPU architecture or around it for some clarification though :)

    • gchamonlive 2 days ago

      Should have given more info, indeed. One notebook has a 3070ti which has 8gb of vram and the other is a mx150, which I guess has 2gb dedicated vram.

rtghrhtr 4 days ago

Everyone hates nvidia but treats ATI as an afterthought. Another completely useless tool to throw on the pile.

  • dahart 4 days ago

    > Everyone hates nvidia but treats ATI as an afterthought.

    Hehe, do you mean AMD?

  • gorkish 4 days ago

    ATI? afterthought, indeed

elintknower 4 days ago

Curious if this could be simplified to provide NVENC over ip?

Technetium 4 days ago

It would be nice to have a description added.

kbumsik 4 days ago

I have heard NVSwitch is used for GPU-to-GPU interconnection over network.

How is it different?

  • nsteel 4 days ago

    Isn't this GPU-to-CPU? And really slow. And only CUDA. And over IP. And implemented in software. I think it's really very different.