Distributed Systems Software Engineer

About the Role

As a distributed systems software engineer, you’ll be working on our in-house resource orchestration system. This system coordinates state and access to hundreds (soon thousands) of GPU compute nodes in multi-tenant clusters spanning across multiple data centers. Some responsibilities of the role include:

Design of distributed system architectures that enable high availability fault tolerant state management
Deployment automation and performance optimization of virtual machines running on bare metal that utilize GPU passthrough
Design and deployment of multi-tier high performance network attached storage systems

About Us

We’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will be possible to scale to tens of thousands of accelerators for hours at a time without having to build your own infrastructure. This will greatly increase the number of organizations that can afford to train large models, which will make the most important technology of our lifetime accessible to more people.

About You

You have built fault tolerant distributed systems before that can manage hardware resources at scale
You enjoy creating self-correcting systems that contribute to hardware health and reliability
You have experience with Linux virtualization (Cloud Hypervisor, QEMU, libvirt, virtiofs, sr-iov, PCIe passthrough)
You appreciate and value good documentation

Some Nice to Haves

Experience with Rust (our VM orchestrator is written in Rust)
Experience with etcd
Experience with high performance storage systems (WEKA, VAST, Ceph, etc.)

Compensation

US: $170k - $300k + equity