By Alex Gajewski and Evan Conrad
July 22, 2023
The Idea
- We’re getting a bunch of startups together that need compute for training large models
- Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with N*K gpus
- Then we set up a job scheduler to allocate compute fairly across all the startups (proportional to how much of the cluster they own)
- This means rather than needing to fill 128 A100s constantly over a month, you can burst up to 512 A100s for a week and get your model quicker
- Also if there’s ever idle compute, the scheduler will just give it to you, so you may end up with more than your share of compute if you get lucky
- Big labs like OpenAI and Deepmind have big clusters that support this kind of bursty allocation for their researchers, but startups so far have had to get very small clusters on very long term contracts, wait months of lead time, and try to keep them busy all the time
- We ought to be able to get something like $1.75/hr per H100, but with bursty allocation and short term contracts
- If you’re interested in having your startup join, fill out this form
Coming and going and scaling up
- Like a hacker house, if you ever want to leave the cluster (e.g. to build your own), just give us a month or two of notice so we can find another person to fill your place
- We can add new startups to the group in batches, and every couple months add new H100s to the cluster
- Same deal if you’re already in the group and you want to scale up to more compute
- We may want to overprovision a little bit so that e.g. if one of our friends wants a couple nodes to run a small experiment on, we can just give it to them at a good price
- If we overprovision by 10%, this would raise the hourly H100 price by 10%
Finances
- We have a good lead on 512 H100s that would come online in 4-6 weeks
- If we have more than that much demand, we can probably find more H100s that would be delivered in about 8 weeks
- We can probably get a good deal from a bank to spread out the cost of buying the cluster, so we can do something like $1.75/hr for H100s, but on a short term contract, and with bursty job allocation
- We can make a separate entity for this thing, so if we make any big financial mistakes, this entity dies but your startup is fine
Infra