Hot-swap models, minimize cold starts, keep models up-to-date, and secure access controls.
TL;DR: Instant model hot swaps, fast cold starts, automatic model updates, predictive LLM scaling, secure access control, all on your infrastructure or private cloud.
Get started now at https://outerport.com!
Horizontal scaling of LLM inference is difficult. Preparing a server for LLM inference roughly involves the following steps:
When implemented naively, just these 3 steps can take around ~4 minutes for a small 7B parameter LLM. To optimize this, you need to implement things like model chunking, parallel downloads, network streaming into memory, and use local SSDs. Even after doing all of this, model loading can take upwards of ~30 seconds, a long time to keep impatient customers waiting.
Outerport achieves a ~2 second model load time by keeping models warm in a pinned memory cache daemon, with predictive orchestration to figure out where & when to keep models warm. We provide what many serverless providers have figured out for container images but specialized for model weights which bring new sets of challenges.
Here’s a live demo of the model hotswapping:
https://www.youtube.com/watch?v=YoA2elVvo_o
With Outerport, you can also get:
push
to our model registry and pull
to get them fast.Overall system architecture:
We (Towaki and Allen) bring experience in ML infrastructure and systems from NVIDIA, Tome, LinkedIn, and Meta. Allen shipped fine-tuned LLM inference features to 10s of millions of customers at his previous startup, and Towaki worked on writing GPU code & optimizing 3D foundation model training at NVIDIA.
Now we want to unlock this capability to everyone else- ping us at founders@outerport.com, book a demo at https://outerport.com.
Our ask: If you are or know someone who fits any of the bills below, we’d love to talk! Please reach out to founders@outerport.com or book a demo at https://outerport.com.