Outerport - VMware for GPUs 🏗️

Hot-swap models, minimize cold starts, keep models up-to-date, and secure access controls.

Towaki Takikawa

6 months ago

#ai#saas#aiops#infrastructure#developer_tools

TL;DR: Instant model hot swaps, fast cold starts, automatic model updates, predictive LLM scaling, secure access control, all on your infrastructure or private cloud.

Get started now at https://outerport.com!

❌ The Problem

Horizontal scaling of LLM inference is difficult. Preparing a server for LLM inference roughly involves the following steps:

Downloading a model (usually big, like 10s of GBs!)
Moving the model from disk to RAM
Moving the model from RAM to GPU memory

When implemented naively, just these 3 steps can take around ~4 minutes for a small 7B parameter LLM. To optimize this, you need to implement things like model chunking, parallel downloads, network streaming into memory, and use local SSDs. Even after doing all of this, model loading can take upwards of ~30 seconds, a long time to keep impatient customers waiting.

🏗️ Outerport

Outerport achieves a ~2 second model load time by keeping models warm in a pinned memory cache daemon, with predictive orchestration to figure out where & when to keep models warm. We provide what many serverless providers have figured out for container images but specialized for model weights which bring new sets of challenges.

Here’s a live demo of the model hotswapping:

https://www.youtube.com/watch?v=YoA2elVvo_o

With Outerport, you can also get:

Annoying model storage details taken care of- like chunking, streaming, compression, and encryption.
Ease of use: just push to our model registry and pull to get them fast.
A modern web dashboard to monitor and audit your registry and deployments.
Self-hosted options for additional security and flexibility.
24/7 customer support from us. 😄

Overall system architecture:

About Us

We (Towaki and Allen) bring experience in ML infrastructure and systems from NVIDIA, Tome, LinkedIn, and Meta. Allen shipped fine-tuned LLM inference features to 10s of millions of customers at his previous startup, and Towaki worked on writing GPU code & optimizing 3D foundation model training at NVIDIA.

Now we want to unlock this capability to everyone else- ping us at founders@outerport.com, book a demo at https://outerport.com.

Our ask: If you are or know someone who fits any of the bills below, we’d love to talk! Please reach out to founders@outerport.com or book a demo at https://outerport.com.

Anyone doing fine-tuned LLM or diffusion model inference on-prem or on rented machines.
Anyone operating a GPU datacenter or an LLM inference service.
Anyone concerned about security & compliance for model weights.
Anyone working on LLM inference for regulated industries (finance, banks, pharmaceuticals, etc).