The most cost-effective way to setup and scale up your AI infrastructure.
We’ve built an AI model inferencing system that can serve requests at scale like no other and now we’re releasing it to the public as a rate-limit-free API. We serve any open-source LLM and can also deploy optimized versions of your custom fine-tuned LLM with cost-effective autoscaling. Sign up here, create an API key, get $100 of credit on us, and run as many requests as you like!
Deploying AI models in production requires expensive infrastructure. Serving more than ~10req/s using open source inference engines like vLLM on a single GPU results in terrible quality of service. Time-to-first-token skyrockets to more than 10s, and end-to-end latency degrades even more!
The common solution: horizontally scale up GPUs.
The problem: GPU’s are expensive and hard to find.
We’ve built an AI inference serving system that can sustain 100s of requests per second while maintaining a time-to-first-token of <1s on ~30% fewer GPUs when compared to NVIDIA’s NIMs containers and up to 2x fewer GPUs when compared to vLLM.
This enables us to provide a rate-limit-free API while maintaining a high quality of service. Alternatively, we can provide this as a cost-effective on-prem deployment solution, ensuring your infrastructure costs don’t blow up with requests served. We support any open source model and can host your custom fine-tuned model as an API with autoscaling enabled as well.
To be able to build such a scalable and available system, we needed a top-quality hardware provider. We wanted to use this as an opportunity to shout out Ori Global Cloud, a key partner in this journey, to enable a serverless Kubernetes platform for AI inference at scale. Ori Serverless Kubernetes is an infrastructure service that combines powerful scalability, simple management, and affordability to help AI-focused startups realize their wildest AI ambitions. Reach out to Ori for exclusive GPU cloud deals!
Our pricing is transparent and can be found here: https://console.ncompass.tech/public-pricing