nCompass Technologies: Reliable LLM API with no rate-limits

The most cost-effective way to setup and scale up your AI infrastructure.

Aditya Rajagopal

6 months ago

https://www.ncompass.tech

#ai#api#hardware#open_source#cloud_computing

Tl;dr:

We’ve built an AI model inferencing system that can serve requests at scale like no other and now we’re releasing it to the public as a rate-limit-free API. We serve any open-source LLM and can also deploy optimized versions of your custom fine-tuned LLM with cost-effective autoscaling. Sign up here, create an API key, get $100 of credit on us, and run as many requests as you like!

The Problem

Deploying AI models in production requires expensive infrastructure. Serving more than ~10req/s using open source inference engines like vLLM on a single GPU results in terrible quality of service. Time-to-first-token skyrockets to more than 10s, and end-to-end latency degrades even more!

The common solution: horizontally scale up GPUs.

The problem: GPU’s are expensive and hard to find.

Why should you care

API user: These high infrastructure costs are the reason you suffer rate limits when using existing API providers.
Deploying on-prem: Your infrastructure costs might be the reason a PoC doesn’t move to production.

Our Solution

We’ve built an AI inference serving system that can sustain 100s of requests per second while maintaining a time-to-first-token of <1s on ~30% fewer GPUs when compared to NVIDIA’s NIMs containers and up to 2x fewer GPUs when compared to vLLM.

This enables us to provide a rate-limit-free API while maintaining a high quality of service. Alternatively, we can provide this as a cost-effective on-prem deployment solution, ensuring your infrastructure costs don’t blow up with requests served. We support any open source model and can host your custom fine-tuned model as an API with autoscaling enabled as well.

Tutorials

Shout out

To be able to build such a scalable and available system, we needed a top-quality hardware provider. We wanted to use this as an opportunity to shout out Ori Global Cloud, a key partner in this journey, to enable a serverless Kubernetes platform for AI inference at scale. Ori Serverless Kubernetes is an infrastructure service that combines powerful scalability, simple management, and affordability to help AI-focused startups realize their wildest AI ambitions. Reach out to Ori for exclusive GPU cloud deals!

Asks

Use our self-serve console (https://console.ncompass.tech/login) to create an account and start running with $100 of credit.
Book a demo (https://calendar.app.google/3jRDwcstFQvsbqnR8) if you would like to discuss an on-prem solution. YC deals apply!

Our pricing is transparent and can be found here: https://console.ncompass.tech/public-pricing