nCompass is a platform for acceleration and hosting of open-source and custom AI models. We provide low-latency AI deployment without rate-limiting you. All with just one line of code.
I am a recent PhD graduate from Imperial College London with experience in machine learning algorithms, compilers and hardware architectures. I've worked in compiler teams at Qualcomm and Huawei as well as served as a reviewer for ICML. My co-founder and I are building nCompass which is a platform for accelerating and hosting both open-source and custom large AI models. Our focus is on providing rate unlimited and low latency large AI inference with only one line of code.
I'm a recent Imperial College London PhD Graduate where I specialized in reconfigurable hardware architectures for accelerated machine learning and reduced precision training algorithms. I have worked as an AI feasibility consultant prototyping and evaluating AI spin-outs. We are building nCompass, a platform for accelerating and hosting both open-source and custom large AI models. Our focus is on providing rate-unlimited and low latency large AI inference with only one line of code.
We’ve built an AI model inferencing system that can serve requests at scale like no other and now we’re releasing it to the public as a rate-limit-free API. We serve any open-source LLM and can also deploy optimized versions of your custom fine-tuned LLM with cost-effective autoscaling. Sign up here, create an API key, get $100 of credit on us, and run as many requests as you like!
Deploying AI models in production requires expensive infrastructure. Serving more than ~10req/s using open source inference engines like vLLM on a single GPU results in terrible quality of service. Time-to-first-token skyrockets to more than 10s, and end-to-end latency degrades even more!
The common solution: horizontally scale up GPUs.
The problem: GPU’s are expensive and hard to find.
We’ve built an AI inference serving system that can sustain 100s of requests per second while maintaining a time-to-first-token of <1s on ~30% fewer GPUs when compared to NVIDIA’s NIMs containers and up to 2x fewer GPUs when compared to vLLM.
This enables us to provide a rate-limit-free API while maintaining a high quality of service. Alternatively, we can provide this as a cost-effective on-prem deployment solution, ensuring your infrastructure costs don’t blow up with requests served. We support any open source model and can host your custom fine-tuned model as an API with autoscaling enabled as well.
To be able to build such a scalable and available system, we needed a top-quality hardware provider. We wanted to use this as an opportunity to shout out Ori Global Cloud, a key partner in this journey, to enable a serverless Kubernetes platform for AI inference at scale. Ori Serverless Kubernetes is an infrastructure service that combines powerful scalability, simple management, and affordability to help AI-focused startups realize their wildest AI ambitions. Reach out to Ori for exclusive GPU cloud deals!
Our pricing is transparent and can be found here: https://console.ncompass.tech/public-pricing