Confident AI: The DeepEval LLM Evaluation Platform

An LLM testing platform for engineering teams to pinpoint which iteration of their app to put in production.

Jeffrey Ip

a month ago

#open_source#generative_ai#developer_tools#artificial_intelligence

TL;DR – Confident AI is the cloud platform for DeepEval - an open-source evaluation framework we've built to help engineers unit-test LLM applications. DeepEval has 4.8k stars, 500k monthly downloads, runs 700k evaluations every day, and is most commonly found in CI/CD pipelines of enterprises such as BCG, Astrazeneca, AXA, Microsoft. Confident AI allows engineering teams to iterate on their LLM app x10 faster by bringing DeepEval to the cloud.

Try Confident AI today (setup in 5 min)

The Problem

Despite LLM evaluation being a problem with many solutions on the market, it remains unsolved. General LLMOps observability platforms that offer evals lack robust metrics and are more suited for debugging through tracing UIs, while evaluation-focused frameworks don’t offer enough control for users to customize and make metrics reliable for specific use cases.

As a result, developers often build custom evaluation metrics and pipelines from scratch—writing hundreds or even thousands of lines of code to test their LLM apps. The worst part? Once they’ve fine-tuned their metrics and are ready to deploy them across the organization, they hit a roadblock: there’s no easy way to collaborate. Because these custom metrics exist in scattered code rather than an integrated ecosystem, incorporating them to enable team-wide adoption becomes frustrating and inefficient.

The Solution

We built DeepEval for engineers to create use-case-specific, deterministic LLM evaluation metrics, and when you're ready, Confident AI brings these evaluation results to the cloud. This allows teams to collaborate on LLM app iteration — with no extra setup required.

Curate your evaluation dataset on Confident AI.
Run evaluations locally with DeepEval's metrics, pulling datasets from Confident AI.
View and share testing reports to compare prompts and models and refine your LLM application.

Confident AI continuously evaluates monitored LLM outputs for production, automatically enriching your dataset with real-world, adversarial test cases. This keeps your evaluation data high-quality and reflective of your use case.

https://youtu.be/yLIhVn3B8Wg

How are we different?

What we've learned is in order to get legitimate evaluation results required for benchmark-driven iteration, you need extremely high-quality metrics and datasets. That's why we've built specifically for the ideal LLM evaluation workflow:

DeepEval handles robust, deterministic metrics required for rigorous, use-case-tailored validation.
Confident AI provides teams the ability to collaborate on curating the best possible evaluation dataset.

If your evaluation results are 100% reflective of your LLM application's performance, what's stopping you from shipping the best version of your LLM app?

Customer ROI metrics

We've been working with companies of all sizes, and some of the ROI metrics include:

Decreasing LLM cost by more than 70% through evaluation to safely switch away from GPT-4o to cheaper models.
Decreasing time to deployment for a team of 7 from 1-2 weeks to 2-3 hours.
Helping a team of 30 customer support agents save 200+ hours a week by having a centralized place to analyze LLM performance live.

There's more to come as we start publishing some case studies on our website, so stay tuned!

Our Ask

Thanks for sticking with us to the end! We’re on a mission to help companies get the most ROI out of their LLM use cases, and we believe this is only achievable through rigorous LLM evaluation at scale. If our mission resonates with you, Confident AI is always here and available to try immediately (coding required, but free to try): https://docs.confident-ai.com/confident-ai/confident-ai-introduction

If you want to explore our enterprise offering, you can always talk to us here.

About Us

Confident AI is founded by Jeffrey Ip, a SWE formally at Google scaling YouTube's creators studio infrastructure, and Microsoft building document recommenders for Office 365, and Kritin Vongthongsri, an AI researcher and CHI-published author, who previously built NLP pipelines for fintech startups and researched self-driving cars/HCI during his time at Princeton, where he studied ORFE and CS.