HomeCompaniesConfident AI

Confident AI

The DeepEval LLM Evaluation Platform

Confident AI allows companies of all sizes to benchmark, safeguard, and improve LLM applications, with best-in-class metrics and guardrails powered by DeepEval. Built by the creators of DeepEval (4.3k stars, >400k monthly downloads), Confident AI is able to offer battle-tested, open-source evaluation algorithms while providing the infrastructure needed for teams to stay confident their LLM systems.
Confident AI
Founded:2024
Team Size:2
Status:
Active
Location:San Francisco
Group Partner:Tom Blomfield
Active Founders

Jeffrey Ip, CEO & Cofounder

Creator of DeepEval, the open-source LLM evaluation framework. and grew it to over 400k monthly downloads and counting. Previously SWE @ Google, Microsoft.

Kritin Vongthongsri, Co-Founder

Building the #1 LLM Evaluation Platform & empowering teams to red-team and safeguard LLM apps. AI Researcher and CHI-published author, previously built NLP pipelines for fintech startup and researched self-driving cars/HCI during @ Princeton (ORFE'24 + CS).
Kritin Vongthongsri
Kritin Vongthongsri
Confident AI
Company Launches
Confident AI: The DeepEval LLM Evaluation Platform
See original launch post ›

TL;DR – Confident AI is the cloud platform for DeepEval - an open-source evaluation framework we've built to help engineers unit-test LLM applications. DeepEval has 4.8k stars, 500k monthly downloads, runs 700k evaluations every day, and is most commonly found in CI/CD pipelines of enterprises such as BCG, Astrazeneca, AXA, Microsoft. Confident AI allows engineering teams to iterate on their LLM app x10 faster by bringing DeepEval to the cloud.

Try Confident AI today (setup in 5 min)

The Problem

Despite LLM evaluation being a problem with many solutions on the market, it remains unsolved. General LLMOps observability platforms that offer evals lack robust metrics and are more suited for debugging through tracing UIs, while evaluation-focused frameworks don’t offer enough control for users to customize and make metrics reliable for specific use cases.

As a result, developers often build custom evaluation metrics and pipelines from scratch—writing hundreds or even thousands of lines of code to test their LLM apps. The worst part? Once they’ve fine-tuned their metrics and are ready to deploy them across the organization, they hit a roadblock: there’s no easy way to collaborate. Because these custom metrics exist in scattered code rather than an integrated ecosystem, incorporating them to enable team-wide adoption becomes frustrating and inefficient.

The Solution

We built DeepEval for engineers to create use-case-specific, deterministic LLM evaluation metrics, and when you're ready, Confident AI brings these evaluation results to the cloud. This allows teams to collaborate on LLM app iteration — with no extra setup required.

  1. Curate your evaluation dataset on Confident AI.
  2. Run evaluations locally with DeepEval's metrics, pulling datasets from Confident AI.
  3. View and share testing reports to compare prompts and models and refine your LLM application.

Confident AI continuously evaluates monitored LLM outputs for production, automatically enriching your dataset with real-world, adversarial test cases. This keeps your evaluation data high-quality and reflective of your use case.

https://youtu.be/yLIhVn3B8Wg

How are we different?

What we've learned is in order to get legitimate evaluation results required for benchmark-driven iteration, you need extremely high-quality metrics and datasets. That's why we've built specifically for the ideal LLM evaluation workflow:

  • DeepEval handles robust, deterministic metrics required for rigorous, use-case-tailored validation.
  • Confident AI provides teams the ability to collaborate on curating the best possible evaluation dataset.

If your evaluation results are 100% reflective of your LLM application's performance, what's stopping you from shipping the best version of your LLM app?

Customer ROI metrics

We've been working with companies of all sizes, and some of the ROI metrics include:

  • Decreasing LLM cost by more than 70% through evaluation to safely switch away from GPT-4o to cheaper models.
  • Decreasing time to deployment for a team of 7 from 1-2 weeks to 2-3 hours.
  • Helping a team of 30 customer support agents save 200+ hours a week by having a centralized place to analyze LLM performance live.

There's more to come as we start publishing some case studies on our website, so stay tuned!

Our Ask

Thanks for sticking with us to the end! We’re on a mission to help companies get the most ROI out of their LLM use cases, and we believe this is only achievable through rigorous LLM evaluation at scale. If our mission resonates with you, Confident AI is always here and available to try immediately (coding required, but free to try): https://docs.confident-ai.com/confident-ai/confident-ai-introduction

If you want to explore our enterprise offering, you can always talk to us here.

About Us

Confident AI is founded by Jeffrey Ip, a SWE formally at Google scaling YouTube's creators studio infrastructure, and Microsoft building document recommenders for Office 365, and Kritin Vongthongsri, an AI researcher and CHI-published author, who previously built NLP pipelines for fintech startups and researched self-driving cars/HCI during his time at Princeton, where he studied ORFE and CS.

Selected answers from Confident AI's original YC application for the W25 Batch

Describe what your company does in 50 characters or less.

LLM Evaluation Platform for LLM Practitioners

What is your company going to make? Please describe your product and what it does or will do.

We are building an open-source LLM evaluation framework (DeepEval) for LLM practitioners to unit-test LLM applications. When used in conjunction with our evaluation platform (Confident AI), we provide insights on the best parameters (e.g. model, prompt-template) to use, a centralized place for teams to collaborate on evaluation datasets, and real-time performance tracking for LLM applications in production.

Without Confident AI, companies would have to build their own framework to automate LLM testing in CI/CD to prevent unnoticed breaking changes, have no visibility in which parameters gives the best performing results, pass evaluation datasets around through email or slack between teams to discuss failing test cases, unable to pinpoint how LLM performance relates to top-line business KPIs, and hire expert human evaluators to evaluate sampled LLM responses in production.