HomeCompaniesAtla

Atla

We train frontier models to evaluate generative AI

Atla helps developers find AI mistakes at scale, so they can build more reliable GenAI applications. Generative AI can only reach its full potential when it consistently produces safe and useful outputs. We train models to catch mistakes, monitor AI performance, and understand critical failure modes.
Active Founders
Maurice Burger
Maurice Burger
Founder
Roman Engeler
Roman Engeler
Founder
Jobs at Atla
London, England, GB
£200K - £300K GBP
6+ years
London, England, GB
£80K - £130K GBP
3+ years
London, England, GB
£80K - £130K GBP
3+ years
London, England, GB
£100K - £300K GBP
3+ years
Atla
Founded:2023
Batch:S23
Team Size:12
Status:
Active
Location:London, United Kingdom
Primary Partner:Harj Taggar
Company Launches
Selene - The World’s Most Accurate LLM-as-a-Judge
See original launch post

TL;DR

Meet Selene, a state-of-the-art LLM Judge trained specifically to evaluate AI responses. Selene is the best model on the market for evals, beating all frontier models from leading labs across 11 commonly used benchmarks for evaluators. Today, we are releasing:

  • API/SDK - Integrate Selene into your AI workflow
  • Alignment Platform - Build custom evaluation metrics for your use case

Get started using Selene for free.

Watch our demo here.

The Problem

Generative AI is unpredictable. Even the best models occasionally hallucinate, contradict themselves, or produce unsafe outputs. Many teams rely on the same general-purpose LLMs to evaluate AI outputs, but these models weren’t trained to be judges. That leads to:

  • Inaccurate evaluations and inefficient iteration cycles in development
  • Risky, unpredictable AI behavior in production

Our Solution

  • A SOTA model for evals: Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.

  • A platform to align our evaluator: Adapt Selene to your exact evaluation criteria—like “detect medical advice,” “flag legal errors,” or “judge whether the agent upgraded its workflow correctly.”

Selene works seamlessly with popular frameworks like DeepEval (YC W25) and Langfuse (YC W23) — just add it to your pipeline. And it runs faster than GPT-4o and Claude 3.5 Sonnet.

Who We Are & Why We’re Building This

We’re a small, highly technical team of AI researchers and engineers, with folks from leading AI labs and startups. Our mission is to enable the safe development of AGI. As models grow more powerful, we need a ‘frontier evaluator’ that keeps pace with frontier AI. We see Selene as a stepping stone toward scalable oversight of powerful AI.

Our Ask

Try Selene for free → Integrate our API into your eval pipeline.

Try our Alignment Platform → Craft a custom eval for your application.

Discord → Leave feedback, get to know us, and brainstorm cool ideas.