HomeCompaniesAtla

Atla

We train frontier models to evaluate generative AI

Atla helps developers find AI mistakes at scale, so they can build more reliable GenAI applications. Generative AI can only reach its full potential when it consistently produces safe and useful outputs. We train models to catch mistakes, monitor AI performance, and understand critical failure modes.
Jobs at Atla
London, England, GB
£200K - £300K GBP
6+ years
London, England, GB
£100K - £300K GBP
3+ years
London, England, GB
£80K - £130K GBP
3+ years
London, England, GB
£80K - £130K GBP
3+ years
Atla
Founded:2023
Team Size:12
Status:
Active
Location:London, United Kingdom
Group Partner:Harj Taggar
Active Founders

Maurice Burger, Founder

Co-founder & CEO of Atla (S23). Startup veteran @ Syrup, Trim, and Merantix. Masters in CS @ University of Pennsylvania. Half an MBA @ Harvard Business School.
Maurice Burger
Maurice Burger
Atla

Roman Engeler, Founder

Co-founder & CTO of atla (S23). AI safety researcher @ MATS. MSc. Robotics @ ETH, Stanford, Imperial.
Roman Engeler
Roman Engeler
Atla
Company Launches
Selene - The World’s Most Accurate LLM-as-a-Judge
See original launch post ›

TL;DR

Meet Selene, a state-of-the-art LLM Judge trained specifically to evaluate AI responses. Selene is the best model on the market for evals, beating all frontier models from leading labs across 11 commonly used benchmarks for evaluators. Today, we are releasing:

  • API/SDK - Integrate Selene into your AI workflow
  • Alignment Platform - Build custom evaluation metrics for your use case

Get started using Selene for free.

Watch our demo here.

The Problem

Generative AI is unpredictable. Even the best models occasionally hallucinate, contradict themselves, or produce unsafe outputs. Many teams rely on the same general-purpose LLMs to evaluate AI outputs, but these models weren’t trained to be judges. That leads to:

  • Inaccurate evaluations and inefficient iteration cycles in development
  • Risky, unpredictable AI behavior in production

Our Solution

  • A SOTA model for evals: Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.

  • A platform to align our evaluator: Adapt Selene to your exact evaluation criteria—like “detect medical advice,” “flag legal errors,” or “judge whether the agent upgraded its workflow correctly.”

Selene works seamlessly with popular frameworks like DeepEval (YC W25) and Langfuse (YC W23) — just add it to your pipeline. And it runs faster than GPT-4o and Claude 3.5 Sonnet.

Who We Are & Why We’re Building This

We’re a small, highly technical team of AI researchers and engineers, with folks from leading AI labs and startups. Our mission is to enable the safe development of AGI. As models grow more powerful, we need a ‘frontier evaluator’ that keeps pace with frontier AI. We see Selene as a stepping stone toward scalable oversight of powerful AI.

Our Ask

Try Selene for free → Integrate our API into your eval pipeline.

Try our Alignment Platform → Craft a custom eval for your application.

Discord → Leave feedback, get to know us, and brainstorm cool ideas.