Atla

We train frontier models to evaluate generative AI

Active

We train frontier models to evaluate generative AI

Atla helps developers find AI mistakes at scale, so they can build more reliable GenAI applications. Generative AI can only reach its full potential when it consistently produces safe and useful outputs. We train models to catch mistakes, monitor AI performance, and understand critical failure modes.

Active Founders

Maurice Burger

Founder

Roman Engeler

Founder

Latest News

Atla wants to build text-generating AI models with ‘guardrails’ | TechCrunch

Dec 05, 2023

Jobs at Atla

View all jobs

Chief Scientist

London, England, GB

£200K - £300K GBP

6+ years

Apply Now ›

Product Engineer (Frontend)

London, England, GB

£80K - £130K GBP

3+ years

Apply Now ›

Software Engineer (Frontend)

London, England, GB

£80K - £130K GBP

3+ years

Apply Now ›

Research Engineer

London, England, GB

£100K - £300K GBP

3+ years

Apply Now ›

Atla

Founded:2023

Batch:S23

Team Size:12

Status:

Active

Location:London, United Kingdom

Primary Partner:Harj Taggar

Company Launches

Selene - The World’s Most Accurate LLM-as-a-Judge

See original launch post

TL;DR

Meet Selene, a state-of-the-art LLM Judge trained specifically to evaluate AI responses. Selene is the best model on the market for evals, beating all frontier models from leading labs across 11 commonly used benchmarks for evaluators. Today, we are releasing:

API/SDK - Integrate Selene into your AI workflow
Alignment Platform - Build custom evaluation metrics for your use case

Get started using Selene for free.

Watch our demo here.

The Problem

Generative AI is unpredictable. Even the best models occasionally hallucinate, contradict themselves, or produce unsafe outputs. Many teams rely on the same general-purpose LLMs to evaluate AI outputs, but these models weren’t trained to be judges. That leads to:

Inaccurate evaluations and inefficient iteration cycles in development
Risky, unpredictable AI behavior in production

Our Solution

A SOTA model for evals: Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.
A platform to align our evaluator: Adapt Selene to your exact evaluation criteria—like “detect medical advice,” “flag legal errors,” or “judge whether the agent upgraded its workflow correctly.”

Selene works seamlessly with popular frameworks like DeepEval (YC W25) and Langfuse (YC W23) — just add it to your pipeline. And it runs faster than GPT-4o and Claude 3.5 Sonnet.

Who We Are & Why We’re Building This

We’re a small, highly technical team of AI researchers and engineers, with folks from leading AI labs and startups. Our mission is to enable the safe development of AGI. As models grow more powerful, we need a ‘frontier evaluator’ that keeps pace with frontier AI. We see Selene as a stepping stone toward scalable oversight of powerful AI.

Our Ask

Try Selene for free → Integrate our API into your eval pipeline.

Try our Alignment Platform → Craft a custom eval for your application.

Discord → Leave feedback, get to know us, and brainstorm cool ideas.