Meet Selene, a state-of-the-art LLM Judge trained specifically to evaluate AI responses. Selene is the best model on the market for evals, beating all frontier models from leading labs across 11 commonly used benchmarks for evaluators. Today, we are releasing:
Get started using Selene for free.
Watch our demo here.
Generative AI is unpredictable. Even the best models occasionally hallucinate, contradict themselves, or produce unsafe outputs. Many teams rely on the same general-purpose LLMs to evaluate AI outputs, but these models weren’t trained to be judges. That leads to:
A SOTA model for evals: Selene outperforms all frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for scoring, classifying, and pairwise comparisons.
A platform to align our evaluator: Adapt Selene to your exact evaluation criteria—like “detect medical advice,” “flag legal errors,” or “judge whether the agent upgraded its workflow correctly.”
Selene works seamlessly with popular frameworks like DeepEval (YC W25) and Langfuse (YC W23) — just add it to your pipeline. And it runs faster than GPT-4o and Claude 3.5 Sonnet.
We’re a small, highly technical team of AI researchers and engineers, with folks from leading AI labs and startups. Our mission is to enable the safe development of AGI. As models grow more powerful, we need a ‘frontier evaluator’ that keeps pace with frontier AI. We see Selene as a stepping stone toward scalable oversight of powerful AI.
Try Selene for free → Integrate our API into your eval pipeline.
Try our Alignment Platform → Craft a custom eval for your application.
Discord → Leave feedback, get to know us, and brainstorm cool ideas.