🚀 Hamming - Make your RAG & AI agents reliable

The only end-to-end AI development platform you need: prompt management, evals, observability

Sumanyu Sharma

9 months ago

👋 @Sumanyu Sharma and @Marius Buleandra from @Hamming AI

TLDR: Are you struggling to make your RAG & AI agents reliable? We're launching our AI Optimization Platform to help eng teams speed up iteration velocity, root-cause bad AI outputs, and prevent regressions.

🌟 Click here to try our AI Optimization Platform 🌟

Our thesis: Experimentation drives reliability

Previously, Marius and I ran growth-focused eng and data teams at Tesla, Citizen, Spell & Anduril. We learned that running experiments is the best way to move a metric. More experiments = more growth.

Last year, we shipped 10+ RAG & AI agents to production. We found the same pattern holds when building reliable AI products. More experiments = more reliability = more retention for your AI products.

Problem: Making RAG and AI agents reliable feels like whack-a-mole

Here's the workflow most teams follow:

Tweak your RAG or AI agents by indexing new documents, adding new tools, changing the prompts, models or other business logic.
Eyeball how well your changes improved a handful of examples you wanted to fix. Often ad-hoc and slow.
Ship the changes if they worked.
Detect regressions when users complain of things breaking in production.
Repeat steps 1 to 4 until you get tired.

Steps 2 and 4 are often the slowest & most painful parts of the feedback loop. This is what we tackle.

Our take: Use LLMs as judges to speed up iteration velocity

We use LLMs to score the outputs of other LLMs. This is the fastest way to speed up the feedback loop.

Flag errors in production before customers notice

We go beyond passive LLM / trace-level monitoring. We actively score your production outputs in real time and flag cases the team needs to double-click on. This helps eng teams quickly prioritize cases they need to fix.

Test new changes quickly while developing and prevent regressions from reaching users

We make offline evaluations easy, so you can change your system and get feedback in minutes.

Eval-driven prompt iteration

Rapidly iterate with new prompts and models in our prompt playground with first-class support for function-calling. Run evals so you know your changes are improving things.

Deploy prompt changes without code change

We keep track of all prompt versions and update your prompts on the fly without needing a code change.

Easily create golden datasets

Offline evaluations are bottlenecked on a high-quality golden dataset of input/output pairs. We support converting production traces to dataset examples in one click.

Diagnose between retrieval, reasoning or function-calling errors quickly

Differentiating between retrieval, reasoning, and function-calling errors is time-consuming. We score each retrieved context on metrics like hallucination, recall, and precision to help you prioritize your eng efforts where it matters.

Override AI scores

Sometimes our AI scores disagree with your definition of "good". We make it easy to override our scores with your preferences. Our AI scorer learns from your feedback.

Meet the team

Sumanyu previously helped Citizen (safety app; backed by Founders Fund, Sequoia, 8VC) grow its users by 4X and grew an AI-powered sales program to $100s of millions in revenue/year at Tesla.

Marius previously ran data infrastructure @ Anduril, drove user growth at Citizen with Sumanyu and was a founding engineer @ Spell (MLOps startup acquired by Reddit).

Our offer

We previously launched Prompt Optimizer on launch YC, which saves 80% of manual prompt engineering effort. In this launch, we show how teams use Hamming to build reliable RAG and AI agents.

Free 1:1 Debug session. Struggling with making your RAG/agents reliable? We're offering a complementary 1:1 RAG/agent debugging session. Book time with us here.

Questions? Email us here or chat with us here.