tl;dr: Parea AI automates the creation of evals for your AI products. We do this by bootstrapping an evaluation function with human annotations. Allowing you to automagically turn “vibe checks” into scalable and reliable evaluations aligned with human judgment.
Evaluating free-form text is often only possible by humans reviewing outputs or using LLMs to evaluate them. The former is laborious, slow, and expensive, while the latter often fails to evaluate the outputs correctly. For LLM evaluations to work properly, one needs to prompt engineer them; i.e., they require their own optimization process.
The best LLM evals are adapted to your particular business use case & data. We've developed a method for uploading human annotations (via CSV or using our Annotation Queue) and bootstrapping an evaluation to mimic the annotations. To create a human-aligned eval, you need as few as 20 sample annotations. Using your new LLM eval is as easy as copying the code into your codebase or using it directly via Parea's API. Check out our docs to see the complete workflow.