tl;dr: Vectorview makes it easy to evaluate the capabilities of foundation models and LLM agents. We do this by building custom task evaluators that red team for safety and benchmark performance.
Hi Everyone! We’re Lukas (left) and Emil (right). Lukas has a background as an engineer, has previously worked with LLMs at Google, and is passionate about AI safety. Emil also has a background as an engineer specializing in AI and ML. Previously he led a team at Klarna aimed at market expansion.
It’s difficult to prevent unwanted behaviors in LLMs due to their non-deterministic nature. Testing them against every possible scenario is hard, making it tough to catch all unintended behaviors. Additionally, most evaluation benchmarks (like MMLU or BBQ) are too general, missing the specific issues that can arise in real-world use. Take this example:
This issue isn’t limited to chatbots. It spans across LLM agents designed for specialized tasks and extends to AI labs striving for model safety. The task of crafting, deploying, and precisely scoring custom evaluations is complex and time-consuming.
Each use case demands a custom evaluation. In the case of of the Chevrolet chatbot, our custom auto-red teaming solution could be implemented to prevent the mistake.
Our platform offers a suite of custom evaluation tools designed to benchmark AI applications against specific, real-world scenarios they are likely to encounter. This targeted approach ensures that AI behaves as intended, mitigating the risk of unintended behaviors that generic benchmarks often miss.
Are you working with LLM agents or foundation models? If you need custom evaluations specific to your use case - we can help!