We build tools for developing applications with generative AI models. Whether you are using LLMs, vector databases, and text-to-image models like Stable Diffusion, our tools help you find the right model, prompt, configuration and consistently monitor their behaviors in production. --- **PromptTools**: the first open-source, developer-focused SDK for experimenting with and evaluating prompts, models, and vector databases. You can try the repo here: https://github.com/hegelai/prompttools. If you're building with LLMs and struggling with evaluation, reach out and we can help.
Building open-source developer tools for large language models at Hegel AI. Previously worked on PyTorch at Meta AI and quantitative development at Bridgewater Associates. Studied at Computer Science and Economics at UChicago. Happy to connect and chat about how your company is using AI and LLMs!
TL;DR Hegel AI is building developer tools for LLM and prompt evaluation and continuous testing, with over a thousand downloads per week since launch. We’re a team of ex-PyTorch engineers from Meta and Google.
Is Llama 2 really better? Is GPT-4 getting worse? Developers building with language models are constantly facing challenges with (1) model drift (2) new models and (3) insufficient evaluation systems. Besides academic metrics and “eyeballing it”, there’s no existing solution for continuous LLM regression testing or experimentation.
What’s more, the complexity of LLMs and related tools like vector databases make it hard for developers to set up honest apples-to-apples comparisons for their use case.
We want to make high quality evaluation and testing systems available to everyone, which is why we’re building an open source platform for language model evaluation. Developers can use our product to monitor GPT-4 regressions, audit models for safety, and measure performance of vector databases.
Developers come to our product with some test cases, and an idea of the models or vector databases they want to evaluate. We handle all of the testing infrastructure to help them (1) set up experiments across models/DBs in notebooks and (2) scale those experiments into reusable test suites.
We support multiple strategies for evaluation: auto-eval by another model, systematic evaluations like structured output validation or semantic similarity to an expected output, or even through gathering human feedback with our notebook UI.
Kevin Tse and Steven Krawczyk met 9 years ago when we started our undergraduate studies at the University of Chicago. We crossed paths again years after graduation when we were both working on PyTorch, an open-source deep learning library, at Meta and Google. We worked on the core library, data loading, and the XLA compiler library. Prior to that, we worked at Amazon and Bridgewater Associates.
It’s hard to convince senior management that investing in LLMs is worth it, without doing expensive experimentation on my own.
If you are doing anything remotely related to prompting, you'd know how frustrating & inefficient it is to compare between prompts. I, for one, am using it to make my life easier in building Watto AI.