Prompt Engineering: Super powers for prompt engineers - Compare prompts, models, and even LLM providers side-by-side - Curate a library of test cases to evaluate prompts against - Quantitatively evaluate the output of your prompts using industry-standard ML metrics (Bleu, Meteor, Levenshtein distance, Semantic similarity) Deployments: Confidently iterate on models in production - Simple API interface that proxies requests to any model provider - Back-testing & version control - Observability of all your inputs and outputs; UI & API to submit explicit or implicit user feedback Documents: Use your proprietary data in LLM applications - Robust API endpoint to submit documents (“corpus of text”) for querying against - Configurable chunking and semantic search strategies - Ability to query against corpus of text at run time Continuous Improvement: Continuously fine-tune to improve quality and lower cost - Passively accumulate training data to fine-tune your own proprietary models - Swap model providers or parameters under the hood – no code changes required We’re a team of MIT engineers and McKinsey consultants who’ve been building apps on GPT-3 for 3 years since it first came out. We’ve built similar tools in MLOps for 4 years and have closely experienced the pain we’re solving for our customers today. We believe that AI is the greatest technological leap since the internet. Our mission is to help companies adopt AI by taking their prototypes to production. If you have an AI use-case in mind, please reach out!
Hello hello! We’re Noa Flaherty, Akash Sharma, and Sidd Seethepalli from Vellum.
Tl;dr "Workflows" is a new product in Vellum's LLM dev platform that helps you quickly prototype, deploy, and manage complex chains of LLM calls and the business logic that tie them together. We solve the "whack-a-mole" problem encountered by companies that use popular open source frameworks to build AI applications, but are scared to make changes for fear of introducing regressions in production.
The Problem 😰: Many AI use-cases require chains of prompts, but experimentation and productionization of complex chains is hard.
We have helped dozens of customers take their AI prototypes to production by delivering tools for efficient prompt engineering, tightly integrated semantic search, prompt versioning, and performance monitoring. However, as the AI industry matures, we’ve found that more and more real-world use-cases require multi-step flows across actions like semantic search, multiple prompts/LLM calls, and bespoke business logic.
For example, if building a customer-support chatbot, you may want to:
Unfortunately, existing tools and frameworks don’t make it easy to:
The Solution 🤤: A fully managed platform for experimenting with, deploying, and managing AI workflows that power your app
Vellum Workflows provides a low-code UI for experimenting with and deploying LLM workflows to power features in your app.
You can construct a workflow using different “Nodes,” define “Input Variables” to the workflow, their values across different “Scenarios” and run with a single click to see the output at each step along the way.
Shown here is one of the workflows used in production by a customer of ours, Miri Health, for powering their health & wellness AI chatbot.
You get immediate feedback on whether your chain/prompts perform the way you expect without having to edit code, inspect console logs, or hop between browser tabs. You can validate that your workflow does what it should across a variety of scenarios / test cases.
Once you’re happy, you can deploy the Workflow directly in Vellum and invoke it through an API via Vellum’s python/node SDKs. Events for nodes that you subscribe to are streamed back using Server-Sent Events.
Invoke a workflow via a simple API. Use our officially supported python and node sdks, or roll your own.
By deploying your Workflow through Vellum, you can:
Monitor how your workflows are performing in production, with the ability to inspect the inputs/outputs of the workflow as a whole, as well as each step in the chain.
Looking Ahead
This is just the beginning! Our beta customers are already asking for things like:
Why Vellum?
Our focus to date has been to provide robust building blocks for creating production-ready AI applications. We’ve seen our customers assemble Vellum-powered Prompts and Semantic Search to create incredible products, version control and debug them using Vellum Deployments, and validate them when making changes using Vellum Test Suites.
Now that we have the building blocks, we’re well-positioned to help you assemble them. Workflows has been in closed-beta for a few weeks now and we already have customers using them to power their entire AI backend in production.
Vellum Workflows give us the opportunity to really tailor different parts of our product to the end users’ needs without having to invest in tons of custom development, which has dramatically decreased our time to market. As a technical, but non-engineering stakeholder, I’m able to truly participate in the development of the product experience and help deliver personalized AI-powered experiences to customers faster than I could have ever imagined.
Adam Daigian, Product Lead at Miri Health
We firmly believe that the best AI-powered products out there will be the result of close collaboration between technical and non-technical team members. We’ve repeatedly seen engineers set up the initial scaffolding, integrations, and guard-rails, while non-technical folks run experiments and tweak prompts/chains. No other platform facilitates this collaboration as well as Vellum.
Ask: How you can help
We worked together at Dover (YC S19) for 2+ years where we built production use-cases of LLMs. Noa and Sidd are MIT engineers who have worked DataRobot’s MLOps team and Quora’s ML Platform team respectively. Akash spent 5 years at McKinsey’s Silicon Valley Office. While working with GPT-3 and Cohere to build user-facing LLM apps, we found ourselves building complex internal tooling to compare models, fine-tune them, measure performance, and improve quality over time. This took away time from building our user facing product. We’ve worked on ML Ops for traditional ML and wished we had the same when later working with LLMs, so we’re building it.