Hi all, we’re Mehul, Ishaan and Akhilesh - co-founders of Chunkr.
TLDR: We built Chunkr to solve the one-size-doesn't-fit-all problem in document parsing for RAG/LLM applications. Get granular control over your pipeline to balance speed, quality and features - like a tool you built in-house, minus the headache.
https://www.youtube.com/watch?v=2Iy5ssyeneA
We were initially building Lumina, the AI search engine for scientific literature. The biggest undertaking was developing a document ingestion pipeline that could process ~600M pages of scientific literature and extract everything we needed to build a great RAG experience. The OG "deep research".
Document processing solutions on the market couldn't deliver the performance, observability and features needed to win our trust. So we built parsing in-house - and spent so much time fighting foundational problems that we pivoted to solve the underlying challenge itself. That's how Chunkr was born - with a singular goal to build document ingestion infra that devs can love.
We released an API and easy to self-host pipeline in October 2024 as a test and didn’t expect much. What we ended up with was a very viral launch (400K impression) - and a year worth of learnings condensed down to 2 weeks.
By Nov 2024 - we had processed documents across verticals like healthcare, finance, research, government, education and hardware for RAG/LLM use-cases. Beyond common issues like bounding boxes for citations, accurate HTML/MD extraction, and chunking - it became clear that document parsing isn’t one size fits all. Some of the cases we ran into:
"I just need fast, simple full page OCR - leave out the rest"
"We need custom VLM processing for specific pages/segments (formulas, chart-to-table conversion)"
"Give us high-res crops of specific document sections"
"Everything must run in our self-hosted VPC"
“Can we apply our own chunking strategy”
Chunkr gives you “knobs” to balance speed, quality, and features depending on your use case - like a tool you built in-house, minus the headache. Being open source means minimal lock-in, and a variety of integration options.
🧩 Working at the segment level
This is where layout analysis really shines. Break a page into bounded titles, tables, formulas, captions etc (we call these segments) - and the downstream processing options are vast.
Each segment can be configured differently: fast OCR for text, VLMs for complex elements like tables and formulas, optional high-res image cropping, custom prompts for charts and image descriptions.
Abstractions have been kept to a minimum while maintaining a smooth DX. It’s everything you could need for generating great RAG data on your own terms.
At a higher level - you can configure a host of other options like layout analysis provider and chunking strategy. No matter how gnarly the document is, there’s a config that can work! Another great advantage is a pipeline that can stand the test of progress. As VLMs improve every week - Chunkr gets better as well.
🛠️ Production ready API / Self-host
The pipeline comes with all the last-mile engineering you need for production use-cases. We’ve built it all in Rust to offer great performance & reliability. Tasks like image conversions, page parallelization, reading order, failure handling and batching are taken care of.
Self-hosting is simple with ready to go dockers + helm charts. Processing speed is around 4 pages per second on a single RTX 4090, which is enough to handle over 11 million pages per month. Renting that hardware on Runpod costs about $249/month, so the math is pretty compelling.
Try out Chunkr (https://chunkr.ai/) - and share feedback with us. We support PDFs, office file types (docx, ppt, excel), and images. Get started for free with our no-code playground, raw post requests, or use our python package! Our dashboard gives you all the visibility needed to evaluate quality in seconds.
Give our repo a star! (https://github.com/lumina-ai-inc/chunkr)
**Email: **team@chunkr.ai