Chunkr: Open Source Document Parsing You Can Own

Vision infrastructure to turn complex docs into LLM/RAG-ready data

Mehul Chadda

10 months ago

Hi all, we’re Mehul, Ishaan and Akhilesh - co-founders of Chunkr.

TLDR: We built Chunkr to solve the one-size-doesn't-fit-all problem in document parsing for RAG/LLM applications. Get granular control over your pipeline to balance speed, quality and features - like a tool you built in-house, minus the headache.

https://www.youtube.com/watch?v=2Iy5ssyeneA

We were initially building Lumina, the AI search engine for scientific literature. The biggest undertaking was developing a document ingestion pipeline that could process ~600M pages of scientific literature and extract everything we needed to build a great RAG experience. The OG "deep research".

Document processing solutions on the market couldn't deliver the performance, observability and features needed to win our trust. So we built parsing in-house - and spent so much time fighting foundational problems that we pivoted to solve the underlying challenge itself. That's how Chunkr was born - with a singular goal to build document ingestion infra that devs can love.

The Problem

We released an API and easy to self-host pipeline in October 2024 as a test and didn’t expect much. What we ended up with was a very viral launch (400K impression) - and a year worth of learnings condensed down to 2 weeks.

By Nov 2024 - we had processed documents across verticals like healthcare, finance, research, government, education and hardware for RAG/LLM use-cases. Beyond common issues like bounding boxes for citations, accurate HTML/MD extraction, and chunking - it became clear that document parsing isn’t one size fits all. Some of the cases we ran into:

"I just need fast, simple full page OCR - leave out the rest"

"We need custom VLM processing for specific pages/segments (formulas, chart-to-table conversion)"

"Give us high-res crops of specific document sections"

"Everything must run in our self-hosted VPC"

“Can we apply our own chunking strategy”

The Solution

Chunkr gives you “knobs” to balance speed, quality, and features depending on your use case - like a tool you built in-house, minus the headache. Being open source means minimal lock-in, and a variety of integration options.

🧩 Working at the segment level

This is where layout analysis really shines. Break a page into bounded titles, tables, formulas, captions etc (we call these segments) - and the downstream processing options are vast.

Each segment can be configured differently: fast OCR for text, VLMs for complex elements like tables and formulas, optional high-res image cropping, custom prompts for charts and image descriptions.

uploaded image

Abstractions have been kept to a minimum while maintaining a smooth DX. It’s everything you could need for generating great RAG data on your own terms.

At a higher level - you can configure a host of other options like layout analysis provider and chunking strategy. No matter how gnarly the document is, there’s a config that can work! Another great advantage is a pipeline that can stand the test of progress. As VLMs improve every week - Chunkr gets better as well.

🛠️ Production ready API / Self-host

The pipeline comes with all the last-mile engineering you need for production use-cases. We’ve built it all in Rust to offer great performance & reliability. Tasks like image conversions, page parallelization, reading order, failure handling and batching are taken care of.

Self-hosting is simple with ready to go dockers + helm charts. Processing speed is around 4 pages per second on a single RTX 4090, which is enough to handle over 11 million pages per month. Renting that hardware on Runpod costs about $249/month, so the math is pretty compelling.

uploaded image

Asks

Try out Chunkr (https://chunkr.ai/) - and share feedback with us. We support PDFs, office file types (docx, ppt, excel), and images. Get started for free with our no-code playground, raw post requests, or use our python package! Our dashboard gives you all the visibility needed to evaluate quality in seconds.

Give our repo a star! (https://github.com/lumina-ai-inc/chunkr)

**Email: **team@chunkr.ai

uploaded image