Chunkr

Open source API service to parse complex documents

W24

Active

Open source API service to parse complex documents

Battle-tested + highly modular vision infrastructure to convert PDFs, PPTs, Word, Excel, PNG, and JPEGs into LLM-ready data. We started by building lumina.sh - where we needed to parse ~600M pages of scientific literature. The researchers didn't care - but devs wanted our ingestion pipeline. So we built chunkr instead. We offer high quality layout analysis, OCR, bounding boxes, granular VLM controls, semantic chunking, and all the last mile engineering that goes into building standout AI applications. Common use-cases include RAG, and automating document workflows like invoices/medical reports -> database.

Chunkr

Founded:2023

Team Size:3

Status:

Active

Location:San Francisco

Group Partner:Harj Taggar

Company Launches

Chunkr: Open Source Document Parsing You Can Own

See original launch post ›

Hi all, we’re Mehul, Ishaan and Akhilesh - co-founders of Chunkr.

TLDR: We built Chunkr to solve the one-size-doesn't-fit-all problem in document parsing for RAG/LLM applications. Get granular control over your pipeline to balance speed, quality and features - like a tool you built in-house, minus the headache.

https://www.youtube.com/watch?v=2Iy5ssyeneA

We were initially building Lumina, the AI search engine for scientific literature. The biggest undertaking was developing a document ingestion pipeline that could process ~600M pages of scientific literature and extract everything we needed to build a great RAG experience. The OG "deep research".

Document processing solutions on the market couldn't deliver the performance, observability and features needed to win our trust. So we built parsing in-house - and spent so much time fighting foundational problems that we pivoted to solve the underlying challenge itself. That's how Chunkr was born - with a singular goal to build document ingestion infra that devs can love.

The Problem

We released an API and easy to self-host pipeline in October 2024 as a test and didn’t expect much. What we ended up with was a very viral launch (400K impression) - and a year worth of learnings condensed down to 2 weeks.

By Nov 2024 - we had processed documents across verticals like healthcare, finance, research, government, education and hardware for RAG/LLM use-cases. Beyond common issues like bounding boxes for citations, accurate HTML/MD extraction, and chunking - it became clear that document parsing isn’t one size fits all. Some of the cases we ran into:

"I just need fast, simple full page OCR - leave out the rest"

"We need custom VLM processing for specific pages/segments (formulas, chart-to-table conversion)"

"Give us high-res crops of specific document sections"

"Everything must run in our self-hosted VPC"

“Can we apply our own chunking strategy”

The Solution

Chunkr gives you “knobs” to balance speed, quality, and features depending on your use case - like a tool you built in-house, minus the headache. Being open source means minimal lock-in, and a variety of integration options.

🧩 Working at the segment level

This is where layout analysis really shines. Break a page into bounded titles, tables, formulas, captions etc (we call these segments) - and the downstream processing options are vast.

Each segment can be configured differently: fast OCR for text, VLMs for complex elements like tables and formulas, optional high-res image cropping, custom prompts for charts and image descriptions.

Abstractions have been kept to a minimum while maintaining a smooth DX. It’s everything you could need for generating great RAG data on your own terms.

At a higher level - you can configure a host of other options like layout analysis provider and chunking strategy. No matter how gnarly the document is, there’s a config that can work! Another great advantage is a pipeline that can stand the test of progress. As VLMs improve every week - Chunkr gets better as well.

🛠️ Production ready API / Self-host

The pipeline comes with all the last-mile engineering you need for production use-cases. We’ve built it all in Rust to offer great performance & reliability. Tasks like image conversions, page parallelization, reading order, failure handling and batching are taken care of.

Self-hosting is simple with ready to go dockers + helm charts. Processing speed is around 4 pages per second on a single RTX 4090, which is enough to handle over 11 million pages per month. Renting that hardware on Runpod costs about $249/month, so the math is pretty compelling.

Asks

Try out Chunkr (https://chunkr.ai/) - and share feedback with us. We support PDFs, office file types (docx, ppt, excel), and images. Get started for free with our no-code playground, raw post requests, or use our python package! Our dashboard gives you all the visibility needed to evaluate quality in seconds.

Give our repo a star! (https://github.com/lumina-ai-inc/chunkr)

**Email: **team@chunkr.ai