HomeCompaniesMundo AI

Mundo AI

High Quality Multilingual Training Data for AI Models

AI models are terrible in non-English languages because it's nearly impossible to find training data in other languages. So, we're building the world's largest and highest-quality multilingual data library.
Mundo AI
Founded:2024
Team Size:4
Status:
Active
Location:San Francisco
Group Partner:Aaron Epstein
Active Founders

Jason Liao, CEO and Founder

Hi! I'm Jason. Last year I worked on ML research abroad, where I discovered how impossibly challenging it is to build good multi-lingual AI models. Before that, I was the youngest quant researcher at a $60B hedge fund in Canada. jason@mundoai.world
Jason Liao
Jason Liao
Mundo AI

Naijide Anwaer, Founder

Co-founder at Mundo AI building the largest and highest quality multilingual datasets. Product builder. Lover of languages and culture. Ex-Platform PM at Binance.US.
Naijide Anwaer
Naijide Anwaer
Mundo AI

Garreth Lee, Founder

Co-Founder at Mundo AI. I was previously working on pretraining data at Cohere and tokenization at Hugging Face. I'm interested in all things related to ML/AI and loove playing music.
Garreth Lee
Garreth Lee
Mundo AI

Kenneth Wu, Founder

doing fun stuff
Kenneth Wu
Kenneth Wu
Mundo AI
Company Launches
Mundo AI - High Quality Multilingual Training Data for AI Models
See original launch post ›

TL;DR

AI models are great at English, but struggle with almost every other language. So, we are building the world’s largest and highest quality multilingual data library to help AI labs build better non-English models.

The Story

When Jason was working on AI research abroad, he found that it was incredibly difficult to find training data in non-English languages. Because of this, his peers were all working on English models rather than ones in their native language.

The Problem

After speaking with researchers and entrepreneurs around the world, it became clear to us that AI usability was dramatically behind in non-English languages - even for major languages like Hindi and Arabic. This is because of the severe shortage of high quality training data in non-English languages. That leaves the 75% of the world that does not speak English out of the AI revolution.

Data has been a major bottleneck for researchers and AI labs building multilingual AI models, and the demand for better and larger datasets is only increasing.

Current workarounds such as synthetic data and machine translation simply don’t achieve the desired results, and open-source efforts fail to produce datasets in the quantity and quality required.

How are we solving this

We work directly with native speakers to build and create completely novel and high quality datasets. We do this by setting up end-to-end operations in the country where native speakers of a language reside, and by using our proprietary software platform to streamline data collection, generation, annotation, and quality assurance.

Demo Video

https://www.youtube.com/watch?v=zZiilPrhDJs

The Team

Jason Liao helped build a record-breaking fraud detection AI model at Tsinghua University. Before that, he led a quant research team at a $60B quant hedge fund.

Kenneth Wu was a quant at Canada’s largest quant fund. Previous roles in SWE at Amazon Web Services and Analyst at the Ontario Teachers’ Pension Plan.

Naijide Anwaer was the youngest Platform PM at Binance US. He speaks 4 languages.

Garreth Lee was an ML engineer and the first Indonesian at Hugging Face, where he helped build the world’s best open pre-training dataset. Previously a member of technical staff at Cohere.

Also shoutout to our founding PM @Ahnaf Muqset Haque

Our Ask

Do you know any researchers or data partnership managers at any AI labs? We’d love to get in touch! We’re trying to learn as much as we can about the data bottlenecks that are preventing researchers from making progress.

You can reach us at contact@mundoai.world