Mundo AI

High Quality Multilingual Training Data for AI Models

Active

High Quality Multilingual Training Data for AI Models

AI models are terrible in non-English languages because it's nearly impossible to find training data in other languages. So, we're building the world's largest and highest-quality multilingual data library.

Active Founders

Jason Liao

CEO and Founder

Hi! I'm Jason. Last year I worked on ML research abroad, where I discovered how impossibly challenging it is to build good multi-lingual AI models. Before that, I was the youngest quant researcher at a $60B hedge fund in Canada. jason@mundoai.world

Jason Liao

CEO and Founder

Naijide Anwaer

Founder

Co-founder at Mundo AI building the largest and highest quality multilingual datasets. Product builder. Lover of languages and culture. Ex-Platform PM at Binance.US.

Naijide Anwaer

Founder

Co-founder at Mundo AI building the largest and highest quality multilingual datasets. Product builder. Lover of languages and culture. Ex-Platform PM at Binance.US.

Garreth Lee

Founder

Co-Founder at Mundo AI. I previously worked on pretraining data at Cohere and tokenization at Hugging Face. I'm interested in all things related to ML/AI and loove playing music.

Garreth Lee

Founder

Co-Founder at Mundo AI. I previously worked on pretraining data at Cohere and tokenization at Hugging Face. I'm interested in all things related to ML/AI and loove playing music.

Kenneth Wu

Founder

doing fun stuff

Kenneth Wu

Founder

doing fun stuff

Company Launches

Mundo AI - High Quality Multilingual Training Data for AI Models

See original launch post

TL;DR

AI models are great at English, but struggle with almost every other language. So, we are building the world’s largest and highest quality multilingual data library to help AI labs build better non-English models.

The Story

When Jason was working on AI research abroad, he found that it was incredibly difficult to find training data in non-English languages. Because of this, his peers were all working on English models rather than ones in their native language.

The Problem

After speaking with researchers and entrepreneurs around the world, it became clear to us that AI usability was dramatically behind in non-English languages - even for major languages like Hindi and Arabic. This is because of the severe shortage of high quality training data in non-English languages. That leaves the 75% of the world that does not speak English out of the AI revolution.

Data has been a major bottleneck for researchers and AI labs building multilingual AI models, and the demand for better and larger datasets is only increasing.

Current workarounds such as synthetic data and machine translation simply don’t achieve the desired results, and open-source efforts fail to produce datasets in the quantity and quality required.

How are we solving this

We work directly with native speakers to build and create completely novel and high quality datasets. We do this by setting up end-to-end operations in the country where native speakers of a language reside, and by using our proprietary software platform to streamline data collection, generation, annotation, and quality assurance.

Demo Video

https://www.youtube.com/watch?v=zZiilPrhDJs

The Team

Jason Liao helped build a record-breaking fraud detection AI model at Tsinghua University. Before that, he led a quant research team at a $60B quant hedge fund.

Kenneth Wu was a quant at Canada’s largest quant fund. Previous roles in SWE at Amazon Web Services and Analyst at the Ontario Teachers’ Pension Plan.

Naijide Anwaer was the youngest Platform PM at Binance US. He speaks 4 languages.

Garreth Lee was an ML engineer and the first Indonesian at Hugging Face, where he helped build the world’s best open pre-training dataset. Previously a member of technical staff at Cohere.

Also shoutout to our founding PM @Ahnaf Muqset Haque

Our Ask

Do you know any researchers or data partnership managers at any AI labs? We’d love to get in touch! We’re trying to learn as much as we can about the data bottlenecks that are preventing researchers from making progress.

You can reach us at contact@mundoai.world

uploaded image