TL;DR: we’ve built a state-of-the-art lip-sync model – and we’re building towards real-time face-to-face conversations w/ AI indistinguishable from humans 🦾
try our playground here: https://app.synclabs.so/playground
how does it work?
theoretically, our models can support any language — they learn phoneme / viseme mappings (the most basic unit / “token” of how sounds we make map to the shapes our mouths make to create them). it’s simple, but a start towards learning a foundational understanding of humans from video.
why is this useful?
[1] we can dissolve language as a barrier
check out how we used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English.
imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.
realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.
[2] we can move the human-computer interface beyond text-based-chat
keyboard / mice are lossy + low bandwidth. human communication is rich and goes beyond just the words we say. what if we could compute w/ a face-to-face interaction?
maybe embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way. this thread of research is exciting.
[3] and more
powerful models small enough to run at the edge could unlock a lot:
eg.
extreme compression for face-to-face video streaming
enhanced, spatial-aware transcription w/ lip-reading
detecting deepfakes in the wild
on-device real-time video translation
etc.
who are we?
Prady Modukuru [CEO] | Led product for a research team at Microsoft that made Defender a $350M+ product, took MSR research into production moving it from bottom of market to #1 in industry evals.
Rudrabha Mukhopadhyay [CTO] | PhD CVIT @ IIIT-Hyderabad, co-authored wav2lip / 20+ major publications + 1200+ citations in the last 5 years.
Prajwal K R [CRO] | PhD, VGG @ University of Oxford, w/ Andrew Zisserman, prev. Research Scientist @ Meta, authored multiple breakthrough research papers (incl. Wav2Lip) on understanding and generating humans in video
Pavan Reddy [COO/CFO] 2x venture-backed founder/operator, built the first smart air purifier in India, prev. monetizing sota research @ IIIT-Hyderabad, engineering @ IIT Madras
how did we meet?
Prajwal + Rudrabha worked together at IIIT-hyderabad — and became famous by shipping the world’s first model that could sync the lips in a video to any audio in the wild in any language, no training required.
they formed a company w/ Pavan and then worked w/ the university to monetize state-of-the-art research coming out of the labs and bring it to market.
Prady met everyone online — first by hacking together a viral app around their open source models, then collaborating on product + research for fun, to cofounding sync. + going mega-viral.
Since then we’ve hacked irl across 4 different countries, across the US coasts, and moved into a hacker house in SF together.
what’s our ask?
try out our playground and API and let us know how we can make it easier to understand and simpler to use 😄
play around here: https://app.synclabs.so/playground
lipsync video to audio in any language in one-shot
Prady + Pavan have been full-time on sync since June 2023
Rudrabha has been contributing greatly while finishing his PhD + joined full-time starting October 2023
Prajwal is finishing up his PhD and is joining fulltime once he completes in May of 2024 – his supervisor is Professor Andrew Zisserman (190+ citations / foremost expert in the field we are playing in. His proximity helps us stay sota + learn from the bleeding edge.
we're building generative models to modify / synthesize humans in video + hosting production APIs to let anyone plug them into their own apps / platforms / services.
today we're focused on visual dubbing – we built + launched an updated lip-synchronizing model to let anyone lip-sync a video to an audio in any language in near real-time for HD videos.
as part of the AI translation stack we're used as a post processing step to sync the lips in a video to the new dubbed audio track – this lets everyone around the world experience content like it was made for them in their native language (no more bad / misaligned dubs).
in the future we plan to build + host a suite of production ready models to modify + generate a full-human body digitally in video (ex. facial expressions, head + hand + eye movements, etc.) that can be used for anything from seamless localization of content (cross-language) to generative videos