at sync. we're making video as fluid and editable as a word document how much time would you save if you could *record every video in a single take?* no more re-recording yourself because you didn't like what you said, or how you said it. just shoot once, revise yourself to do exactly what you want, and post. that's all. this is the future of video: *AI modified >> AI generated* we're playing at the edge of science + fiction. our team is young, hungry, uniquely experienced, and advised by some of the greatest research minds + startup operators in the world. we're driven to solve impossible problems, impossibly fast. our founders are the original team behind the open sourced wav2lip — the most prolific lip-sync model to date w/ over 10k+ GitHub stars. we [1] train state-of-the-art generative models, [2] productize + host them for scale [3] grow virally through content [4] upsell into enterprise
ceo & cofounder at sync. labs | product engineer obsessed w/ networks of people + products. before startups, I helped incubate, launch, and scale AI-powered cybersecurity products at Microsoft impacting over 500M consumers + $1T worth of publicly traded companies – and became the youngest product leader in my org.
Co-founder & Chief Scientist at sync. labs. Ph.D. from University of Oxford with Prof. Andrew Zisserman. Authored multiple breakthrough research papers (incl. Wav2Lip) on understanding and generating humans in video.
I am the co-founder and CTO of Sync Labs. At Sync, We’re building audio-visual models to understand, modify, and synthesize humans in video. I am one of the primary authors of Wav2Lip, one of the most prolific lip syncing models in the world published in 2020. I have done my PhD at IIIT Hyderabad on Audio-visual deep learning and have been involved in several important projects in the community.
Driving Sales/Operations and Finance/ Strategic roadmap at sync. 2x VC backed entrepreneur. IIT Madras Alumnus. Worked with IIIT Hyderabad in productising research work as per market orientation. Key strength is connecting dots/ identifying patterns across different fields to unlock the value
TL;DR: we’ve built a state-of-the-art lip-sync model – and we’re building towards real-time face-to-face conversations w/ AI indistinguishable from humans 🦾
try our playground here: https://app.synclabs.so/playground
how does it work?
theoretically, our models can support any language — they learn phoneme / viseme mappings (the most basic unit / “token” of how sounds we make map to the shapes our mouths make to create them). it’s simple, but a start towards learning a foundational understanding of humans from video.
why is this useful?
[1] we can dissolve language as a barrier
check out how we used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English.
imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.
realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.
[2] we can move the human-computer interface beyond text-based-chat
keyboard / mice are lossy + low bandwidth. human communication is rich and goes beyond just the words we say. what if we could compute w/ a face-to-face interaction?
maybe embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way. this thread of research is exciting.
[3] and more
powerful models small enough to run at the edge could unlock a lot:
eg.
extreme compression for face-to-face video streaming
enhanced, spatial-aware transcription w/ lip-reading
detecting deepfakes in the wild
on-device real-time video translation
etc.
who are we?
Prady Modukuru [CEO] | Led product for a research team at Microsoft that made Defender a $350M+ product, took MSR research into production moving it from bottom of market to #1 in industry evals.
Rudrabha Mukhopadhyay [CTO] | PhD CVIT @ IIIT-Hyderabad, co-authored wav2lip / 20+ major publications + 1200+ citations in the last 5 years.
Prajwal K R [CRO] | PhD, VGG @ University of Oxford, w/ Andrew Zisserman, prev. Research Scientist @ Meta, authored multiple breakthrough research papers (incl. Wav2Lip) on understanding and generating humans in video
Pavan Reddy [COO/CFO] 2x venture-backed founder/operator, built the first smart air purifier in India, prev. monetizing sota research @ IIIT-Hyderabad, engineering @ IIT Madras
how did we meet?
Prajwal + Rudrabha worked together at IIIT-hyderabad — and became famous by shipping the world’s first model that could sync the lips in a video to any audio in the wild in any language, no training required.
they formed a company w/ Pavan and then worked w/ the university to monetize state-of-the-art research coming out of the labs and bring it to market.
Prady met everyone online — first by hacking together a viral app around their open source models, then collaborating on product + research for fun, to cofounding sync. + going mega-viral.
Since then we’ve hacked irl across 4 different countries, across the US coasts, and moved into a hacker house in SF together.
what’s our ask?
try out our playground and API and let us know how we can make it easier to understand and simpler to use 😄
play around here: https://app.synclabs.so/playground
lipsync video to audio in any language in one-shot
Prady + Pavan have been full-time on sync since June 2023
Rudrabha has been contributing greatly while finishing his PhD + joined full-time starting October 2023
Prajwal is finishing up his PhD and is joining fulltime once he completes in May of 2024 – his supervisor is Professor Andrew Zisserman (190+ citations / foremost expert in the field we are playing in. His proximity helps us stay sota + learn from the bleeding edge.
we're building generative models to modify / synthesize humans in video + hosting production APIs to let anyone plug them into their own apps / platforms / services.
today we're focused on visual dubbing – we built + launched an updated lip-synchronizing model to let anyone lip-sync a video to an audio in any language in near real-time for HD videos.
as part of the AI translation stack we're used as a post processing step to sync the lips in a video to the new dubbed audio track – this lets everyone around the world experience content like it was made for them in their native language (no more bad / misaligned dubs).
in the future we plan to build + host a suite of production ready models to modify + generate a full-human body digitally in video (ex. facial expressions, head + hand + eye movements, etc.) that can be used for anything from seamless localization of content (cross-language) to generative videos