[sync.] we built the most natural lipsync model in the world, again.

lipsync anyone in any video, no training required, available now via API.

Prady Modukuru

sync.

13 days ago

https://sync.so/

tldr;

lipsync-2 is the most advanced video-to-video lipsyncing model in the world
It’s zero-shot, so you don’t need to wait for an “actor”, “clone”, or “avatar” to train before using it.
Even so, it learns and generates a speaker’s unique style of speech
It works across live-action, animated, and AI-generated humans
Thousands of developers use it to build video translation, word-level editing of video, and character re-animation workflows today (including generating realistic AI UGC)
We’re launching our YC deal, 4mo’s of our scale plan for free plus $1000 in credits 🚀

https://www.youtube.com/watch?v=j5iJ2k05ltc

What did we build?

We built lipsync-2, the first in a new generation of zero-shot lipsyncing models. It seamlessly edits any person's lip movements in a video to match any audio without having to train or be fine-tuned on that person.

Zero-shot lipsync models are versatile because they edit any arbitrary person and voice without having to train or fine-tune on every speaker. But traditionally they can lose traits unique to the person, like their speaking style, skin textures, teeth, etc.

With lipsync-2, we introduce a new capability in zero-shot lipsync: style preservation. We learn a representation of how a person speaks by watching how they speak in the input video. We train a spatiotemporal transformer that encodes the different mouth shapes in the input video into a style representation. A generative transformer synthesizes new mouth movements by conditioning on the new target speech and the learned style representation.

How can you use it?

We built a simple API that let’s you build workflows around our core lipsyncing models. You submit a video and an audio (or a script and voiceID to generate audio from), and get a response with the final output.

We see thousands of developers and businesses integrating our APIs to build generative video workflows into their products and services.

[1] Video translation

Notice how even across different languages, we preserve the speaking style of Nicolas Cage. We are the first zero-shot lipsyncing model to achieve this.

https://youtu.be/GaCoHy99zT4

We can even handle long videos with multiple speakers — we built a state-of-the-art active speaker detection pipeline that associates a unique voice with a unique face, and only applies lipsync when we detect that person is actively speaking.

https://www.youtube.com/watch?v=ZaXbiKdoBz8

It also works across animated characters, from Pixar-level animations to AI generated characters.

https://www.youtube.com/watch?v=F_6lGFl6bcA

But translation is only the beginning, with the power to edit dialogue in any video in post-production we’re on the cusp of reimagining how we create, edit, and consume videos forever.

[2] Record once and edit dialogue to use forever.

https://youtu.be/HJR4BbhZ8Uo

Imagine a world where you only ever have to hit record once. lipsync-2 is the only model that let’s you edit a dialogue while preserving the original speakers style, without needing to train or fine-tune beforehand.

[3] AI video

In an age where we can generate any video by typing a few lines of text, we don’t have to limit ourselves to what we can capture with a camera.

https://youtube.com/shorts/KnzWtu3niKQ

Our YC deal

For any YC company we’re giving away our Scale Plan for free for 4 months, plus $1000 to spend on usage.

With the scale plan you get access to up to 15 concurrent jobs processing at once and handle up to 30 minute video at a time — leveraging this maximally you have the ability to generate around ~90 minutes of video per hour every hour.

Launch an AI admaker, video translation tool, or any other content generation workflow you want and serve viral load with speed, reliability, and best-in-class quality.

Email us at yc@sync.so and we’ll get you set up.

So why does this matter?

At sync, AI lipsync is just the beginning.

We live in an extraordinary age.

A high schooler can craft a masterpiece with an iPhone. A studio can produce a movie at a tenth of the cost 10x faster. Every video can be distributed worldwide in any language, instantly. Video is becoming as malleable as text.

But we have two fundamental problems to tackle before this is a reality:

[1] Large video models are great at generating entirely new scenes and worlds, but struggle with precise control and fine grained edits. The ability to make subtle, intentional adjustments – the kind that separates good content from great content – doesn’t exist yet.

[2] If video generation is world modeling, each human is a world unto themselves. We each have idiosyncrasies that make us unique — building primitives to capture, express, and modify them with high precision is the key to breaking through the uncanny valley.

We’re excited about lipsync-2, and for what’s coming up next. Reach out to founders@sync.so if you have any questions or are curious about our roadmap.