Meta's game-changer: how Movie Gen outshines OpenAI's Sora in video generation

Next-level video generation with Llama and flow matching

and

Oct 08, 2024

Meta has made a significant move in the AI space with the release of its new video generation model, Movie Gen, which claims superiority over OpenAI's Sora. At a time when Sora is reportedly struggling with internal quality issues, Meta seized the opportunity to unveil its advancements, supported by an extensive 95-page technical report—a bold step that signals Meta's intent to disrupt the video generation landscape.

Unlike Sora, which relies on the established DiT (Diffusion Transformer) architecture, Meta's Movie Gen takes a different approach. With a massive 30 billion parameters, the new model abandons DiT in favor of Meta's own Llama architecture. This innovation leverages Flow Matching, a technique that bypasses the traditional diffusion process and focuses on finding an optimal path from random noise to video output.

According to Andrew Brown, a research scientist on Meta's video generation team, the project demonstrated the sheer power of data, computing resources, and model parameters—elements that, when combined with Flow Matching, yielded Meta's most powerful video model to date.

What is Movie Gen?

Meta's Movie Gen isn't just a single model—it's a suite of media foundation models designed to generate high-quality, personalized media content. The series includes:

- Movie Gen Video: A 30-billion-parameter model for generating high-definition (1080p) videos up to 16 seconds long.

- Movie Gen Audio: A 13-billion-parameter model that generates synchronized audio up to 45 seconds in length.

- Personalized Movie Gen Video: A post-trained version that allows for personalized video content by incorporating user images.

- Movie Gen Edit: A video editing tool that allows for precise modifications, such as adding or removing elements through text prompts.

Together, these models can create 16-second personalized videos (at 16 frames per second) with matching 48kHz audio. Movie Gen empowers users to generate HD videos based on text prompts, create personalized content featuring themselves, and edit video elements with unprecedented precision—addressing a major pain point in current video generation products.

The battle against DiT

The biggest revelation from Meta's release is the complete shift away from DiT, the diffusion-based architecture that has become the standard for text-to-video models. Traditionally, diffusion models generate content by iteratively removing noise from a latent space, guided by textual input. DiT improved this by adding Transformer capabilities, allowing for better context capture throughout the process.

Flow Matching, on the other hand, skips this gradual denoising and instead aims to directly map from random input to a target video sequence, optimizing a continuous differential equation system (ODE). Meta's choice to integrate Flow Matching with Llama results in a powerful generative model that delivers high-quality video without relying on the stepwise diffusion process.

Meta has also invested heavily in computational power—training Movie Gen on 6,144 H100 GPUs, each with a 700W TDP, using Meta's Grand Teton AI server platform. With such massive infrastructure, Meta could leverage additional innovations like the Temporal Autoencoder (TAE) to reduce computational complexity while still generating high-quality results. Essentially, the Movie Gen approach is about brute force done smartly—harnessing immense computing resources to bypass some of the inefficiencies of existing diffusion methods.

A call to developers

In its promotional efforts for Movie Gen, Meta has made its intentions clear: it wants developers to move away from the Sora route and explore the potential of Llama for generative video. Movie Gen's success isn't just a technical feat—it's a strategic play to bolster Meta's open-source ambitions and build a developer ecosystem centered on its technologies.

At the same time, this new "family of models" seems to be less about open-source breakthroughs and more about practical application. Meta is positioning Movie Gen as a tool for real-world use cases—from enhancing Reels content to integrating into Orion, Meta's next-generation computing platform. As Mark Zuckerberg himself hinted, the goal is to make everyday creative expression accessible and exciting: imagine animating a slice of your daily life or sending a custom birthday greeting video to friends via WhatsApp—the possibilities are limitless.

Conclusion

With Movie Gen, Meta is redefining the rules of video generation. By shifting away from DiT and embracing Llama and Flow Matching, Meta is making a bold statement: the established path is not the only one—and sometimes, taking the road less traveled delivers more powerful results.