How Open-Sora 2.0 cuts the costs of AI video generation without sacrificing quality

Open Sora
Image created with Imagen 3

This article is part of our coverage of the latest in AI research.

Video generation models are known for being notoriously expensive to train and run. But Open-Sora 2.0, a new model developed by researchers at HPC-AI Tech shows that a commercial-level video generation model can be trained within a reasonable budget. 

Open-Sora 2.0 was trained on $200,000 worth of compute, 5-10 times lower than comparable models, while producing results comparable to some of the most advanced video generation models, including open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. The key to the success of Open-Sora 2.0 is smart data collection pipeline, a novel architecture, and a very efficient training regime.

“We show that a top-performing video generation model can be trained at a highly controlled cost, offering new insights into cost-effective training and efficiency optimization,” the researchers write.

A smart data collection pipeline

While other labs focus on creating larger neural networks and throwing more data and compute at their models, Open-Sora 2.0 is based on the joint optimization of data curation, training strategy, and model architecture.

To curate the training dataset, they developed a hierarchical data pyramid to create suitable training examples for each stage of training. They first preprocess the raw videos by removing short and low-quality videos and segmenting the remaining videos into clips suitable for training the model. They then apply a set of filters to remove low-quality segments, (e.g., are too fast or slow, contain too much text, are blurry, or have camera jitters).

They next used the open source LLaVA-Video model to generate text descriptions for the video samples. 

Open Sora data processing and filtering pipeline
Open Sora data processing and filtering pipeline (source: arXiv)

An efficient model architecture

The research team put together an architecture that maintains the quality of the output video while reducing the memory footprint. Open-Sora 2.0 is composed of a video autoencoder (VAE) and a diffusion transformer. The VAE learns to encode videos in latent representations that capture their key features while the diffusion transformer is tasked with generating the video segments.

Video DC-AE, the encoder of Open-Sora 2.0, is based on HunyuanVideo VAE and enhanced with deep compression to reduce the costs of training and inference.

DC-AE is an algorithm that increases downsampling ratios while maintaining the reconstruction quality of the input image. The original DC-AE algorithm was designed for image encoding. The Open-Sora team extended it to support video encoding.

After the input video or initial image is encoded by the autoencoder, the diffusion transformer processes the latent presentation to generate the output video. The diffusion transformer incorporates powerful pre-trained models, including T5-XXL, a transformer model for the textual semantics from the input prompt, and CLIP-Large, a model that improves the alignment between the text prompt and the generated visual content. Open-Sora’s diffusion transformer also employs several techniques to improve the consistency and quality of the generated video. 

Open Sora architecture
Open-Sora 2.0 architecture (source: arXiv)

An adaptive training regime

The training pipeline for Open-Sora 2.0 is a three-stage process optimized for cost-effectiveness. The researchers build on previous research finding that pretraining on image datasets can significantly accelerate video model. Therefore, they initialized their text-to-video model on Flux, an open-source text-to-image model with 11 billion parameters.

Training the model on a full hi-res video dataset can be slow and costly. Therefore, they first trained the model on low-resolution videos (256 x 256px) to allow the model to learn different motion patterns at a low cost. They then fine-tune the model on a smaller dataset of hi-res videos (768 x 768px). “To optimize efficiency, we allocate the majority of computational resources to low-resolution training, reducing the need for expensive high-resolution computations,” the researchers write.

Another interesting finding in the paper is that adapting a model from 256px to 768px resolution is significantly more efficient using an image-to-video approach compared to text-to-video. Based on this observation, they prioritize training an image-to-video model at high resolution. During inference, Open-Sora first generates an image from the text prompt using Flux. Then it generates the video conditioned on both the image and the text. 

“During training, text/image-to-video training at low resolution followed by brief fine-tuning on high-resolution videos yields high-quality results with minimal additional training,” the researchers write.

Open Sora 2.0 training costs
Open Sora 2.0 training costs vs other models (source: arXiv)

With this approach, they were able to cut the costs of one-time training to around $200,000, as opposed to millions of dollars for other similar models. (Note that one-time training does not cover the entire cost of training, which can include multiple runs and experiments to find the right formula to get the best results.)

Open-Sora 2.0 vs other text-to-video models

Open-Sora 2.0 supports both text-to-video and image-to-video generation of up to 128 frames at 256×256px and 768×768px resolutions. (Note that since the model has been optimized for image-to-video generation, the text-to-image-to-video pipeline first generates an image using FLUX, which it uses as the starting frame for video generation.)

Open Sora 2.0 performance vs other models
Open Sora 2.0 performance vs other video-generation models (source: arXiv)

The researchers used 100 text prompts to compare Open-Sora 2.0 with several closed-source APIs and open-source models, including Runway Gen-3 Alpha, Luma Ray2, HunyuanVideo, and Step-Video-T2V, weighing the models based on visual quality, prompt adherence, and motion quality. According to human evaluation results, despite its low cost, Open-Sora 2.0 outperforms the other top-performing models in at least two of three aspects.The researchers have released Open-Sora 2.0 under the permissive Apache 2.0 license, which means it can be used for commercial applications. “We hope that by open-sourcing Open-Sora 2.0, we can provide the community with tools to collectively tackle these challenges, fostering innovation and advancements in the field of video generation,” the researchers write.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.