The latest public launch of the Hunyuan Video generative AI mannequin has intensified ongoing discussions in regards to the potential of huge multimodal vision-language fashions to sooner or later create total motion pictures.
Nonetheless, as we have noticed, it is a very distant prospect in the mean time, for various causes. One is the very brief consideration window of most AI video turbines, which wrestle to take care of consistency even in a brief single shot, not to mention a collection of pictures.
One other is that constant references to video content material (corresponding to explorable environments, which mustn’t change randomly when you retrace your steps via them) can solely be achieved in diffusion fashions by customization strategies corresponding to low-rank adaptation (LoRA), which limits the out-of-the-box capabilities of basis fashions.
Subsequently the evolution of generative video appears set to stall until new approaches to narrative continuity are developed.
Recipe for Continuity
With this in thoughts, a brand new collaboration between the US and China has proposed using educational cooking movies as a potential template for future narrative continuity methods.
Click on to play. The VideoAuteur challenge systematizes the evaluation of components of a cooking course of, to supply a finely-captioned new dataset and an orchestration methodology for the technology of cooking movies. Discuss with supply web site for higher decision. Supply: https://videoauteur.github.io/
Titled VideoAuteur, the work proposes a two-stage pipeline to generate educational cooking movies utilizing cohered states combining keyframes and captions, reaching state-of-the-art leads to – admittedly – an under-subscribed house.
VideoAuteur’s challenge web page additionally consists of various fairly extra attention-grabbing movies that use the identical approach, corresponding to a proposed trailer for a (non-existent) Marvel/DC crossover:
Click on to play. Two superheroes from alternate universes come head to head in a faux trailer from VideoAuteur. Discuss with supply web site for higher decision.
The web page additionally options similarly-styled promo movies for an equally non-existent Netflix animal collection and a Tesla automotive advert.
In creating VideoAuteur, the authors experimented with numerous loss capabilities, and different novel approaches. To develop a recipe how-to technology workflow, in addition they curated CookGen, the most important dataset centered on the cooking area, that includes 200, 000 video clips with a median period of 9.5 seconds.
At a median of 768.3 phrases per video, CookGen is comfortably probably the most extensively-annotated dataset of its form. Numerous imaginative and prescient/language fashions have been used, amongst different approaches, to make sure that descriptions have been as detailed, related and correct as potential.
Cooking movies have been chosen as a result of cooking instruction walk-throughs have a structured and unambiguous narrative, making annotation and analysis a neater job. Aside from pornographic movies (prone to enter this explicit house sooner fairly than later), it’s troublesome to think about every other style fairly as visually and narratively ‘formulaic’.
The authors state:
‘Our proposed two-stage auto-regressive pipeline, which includes a long narrative director and visual-conditioned video generation, demonstrates promising improvements in semantic consistency and visual fidelity in generated long narrative videos.
Through experiments on our dataset, we observe enhancements in spatial and temporal coherence across video sequences.
‘We hope our work can facilitate further research in long narrative video generation.’
The new work is titled VideoAuteur: Towards Long Narrative Video Generation, and comes from eight authors across Johns Hopkins University, ByteDance, and ByteDance Seed.
Dataset Curation
To develop CookGen, which powers a two-stage generative system for producing AI cooking videos, the authors used material from the YouCook and HowTo100M collections. The authors compare the scale of CookGen to previous datasets focused on narrative development in generative video, such as the Flintstones dataset, the Pororo cartoon dataset, StoryGen, Tencent’s StoryStream, and VIST.
CookGen focuses on real-world narratives, particularly procedural activities like cooking, offering clearer and easier-to-annotate stories compared to image-based comic datasets. It exceeds the largest existing dataset, StoryStream, with 150x more frames and 5x denser textual descriptions.
The researchers fine-tuned a captioning model using the methodology of LLaVA-NeXT as a base. The automatic speech recognition (ASR) pseudo-labels obtained for HowTo100M were used as ‘actions’ for each video, and then refined further by large language models (LLMs).
For instance, ChatGPT-4o was used to produce a caption dataset, and was asked to focus on subject-object interactions (such as hands handling utensils and food), object attributes, and temporal dynamics.
Since ASR scripts are likely to contain inaccuracies and to be generally ‘noisy’, Intersection-over-Union (IoU) was used as a metric to measure how closely the captions conformed to the section of the video they were addressing. The authors note that this was crucial for the creation of narrative consistency.
The curated clips were evaluated using Fréchet Video Distance (FVD), which measures the disparity between ground truth (real world) examples and generated examples, both with and without ground truth keyframes, arriving at a performative result:
Additionally, the clips were rated both by GPT-4o, and six human annotators, following LLaVA-Hound‘s definition of ‘hallucination’ (i.e., the capacity of a model to invent spurious content).
The researchers compared the quality of the captions to the Qwen2-VL-72B collection, obtaining a slightly improved score.
Method
VideoAuteur’s generative phase is divided between the Long Narrative Director (LND) and the visual-conditioned video generation model (VCVGM).
LND generates a sequence of visual embeddings or keyframes that characterize the narrative flow, similar to ‘essential highlights’. The VCVGM generates video clips based on these choices.
The authors extensively discuss the differing merits of an interleaved image-text director and a language-centric keyframe director, and conclude that the former is the more effective approach.
The interleaved image-text director generates a sequence by interleaving text tokens and visual embeddings, using an auto-regressive model to predict the next token, based on the combined context of both text and images. This ensures a tight alignment between visuals and text.
By contrast, the language-centric keyframe director synthesizes keyframes using a text-conditioned diffusion model based solely on captions, without incorporating visual embeddings into the generation process.
The researchers found that while the language-centric method generates visually appealing keyframes, it lacks consistency across frames, arguing that the interleaved method achieves higher scores in realism and visual consistency. They also found that this method was better able to learn a realistic visual style through training, though sometimes with some repetitive or noisy elements.
Unusually, in a research strand dominated by the co-opting of Stable Diffusion and Flux into workflows, the authors used Tencent’s SEED-X 7B-parameter multi-modal LLM foundation model for their generative pipeline (though this model does leverage Stability.ai’s SDXL release of Stable Diffusion for a limited part of its architecture).
The authors state:
‘Unlike the classic Image-to-Video (I2V) pipeline that uses an image as the starting frame, our approach leverages [regressed visual latents] as continuous conditions throughout the [sequence].
‘Furthermore, we improve the robustness and quality of the generated videos by adapting the model to handle noisy visual embeddings, since the regressed visual latents may not be perfect due to regression errors.’
Though typical visual-conditioned generative pipelines of this kind often use initial keyframes as a starting point for model guidance, VideoAuteur expands on this paradigm by generating multi-part visual states in a semantically coherent latent space, avoiding the potential bias of basing further generation solely on ‘starting frames’.
Tests
In line with the methods of SeedStory, the researchers use SEED-X to apply LoRA fine-tuning on their narrative dataset, enigmatically describing the result as a ‘Sora-like model’, pre-trained on large-scale video/text couplings, and capable of accepting both visual and text prompts and conditions.
32,000 narrative videos were used for model development, with 1,000 held aside as validation samples. The videos were cropped to 448 pixels on the short side and then center-cropped to 448x448px.
For training, the narrative generation was evaluated primarily on the YouCook2 validation set. The Howto100M set was used for data quality evaluation and also for image-to-video generation.
For visual conditioning loss, the authors used diffusion loss from DiT and a 2024 work based around Stable Diffusion.
To prove their contention that interleaving is a superior approach, the authors pitted VideoAuteur against several methods that rely solely on text-based input: EMU-2, SEED-X, SDXL and FLUX.1-schnell (FLUX.1-s).
The authors state:
‘The language-centric approach using text-to-image models produces visually appealing keyframes but suffers from a lack of consistency across frames due to limited mutual information. In contrast, the interleaved generation method leverages language-aligned visual latents, achieving a realistic visual style through training.
‘However, it occasionally generates images with repetitive or noisy elements, as the auto-regressive model struggles to create accurate embeddings in a single pass.’
Human evaluation further confirms the authors’ contention about the improved performance of the interleaved approach, with interleaved methods achieving the highest scores in a survey.
However we note that language-centric approaches achieve the best aesthetic scores. The authors contend, however, that this is not the central issue in the generation of long narrative videos.
Click to play. Segments generated for a pizza-building video, by VideoAuteur.
Conclusion
The most popular strand of research in regard to this challenge, i.e., narrative consistency in long-form video generation, is concerned with single images. Projects of this kind include DreamStory, StoryDiffusion, TheaterGen and NVIDIA’s ConsiStory.
In a sense, VideoAuteur also falls into this ‘static’ class, because it makes use of seed photographs from which clip-sections are generated. Nonetheless, the interleaving of video and semantic content material brings the method a step nearer to a sensible pipeline.
Â
First printed Thursday, January 16, 2025