Meta Launches OneStory: A Short-Drama Model That Remembers and Generates 10 Linked Scenes

The "3-second clip" era of AI video is ending. While models like Sora, Kling, and Gen-3 have mastered visual fidelity, they struggle with the filmmaker's most fundamental tool: narrative consistency over time.

Current Image-to-Video (I2V) models usually produce a single continuous scene. Ask them to generate a second shot of the same character from a different angle, and the face changes, the clothes morph, and the background glitches. This is the "amnesia" problem of generative video.

Meta AI has just addressed this bottleneck with OneStory, a new framework that generates coherent, multi-shot videos. It doesn't just generate pixels; it "remembers" context across 10+ linked scenes, effectively acting as an automated director for short dramas.

As shown above, OneStory maintains character identity (the cat, the woman, the LEGO figures) across radically different camera angles and lighting conditions—something previous models have failed to do reliably.

Here is the technical breakdown of how OneStory works, why its "Adaptive Memory" is a game-changer, and why data curation was the hidden engine behind this breakthrough.

The Problem: Why AI Directors Have "Amnesia"

To understand OneStory, we must look at why previous Multi-Shot Video (MSV) attempts failed. Existing methods generally fall into two traps:

Fixed-Window Attention: The model looks at a specific window of previous frames (e.g., the last 3 seconds). As the video progresses, the window slides forward, causing the model to "forget" what the protagonist looked like in Shot 1 by the time it reaches Shot 5.
Keyframe Conditioning: The model generates a single image (keyframe) for the next shot and animates it. This limits context to a single static image, losing the complex motion and narrative cues from the previous video sequence.

OneStory reformulates the problem entirely. Instead of treating video as a block, it treats it as a "Next-Shot Generation Task"—similar to how LLMs predict the next token, OneStory predicts the next shot.

The Solution: An Autoregressive "Brain" for Video

To achieve consistent storytelling without exploding computational costs, Meta introduced a novel architecture that mimics human memory.

As illustrated in the architecture diagram, the model relies on two distinct modules to manage long-term context:

The "Brain": Frame Selection Module

In a movie, Shot 10 might reference Shot 1 (the protagonist's face) rather than Shot 9 (a landscape). OneStory acknowledges that adjacent shots aren't always the most relevant.

The Frame Selection module builds a "global memory" by selecting a sparse set of informative frames from all prior shots. It scores frames based on semantic relevance to the current caption. This allows the model to recall specific details—like a shirt color from minute one—even if the character hasn't been on screen for thirty seconds.

The "Compressor": Adaptive Conditioner

Feeding raw frames into a diffusion model is computationally expensive. The Adaptive Conditioner dynamically compresses the selected frames. Instead of treating all history equally, it assigns "finer patchifiers" (less compression) to high-importance frames and higher compression to less relevant background details. This allows OneStory to inject a global context directly into the generator efficiently.

The Evidence: Beating the Benchmarks

Theory is good, but performance is better. How does OneStory stack up against current state-of-the-art pipelines like Flux combined with Wan2.1 or LTX-Video?

The results are decisive. OneStory achieves the highest scores in Character Consistency (0.5874) and Environment Consistency (0.5752), significantly outperforming standard fixed-window approaches (Mask2DiT) and keyframe conditioning methods (StoryDiffusion). It proves that adaptive memory architecture is essential for long-form content.

The Hidden Engine: Why Data Quality Was Key

We often say that algorithms are the car, but data is the fuel. OneStory is the perfect proof of this.

The architecture alone wasn't enough. To make OneStory work, Meta had to curate a brand new, high-quality dataset of 60,000 multi-shot videos. They couldn't use standard stock footage; they needed videos with "referential narrative flow", which teaches the model that entities (people, objects) are persistent, solving the identity consistency problem at the data level.

The leap from single-shot to multi-shot video generation requires a massive upgrade in data strategy. As the OneStory paper demonstrates, generic metadata is no longer sufficient. Models now require narrative-aware annotation and interlinked datasets.

If you are building the next generation of video models, don't let your data break the story.

Contact Abaka AI to Scale Your Video Data Pipeline ->

Meta Launches OneStory: A Short-Drama Model That Remembers and Generates 10 Linked Scenes

Meta Launches OneStory: A Short-Drama Model That Remembers and Generates 10 Linked Scenes

The Problem: Why AI Directors Have "Amnesia"

The Solution: An Autoregressive "Brain" for Video

The "Brain": Frame Selection Module

The "Compressor": Adaptive Conditioner

The Evidence: Beating the Benchmarks

The Hidden Engine: Why Data Quality Was Key

Other Articles

Products

Services

Resources

About Us