VideoFlexTok: Flexible-Length Coarse-to-Fine
Video Tokenization

1Apple   2Swiss Federal Institute of Technology (EPFL)
* Equal contribution

VideoFlexTok represents videos with a flexible-length coarse-to-fine sequence of tokens. Given an input video (leftmost), VideoFlexTok maps it to a temporal sequence of tokens of shape $T\times256$, where the second dimension corresponds to the coarse-to-fine ordered tokens. The generative flow decoder enables realistic video reconstructions using any number of tokens $T \times k$. We find that the first few tokens (emergently) capture abstract information, such as semantics and motion, while later tokens add finer details (e.g., the cars' motion in the first two rows, or the camera motion in the last row). This property allows adapting the token count according to downstream needs1 and encoding longer videos2 than the baselines with the same total budget.

Abstract

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details ``pixel-by-pixel'' irrespective of the video's inherent complexity, leading to high learning complexity.

We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding longer videos than the baselines with the same budget.

We evaluate VideoFlexTok on class-to-video and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 10x smaller model (0.4B vs 3.6B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

VideoFlexTok method overview

We design our VideoFlexTok tokenizer to have the following three main properties:

  • Flexible-length tokenization: The encoder produces a variable-length sequence of tokens. This allows the downstream models to adapt the token count based on task requirements. Following FlexTok, we achieve this by applying nested dropout over the register tokens.

  • Coarse-to-fine semantic ordering: Earlier tokens capture higher-level semantic information, while later tokens encode finer details. We achieve this by using DINOv2 self-supervised features as an auxiliary target during training.

  • High-fidelity reconstruction: The decoder reconstructs plausible realistic videos from any number of tokens, preserving information captured by the tokens. We achieve this by using a generative flow decoder conditioned on the encoded tokens, which also provides a reconstruction objective for encoder training.

VideoFlexTok method overview

VideoFlexTok: The ViT encoder with registers resamples 3D video VAE latents to a temporal coarse-to-fine sequence of tokens. Nested dropout is applied over the register tokens to promote the coarse-to-fine structure. Combined with a semantic bias in the form of REPA loss predicting DINOv2 features leads to early tokens capturing the most salient semantic information. A generative flow decoder reconstructs realistic videos from any number of tokens.

VideoFlexTok training

For implementation, we mainly follow FlexTok and extend it to temporal video data. Our architecture consists of the following three main components:

  • Time-Causal Encoder is a ViT transformer with register tokens that maps input 3D VAE latent frames $T \times H \times W$ to a temporal 2D sequence of tokens $T \times K$, where $K$ corresponds to the coarse-to-fine ordered tokens, $T$ is the number of latent frames (after temporal compression in VAE). Different from a fully 1D sequence as in LARP, our encoder uses the time-causal attention pattern preserving the temporal structure of the original signal, which we found to be beneficial for downstream video modeling, as well as enabling streaming tokenization. Since our main downstream application is video generation via a GPT-like autoregressive transformer, we also apply FSQ quantization to register tokens.

  • Nested dropout randomly drops last $k < K$ register tokens during training, promoting the model to learn an ordered representation where earlier tokens capture most important information and later tokens provide finer details. This is the key component that enables flexible-length coarse-to-fine tokenization.

  • Time-Causal Decoder is a DiT-based conditional generative flow model. Given the masked token sequence after nested dropout and noised VAE latents, it reconstructs the clean VAE latents. Reconstruction-based objectives, however, tend to prioritize low-level details, which can prevent earlier tokens in the hierarchy from mainly focusing on semantically meaningful information. We, therefore, use a semantic bias in the form of the REPA loss, found to be useful in FlexTok. Specifically, we train a shallow readout network that predicts DINOv2 features. Finally, we opt for time-causal attention pattern in the decoder which we found to improve the downstream generative modeling performance especially when using fewer tokens.

We train two main VideoFlexTok versions. VideoFlexTok-K600 is a 570M parameter model (excluding adaLN parameters) with 18-layers encoder and decoder trained on the Kinetics-600 dataset at a 17-frames 128x128 resolution for 400B VAE tokens. VideoFlexTok-Panda is a 1.3B parameter model (excluding adaLN parameters) with a 18-layers encoder and 28-layers decoder trained on a subset of the Panda70M dataset at a 17-frames 256x256 resolution for 400B VAE tokens. We use the codebook size of 64000 for both models.

For VideoFlexTok-Panda, we introduce an additional training stage where we fix the encoder and fine-tune the decoder for another 400B tokens with the following two interventions. First, we switch from time-causal to full attention pattern in the decoder, which leads to better reconstruction quality, especially improving temporal consistency. Note that the encoder remains fixed during this stage, so it retains the benefits from being trained with the time-causal decoder. Second, we introduce a frame-conditioning capability by randomly providing a clean first frame instead of a noised one. This enables streaming tokenization by conditioning the decoder on its previously reconstructed frames during inference. We demonstrate results using this tokenizer version unless stated otherwise.

VideoFlexTok reconstruction visualization

The following visualizations demonstrate the flexible-length tokenization capability of VideoFlexTok. First, we see that no matter the number of tokens used, the reconstructions remain plausible and realistic, thanks to the generative flow decoder. Second, and most interestingly, we find that the first few tokens capture semantically-meaningful information, such as object type, their motion, and overall scene geometry, while abstracting away more nuanced details such as color information, texture, etc. Later tokens progressively add finer details, enabling high-fidelity reconstruction when using more tokens. In the car example, note how the car object type and its rotational motion are well-preserved with as few as 1-4 tokens per frame, while the color and finer details are only reconstructed when using more tokens.

Flexible-length tokenization


We design the following probing experiment to analyze the information contained in the first few VideoFlexTok tokens. Given a source video, we keep only 1 or 2 tokens per latent frame and make an isolated change to its first frame (e.g., changing an orange to an apple using Nano Banana). We then condition the decoder on both the original tokens and the new edited frame for reconstruction. We find that, in most cases, VideoFlexTok preserves the made edit throughout the reconstructed video, suggesting that the first tokens primarily capture the motion information.

Probing the first VideoFlexTok tokens

Generative modeling with VideoFlexTok

We evaluate the representations learned by VideoFlexTok on two downstream video generation tasks: class-to-video (C2V) and text-to-video (T2V) generation. We demonstrate that using VideoFlexTok's tokens and the generative decoder enables coarse-to-fine video generation, allowing the downstream model to adapt the number of generated tokens, leading to more efficient downstream generative modeling, achieving the same or better performance with smaller models and/or less training compute compared to standard 3D grid tokenization.

Coarse-to-fine generation of VideoFlexTok tokens

Autoregressive coarse-to-fine generation
We train a GPT-like autoregressive transformer to generate VideoFlexTok tokens in a coarse-to-fine manner. Specifically, we use a "time-first" ordering where we generate the first token for all frames, then the second token for all frames, and so on. Using this ordering, we can generate videos with any desired level of detail (number of tokens) and use the VideoFlexTok decoder to fill in the rest of the details and reconstruct a realistic video. In addition, we found empirically this ordering to work better than the "depth-first" approach, where all tokens for a single frame are generated before moving to the next frame (see Supplementary material).
We train our class-to-video models on Kinetics-600 and text-to-video models on a subset of the Panda70M dataset with synthetic captions obtained following ShareGPT4V. Below we demonstrate examples of the coarse-to-fine generation from our 5B T2V model trained for 400B VideoFlexTok tokens.

Coarse-to-fine generation



Efficient generative modeling with VideoFlexTok

As demonstrated above, it is possible to express text conditioning well with only 16-64 VideoFlexTok tokens per latent frame. This suggests that we should be able to train a downstream generative model more efficiently, for example, by using a smaller model and/or training for fewer iterations. We design a series of scaling experiments to study the efficiency of VideoFlexTok for downstream generative modeling. For the text-to-video tasks, we use a Chinchilla-inspired scaling approach and scale both the model size $N$ and the number of training tokens $D$ using the heuristic $D \approx 20N$. Our sweep spans FLOPs from $1.6\times 10^{20}$ to $5\times 10^{21}$ with model sizes from 0.16B to 5B parameters.



VideoFlexTok method overview

FLOPs-efficient T2V modelgin via VideoFlexTok. We train a series of T2V models with different compute budgets (FLOPs) using 3D Grid and VideoFlexTok tokenizers. We evaluate the fidelity (gFVD) and conditioning alignment (ViCLIP) of the generated videos. We find that choosing the best number of tokens at inference time for VideoFlexTok results in significantly more FLOPs-efficient generative modeling. Further, directly training AR models on only 32 VideoFlexTok tokens (per latent frame, purple) further improves the efficiency by reducing the training compute requirement. Overall, VideoFlexTok can achieve similar performance to a 3D grid tokenizer with ~10x less training compute.

We train T2V models using both 3D grid and VideoFlexTok tokenizers across a range of compute budgets (FLOPs) using full sequence of 256 tokens per latent frame (the total of 1280 for 5 latent frames) during training. At inference time, we can vary the number of generated VideoFlexTok tokens and select the best performing configuration. Intuitively, this allows us to approximate the performance of a model that was trained specifically for that token count, without the need to train multiple models.

We find that choosing the optimal number of tokens leads to significantly more FLOPs-efficient generative modeling compared to using a fixed 3D grid of tokens. This is especially pronounced with gFVD, which is only concerned with fidelity, not alignment with the conditioning. Therefore, generating even a single token, which can be done very efficiently, and reusing the generative VideoFlexTok decoder to fill in the rest of the details, can lead to low gFVD. On the other hand, ViCLIP score requires better alignment with the text conditioning, which requires generating more tokens. We observe similar trends for the class-to-video task, which you can find in our paper.

It is important to note that these efficiency gains rely on the ability of the VideoFlexTok decoder to generate plausible samples given any number of tokens, and training this generative decoder is itself compute-intensive. To measure the efficiency of the complete pipeline as a single run, one would need to include the cost of both training stages. However, we believe the main value of VideoFlexTok lies in amortizing the training cost of the tokenizer and its decoder over multiple downstream tasks and runs. Indeed, similar to how image encoders such as CLIP or DINO are pre-trained once and then reused for multiple downstream tasks, a strong decoder with a flexible conditioning mechanism can play a similar role for generative modeling, essentially democratizing it.

Long video modeling with VideoFlexTok

Finally, we provide a proof-of-concept demonstration of how VideoFlexTok can enable long video modeling without incurring prohibitive computational costs. Specifically, we demonstrate two capabilities: First, we show how to extend VideoFlexTok to streaming tokenization of arbitrary-length videos. Second, leveraging our findings on efficient generative modeling, we train a text-to-video model to generate 10-second (81-frame) videos using only 672 tokens in total—8× fewer than a comparable 3D grid tokenizer (5376 tokens).


Streaming tokenization with VideoFlexTok

How can we extend VideoFlexTok trained on fixed-length clips to streaming tokenization of arbitrary-length videos? A common approach is to use overlapping sliding windows and rely only on the information in the tokens during decoding. This approach works well when the tokens preserve most of the information about the input video. However, when decoding from only a few VideoFlexTok tokens, the decoder needs to fill in, i.e., generate, the missing details that need to be preserved across time for temporal consistency. Therefore, we design a decoding scheme where the decoder is conditioned not only on the current window's tokens but also on the previously decoded frames (this is inspired by ARLON). This allows the decoder to maintain temporal consistency across windows even when decoding from only a few tokens.


Streaming tokenization with VideoFlexTok

Streaming tokenization with VideoFlexTok. We apply the encoder in a sliding window manner with overlapping frames (1 frame in our experiments). The first window is decoded as usual, conditioning only on the tokens from that window (32 tokens in this example). For subsequent windows, the decoder is conditioned on both the current window's tokens and the previously decoded frames, allowing it to maintain temporal consistency even when decoding from only a few tokens. Note how the decoded frames for the first window vary in some low-level details (e.g., the plant, table, etc.) due to using only 32 tokens, but these get preserved across time in the subsequent windows due to conditioning on the previously generated frames.

To enable this decoding scheme, we fine-tune our VideoFlexTok decoder in the frame-conditional setting. Similar to Open-Sora, we observed that even a short fine-tuning can be sufficient to acquire this capability. Below we demonstrate examples of streaming tokenization using varying number of tokens per latent frame.


In-context long video generation with VideoFlexTok.

Finally, we combine streaming tokenization with our findings on efficient generative modeling to train a text-to-video model on 10-second (81-frame) videos. By using only 32 tokens per latent frame, we require just 672 tokens per video—8× fewer than the 5376 tokens required by comparable 3D grid tokenizers1,2,3. Importantly, this allows us to fit the entire 10-second video within the context window without prohibitive computational costs.

We train a 3.2B AR model for $\sim$55B tokens resulting in a total of $\sim 10^{21}$ FLOPs, which falls within the middle range of our previous scaling experiments. Below we demonstrate qualitative examples of the generated 10-second videos from our model.

10-second video generation



Conclusion and discussion

We introduce VideoFlexTok, a tokenizer that represents videos with a flexible-length sequence of tokens structured in a coarse-to-fine manner, allowing to adapt these representations to particular downstream needs. Its generative flow decoder can decode realistic videos from any number of tokens. We demonstrate that this structure leads to more computationally efficient generative modeling and can enable the generation of longer videos without substantially increasing the context length and computational cost, effectively democratizing video generative modeling.

We believe that modeling in more compact and semantically-aware abstract representation spaces like VideoFlexTok will enable capturing higher-level long-range dependencies from videos more efficiently compared to learning them directly from pixels. The coarse-to-fine structure enables capturing the dependencies at different levels of abstractions. This, in turn, can lead to more efficient and performant visual reasoning models that adaptively decide what level of abstraction to work in.

BibTeX

{}

Acknowledgments

We thank Mingfei Gao, David Mizrahi, Enrico Fini, Philipp Dufter, and Erik Daxberger for their feedback and discussion during the early stages of the project. We also thank Jason Toskov, Rishubh Singh, Kunal Singh, and Ali Garjani for their help in preparing the manuscript. This work was supported under project ID a08 as part of the Swiss AI Initiative, through a grant from the ETH Domain and computational resources provided by the Swiss National Supercomputing Centre (CSCS) under the Alps infrastructure. This work has received funding from the Swiss State Secretariat for Education, Research and lnnovation (SERI).