Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details ``pixel-by-pixel'' irrespective of the video's inherent complexity, leading to high learning complexity.
We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding longer videos than the baselines with the same budget.
We evaluate VideoFlexTok on class-to-video and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 10x smaller model (0.4B vs 3.6B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.
We train two main VideoFlexTok versions. VideoFlexTok-K600 is a 570M parameter model (excluding adaLN parameters) with 18-layers encoder and decoder trained on the Kinetics-600 dataset at a 17-frames 128x128 resolution for 400B VAE tokens. VideoFlexTok-Panda is a 1.3B parameter model (excluding adaLN parameters) with a 18-layers encoder and 28-layers decoder trained on a subset of the Panda70M dataset at a 17-frames 256x256 resolution for 400B VAE tokens. We use the codebook size of 64000 for both models.
For VideoFlexTok-Panda, we introduce an additional training stage where we fix the encoder and fine-tune the decoder for another 400B tokens with the following two interventions. First, we switch from time-causal to full attention pattern in the decoder, which leads to better reconstruction quality, especially improving temporal consistency. Note that the encoder remains fixed during this stage, so it retains the benefits from being trained with the time-causal decoder. Second, we introduce a frame-conditioning capability by randomly providing a clean first frame instead of a noised one. This enables streaming tokenization by conditioning the decoder on its previously reconstructed frames during inference. We demonstrate results using this tokenizer version unless stated otherwise.
The following visualizations demonstrate the flexible-length tokenization capability of VideoFlexTok. First, we see that no matter the number of tokens used, the reconstructions remain plausible and realistic, thanks to the generative flow decoder. Second, and most interestingly, we find that the first few tokens capture semantically-meaningful information, such as object type, their motion, and overall scene geometry, while abstracting away more nuanced details such as color information, texture, etc. Later tokens progressively add finer details, enabling high-fidelity reconstruction when using more tokens. In the car example, note how the car object type and its rotational motion are well-preserved with as few as 1-4 tokens per frame, while the color and finer details are only reconstructed when using more tokens.
We design the following probing experiment to analyze the information contained in the first few VideoFlexTok tokens. Given a source video, we keep only 1 or 2 tokens per latent frame and make an isolated change to its first frame (e.g., changing an orange to an apple using Nano Banana). We then condition the decoder on both the original tokens and the new edited frame for reconstruction. We find that, in most cases, VideoFlexTok preserves the made edit throughout the reconstructed video, suggesting that the first tokens primarily capture the motion information.
Original
1st Frame
Edit
→
Edit #1
↓
Edit #2
↓
Edit #3
↓
Original Video
keep 1 token
per frame
→
Reconstruction #1
Reconstruction #2
Reconstruction #3
As demonstrated above, it is possible to express text conditioning well with only 16-64 VideoFlexTok tokens per latent frame. This suggests that we should be able to train a downstream generative model more efficiently, for example, by using a smaller model and/or training for fewer iterations. We design a series of scaling experiments to study the efficiency of VideoFlexTok for downstream generative modeling. For the text-to-video tasks, we use a Chinchilla-inspired scaling approach and scale both the model size $N$ and the number of training tokens $D$ using the heuristic $D \approx 20N$. Our sweep spans FLOPs from $1.6\times 10^{20}$ to $5\times 10^{21}$ with model sizes from 0.16B to 5B parameters.
We train T2V models using both 3D grid and VideoFlexTok tokenizers across a range of compute budgets (FLOPs) using full sequence of 256 tokens per latent frame (the total of 1280 for 5 latent frames) during training. At inference time, we can vary the number of generated VideoFlexTok tokens and select the best performing configuration. Intuitively, this allows us to approximate the performance of a model that was trained specifically for that token count, without the need to train multiple models.
We find that choosing the optimal number of tokens leads to significantly more FLOPs-efficient generative modeling compared to using a fixed 3D grid of tokens. This is especially pronounced with gFVD, which is only concerned with fidelity, not alignment with the conditioning. Therefore, generating even a single token, which can be done very efficiently, and reusing the generative VideoFlexTok decoder to fill in the rest of the details, can lead to low gFVD. On the other hand, ViCLIP score requires better alignment with the text conditioning, which requires generating more tokens. We observe similar trends for the class-to-video task, which you can find in our paper.
It is important to note that these efficiency gains rely on the ability of the VideoFlexTok decoder to generate plausible samples given any number of tokens, and training this generative decoder is itself compute-intensive. To measure the efficiency of the complete pipeline as a single run, one would need to include the cost of both training stages. However, we believe the main value of VideoFlexTok lies in amortizing the training cost of the tokenizer and its decoder over multiple downstream tasks and runs. Indeed, similar to how image encoders such as CLIP or DINO are pre-trained once and then reused for multiple downstream tasks, a strong decoder with a flexible conditioning mechanism can play a similar role for generative modeling, essentially democratizing it.
Finally, we provide a proof-of-concept demonstration of how VideoFlexTok can enable long video modeling without incurring prohibitive computational costs. Specifically, we demonstrate two capabilities: First, we show how to extend VideoFlexTok to streaming tokenization of arbitrary-length videos. Second, leveraging our findings on efficient generative modeling, we train a text-to-video model to generate 10-second (81-frame) videos using only 672 tokens in total—8× fewer than a comparable 3D grid tokenizer (5376 tokens).
How can we extend VideoFlexTok trained on fixed-length clips to streaming tokenization of arbitrary-length videos? A common approach is to use overlapping sliding windows and rely only on the information in the tokens during decoding. This approach works well when the tokens preserve most of the information about the input video. However, when decoding from only a few VideoFlexTok tokens, the decoder needs to fill in, i.e., generate, the missing details that need to be preserved across time for temporal consistency. Therefore, we design a decoding scheme where the decoder is conditioned not only on the current window's tokens but also on the previously decoded frames (this is inspired by ARLON). This allows the decoder to maintain temporal consistency across windows even when decoding from only a few tokens.
To enable this decoding scheme, we fine-tune our VideoFlexTok decoder in the frame-conditional setting. Similar to Open-Sora, we observed that even a short fine-tuning can be sufficient to acquire this capability. Below we demonstrate examples of streaming tokenization using varying number of tokens per latent frame.
Finally, we combine streaming tokenization with our findings on efficient generative modeling to train a text-to-video model on 10-second (81-frame) videos. By using only 32 tokens per latent frame, we require just 672 tokens per video—8× fewer than the 5376 tokens required by comparable 3D grid tokenizers1,2,3. Importantly, this allows us to fit the entire 10-second video within the context window without prohibitive computational costs.
We train a 3.2B AR model for $\sim$55B tokens resulting in a total of $\sim 10^{21}$ FLOPs, which falls within the middle range of our previous scaling experiments. Below we demonstrate qualitative examples of the generated 10-second videos from our model.
We introduce VideoFlexTok, a tokenizer that represents videos with a flexible-length sequence of tokens structured in a coarse-to-fine manner, allowing to adapt these representations to particular downstream needs. Its generative flow decoder can decode realistic videos from any number of tokens. We demonstrate that this structure leads to more computationally efficient generative modeling and can enable the generation of longer videos without substantially increasing the context length and computational cost, effectively democratizing video generative modeling.
We believe that modeling in more compact and semantically-aware abstract representation spaces like VideoFlexTok will enable capturing higher-level long-range dependencies from videos more efficiently compared to learning them directly from pixels. The coarse-to-fine structure enables capturing the dependencies at different levels of abstractions. This, in turn, can lead to more efficient and performant visual reasoning models that adaptively decide what level of abstraction to work in.
{}
We thank Mingfei Gao, David Mizrahi, Enrico Fini, Philipp Dufter, and Erik Daxberger for their feedback and discussion during the early stages of the project. We also thank Jason Toskov, Rishubh Singh, Kunal Singh, and Ali Garjani for their help in preparing the manuscript. This work was supported under project ID a08 as part of the Swiss AI Initiative, through a grant from the ETH Domain and computational resources provided by the Swiss National Supercomputing Centre (CSCS) under the Alps infrastructure. This work has received funding from the Swiss State Secretariat for Education, Research and lnnovation (SERI).