VARL: Reinforcing Autoregressive Video Generation

Anthea Y. Li ****Hao Tan Kai Zhang Hanwen Jiang Antonio Torralba Sai Bi

<aside>

Autoregressive block diffusion based video generation enables world models and long video sequence generation. However, each video rollout is prohibitively expensive. For reinforcement learning (RL) based post-training that requires many rollout for a single network update, this is extremely expensive. We explore flexible length block updates by shortening the update horizon and rollout extension. We rollout diverse trajectories of shorter sequences of blocks and compute the reward over the generated groups. We then randomly selects a subset of trajectories to further extend generation for recomputing reward. This way, we densify reward back propagation over sequence length.

</aside>

Autoregressive Video Diffusion RL

Autoregressive block diffusion for video generation nests diffusion within autoregression. To rollout a video sequence trajectory takes number of autoregressive steps by diffusion time steps. This nested loop make it extremely expensive to generate a long video sequences. For RL algorithms that requires many rollouts for a single network update, this is very expensive.

Continuous Block Diffusion DPO Formulation

For direct preference optimization (DPO), or other variant of policy optimization, we maintain an importance sampling ratio $\frac{\pi_\theta(a)}{\pi_{\theta_{\mathrm{old}}}(a)}$ .

DPO:

$$ ⁍ $$

The naive formulation of a policy trajectory for block diffusion by directly nest the diffusion SDE trajectory inside the autoregression, and we arrive at the AR denoising diffusion:

AR denoising diffusion:

$$ ⁍ $$

The only missing piece is We treat each time step as a conditional Gaussian distribution, and the resulting log probability is dictated by the denoising scheduler:

logP — Flow matching DDPM:

$$ ⁍ $$
Pseudo Code with Condition Gaussian Distribution

Now, a different formulation of log P is introduced through ELBO estimation of L2 flow matching loss:

logP — ELBO:

$$ \log q(x_t \mid x_0) < \text{ELBO}(x_t) = \| v_\theta(x_t, t) - (\epsilon - x_0) \|^2 $$