Figure 1. Illustration of SAGE-GRPO. (Left) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. Our precise Stochastic Differential Equation (SDE) removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the Ordinary Differential Equation (ODE) to SDE conversion used for exploration can inject excess noise. This excess noise lowers rollout quality and makes reward estimates less reliable, which destabilizes post-training alignment.
To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold. This constraint ensures that rollout quality is preserved and reward estimates remain reliable.
We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.
We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in Visual Quality (VQ), Motion Quality (MQ), Text Alignment (TA), and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.
We formulate GRPO for video generation as a manifoldconstrained exploration problem and show that the ODEto-SDE conversions used in existing methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.
Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.
@article{zheng2026sagegrpo,
title={Manifold-Aware Exploration for Reinforcement Learning in Video Generation},
author={Zheng, Mingzhe and Kong, Weijie and Wu, Yue and Jiang, Dengyang and Ma, Yue and He, Xuanhua and Lin, Bin and Gong, Kaixiong and Zhong, Zhao and Bo, Liefeng and Chen, Qifeng and Yang, Harry},
journal={arXiv preprint arXiv:2603.21872},
year={2026}
}