SAGE-GRPO: Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Figure 1. Illustration of SAGE-GRPO. (Left) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. Our precise Stochastic Differential Equation (SDE) removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.

Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the Ordinary Differential Equation (ODE) to SDE conversion used for exploration can inject excess noise. This excess noise lowers rollout quality and makes reward estimates less reliable, which destabilizes post-training alignment.

To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold. This constraint ensures that rollout quality is preserved and reward estimates remain reliable.

We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.

We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in Visual Quality (VQ), Motion Quality (MQ), Text Alignment (TA), and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.

Highlight

We formulate GRPO for video generation as a manifoldconstrained exploration problem and show that the ODEto-SDE conversions used in existing methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.

Manifold-Constrained Exploration

Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.

Micro-level Exploration: We constrain exploration with a Precise Manifold-Aware SDE and a Gradient Norm Equalizer. This formulation ensures that sampling noise remains manifold-consistent and parameter updates are balanced across timesteps.
Macro-level Exploration: We constrain long-horizon exploration using a Dual Trust Region. This includes moving anchors and step-wise constraints, ensuring that the trust region tracks more manifold-consistent checkpoints and prevents policy drift.

Visual Results

1. Comparison with Baseline

HunyuanVideo-1.5 (Baseline)

SAGE-GRPO (Ours)

Prompt: The scene opens on a medium, low-angle shot of a teenage boy on an empty, red-surfaced running track during sunset. He is positioned on the right third of the frame, having just completed an intense sprint. He wears a striking neon green athletic jacket, unzipped to reveal a dark shirt underneath, and black running shorts. His body is bent sharply at the waist, his hands pressed firmly onto his knees for support as he struggles to recover. His dark, curly hair is damp with sweat, which also beads on his forehead and temples. His chest rises and falls rapidly and deeply, and with each ragged exhalation, a faint mist of his breath is visible in the cooling air, illuminated by the strong backlight from the setting sun. The sun, low on the horizon, casts long shadows and bathes the scene in a warm, orange glow, creating a cinematic lens flare that streaks across the frame. After a few moments of labored breathing, he slowly and painfully straightens his posture, his eyes remaining fixed on the track ahead with a look of fierce determination mixed with utter exhaustion.

Prompt: The scene opens on a tranquil, sun-drenched meadow in the late afternoon. An eye-level full shot frames Isaac Newton, a man with long hair dressed in simple 17th-century clothing, sitting at the base of a large, gnarled apple tree. He leans against the trunk, positioned according to the rule of thirds, creating a sense of balance and space. Dappled sunlight streams through the leafy canopy, casting soft, moving shadows on the ground. Newton is completely absorbed in thought, his gaze distant and unfocused. A gentle breeze rustles the leaves. High above him, a ripe red apple loosens from its stem. It drops silently at first, then lands with a distinct 'thump' on top of Newton's head. He flinches, startled out of his deep thoughts, and instinctively raises a hand to the point of impact. His eyes dart upwards towards the branches, then scan the ground around him. He spots the offending red apple lying in the grass. His initial annoyance gives way to curiosity as he reaches down, picks it up, and holds it in his palm. He turns it over, examining it, and his expression slowly transforms into one of profound, dawning realization, the genesis of a revolutionary idea.

Prompt: The scene opens with a stunning wide shot, filmed in slow motion from a low angle. Five children, a diverse group of boys and girls aged between six and ten, are running exuberantly across a vast field. The field is filled with tall, golden-yellow grass that sways gently in the breeze and reaches their waists. It's the golden hour, and the setting sun, positioned behind the children, creates a brilliant backlight. This light forms a radiant halo around their hair and outlines their bodies, separating them from the lush background. Dust motes and pollen kicked up by their running feet dance and sparkle in the sunbeams. The children are spread out, yet moving together as a group from right to left across the frame. Their faces are alight with pure joy; mouths are open in laughter, and their eyes are bright with excitement. One girl with long blonde pigtails leads the pack, looking back over her shoulder with a wide grin. A boy in a red t-shirt leaps playfully into the air. The slow-motion effect accentuates every detail: the bounce of their hair, the flowing fabric of their clothes, and the effortless grace of their youthful movements. The sky above is a soft, clear blue, providing a cool contrast to the warm tones of the field below. The atmosphere is overwhelmingly joyful, nostalgic, and evocative of the perfect, endless days of summer childhood.

SAGE-GRPO Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Abstract

Highlight

Manifold-Constrained Exploration

Visual Results

1. Comparison with Baseline

HunyuanVideo-1.5 (Baseline)

SAGE-GRPO (Ours)

2. Comparison with Other Methods (20 steps)

DanceGRPO

FlowGRPO

CPS

Ours

3. Comparison with Other Methods (40 steps)

DanceGRPO

FlowGRPO

CPS

Ours

4. KL Ablation Study

Citation