PhysPO: Physics-Aware Local Preference Optimization for Physically Consistent Video Generation

Abstract

Despite rapid progress in video diffusion models (VDMs), ensuring semantic adherence and physical commonsense remains a fundamental challenge, as even state-of-the-art systems frequently violate real-world physical laws, such as dynamics, thermodynamics, and optics. While Direct Preference Optimization (DPO) has become a popular post-training strategy for aligning generative models, existing video DPO methods to improve physical commonsense suffer from high computational costs, poorly matched preference pairs, and ambiguous global supervision that fails to localize physical violations. We propose PhysPO, a physics-aware local preference optimization framework for physically consistent video generation. We first introduce CounterPhyPipe, a physics-aware counterfactual data construction pipeline that forms video preference pairs with consistent global semantics and local physical violations, enabling meaningful comparisons for preference learning.Then we leverage high-frequency decomposition to introduce a physical mask, which localizes regions where physical state transitions may occur.Finally, we develop PhysPO, a physics-aware DPO framework that concentrates supervision on physically relevant regions while enforcing neutrality on background regions.This mechanism reduces gradient noise and mitigates shortcut optimization, encouraging the model to focus on genuine physical discrepancies rather than superficial cues.Extensive experiments demonstrate that PhysPO significantly improves physical commonsense without compromising semantic adherence.

PhysPO Overview

1. Comparison with Baselines

We compare under three fundamental categories of physics that are universally relevant in life: Dynamics, Thermodynamics, and Optics.
PhysPO can simultaneously improve generated videos' physical commonsense (PC), while at the same time maintaining semantic adherence (SA).

1.1 Dynamics

Wan2.1-T2V-1.3B vs Wan2.1-T2V-1.3B+PhysPO.

A metal pendulum swinging left and right under gravity in a quiet room.

"Wan2.1-T2V-1.3B": irregular and non-smooth motion with unstable trajectories that violate basic dynamic constraints.

A metal pendulum swinging left and right under gravity in a quiet room.

"Wan2.1-T2V-1.3B+PhysPO": physically consistent pendulum motion, exhibiting smooth periodic oscillations, gradually decaying amplitude due to damping, and gravity-governed trajectories.

1.2 Thermodynamics

CogVideoX vs. CogVideoX+PhysPO

Candle burning with smoke.

"CogVideoX": displays unrealistic flame behavior with abrupt upward movement and inconsistent smoke patterns that violate thermodynamic principles.

Candle burning with smoke.

"CogVideoX+PhysPO": follow thermodynamic principles, including gradual wax melting, continuous flame flickering, and upward smoke convection driven by heat.

1.3 Optics

CogVideoX vs. CogVideoX+PhysPO

Straw bending in water.

"CogVideoX": geometrically inconsistent distortions, such as misplaced bending points that contradict optical refraction laws.

Straw bending in water.

"CogVideoX+PhysPO": refraction-induced visual displacement, where the apparent bending occurs consistently at the air–water interface and varies smoothly with perspective.

2. Comparison with SOTA Models

We compare with two state-of-the-art models: Commercial models (OpenAI Sora2) and Physics-aware models (PhyT2V).

2.1 Compare with Commercial Models: OpenAI Sora2 vs. PhysPO

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"OpenAI Sora2": After being hit by the bat, the bottle deforms and moves as a single object instead of shattering.

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"PhysPO": The bat strikes the bottle, triggering visible crack propagation and realistic glass fragmentation. The shards scatter outward following momentum transfer and then fall under gravity, producing physically consistent motion.

2.2 Compare with Physics-aware Models: PhyT2V vs. PhysPO

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"PhyT2V": The glass fragments appear abruptly without clear crack propagation after impact.

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"PhysPO": The bat strikes the bottle, triggering visible crack propagation and realistic glass fragmentation. The shards scatter outward following momentum transfer and then fall under gravity, producing physically consistent motion.

3. Comparison with other DPO Methods

Based on Wan2.1-T2V-1.3B, we compare results after post-training using VideoDPO, PhyGDPO, and PhysPO (ours).

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"Wan2.1": The baseball bat and the glass bottle both undergo non-rigid deformation and develops flexible bending, which does not conform to physical laws.

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"Wan2.1+VideoDPO": Lacking semantic adherence, this video fails to accurately represent the actual shape of the glass bottle.

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"Wan2.1+PhyGDPO": The baseball bat undergoes non-rigid deformation and develops flexible bending, which does not conform to physical laws.

A baseball bat smashes a glass bottle, sending shards flying in all directions.

"Wan2.1+PhysPO": The bat strikes the bottle, triggering visible crack propagation and realistic glass fragmentation. The shards scatter outward following momentum transfer and then fall under gravity, producing physically consistent motion.

4. Ablation Study

Qualitative results for ablation study.

Water poured into cup.

"PhysPO": water being poured into a cup, where the fluid motion follows physical laws.

Water poured into cup.

"PhysPO w/o CounterPhy": lacking physics commonsense, the cup tips over but the water does not spill.

Water poured into cup.

"PhysPO w/o PA-DPO": lacking physics commonsense, the water spurts abruptly at the surface.

Water poured into cup.

"PhysPO w/o Tie-DPO": lacking semantic adherence, the water is poured outside the cup.