PhysPO: Physics-Aware Local Preference Optimization for Physically Consistent Video Generation

Abstract

Despite rapid progress in video diffusion models (VDMs), ensuring semantic adherence and physical commonsense remains a fundamental challenge, as even state-of-the-art systems frequently violate real-world physical laws, such as dynamics, thermodynamics, and optics. While Direct Preference Optimization (DPO) has become a popular post-training strategy for aligning generative models, existing video DPO methods to improve physical commonsense suffer from high computational costs, poorly matched preference pairs, and ambiguous global supervision that fails to localize physical violations. We propose PhysPO, a physics-aware local preference optimization framework for physically consistent video generation. We first introduce CounterPhyPipe, a physics-aware counterfactual data construction pipeline that forms video preference pairs with consistent global semantics and local physical violations, enabling meaningful comparisons for preference learning.Then we leverage high-frequency decomposition to introduce a physical mask, which localizes regions where physical state transitions may occur.Finally, we develop PhysPO, a physics-aware DPO framework that concentrates supervision on physically relevant regions while enforcing neutrality on background regions.This mechanism reduces gradient noise and mitigates shortcut optimization, encouraging the model to focus on genuine physical discrepancies rather than superficial cues.Extensive experiments demonstrate that PhysPO significantly improves physical commonsense without compromising semantic adherence.

PhysPO Overview

1. Comparison with Baselines

We compare under three fundamental categories of physics that are universally relevant in life: Dynamics, Thermodynamics, and Optics.
PhysPO can simultaneously improve generated videos' physical commonsense (PC), while at the same time maintaining semantic adherence (SA).

1.1 Dynamics

Wan2.1-T2V-1.3B vs Wan2.1-T2V-1.3B+PhysPO.

1.2 Thermodynamics

CogVideoX vs. CogVideoX+PhysPO

1.3 Optics

CogVideoX vs. CogVideoX+PhysPO

2. Comparison with SOTA Models

We compare with two state-of-the-art models: Commercial models (OpenAI Sora2) and Physics-aware models (PhyT2V).

2.1 Compare with Commercial Models: OpenAI Sora2 vs. PhysPO

2.2 Compare with Physics-aware Models: PhyT2V vs. PhysPO

3. Comparison with other DPO Methods

Based on Wan2.1-T2V-1.3B, we compare results after post-training using VideoDPO, PhyGDPO, and PhysPO (ours).

4. Ablation Study

Qualitative results for ablation study.