Varun Varma Thozhiyoor1, Shivam Tripathi1, Venkatesh Babu Radhakrishnan1, Anand Bhattad2
1Indian Institute of Science, Bangalore 2Johns Hopkins University
Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate whether observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio t₁² / t₂² = h₁ / h₂, a relationship independent of g, focal length, and scale. This relative test reveals violations of Galileo’s equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises geff from 1.81 m/s² to 6.43 m/s² (reaching 65% of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.
Figure 1: We plot measured timing ratios t12 / t22 against theoretical predictions h1 / h2 across multiple height ratios. The gray dashed line indicates perfect agreement. All models deviate systematically, confirming that under-acceleration is not an artifact of scale estimation but reflects genuine physics and Galileo's principle violations.
Figure 2: We report effective gravity values computed as geff = 2h/t2 (m/s2). The ground truth is 9.81 m/s2. All models under-accelerate, and Gravity Adapters consistently reduce this deficit. Reported mean values are averaged over four random seeds and all test examples. Median and Range values are across all seeds and test samples.