TransCoPhy: Modernizing Causal Physical Reasoning with Transformers
Published:
Code is available at this url.
Introduction: Predicting the Unseen
If you see a tower of blocks, can you predict if it will fall? Usually, yes. But what if you can’t see how heavy the blocks are, or how slippery the table is?
Standard video prediction models try to predict the future based solely on visual history. However, in the real world, visual history isn’t enough. Hidden factors, or confounders, like mass and friction dictate the future just as much as the visible position of objects.
For my course project in Causal Machine Learning, I explored CoPhy, a benchmark designed to test an AI’s ability to reason about these hidden physical traits using counterfactuals. The goal here is to take the existing baseline architecture (CoPhyNet) and modernize it by replacing its recurrent core with a Transformer, creating TransCoPhy.
The Challenge: The Ambiguous Tower
The core problem with standard “feedforward” prediction is that it ignores the hidden physics. Consider the “Ambiguous Tower” scenario:
- Scenario A (High Friction): A block tower stands perfectly still.
- Scenario B (Low Friction): The exact same tower collapses immediately.
To a model looking only at the initial frame \(X_0\), these two scenarios look identical. A standard model will guess blindly. To predict correctly, the model must first observe a video of the objects moving to infer the hidden properties (Abduction), and then apply those properties to a new scenario (Prediction).
The Causal Approach
The project builds on the CoPhyNet architecture (Baradel et al., ICLR 2020), which utilizes Pearl’s 3-step causal framework:
- Abduction (Infer U): The model watches an observed sequence (Past + Outcome) to calculate a latent representation of the confounders, U (e.g., determining “these blocks are heavy”).
- Action (Intervene): We intervene on the initial state, setting up a new configuration C while keeping the physics U we just learned.
- Prediction (Simulate): The model predicts the counterfactual outcome using the new starting positions and the inferred physics.
The Transformation: RNN vs. Transformer
The original CoPhyNet relies on a Recurrent Neural Network (GRU) to encode the time-series data during the Abduction phase. While functional, RNNs process data sequentially, creating information bottlenecks and suffering from vanishing gradients over long sequences.
My proposal: TransCoPhy.
I replaced the sequential GRU encoder with a Transformer Encoder.
Time View: While RNNs view time sequentially, Transformers view the entire sequence simultaneously.
Event Detection: Self-attention mechanisms allow the model to instantly focus on critical physical events (like a collision) rather than waiting for the information to propagate through hidden states.
Efficiency: Transformers are highly parallelizable.
Results
We tested both models on the BallsCF dataset (3 objects), predicting 3D trajectories and stability. The switch to a Transformer architecture yielded immediate benefits in computational efficiency and performance:
- Faster Training: The TransCoPhy model reduced the training time per epoch from 165s to 135s, an 18% speedup due to parallelization.
- Improved Accuracy: The model demonstrated a stronger ability to minimize Mean Squared Error (MSE) on the counterfactual trajectories compared to the RNN baseline.
Conclusion and Future Work
This project demonstrated that causal physical reasoning benefits significantly from modern sequence modeling architectures. While the current benchmark is restricted to short clips (6 seconds @ 5fps), the advantages of the Transformer, specifically its resistance to vanishing gradients, suggest that TransCoPhy would vastly outperform RNN baselines on newer, higher-frequency physics datasets (25fps+) with longer time horizons.
By moving from sequential processing to attention-based mechanisms, we bring AI one step closer to intuitively understanding the physics of the world, rather than just memorizing pixel movements.
