File size: 89,214 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
## FILTERED-COPHY: UNSUPERVISED LEARNING OF COUNTERFACTUAL PHYSICS IN PIXEL SPACE


**Steeven Janny**
LIRIS, INSA Lyon, France

steeven.janny@insa-lyon.fr

**Madiha Nadri**
LAGEPP, Univ. Lyon 1, France

madiha.nadri-wolf@univ-lyon1.fr


**Fabien Baradel**
Naver Labs Europe, France

fabien.baradel@naverlabs.com

**Greg Mori**
Simon Fraser Univ., Canada

mori@cs.sfu.ca

ABSTRACT


**Natalia Neverova**
Meta AI

nneverova@fb.com

**Christian Wolf**
LIRIS, INSA Lyon, France

christian.wolf@insa-lyon.fr


Learning causal relationships in high-dimensional data (images, videos) is a hard
task, as they are often defined on low-dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also
spurious correlations in the data. We present a method for learning counterfactual
reasoning of physical processes in pixel space, which requires the prediction of
the impact of interventions on initial conditions. Going beyond the identification
of structural relationships, we deal with the challenging problem of forecasting
raw video over long horizons. Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties. Our
model learns and acts on a suitable hybrid latent representation based on a combination of dense features, sets of 2D keypoints and an additional latent vector per
keypoint. We show that this better captures the dynamics of physical processes
than purely dense or sparse representations. We introduce a new challenging and
carefully designed counterfactual benchmark for predictions in pixel space and
outperform strong baselines in physics-inspired ML and video prediction.

1 INTRODUCTION

Reasoning on complex, multi-modal and high-dimensional data is a natural ability of humans and
other intelligent agents (Martin-Ordas et al., 2008), and one of the most important and difficult challenges of AI. While machine learning is well suited for capturing regularities in high-dimensional
signals, in particular by using high-capacity deep networks, some applications also require an accurate modeling of causal relationships. This is particularly relevant in physics, where causation is
considered as a fundamental axiom. In the context of machine learning, correctly capturing or modeling causal relationships can also lead to more robust predictions, in particular better generalization
to out-of-distribution samples, indicating that a model has overcome the exploitation of biases and
shortcuts in the training data. In recent literature on physics-inspired machine learning, causality
has often been forced through the addition of prior knowledge about the physical laws that govern
the studied phenomena, e.g. (Yin et al., 2021). A similar idea lies behind structured causal models,
widely used in the causal inference community, where domain experts model these relationships
directly in a graphical notation. This particular line of work allows to perform predictions beyond
statistical forecasting, for instance by predicting unobserved counterfactuals, the impact of unobserved interventions (Balke & Pearl, 1994) — “What alternative outcome would have happened, if
_the observed event X had been replaced with an event Y (after an intervention)”. Counterfactuals are_
interesting, as causality intervenes through the effective modification of an outcome. As an example, taken from (Sch¨olkopf et al., 2021), an agent can identify the direction of a causal relationship
between an umbrella and rain from the fact that removing an umbrella will not affect the weather.

We focus on counterfactual reasoning on high-dimensional signals, in particular videos of complex
physical processes. Learning such causal interactions from data is a challenging task, as spurious
correlations are naturally and easily picked up by trained models. Previous work in this direction


-----

was restricted to discrete outcomes, as in CLEVRER (Yi et al., 2020), or to the prediction of 3D
trajectories, as in CoPhy (Baradel et al., 2020), which also requires supervision of object positions.
In this work, we address the hard problem of predicting the alternative (counterfactual) outcomes
of physical processes in pixel space, i.e. we forecast sequences of 2D projective views of the 3D
scene, requiring the prediction over long horizons (150 frames corresponding to ∼ 6 seconds). We
conjecture that causal relationships can be modeled on a low dimensional manifold of the data,
and propose a suitable latent representation for the causal model, in particular for the estimation of
the confounders and the dynamic model itself. Similar to V-CDN (Kulkarni et al., 2019; Li et al.,
2020), our latent representation is based on the unsupervised discovery of keypoints, complemented
by additional information in our case. Indeed, while keypoint-based representations can easily be
encoded from visual input, as stable mappings from images to points arise naturally, we claim that
they are not the most suitable representation for dynamic models. We identified and addressed
two principal problems: (i) the individual points of a given set are discriminated through their 2D
positions only, therefore shape, geometry and relationships between multiple moving objects need to
be encoded through the relative positions of points to each other, and (ii) the optimal representation
for a physical dynamic model is not necessarily a 2D keypoint space, where the underlying object
dynamics has also been subject to the imaging process (projective geometry).

We propose a new counterfactual model, which learns a sparse representation of visual input in the
form of 2D keypoints coupled with a (small) set of coefficients per point modeling complementary
shape and appearance information. Confounders (object masses and initial velocities) in the studied
problem are extracted from this representation, and a learned dynamic model forecasts the entire
trajectory of these keypoints from a single (counterfactual) observation. Building on recent work in
data-driven analysis of dynamic systems (Janny et al., 2021; Peralez & Nadri, 2021), the dynamic
model is presented in a higher-dimensional state space, where dynamics are less complex. We
show, that these design choices are key to the performance of our model, and that they significantly
improve the capability to perform long-term predictions. Our proposed model outperforms strong
baselines for physics-informed learning of video prediction.

We introduce a new challenging dataset for this problem, which builds on CoPhy, a recent counterfactual physics benchmark (Baradel et al., 2020). We go beyond the prediction of sequences of
3D positions and propose a counterfactual task for predictions in pixel space after interventions on
initial conditions (displacing, re-orienting or removing objects). In contrast to the literature, our
benchmark also better controls for the identifiability of causal relationships and counterfactual variables and provides more accurate physics simulation.

2 RELATED WORK

**Counterfactual (CF) reasoning — and learning of causal relationships in ML was made popular**
by works of J. Pearl, e.g. (Pearl, 2000), which motivate and introduce mathematical tools detailing
the principles of do-calculus, i.e. study of unobserved interventions on data. A more recent survey
links these concepts to the literature in ML (Sch¨olkopf et al., 2021). The last years have seen the
emergence of several benchmarks for CF reasoning in physics. CLEVRER (Yi et al., 2020) is a visual question answering dataset, where an agent is required to answer a CF question after observing
a video showing 3D objects moving and colliding. Li et al. (2020) introduce a CF benchmark with
two tasks: a scenario where balls interact with each other according to unknown interaction laws
(such as gravity or elasticity), and a scenario where clothes are folded by the wind. The agent needs
to identify CF variables and causal relationships between objects, and to predict future frames. CoPhy (Baradel et al., 2020) clearly dissociates the observed experiment from the CF one, and contains
three complex 3D scenarios involving rigid body dynamics. However, the proposed method relies
on the supervision of 3D object positions, while our work does not require any meta data.

**Physics-inspired ML — and learning visual dynamics has been dealt early on with recurrent mod-**
els (Srivastava et al., 2015; Finn et al., 2016; Lu et al., 2017), or GANs (Vondrick et al., 2016;
Mathieu et al., 2016). Kwon & Park (2019) adopt a Cycle-GAN with two discriminator heads, in
charge of identifying false images and false sequences in order to improve the temporal consistency
of the model in long term prediction. Nonetheless, the integration of causal reasoning and prior
knowledge in these models is not straightforward. Typical work in physics-informed models relies
on disentanglement between physics-informed features and residual features (Villegas et al., 2017a;


-----

Denton & Birodkar, 2017) and may incorporate additional information based on the available priors
on the scene (Villegas et al., 2017b; Walker et al., 2017). PhyDNet Le Guen & Thome (2020) explicitly disentangles visual features from dynamical features, which are supposed to follow a PDE.
It achieves SOTA performance on Human3.6M (Ionescu et al., 2014) and Sea Surface Temperature
(de Bezenac et al., 2018), but we show that it fails on our challenging benchmark.

**Keypoint detection — is a well researched problem in vision with widely used handcrafted base-**
lines (Lowe, 1999). New unsupervised variants emerged recently and have been shown to provide a
suitable object-centric representation, close to attention models, which simplify the use of physical
and/or geometric priors (Locatello et al., 2020; Veerapaneni et al., 2020). They are of interest in
robotics and reinforcement learning, where a physical agent has to interact with objects (Kulkarni
et al., 2019; Manuelli et al., 2020; 2019). KeypointsNet (Suwajanakorn et al., 2018) is a geometric
reasoning framework, which discovers meaningful keypoints in 3D through spatial coherence between viewpoints. Close to our work, (Minderer et al., 2019) proposes to learn a keypoints-based
stochastic dynamic model. However, the model is not suited for CF reasoning in physics and may
suffer from inconsistency in the prediction of dynamics over long horizons.

3 THE FILTERED-COPHY BENCHMARK

We build on CoPhy (Baradel et al., 2020), retaining its strengths, but explicitly focusing on a counterfactual scenario in pixel space and eliminating the ill-posedness of tasks we identified in the
existing work. Each data sample is called an experiment, represented as a pair of trajectories: an
_observed one with initial condition X0 = A and outcome Xt=1..T = B (a sequence), and a counter-_
_factual one_ _X[¯]0 = C and_ _X[¯]t=1..T = D (a sequence). Throughout this paper we will use the letters_
**A, B, C and D to distinguish the different parts of each experiment. The initial conditions A and C**
are linked through a do-operator do(X0 = C), which modifies the initial condition (Pearl, 2018).
Experiments are parameterized by a set of intrinsic physical parameters z which are not observable
from a single initial image A. We refer to these as confounders. As in CoPhy, in our benchmark the
do-operator is observed during training, but confounders are not — they have been used to generate
the data, but are not used during training or testing. Following (Pearl, 2018), the counterfactual
task consists in inferring the counterfactual outcome D given the observed trajectory AB and the
counterfactual initial state C, following a three-step process:

Œ Abduction: use the observed data AB to compute the counterfactual variables, i.e. physical parameters, which are not affected by the do-operation.

 Action: update the causal model; keep the same identified confounders and apply the dooperator, i.e. replace the initial state A by C.

Ž Prediction : Compute the counterfactual outcome D using the causal graph.

The benchmark contains three scenarios involving rigid body dynamics. BlocktowerCF studies
stable and unstable 3D cube towers, the confounders are masses. BallsCF focuses on 2D collisions
between moving spheres (confounders are masses and initial velocities). CollisionCF is about
collisions between a sphere and a cylinder (confounders are masses and initial velocities) (Fig. 1).

Unlike CoPhy, our benchmark involves predictions in RGB pixel space only. The do-operation
consists in visually observable interventions on A, such as moving or removing an object. The
confounders cannot be identified from the single-frame observation A, identification requires the
analysis of the entire AB trajectory.

**Identifiability of confounders — For an experiment (AB, CD, z) to be well-posed, the con-**
founders z must be retrievable from AB. For example, since the masses of a stable cube tower
cannot be identified generally in all situations, it can be impossible to predict the counterfactual
outcome of an unstable tower, as collisions are not resolvable without known masses. In contrast
to CoPhy, we ensure that each experiment ψ : (X0, z) 7→ _Xt=1..T, given initial condition X0 and_
confounders z, is well posed and satisfies the following constraints:

**Definition 1 (Identifiability, (Pearl, 2018)) The experiment (AB, CD, z) is identifiable if, for any**
_set of confounders z[′]:_

_ψ(A, z) = ψ(A, z[′]) ⇒_ _ψ(C, z) = ψ(C, z[′])._ (1)


-----

|A|Col2|B|
|---|---|---|
|C||D|
||||

|A|Col2|B|
|---|---|---|
||||
|C||D|
||||


**A** **B**

**C** **D**


3D rigid body dynamics with complex interactions, including collision and resting contact. Initial

**A** **B** **A** **B**

**C** **D** **C** **D**

(a) BlocktowerCF (BT-CF) (b) CollisionCF (C-CF)

Figure 1: The Filtered-CoPhy benchmark suite contains three challenging scenarios involving 2D or

conditions A are modified to C by an intervention. Initial motion is indicated through arrows.

In an identifiable experiment there is no pair (z, z[′]) that gives the same trajectory AB but different
counterfactual outcomes CD. Details on implementation and impact are in appendix A.1.

**Counterfactuality — We enforce sufficient difficulty of the**
problem through the meaningfulness of confounders. We remove initial situations where the choice of confounder values
has no significant impact on the final outcome:

**Definition 2 (Counterfactuality). Let z[k]** _be the set of con-_
_founders z, where the k[th]_ _value has been modified. The ex-_
_periment (AB, CD, z) is counterfactual if and only if:_

_∃k : ψ(C, z[k]) ̸= ψ(C, z)._ (2)


In other words, we impose the existence of an object of the
scene for which the (unobserved) physical properties have a Figure 2: Impact of temporal fredetermining effect on the trajectory. Details on how this con- quency on dynamics, 3D trajectostraint was enforced are given in appendix A.2. ries of each cube are shown. Black

dots are sampled at 5 FPS, col
**Temporal resolution — the physical laws we target involve** ored dots at 25 FPS. Collisions behighly non-linear phenomena, in particular collision and rest- tween the red cube and the ground
ing contacts. Collisions are difficult to learn because their ac- are not well described by the black
tions are both intense, brief, and highly non-linear, depend- dots, making it hard to infer physiing on the geometry of the objects in 3D space. The temporal cal laws from regularities in data.
resolution of physical simulations is of prime importance. A
parallel can be made with Nyquist-Shannon frequency, as a
trajectory sampled with too low frequency cannot be reconstructed with precision. We simulate and
record trajectories at 25 FPS, compared to 5 FPS chosen in CoPhy, justified with two experiments.
Firstly, Fig. 2 shows the trajectories of the center of masses of cubes in BlocktowerCF, colored
dots are shown at 25 FPS and black dots at 5 FPS. We can see that collisions with the ground fall
below the sampling rate of 5 FPS, making it hard to infer physical laws from regularities in data at
this frequency. A second experiment involves learning a prediction model at different frequencies,
confirming the choice 25 FPS — details are given in appendix A.3.


4 UNSUPERVISED LEARNING OF COUNTERFACTUAL PHYSICS

We introduce a new model for counterfactual learning of physical processes capable of predicting
visual sequences D in the image space over long horizons. The method does not require any supervision other than videos of observed and counterfactual experiences. The code is publicly available
[online at https://filteredcophy.github.io. The model consists of three parts, learning the](https://filteredcophy.github.io)
latent representation and its (counterfactual) dynamics:

-  The encoder (De-Rendering module) learns a hybrid representation of an image in the form of a
(i) dense feature map and (ii) 2D keypoints combined with (iii) a low-dimensional vector of coef

-----

kDO5+T9w=</latexit>encoder **decoderubpNOoe9f1xkOj2qwVcZTJBbkNeKRG9Ik96RF2oSRKXkmr+TNyq0X6936WLWrGLmnPyB9fkDLDqT7Q=</latexit>**

**Xsource** 0xWrCWmVP0B9bHN+DAmbk=</latexit> _F_ _D_

CNN _K_ source appearance **Xˆ** target

kDO5+T9w=</latexit> _C_ features F _R_

**Xtarget** **encoder** keypoints kk Gaussian map.G convolution⇤ **G[i]k** reconstructed

0xWrCWmVP0B9bHN+DAmbk=</latexit> _F_ target
CNN _K_ _⇥_ **H**

_C_ coefficients ck


Figure 3: We de-render visual input into a latent space composed of a dense feature map F modeling

|AB|Col2|Col3|Col4|ic|
|---|---|---|---|---|
|AB||||fig te n|
||||||
||||||
||||||


|<latexit sha1_base64="cq4NG81Kyap3DwEKDNltOdfsGl Q=">ACEHicbVC7TsMwFHXKq5RXgJHFokJ0qpIOwFgJBsYi0YfUVpXj3LRWHSeyHUQV9RNY+BUWBhBiZWTjb3DaCEHLkSwd nXOP7Xu8mDOlHefLKqysrq1vFDdLW9s7u3v2/kFLRYmk0KQRj2THIwo4E9DUTHPoxBJI6HFoe+PLzG/fgVQsErd6EkM/JEP BAkaJNtLAPu15MGQipSA0yGmp+Fe0F6ZSgI/0cf2GWn6syAl4mbkzLK0RjYnz0/oklo8pQTpbquE+t+SqRmlIO5PVEQEzo mQ+gaKkgIqp/OFpriE6P4OIikOULjmfo7kZJQqUnomcmQ6JFa9DLxP6+b6OCinzIRJxoEnT8UJBzrCGftYJ9JoJpPDCFUMv NXTEdEmpKUFkJ7uLKy6RVq7pn1dpNrVyv5HU0RE6RhXkonNUR9eogZqIogf0hF7Qq/VoPVtv1vt8tGDlmUP0B9bHN0UVnU U=</latexit>D|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|D|||a repr cients||
||||a re cien||
|earns a coeffi works|rns a oeffi|s ffi|a|r ci|
||||||


|D|Col2|Col3|• Th|Col5|he|
|---|---|---|---|---|---|
|D|||(B +||a c|
|||||||
|||||||


lEf2B9fAPXGp2O</latexit> SFvDqPzrPz5rzPohlnPnNI/sD5/AYPfqFn</latexit>
**AB** static information, a set of keypointsfeatures AB UejWfjzXift64YhecI/Snj4xupmp6O</latexit> **k, and associated coefficients c. We here show the training**

configuration taking as input pairsP1Zr3PRwtWnjlEf2B9fAPXGp2O</latexit> **CoDy (Xsource, Xtarget) of images. Without any supervision, a tracking**
strategy emerges naturally through the unsupervised objective: we optimize reconstruction ofgiven features fromw6j86z8+a8z6NrzmLmhPyB8/kNeEahHQ=</latexit> 9AfGxze65Jvi</latexit>encoder **XsourceABstop gradyqGJYf49jJixPCpJZgoZm+FZIwVJsZmVLQhfH0K/ydt3/UuXP/GLzcul3EUwDE4AWfAzXQANegCVqAEewBN4dpTz6Lw4r4vWFWc5cwR+wHn7BdkJc=</latexit>** and keypoints+coefficients fromm9oFfr0Xq23qz32WjOmu8coj+wPr8BH3aidA=</latexit> CoDY CF estimator9anbU=</latexit> 8coj+wPr8BD4+iag=</latexit>CoDY 3noGs=</latexit>source appearance: XfeatureskNeEahHQ=</latexit> target. C U=</latexit>D

**Cz+wPr4BkOCnUQ=</latexit>** features C vNel+MFqw8c4z+wPr4BkOCnUQ=</latexit> encoder **uk** decoder

P0B9bHN0UVnU=</latexit>D - The Counterfactual Dynamic (CoDy) modelficients, see Fig.which encodes positions in keypoints and appearance and orientation in the coefficients.BX59F5dt6c91l0xZnPHJE/cD6/AXnZoR4=</latexit>features9AfGxze65Jvi</latexit>encoder D 3. Without any state supervision we show that the model learns a representationCtv1vt8tGDlmUP0B9bHN0UVnU=</latexit> stop gradyqGJYf49jJixPCpJZgoZm+FZIwVJsZmVLQhfH0K/ydt3/UuXP/GLzcul3EUwDE4AWfAzXQANegCVqAEewBN4dpTz6Lw4r4vWFWc5cwR+wHn7BdkJc=</latexit> _E_ _σt_ =</latexit>dynamics based on recurrent graph networks, in the lines ofσt+1 ∆ MSEatexit> latexit>test decoderKvxaDwb8b7YrRilJlj9AfGxzeq/ZvY</latexit> =</latexit> qfSg=</latexit>reconstruction

(Baradel et al.9AfGxze65Jvi</latexit>, 2020). It estimates a latent representation of the confoundersD latexit> target: _z from the keypoint_

train

- The decoder+ coefficient trajectories ofencoder that uses the predicted keypoints to generate a pixel-space representation of7QC3p1Hp1n5815X0QLznLmFP2B8/UND5ikCw=</latexit>keypoints and coefficients AB provided by the encoder, and then predictslosslatexit> MSEatexit> **D in this same space.**


4.1 DISENTANGLING VISUAL INFORMATION FROM DYNAMICS

The encoder takes an input image X and predicts a representation with three streams, sharing a
common conv. backbone, as shown in Fig. 3. We propose an unsupervised training objective, which
favors the emergence of a latent representation disentangling static and dynamic information.

1. A dense feature map F = F(X), which contains static information, such as the background.

2. A set of 2D keypoints k = K(X), which carry positional information from moving objects.

3. A set of corresponding coefficients c = C(X), one vector ck of size C per keypoint k, which
encodes orientation and appearance information.

The unsupervised objective is formulated on pairs of images (Xsource, Xtarget) randomly sampled
from same D sequences (see appendix D.1 for details on sampling). Exploiting an assumption
on the absence of camera motion[1], the goal is to favor the emergence of disentangling static and
dynamic information. To this end, both images are encoded, and the reconstruction of the target
image is predicted with a decoder D fusing the source dense feature map and the target keypoints
and coefficients. This formulation requires the decoder to aggregate dense information from the
source and sparse values from the target, naturally leading to motion being predicted by the latter.

On the decoder D side, we add inductive bias, which favors the usage of the 2D keypoint information
in a spatial way. The 2D coordinates kk for each keypoint k are encoded as Gaussian heatmaps
(kk), i.e. 2D Gaussian functions centered on the keypoint position. The additional coefficient
_G_
information, carrying appearance information, is then used to deform the Gaussian mapping into an
anisotropic shape using a fixed filter bank H, as follows:

_D(F, k, c) = R_ **F, G[1]1[,]** [...][,][ G][C]1 _[,][ G]2[1][,]_ [...][,][ G][C]K _,_ **G[i]k** [=][ c]k[C][+1]c[i]k [(][G][(][k][k][)][ ∗] **[H][i][)][,]** (3)

where R(...) is a refinement network performing trained upsampling with transposed convolutions, 
whose inputs are stacked channelwise. G[i]k [are Gaussian mappings produced from keypoint positions]
**k, deformed by filters from bank H and weighted by coefficients c[i]k[. The filters][ H][i][ are defined as]**
fixed horizontal, vertical and diagonal convolution kernels. This choice is discussed in section 5.
The joint encoding and decoding pipeline is illustrated in Fig. 3.

1If this assumption is not satisfied, global camera motion could be compensated after estimation.


-----

The model is trained to minimize the mean squared error (MSE) reconstruction loss, regularized
with a loss on spatial gradients ∇X weighted by hyper-parameters γ1, γ2 ∈ R:

2 2
deren = γ1 **Xtarget** **Xtarget** **Xtarget** **Xtarget** (4)
_L_ _−_ [ˆ] 2 [+][ γ][2] _∇_ _−∇_ [ˆ] 2 _[,]_

where _X[ˆ]target=D(F(Xsource), K(Xtarget), C(Xtarget)) is the reconstructed image, γ1, γ2 are weights._

**Related work – our unsupervised objective is somewhat related to Transporter (Kulkarni et al.,**
2019), which, as our model, computes visual feature vectors Fsource and Ftarget as well as 2D keypoints Ksource and Ktarget, modeled as a 2D vector via Gaussian mapping. It leverages a handcrafted
transport equation: Ψ[ˆ] target=Fsource × (1−Ksource)×(1−Ktarget) + Ftarget×Ktarget. As in our case, the
target image is reconstructed through a refiner network _X[ˆ]target = R(Ψ[ˆ]_ target). The transporter suffers
from a major drawback when used for video prediction, as it requires parts of the target image to
reconstruct the target image — the model was originally proposed in the context of RL and control,
where reconstruction is not an objective. It also does not use shape coefficients, requiring shapes
to be encoded by several keypoints, or abusively be carried through the dense features Ftarget. This
typically leads to complex dynamics non representative of the dynamical objects. We conducted an
in-depth comparison between the Transporter and our representation in appendix C.2.

4.2 DYNAMIC MODEL AND CONFOUNDER ESTIMATION

Our counterfactual dynamic model (CoDy) leverages multiple graph network (GN) based modules
(Battaglia et al., 2016) that join forces to solve the counterfactual forecasting tasks of Filtered**CoPhy. Each one of these networks is a classical GN, abbreviated as GN** (xk), which contextualizes input node embeddings xk through incoming edge interactions eik, providing output node
embeddings ˆxk (parameters are not shared over the instances):


_GN_ (xk) = ˆxk, such that ˆxk = g


with eij = f (xi, xj), (5)


**xk,**


**eik**


where f is a message-passing function and g is an aggregation function.

We define the state of frame Xt at time t as a stacked vector composed of keypoints and coefficients
computed by the encoder (the de-rendering module), i.e. s(t) = [s1(t) ... sk(t)] where sk(t) =

[kk c[1]k [...] **[c]k[C][+1]](t). In the lines of (Baradel et al., 2020), given the original initial condition and**
outcome AB, CoDy estimates an unsupervised representation uk of the latent confounder variables
per keypoint k through the counterfactual estimator (CF estimator in Fig. 4). It first contextualizes
the sequence s[AB](t) through a graph network **s[AB](t)** = h[AB](t) = [h1(t) ... hK(t)]. We
_GN_
then model the temporal evolution of this representation with a gated recurrent unit (Cho et al., 2014)
per keypoint, sharing parameters over keypoints, taking as input the sequence  **hk. Its last hidden**
vector is taken as the confounder estimate uk.

Recent works on the Koopman Operator (Lusch et al., 2018) and Kazantzis-Kravaris-Luenberger
Observer (Janny et al., 2021; Peralez & Nadri, 2021) have theoretically shown that, under mild
assumptions, there exists a latent space of higher dimension, where a dynamical system given as
an EDP can have a simpler dynamics. Inspired by this idea, we used an encoder-decoder structure
within CoDy, which projects our dynamic system into a higher-dimensional state space, performs
forecasting of the dynamics in this latent space, and then projects predictions back to the original
keypoint space. Note that this dynamics encoder/decoder is different from the encoder/decoder of
the de-rendering/rendering modules discussed in section 4.1. The state encoder E(s(t)) = σ(t) is
modeled as a graph network GN, whose aggregation function projects into an output embedding
space σ(t) of dimension 256. The decoder ∆(σ(t)) = s(t) temporally processes the individual
contextualized states σ(t) with a GRU, followed by new contextualization with a graph network
_GN_ . Details on the full architecture are provided in appendix D.2.

The dynamic model CoDy performs forecasting in the higher-dimensional space σ(t), computing
a displacement vector δ(t + 1) such that σ(t + 1) = σ(t) + δ(t + 1). It takes the projected
state embeddings σk[CD](t) per keypoint k concatenated with the confounder representation uk and
contextualizes them with a graph network, resulting in embeddings h[CD]k (t) = ([σk[CD](t), uk]),
_GN_
which are processed temporally by a GRU. We compute the displacement vector at time t + 1


-----

|AB|Col2|Col3|Col4|
|---|---|---|---|
|AB||||
|||||
|||||
|||||

|<latexit sha1_base64="cq4NG81Kyap3DwEKDNltOdfsGlQ="> ACEHicbVC7TsMwFHXKq5RXgJHFokJ0qpIOwFgJBsYi0YfUVpXj3LRWHSeyHUQV9RNY+BUWBhBiZWTjb3DaCEHLkSwdnXOP7Xu8mDOl HefLKqysrq1vFDdLW9s7u3v2/kFLRYmk0KQRj2THIwo4E9DUTHPoxBJI6HFoe+PLzG/fgVQsErd6EkM/JEPBAkaJNtLAPu15MGQipSA0 yGmp+Fe0F6ZSgI/0cf2GWn6syAl4mbkzLK0RjYnz0/oklo8pQTpbquE+t+SqRmlIO5PVEQEzomQ+gaKkgIqp/OFpriE6P4OIikOULj mfo7kZJQqUnomcmQ6JFa9DLxP6+b6OCinzIRJxoEnT8UJBzrCGftYJ9JoJpPDCFUMvNXTEdEmpKUFkJ7uLKy6RVq7pn1dpNrVyv5HU 0RE6RhXkonNUR9eogZqIogf0hF7Qq/VoPVtv1vt8tGDlmUP0B9bHN0UVnU=</latexit>D|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|D r OqroBJ2iOnLRJWqhG9RGHUTQA3pCL+jVeDSejTfjfTFaMcrMfoD4+Mb81qfSg=</latexit>reconstruction|||||
||||||
||||||

|D|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|D||||||
|||||||
|||||||


_C_ coefficients ck


Figure 4: During training, we disconnect the dynamic prediction module (CoDy) from the rendering

gFvVqP1rP1Zr3PRwtWnjlEf2B9fAPXGp2O</latexit>AB RGOHkgT+SFvDqPzrPz5rzPohlnPnNI/sD5/AYPfqFn</latexit>features AB 1SKOEjpGJ6iKHSBGugatVAbEfSAntALejUejWfjzXift64YhecI/Snj4xupmp6O</latexit>

xze65Jvi</latexit> gFvVqP1rP1Zr3PRwtWnjlEf2B9fAPXGp2O</latexit>AB **CoDy** 1FEJ+gUlZGHLlEN3aAGaiKHtATekGv1qP1bL1Z7/PRgrXIHKM/sD6+AS9anbU=</latexit>

h5IE/khbw6j86z8+a8z6NrzmLmhPyB8/kNeEahHQ=</latexit> encoder stop gradhfH0K/ydt3/UuXP/GLzcul3EUwDE4AWfAzXQANegCVqAEewBN4dpTz6Lw4r4vWFWc5cwR+wHn7BdkJc=</latexit> daVYLc3ryKMjdIxKyEPnqIquUB01EP6Am9oFfr0Xq23qz32WjOmu8coj+wPr8BH3aidA=</latexit> CoDY CF estimator zsqV60qxWprXkUdH6BiVkIfOURVdoTpqIe0BN6Qa/Wo/VsvVnvs2jOms8coj+wPr8BD4+iag=</latexit>CoDY XdRifoFWRhy5RA92iJmohip7QC3pD79az9Wp9WJ/z0ZK1yByjP7Am3z3noGs=</latexit>source appearance:featuresSnFR46ckjNSJD65JBVyTWqkTjh5IE/khbw6j86z8+a8z6NrzmLmhPyB8/kNeEahHQ=</latexit> **C** 0RE6RhXkonNUR9eogZqIogf0hF7Qq/VoPVtv1vt8tGDlmUP0B9bHN0UVnU=</latexit>D

7Qq/VoPVtv1vt8tGDlmUP0B9bHN0UVnU=</latexit>D **Ct6tR6tZ+vNel+MFqw8c4z+wPr4BkOCnUQ=</latexit>** featuresknD+SJvJBX59F5dt6c91l0xZnPHJE/cD6/AXnZoR4=</latexit>featuresxze65Jvi</latexit>encoder C D **C7Qq/VoPVtv1vt8tGDlmUP0B9bHN0UVnU=</latexit>** t6tR6tZ+vNel+MFqw8c4z+wPr4BkOCnUQ=</latexit> stop gradhfH0K/ydt3/UuXP/GLzcul3EUwDE4AWfAzXQANegCVqAEewBN4dpTz6Lw4r4vWFWc5cwR+wHn7BdkJc=</latexit> encoderE _σt_ Q+eoiRx0iTroBnVRDxH0gJ7QC3o1Ho1n4814X45WjDJziv7A+PgGpb6cZA=</latexit>dynamicsuk _σt+1decoder∆_ MSETKBdLJ8XyVTlfLczjyKIDdIgKqIROURVdoBqI4ru0AN6Qs/WvfVovVivs9aMNZ/ZRz9kvX0Cag2aFw=</latexit> OnLRJWqhG9RGHUTQA3pCL+jVeDSejTfjfTFaMcrMfoD4+MbxqapA=</latexit>test decoder3mbPWqpd1VNEJOkV15KJL1EI3qI06iKAH9IRe0KvxaDwb8b7YrRilJlj9AfGxzeq/ZvY</latexit> Rieohlx0gZroBrVQGxH0gJ7QC3o1Ho1n4814n4+uGXmCP2B8fENioSbw=</latexit> OqroBJ2iOnLRJWqhG9RGHUTQA3pCL+jVeDSejTfjfTFaMcrMfoD4+Mb81qfSg=</latexit>reconstruction

xze65Jvi</latexit> **D** 1EAukRtdIM6qIsIekBP6AW9Go/Gs/FmvC9HK0aZOUV/YHx8Ay+kmw=</latexit>train target:
encoder xit> 6shFl6iFblAbdRBD+gJvaBX49F4Nt6M98VoxSgzx+gPjI9vcKiapQ=</latexit>

keypoints and coefficients loss MSETKBdLJ8XyVTlfLczjyKIDdIgKqIROURVdoBqI4ru0AN6Qs/WvfVovVivs9aMNZ/ZRz9kvX0Cag2aFw=</latexit>

module (decoder). On test time, we reconnect the two modules. CoDy forecasts the counterfactual
outcome D from the sparse keypoints representation of AB and C. The confounders are discovered
in an unsupervised manner and provided to the dynamical model.

**Ours** UV-CDN PhyD Pred Copy B Copy C **Ours**

|Col1|Copy B Copy C|Ours|
|---|---|---|


|BT-CF B-CF C-CF|43.2 20.0 44.3 92.3 7.6 40.3|9.58 36.12 5.14|
|---|---|---|


Copy B: assumes absence of intervention
(always outputs sequence B).
Copy C: assumes that the tower is stable
(always outputs input C).

Table 2: Comparison with copying
baselines. We report MSE×10[−][3] on
prediction of keypoints + coefficients.

|Col1|Ours N 2N|UV-CDN N 2N|PhyD NET|Pred RNN|
|---|---|---|---|---|


|PSNR BT-CF L-PSNR|23.48 24.69 25.51 26.79|21.07 21.99 22.36 23.64|16.49 23.03|22.04 24.97|
|---|---|---|---|---|
|PSNR B-CF L-PSNR|21.19 21.33 23.88 24.12|19.51 19.54 22.35 22.38|18.56 22.55|22.31 22.63|
|PSNR C-CF L-PSNR|24.09 24.09 26.07 26.55|23.73 23.83 26.08 26.34|19.69 24.61|24.70 26.39|


Table 1: Comparison with the state-of-the-art models in
physics-inspired machine learning of video signals, reporting reconstruction error (PSNR and introduced L-PSNR).


as a linear transformation from the hidden state of the GRU. We apply the dynamic model in an
auto-regressive way to forecast long-term trajectories in the projected space σk, and apply the state
decoder to obtain a prediction ˆs[CD](t).

The dynamic model is trained with a loss in keypoint space,


_∥s[CD](t) −_ **ˆs[CD](t)∥2[2]** [+][ γ][3][∥][s][CD][(][t][)][ −] [∆(][E][(][s][CD][(][t][)))][∥]2[2][.] (6)


_Lphysics =_


The first term enforces the model to learn to predict the outcomes and the second term favors correct
reconstruction of the state in keypoint space. The terms are weighted with a scalar parameter γ3.

4.3 TRAINING

End-to-end training of all three modules jointly is challenging, as the same pipeline controls both the
keypoint-based state representation and the dynamic module (CoDy), involving two adversarial objectives: optimizing reconstruction pushes the keypoint encoding to be as representative as possible,
but learning the dynamics favors a simple representation. Faced with these two contradictory tasks,
the model is numerically unstable and rapidly converges to regression to the mean. As described
above, we separately train the encoder+decoder pair without dynamic information on reconstruction
only, c.f. Equation (4). Then we freeze the parameters of the keypoint detector and train CoDy to
forecast the keypoints from D minimizing the loss in Equation (6).

5 EXPERIMENTS

We compare the proposed model to three strong baselines for physics-inspired video prediction.

-  PhyDNet (Le Guen & Thome, 2020) is a non-counterfactual video prediction model that forecasts
future frames using a decomposition between (a) a feature vector that temporally evolves via an
LSTM and (b) a dynamic state that follows a PDE learned through specifically designed cells.


-----

|# Coefficients: # Keypoints:| N 2N 4N| N 2N 4N|
|---|---|---|

|PSNR BT-CF L-PSNR|23.48 24.69 23.54 21.75 23.03 21.80|22.71 23.28 23.17 21.18 21.86 21.70|
|---|---|---|
|PSNR B-CF L-PSNR|21.19 21.33 21.37 27.88 27.16 27.07|20.49 21.09 20.97 26.33 27.07 26.73|
|PSNR C-CF L-PSNR|24.09 24.09 24.26 23.32 23.46 23.44|23.84 23.66 24.06 22.58 22.81 23.45|


# Coefficients:  
# Keypoints: N 2N 4N N 2N 4N

_{G[i]k[}][i]k[=1]=1[..C]..K_ _R_ BT-CF PSNR 23.48 24.69 23.54 22.71 23.28 23.17

L-PSNR 21.75 23.03 21.80 21.18 21.86 21.70

PSNR 21.19 21.33 21.37 20.49 21.09 20.97
B-CF
L-PSNR 27.88 27.16 27.07 26.33 27.07 26.73

PSNR 24.09 24.09 24.26 23.84 23.66 24.06
C-CF
L-PSNR 23.32 23.46 23.44 22.58 22.81 23.45

a distortion from multiple ori-Figure 5: Network R learns Table 3: Impact of having additional orientation/shape co-efficients () compared to the keypoint-only solution (
ented ellipses to target shapes. for different numbers of keypoints: equal to number of ob
jects (= N ), 2N and 4N .

- V-CDN (Li et al., 2020) is a counterfactual model based on keypoints, close to our work. It
identifies confounders from the beginning of a sequence and learns a keypoint predictor through

auto-encoding using the Transporter equation (see discussion in Sect. 4.1). As it is, it cannot
be used for video prediction and is incomparable with our work, see details in appendix C. We
therefore replace the Transporter by our own de-rendering/rendering modules, from which we
remove the additional coefficients. We refer to this model as UV-CDN (for Unsupervised V-CDN).

-  PredRNN (Wang et al., 2017) is a ConvLSTM-based video prediction model that leverages spatial
and temporal memories a through a spatiotemporal LSTM cell.

All models have been implemented in PyTorch, architectures are described in appendix D. For the
baselines PhyDNet, UV-CDN and PredRNN, we used the official source code provided by the authors. We evaluate on each scenario of Filtered-CoPhy on the counterfactual video prediction task.
For the two counterfactual models (Ours and UV-CDN), we evaluate on the tasks as intended: we
provide the observed sequence AB and the CF initial condition C, and forecast the sequence D.
The non-CF baselines are required to predict the entire video from a single frame, in order to prevent
them from leveraging shortcuts in a part of the video and bypass the need for physical reasoning.

We measure performance with time-averaged peak signal-to noise ratio (PSNR) that directly measures reconstruction quality. However, this metric is mainly dominated by error on the static background, which is not our main interest. We also introduce Localized PSNR (L-PSNR), which measures area error on the important regions near moving objects, computed on masked images. We
compute the masks using classical background subtraction techniques.

**Comparison to the SOTA — We compare our model against UV-CDN, PhyDNet and PredRNN**
in Table 1, consistently and significantly outperforming the baselines. The gap with UV-CDN is
particularly interesting, as it confirms the choice of additional coefficients to model the dynamics of
moving objects. PredRNN shows competitive performances, especially on collisionCF. However, our localized PSNR tends to indicates that the baseline does not reconstruct accurately the
foreground, favoring the reconstruction of the background to the detriment of the dynamics of the
scene. Fig. 6 visualizes the prediction on a single example, more can be found in appendix G. We
also compare to trivial copying baselines in Table 2, namely Copy B, which assumes no intervention
and outputs the B sequence, and Copy C, which assumes a stable tower. We evaluate these models
in keypoints space measuring MSE on keypoints and coefficients averaged over time, as copying
baselines are unbeatable in the regions of static background, making the PSNR metrics unusable.

We provide additional empirical results by comparing the models using Multi-object Tracking metrics and studies on the impact of the do-operations on PSNR in appendix E. We also compute an
upper bound of our model using CoPhyNet baseline as described in Baradel et al. (2020).

**Performance on real-world data — is reported in appendix F, showing experiments on 516 videos**
of real wooden blocks introduced in (Lerer et al., 2016).

**Impact of appearance coefficients — are reported in Table 3, comparing to the baseline using**
a keypoint-only representation. The coefficients have a significant impact: even increasing the
number of keypoints to compensate for the loss of information cannot overcome the advantage
of disentangling positions and shapes, as done in our model. We provide a deeper analysis of the


-----

t=0 t=21 t=42 t=63 t=85 t=106 t=127 t=149

Ground Truth

Filtered CoPhy (8 kpts)

PhyDNet

UV-CDN (8 kpts)

PredRNN


Figure 6: Visualization of the counterfactual video prediction quality, comparing our proposed
model (Filtered CoPhy) with the two baselines, PhyDNet and UV-CDN, over different timestamps.



|State auto-encoder:| |
|---|---|

|Filter bank:|Fixed Learned|
|---|---|


|BT-CF B-CF C-CF|9.58 11.10 36.12 36.88 5.14 16.16|
|---|---|


Table 4: Impact of the dynamic CoDy encoder
() against the baseline operating in the keypoint
+ coefficient space (). We report MSE×10[−][3]
on prediction of keypoints + coefficients (4 pts).


|BT-CF B-CF C-CF|34.40 32.04 37.76 31.25 34.09 33.88|
|---|---|


Table 5: Learning the filter bank H from scratch
has a mild negative effect on the reconstruction
task. We report the PSNR on static reconstruction performance without the dynamic model.


de-rendering/rendering modules in appendix B, which includes visualizations of the navigation of
the latent shape space in B.2.

**Learning filters — does not have a positive impact on reconstruction performance compared to the**
choice of the handcrafted bank, as can be seen in table 5. We conjecture that the additional degrees of
freedom are redundant with parts of the filter kernels in the refinement module R: this corresponds
to jointly learning a multi-channel representation {G[i]k[}][i]k[=1]=1[..C]..K [for shapes as well as the mapping]
which geometrically distorts them into the target object shapes. Fixing the latent representation
does not constrain the system, as the mapping R can adjust to it — see Fig. 5.

**Impact of the high-dimensional dynamic space — We evaluate the impact of modeling object**
dynamics in high-dimensional space through the CoDy encoder in Table 4, comparing projection to
256 dimensions to the baseline reasoning directly in keypoint + coefficient space. The experiment
confirms this choice of KKL-like encoder (Janny et al., 2021).

6 CONCLUSION


We introduced a new benchmark for counterfactual reasoning in physical processes requiring to perform video prediction, i.e. predicting raw pixel observations over a long horizon. The benchmark
has been carefully designed and generated imposing constraints on identifiability and counterfactuality. We also propose a new method for counterfactual reasoning, which is based on a hybrid
latent representation combining 2D keypoints and additional latent vectors encoding appearance
and shape. We introduce an unsupervised learning algorithm for this representation, which does not
require any supervision on confounders or other object properties and processes raw video. Counterfactual prediction of video frames remains a challenging task, and Filtered CoPhy still exhibits
failures in maintaining rigid structures of objects over long prediction time-scales. We hope that our
benchmark will inspire further breakthroughs in this domain.


-----

REFERENCES

Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds
and applications. In UAI, 1994.

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. Cophy: Counterfactual learning of physical dynamics. In International Conference on Learning Representations,
2020.

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray kavukcuoglu.
Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th
_International Conference on Neural Information Processing Systems, 2016._

Keni Bernardin and Rainer Stiefelhagen. Evaluating Multiple Object Tracking Performance: The
CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing, 2008.

Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder
for statistical machine translation. In Conference on Empirical Methods in Natural Language
_Processing, 2014._

Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes:
Incorporating prior scientific knowledge. In International Conference on Learning Representa_tions, 2018._

Emily L. Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations
from video. ArXiv, abs/1705.10915, 2017.

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction
through video prediction. In Proceedings of the 30th International Conference on Neural Infor_mation Processing Systems, 2016._

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale
datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions
_on Pattern Analysis and Machine Intelligence, 2014._

Steeven Janny, Madiha Nadri Vincent Andrieu, and Christian Wolf. Deep kkl: Data-driven output
prediction for non-linear systems. In Conference on Decision and Control (CDC21), 2021.

Tejas D. Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, and Volodymyr Mnih. Unsupervised learning of object keypoints for perception
and control. In Advances in Neural Information Processing Systems 32: Annual Conference on
_Neural Information Processing Systems 2019, 2019._

Y. Kwon and M. Park. Predicting future frames using retrospective cycle gan. 2019 IEEE/CVF
_Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1811–1820, 2019._

Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for
unsupervised video prediction. In Computer Vision and Pattern Recognition (CVPR), 2020.

Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example.
In Proceedings of the 33rd International Conference on International Conference on Machine
_Learning - Volume 48. JMLR.org, 2016._

Yunzhu Li, Antonio Torralba, Anima Anandkumar, Dieter Fox, and Animesh Garg. Causal discovery in physical systems from videos. Advances in Neural Information Processing Systems, 33,
2020.

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, G. Heigold,
Jakob Uszkoreit, A. Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention.
_ArXiv, abs/2006.15055, 2020._

David G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.


-----

Chaochao Lu, Michael Hirsch, and Bernhard Sch¨olkopf. Flexible spatio-temporal networks for
video prediction. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.

Bethany Lusch, Nathan Kutz, and Steven Brunton. Deep learning for universal linear embeddings
of nonlinear dynamics. Nature Communications, 2018.

Lucas Manuelli, Wei Gao, Peter R. Florence, and Russ Tedrake. kpam: Keypoint affordances for
category-level robotic manipulation. ArXiv, abs/1903.06684, 2019.

Lucas Manuelli, Yunzhu Li, Peter R. Florence, and Russ Tedrake. Keypoints into the future: Selfsupervised correspondence in model-based reinforcement learning. CoRR, 2020.

G. Martin-Ordas, J. Call, and F. Colmenares. Tubes, tables and traps: great apes solve twofunctionally equivalent trap tasks but show no evidence of transfer across tasks. Animal cognition, 11(3):
432–430, 2008.

Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond
mean square error. In 4th International Conference on Learning Representations, ICLR 2016,
_San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016._

Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, K. Murphy, and Honglak Lee. Unsupervised learning of object structure and dynamics from videos. ArXiv, abs/1906.07889, 2019.

Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.

Judea Pearl. Causal and counterfactual inference. The Handbook of Rationality, pp. 1–41, 2018.

Johan Peralez and Madiha Nadri. Deep learning-based luenberger observer design for discrete-time
non-linear systems. In Conference on Decision and Control (CDC21), 2021.

Bernhard Sch¨olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner,
Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. _Proceedings of_
_the IEEE, 2021._

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video
representations using lstms. In Proceedings of the 32nd International Conference on International
_Conference on Machine Learning - Volume 37, 2015._

Supasorn Suwajanakorn, Noah Snavely, Jonathan Tompson, and Mohammad Norouzi. Discovery
of latent 3d keypoints via end-to-end geometric reasoning. In NeurIPS 2018, 2018.

Rishi Veerapaneni, John D. Co-Reyes, Michael Chang, Michael Janner, Chelsea Finn, Jiajun Wu,
Joshua Tenenbaum, and Sergey Levine. Entity abstraction in visual model-based reinforcement
learning. In Proceedings of the Conference on Robot Learning, 2020.

Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion
and content for natural video sequence prediction. In ArXiv, volume abs/1706.08033, 2017a.

Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning
to generate long-term future via hierarchical prediction. In ICML, 2017b.

Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics.
In Proceedings of the 30th International Conference on Neural Information Processing Systems,
2016.

Jacob Walker, Kenneth Marino, Abhinav Gupta, and M. Hebert. The pose knows: Video forecasting
by generating pose futures. 2017 IEEE International Conference on Computer Vision (ICCV), pp.
3352–3361, 2017.

Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Recurrent
neural networks for predictive learning using spatiotemporal lstms. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural
_Information Processing Systems, 2017._


-----

Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to see physics
via visual de-animation. In Advances in Neural Information Processing Systems, 2017.

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B.
Tenenbaum. CLEVRER: collision events for video representation and reasoning. In ICLR, 2020.

Yuan Yin, Vincent LE GUEN, J´er´emie DONA, Emmanuel de Bezenac, Ibrahim Ayed, Nicolas
THOME, and patrick gallinari. Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. In International Conference on Learning Representations (ICLR),
2021.


-----

# Appendix

A FURTHER DETAILS ON DATASET GENERATION

**Confounders in our setup are masses, which we discretize in {1, 10}.** For BallsCF and
CollisionCF, we can also consider the continuous initial velocities of each object as confounders
variables, since they have to be identified in AB to forecast CD.

We simulate all trajectories associated with the various possible combinations of masses from the
same initial condition.

**Do-interventions, however, depend on the task.** For BlocktowerCF and BallsCF, dointerventions consist in (a) removing the top cube or a ball, or (b) shifting a cube/ball on the horizontal plane. In this case, for BlocktowerCF, we make sure that the cube does not move too far
from the tower, in order to maintain contact. For CollisionCF, the do-interventions are restricted
to shifting operations, since there are only two objects (a ball and a cylinder). It can consist of either
a switch of the cylinder’s orientation between vertical or horizontal, or a shift of the position of the
moving object relative to the resting one in one of the three canonical directions x, y and z.

A.1 ENFORCING THE IDENTIFIABILTY CONSTRAINT

The identifiabilty and counterfactuality constraints described in section 3 are imposed numerically,
i.e. we first sample and simulate trajectories with random parameters and then reject those that
violate these constraints.

As stated in section 3, an identifiable experiment guarantees that there is no pair (z, z[′]) that gives
the same trajectory AB but a different counterfactual outcome CD. Otherwise, there will be no
way to choose between z and z[′] only from looking at AB, thus no way to correctly forecast the
counterfactual experiment. By enforcing this constraint, we make sure that there exists at least a
set {z, z[′], ...} of confounders that give at the same time similar observed outcomes AB and similar
counterfactual outcomes CD.

In practice, there exists a finite set of possible variables zi, corresponding to every combination of
masses for each object in the scene (masses take their value in {1, 10}). During generation, we
submit each candidate experiment (AB, CD, z) to a test ensuring that the candidate is identifiable.
Let ψ(X0, z) be the function that gives the trajectory of a system with initial condition X0 and
confounders z. We simulate all possible trajectories ψ(A, zi) and ψ(C, zi) for every possible zi.
If there exists z[′] ≠ _z such that the experiment is not identifiable, the candidate is rejected. This_
constraint requires to simulate the trajectory of each experiment several times by modifying physical
properties of the objects.

Equalities in Definition 1 are relaxed by thresholding distances between trajectories. We reject a
candidate experiment if there exists a z[′] such that


_∥ψ(A, z) −_ _ψ(A, z[′])∥2 < ε and_
_t=0_

X


_∥ψ(C, z) −_ _ψ(C, z[′])∥2 > ε._ (7)
_t=0_

X


The choice of the threshold value ε is critical, in particular for the identifiability constraint:

-  If the threshold is too high, all AB trajectories will be considered as equal, which results
in acceptance of unidentifiable experiments.

-  If the threshold is too low, all trajectories CD are considered equal. Again, this leads to
mistakenly accepting unidentifiable experiments.

There exists an optimal value for ε, which allows to correctly reject unidentifiable experiences.
To measure this optimal threshold, we generated a small instance of the BlocktowerCF dataset
without constraining the experiments, i.e. trajectories can be unidentifiable and non-counterfactual.
We then plot the percentage of rejected experiments in this unfiltered dataset against the threshold


-----

Threshold grid search for identifiability constraint


Threshold grid search for counterfactuality constraint


|100%|Col2|Col3|Col4|100%|Col6|Col7|Col8|Col9|
|---|---|---|---|---|---|---|---|---|
|100% experiment 90% 80% tifiable||||100% 90% experiment 80% 70% erfactual 60%|||||
||||||||||
||||||||||
||||||||||
||||||51%||||
|iden 70% of Percentage 60%|||||||||
||||||||||
||||||||||
||56%||162.38||||||
|50% 100 101 102 Threshold ε||||||||162.38|


Figure 7: Experimental tuning of the threshold parameter. We generate an unconstrained subset of
BlocktowerCF and plot the percentage of identifiable experiments as function of the threshold ε.


|Col1|without with constraint constraint|
|---|---|


|Accuracy Corrected Acc.|56% 84% 58% 91%|
|---|---|


(a)


|FPS|5 15 25 35 45|
|---|---|


|MSE (×10−2)|4.58 3.97 3.74 3.82 3.93|
|---|---|


(b)


Table 6: (a) Sanity check of the identifiability constraint in BlocktowerCF, which results in better
estimation of cube masses. The corrected accuracy only considers those cubes for which changes
in masses are consequential for the trajectory D. (b) MSE between ground truth 3D positions and
predicted positions after 1 second, depending on the sampling rate of the trajectory.

value (Fig. 7, left). We chose the threshold ε = 100 which optimizes discrimination and rejects the
highest number of “unidentifiable” trajectories.

To demonstrate importance of this, we train a recurrent graph network on BlocktowerCF to
predict the cube masses from ground-truth state trajectories AB, including pose and velocities, see
Fig. 8. It predicts each cube’s mass by solving a binary classification task. We train this model on
both BlocktowerCF and an alternative version of the scenario generated without the identifiability
constraint. The results are shown in Table 6a. We are not aiming 100% accuracy, and this problem
remains difficult in a sense that the identifiability constraint ensures the identifiability of a set of
confounder variables, while our sanity check tries to predict a unique z.

However, the addition of the identifiability constraint to the benchmark significantly improves the
model’s accuracy, which indicates that the property acts positively on the feasability of Filtered**CoPhy. The corrected accuracy metric focuses solely on the critical cubes, i.e. those cubes whose**
masses directly define the trajectory CD.

A.2 ENFORCING THE COUNTERFACTUALITY CONSTRAINT

Let (AB, CD, z) be a candidate experiment, and z[k] be a combination of masses identical to z
except for the k[th] value. The counterfactuality constraint consists in checking that there exists at
least one k such that ψ(C, z) ̸= ψ(C, z[k]). To do so, we simulate ψ(C, z[k]) for all k and measure the
difference with the candidate trajectory ψ(C, z). Formally, we verify the existence of k such that:

_T_

_∥ψ(C, z[k]) −_ _ψ(C, z)∥2 < ε._ (8)
_t=0_

X

A.3 ANALYZING TEMPORAL RESOLUTION

We analyzed the choice of temporal frequency for the benchmark with another sanity check. We
simulate a non-counterfactual dataset from BlocktowerCF where all cubes have equal masses. A


-----

|Col1|hT|
|---|---|


**AB states** _⇥T_


GN GRU

Position GN GRU

2664 **ABQuaternionAng. vel.Lin. vel. states** 3775t=0..T _⇥T_ _hT_ MLP _m 2 {1, 10}_ 2664 _1 second lengthQuaternionCDAng. vel.Lin. vel.Position states_ 3775t=0..1s _⇥1 second_ _hT_ 2664 QuaternionCDnext secondAng. vel.Lin. vel.Position states 3775t=1s..2s

Figure 8: Impact of the choice of temporal resolution. Left: We check the identifiability constraint
by training a model to predict the cube masses in BlocktowerCF from the observed trajectory
**AB. The model is a graph neural network followed by a gated recurrent unit. Right: We check the**
augmentation of the sampling rate by training an agent to forecast a 1 second-length trajectory fromGN GRU

Position Position

the states of the previous second.2 QuaternionLin. vel. 3 2 QuaternionLin. vel. 3

664 **CDAng. vel. states** 775t=0..1s _⇥1 second_ _hT_ 664 **CDAng. vel. states** 775t=1s..2s

_1 second length_ _next second_


Figure 9: Visual examples of the impact of temporal resolution on dynamical information for each
task in Filtered-CoPhy. Black dots are sampled at 6 FPS while red dots are sampled at 25 FPS.

recurrent graph network takes as input cube trajectories (poses and velocities) over a time interval of
one second and predicts the rest of the trajectory over the following second. We vary the sampling
frequency; for example, at 5 FPS, the model receives 5 measurements, and predicts the next 5 time
steps, which correspond to a one second rollout in the future. Finally, we compare the error in 3D
positions between the predictions and the GT on the last prediction. Results are shown in Table
6b. This check shows clearly that 25 FPS corresponds to the best trade-off between an accurate
representation of the collision and the amount of training data. Fig. 9 shows a practical example of
the effect of time resolution on dynamical information in a trajectory.

A.4 SIMULATION DETAILS

We used Pybullet as physics engine to simulate Filtered-CoPhy. Each experiment is designed to
respects the balance between good coverage of confounder combinations and counterfactuality and
identifiability constraints described above. We generate the trajectories iteratively:

1. We sample a combination of masses and other physical characteristics of the given experiment,
such as stability of the tower, or object motion in CollisionCF, or if the do-operation consists
in removing an object. This allows us to maintain a balance of confounder configurations.

2. Then we search for an initial configuration A. For BlocktowerCF, we make sure that this
configuration is unstable to ensure identifiability. Then we simulate the trajectory B.


-----

```
    ballsCF

```
500 1000 1500 2000

```
BlocktowerCF

```
```
 collisionCF

```

10, 10, 10, 10

10, 10, 10, 1

10, 10, 1, 10

10, 10, 1, 1

10, 1, 10, 10

10, 1, 10, 1

10, 1, 1, 10


10, 10, 10, 10

10, 10, 10, 1

10, 10, 1, 10

10, 10, 1, 1

10, 1, 10, 10

10, 1, 10, 1

10, 1, 1, 10


10, 10

10, 1


10, 1, 1, 1

1, 10, 10, 10


10, 1, 1, 1

1, 10, 10, 10


1, 10, 10, 1

1, 10, 1, 10

1, 10, 1, 1

1, 1, 10, 10

1, 1, 10, 1

1, 1, 1, 10


1, 10, 10, 1

1, 10, 1, 10

1, 10, 1, 1

1, 1, 10, 10

1, 1, 10, 1

1, 1, 1, 10


1, 10

1, 1


1, 1, 1, 1


1, 1, 1, 1

|Col1|Col2|Col3|Col4|Col5|insta|ble|
|---|---|---|---|---|---|---|
||||||stabl|e|
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||


200 400 600 800 1000

# of experiments


2000 4000 6000

# of experiments


# of experiments


Figure 10: During the dataset generation process, we carefully balance combinations of masses, i.e.
the confounders. For BlocktowerCF, we also guarantee that the proportion of stable CD towers
is close to 50% for each confounder configuration.

3. We look for a valid do-operation such that identifiability and counterfactuality constraints are
satisfied. If no valid do-operation is found after a fixed number of trials, we reject this experiment.


4. If a valid pair (AB, CD) is found, we add the sample to the dataset.

The trajectories were simulated with a sample time of 0.04 seconds. The video resolution is
448 × 448 and represents 6 seconds for BlocktowerCF and BallsCF, and 3 seconds for
CollisionCF. We invite interested readers to look at our code for more details, such as dooperation sampling, or intrinsic camera parameters. Fig. 10 shows the confounder distribution in
the three tasks.


B PERFORMANCE EVALUATION OF THE DE-RENDERING MODULE

B.1 IMAGE RECONSTRUCTION


We evaluate the reconstruction performance of
the de-rendering module in the reconstruction
task. Note that there is trade-off between the
reconstruction performance and the dynamic
forecasting accuracy: a higher number of keypoints may lead to better reconstruction, but can
hurt prediction performance, as the dynamic
model is more difficult to learn.


|# Keypoints|N 2N 4N|
|---|---|

|PSNR BT-CF MSE Grad|34.40 35.41 34.92 27.24 21.39 23.99|
|---|---|
|PSNR B-CF MSE Grad|37.76 37.06 36.98 3.47 3.77 3.95|
|PSNR C-CF MSE Grad|32.00 35.41 34.42 32.00 12.57 17.09|


**Reconstruction error – We first investigate the** Table 7: PSNR (dB) on the task of reconstructing
impact of the number of keypoints in Table 7 the target from the source (both randomly samby measuring the Peak Signal to Noise Ratio pled from Filtered-CoPhy), using 5 coefficients
(PSNR) between the target image and its re- per keypoint. We vary the number of keypoints in
construction. We vary the number of keypoints our model. Here N is the maximum number of
among multiples of N, the maximum number the objects in the scene.
of objects in the scene. Increasing the number
of keypoints increases reconstruction quality (PSNR) up to a certain point, but results in degradation
in forecasting performance. Furthermore, doubling the number of keypoints only slightly improves
reconstruction accuracy. This tends to indicate that our additional coefficients are already sufficient
to model finer-grained visual details. Table 3 in the main paper measures the impact of the number
of keypoints and the presence of the additional appearance coefficients on the full pipeline including
the dynamic model. Table 8 illustrates the impact of the number of keypoints and the additional
appearance coefficient on the reconstruction performance alone. As we can see, the addition of the
coefficient consistently improves PSNR for low numbers of keypoints (over 2 dB for N keypoints).


-----

Ground Truth Ours (N Keypoints) Ours (2N Keypoints) Ours (4N Keypoints) Without coefficient (N Kpts) Without coefficient (2N Kpts) Without coefficient (4N Kpts)


Figure 11: Reconstructions produced by the de-rendering module. Our model correctly marks each
object in the scene and achieves satisfactory reconstruction.

The improvement is less visible for larger numbers of keypoints, since 3D visual details could be
encoded via keypoints position, hence coefficients become less relevant. Visualizations are shown
in Fig. 11.

|Coefficients|N 2N 4N      |
|---|---|


|PSNR BT-CF MSE Grad|32.53 34.40 41.86 27.24|33.97 35.41 35.24 21.39|34.57 34.92 28.06 23.99|
|---|---|---|---|
|PSNR B-CF MSE Grad|34.62 37.76 6.22 3.47|36.94 37.06 4.16 3.77|37.15 36.98 4.07 3.95|
|PSNR C-CF MSE Grad|30.65 32.00 12.78 32.00|33.89 35.41 20.59 12.57|35.63 34.42 11.72 17.09|



Table 8: Impact of the number of keypoints and the presence of additional appearance coefficient
in the de-rendering module for pure image reconstruction (no dynamic model). We report PSNR
(dB) and MSE on the image gradient. N is the maximum number of the objects in the scene.
The coefficients significantly improve the reconstruction on low number of keypoints. This table is
related to table 3 in the main paper, which measures this impact on the full pipeline.

B.2 NAVIGATING THE LATENT COEFFICIENT MANIFOLD

We evaluate the influence of the additional appearance coefficients on our de-rendering model by
navigating its manifold. To do so, we sample a random pair (Xsource, Xtarget) from an experiment
in BlocktowerCF and compute the corresponding source features and target keypoints and coefficients. Then, we vary each component of the target keypoints and coefficients and observe the
reconstructed image (fig. 12). We observed that the keypoints accurately control the position of the
cube along both spatial axes. The rendering module does infer some hints on 3D shape information
from the vertical position of the cube, exploiting a shortcut in learning. On the other hand, while not
being supervised, the coefficients naturally learn to encode different orientations in space and distance from the camera. Interestingly, a form of disentanglement emerges. For example, coefficients
n[◦] 1 and 2 control rotation around the z-axis, and coefficient n[◦]4 models rotation around the y-axis.
The last coefficient represents both the size of the cube and its presence in the image.


-----

Keypoint X

-0.7 -0.5 -0.4 -0.2 -0.1 0.1 0.2 0.4 0.5 0.7

Keypoint Y

-0.7 -0.5 -0.4 -0.2 -0.1 0.1 0.2 0.4 0.5 0.7

Coefficient 1

0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9

Coefficient 2

0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9

Coefficient 3

0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9

Coefficient 4

0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9

Coefficient 5

0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.9


Figure 12: Navigating the manifold of the latent coefficient representation. Each line corresponds
to variations of one keypoint coordinate or coefficient and shows the effect on a single cube.

C COMPARISON WITH THE TRANSPORTER BASELINE

C.1 COMPARISON WITH OUR DE-RENDERING MODEL

As described in section 4.1, the Transporter (Kulkarni et al., 2019) is a keypoint detection model
somewhat close to our de-rendering module. It leverages the transport equation to compute a reconstruction vector:

Ψˆ target = Fsource × (1 − _Ksource) × (1 −_ _Ktarget) + Ftarget × Ktarget._ (9)

This equation allows to transmit information from the input by two means: the 2D position of
the keypoints (Ktarget) and the dense visual features of the target (Ftarget). In comparison, our derendering solely relies on the keypoints from the target image and does not require a dense vector
to be computed on the target to reconstruct the target. This makes the Transporter incomparable
with our de-rendering module. We nevertheless compare the performances of the two models in
Table 13, and provide visual examples in Fig. 13. Even though the two models are not comparable,
as the Transporter uses additional information, our model still outperforms the Transporter for small
numbers of keypoints. Interestingly, for higher numbers of keypoints Transporter tends to discover
keypoints far from the object. We investigate this behavior in the following section, and show that
this is actually a critical problem for learning causal reasoning on the discovered keypoints.

C.2 ANALYSIS OF TRANSPORTER’S BEHAVIOR

The original version of the V-CDN model (Li et al., 2020) is based on Transporter (Kulkarni et al.,
2019). We have already highlighted the fact that this model is not comparable with our task, as it
requires not only the target keypoints Ktarget but also a dense feature map Ftarget, whose dynamics
can hardly be learned due to its high dimensionality. More precisely, the transport equation (Eq. 9)
allows to pass information from the target by two means: the 2D position of the keypoints (Ktarget)
and the dense feature map of the target (Ftarget). The number of keypoints therefore becomes a
highly sensible parameter, as the transporter can decide to preferably transfer information through
the target features rather than through the keypoint locations. When the number of keypoints is
low, they act as a bottleneck, and the model has to carefully discover them to reconstruct the image.


-----

|Col1|Ours Transporter (not comparable)|
|---|---|


|# Keypoints|4 8 16|4 8 16|
|---|---|---|


|BT-CF B-CF C-CF|34.40 35.41 34.92 37.76 37.06 36.98 35.41 34.42 35.98|34.10 34.88 39.20 34.75 34.78 35.13 32.66 33.39 34.47|
|---|---|---|


Table 9: PSNR (dB) on the task of reconstructing target from the source (both randomly sampled
from Filtered-CoPhy), using 5 coefficients per keypoint. We vary the number of keypoints in both
our model and the Transporter. Note that Transporter uses target features to reconstruct the image,
hence it is not comparable with our model.


Ground Truth Ours (4k) Ours (8k) Ours (16k) Learned (4k) Transporter (4k) Transporter (8k) Transporter (16k)

Ground Truth Ours (4k) Ours (8k) Ours (16k) Learned (4k) Transporter (4k) Transporter (8k) Transporter (16k)

Ground Truth Ours (4k) Ours (8k) Ours (16k) Learned (4k) Transporter (4k) Transporter (8k) Transporter (16k)


Figure 13: Example of reconstructed image by our-derendering module and the Transporter.

On the other hand, when we increase the number of keypoints, Transporter stops tracking objects
in the scene and transfers visual information through the dense feature map, making the predicted
keypoints unnecessary for image reconstruction, and therefore not representative of the dynamics.

To illustrate our hypothesis, we set up the following experiment. Starting from a trained Transporter
model, we fixed the source image to be X0 (the first frame from the trajectory) during the evaluation
step. Then, we compute features and keypoints on the target frame Xt regularly sampled in time. We
reconstruct the target image using the transport equation, but without updating the target keypoints.
Practically, this consists in computing Ψ[ˆ] target with Eq. (9) substituting Ktarget for Ksource.

Results are shown in Fig. 14. There is no dynamic forecasting involved in this figure, and the Transporter we used was trained in a regular way, we only change the transport equation on evaluation
time. Even though the keypoint positions have been fixed, the Transporter manages to reconstruct a
significant part of the images, which indicates that a part of the dynamics has been encoded in the
dense feature map.

In contrast, this issue does not arise from our de-rendering module, since our decoder solely relies
on the target keypoints to reconstruct the image. Note that this is absolutely not contradictory with
the claim in Li et al. (2020), since they do not evaluate V-CDN in pixel space. A rational choice of
the number of keypoints leads to satisfactory performance, allowing V-CDN to accurately forecast
the trajectory in keypoints space, and retrieve the hidden confounders on their dataset.

C.3 TEMPORAL INCONSISTENCY ISSUES

Increasing the number of keypoints of the Transporter may lead to temporal inconsistency during
the long-range reconstruction. For example, a keypoint that tracks the edge of a cube in the first


-----

t= 0 t= 7 t= 15 t= 22 t= 30 t= 37

Ground Truth

Transporter 4

Transporter 8

Transporter 16


Figure 14: We evaluate the Transporter with a varying number of keypoints to reconstruct images
regularly sampled in a trajectory while having the target keypoints fixed. Even if the keypoints
are not moving, the Transporter still manages to reconstruct a significant part of the image, which
indicates that the keypoints are not fully responsible for encoding the dynamics of the scene.

frame may target a face of this same cube in the future, since dynamics does not intervene in the
keypoint discovery process.

Our de-rendering directly addresses this through the usage of additional appearance coefficients,
which allows us to limit the number of keypoints to the number of objects in the scene, effectively
alleviating the consistency issue. Fig. 15 illustrates this phenomenon by plotting the discovered
keypoint locations forward in time, as well as the 2D location of the center of mass of each object.
Note that the Transporter suffers from the temporal inconsistency issue with numbers of keypoints
as low as 4 (green cube). In contrast, our model manages to solve the problem and accurately tracks
the centers of mass, even though they were never supervised.

D DETAILS OF MODEL ARCHITECTURES

D.1 DE-RENDERING MODULE

We call a “block” a 2D convolutional layer followed by a 2D batch norm layer and ReLU activation.
The exact architecture of each part of the encoder is described in Table 10a. The decoder hyperparameters are described in Table 10b.

**Dense feature map estimator F – We compute the feature vector from Xsource by applying a con-**
volutional network F on the output of the common CNN of the encoder. This produces the source
feature vector Fsource of shape (batch, 16, 28, 28).

**Keypoin detector K – The convolutional network K outputs a set of 2D heatmaps of shape**
(batch, K, 28, 28), where K is the desired number of keypoints. We apply a spatial softmax function
on the two last dimensions, then we extract a pair of coordinates on each heatmap by looking for the
location of the maximum, which gives us Ktarget of shape (batch, K, 2).


-----

t=0 t=30 t=60 t=70 t=80 t=90

Transporter (4kpts)

Transporter (8kpts)
Ours (4kpts)

Ours (8kpts)


Figure 15: Temporal inconsistency in long-range reconstruction. We show the keypoints discovered
on images taken from different time steps (black dots). We also compute the 2D location of the
center of mass of each object in the scene (white dots). Our de-rendering module accurately tracks
the centers of mass, which have never been supervised.

**Coefficient estimator C – We obtain the coefficient by applying a third convolutional network**
_C to the output of the common encoder CNN, which again results in a set of 2D vectors of_
shape (batch, K, 28, 28). These vectors are flattened channel-wise and provide a tensor of shape
(batch, K, 28 × 28)) fed to an MLP (see Table 10a for the exact architecture) that estimates the
coefficients Ctarget of shape (batch, K, C + 1).

**Gaussian mapping G – The keypoint vector Ktarget is mapped to a 2D vector through a Gaussian**
mapping process :

(k)(x, y) = exp _,_ (10)
_G_ _−_ [(][x][ −] **[k][x][)][2][ + (]σ[2]** _[y][ −]_ **[k][y][)][2]**
 

where G(k) ∈ R[28][×][28] is the Gaussian mapping of the keypoint k = [kx ky]. We deform these
Gaussian mappings by applying convolutions with filters Hi controlled by the coefficients c[i]k[.]

The filters from H are 5 × 5 kernels that elongate the Gaussian in a specific direction. Practically,
we obtain the filter Hi by drawing a line crossing the center of the kernel and with a slope angle of
_i_ _C[π]_ [where][ C][ is the number of coefficients. We then apply a 2D convolution :]

**G[i]k** [=][ c]k[C][+1]c[i]k [(][G][(][k][k][)][ ∗] **[H][i][)][ .]** (11)

Note that we also compute a supplementary coefficient αk[C][+1] used as a gate on the keypoints. By setting this coefficient to zero, the de-rendering module can de-activate a keypoint (which is redundant
with deactivating the full set of coefficients for this keypoint).

**Refiner R – To reconstruct the target image, we channel-wise stack feature vectors from the source**
with the constructed filters and feed them to the decoder CNN R (Table 10b).

We trained the de-rendering module on pairs of images (Xsource, Xtarget) randomly sampled from
sequences D. For a given sequence D, we take T − 25 first frames of the trajectory as a source
(where T is the number of frames in the video). The last 25 frames are used as a target. For
evaluation, we take the 25[th] frame as the source, and the 50[th] frame as the target. We use Adam
optimizer with a learning rate of 10[−][3], γ1 = 10[4] and γ2 = 10[−][1] to minimize equation 4.


-----

|CNN|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||Module|in ch.|out ch.|kernel|stride|pad.|
|1|Block|3|32|7|1|3|
|2||32|32|3|1|1|
|3||32|64|3|2|1|
|4||64|64|3|1|1|
|5||64|128|3|2|1|
|F()|||||||
|1|Block|128|16|3|1|1|


|F() 1 Block 128 16 3 1 1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|K()|||||||
|1|Block|128|128|3|1|1|
|2|Conv2d|128|K|3|1|1|
|3|Softplus||||||


|K() 1 Block 128 128 3 1 1 2 Conv2d 128 K 3 1 1 3 Softplus|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|C()|||||||
|1|Block|128|K|3|1|1|
|2|Flatten||||||
||Module||in||out||
|3|Linear+ReLU||784||2048||
|4|Linear+ReLU||2048||1024||
|5|Linear+ReLU||1024||512||
|6|Linear+ReLU||512||C||
|7|Sigmoid||||||


(a) Encoder architecture

|R()|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|||in ch.|out ch.|kernel|stride|pad.|
|1|Block|16+K × C|128|3|1|1|
|2||128|128|3|1|1|
|3||128|64|3|1|1|
|4|UpSamplingBilinear2d(2)||||||
|5|Block|64|64|3|1|1|
|6||64|32|3|1|1|
|7|UpSamplingBilinear2d(2)||||||
|8|Block|32|32|3|1|1|
|9||32|32|7|1|1|
|10|Conv2d|32|3|1|1|1|
|11|TanH||||||


(b) Decoder architecture


Table 10: Architectural details of the de-rendering module.

D.2 CODY

We describe the architectural choices made in CoDy. Let

**s(t) = [kk c[1]k** _[...][ c]k[C][+1]_ **k˙** _k ˙c[1]k_ _[...][ ˙]c[C]k_ [+1]]k=1..K (12)

be the state representation of an image Xt, composed of the K keypoints 2D coordinates with their
_C + 1 coefficients. The time derivative of each component of the state is computed via an implicit_
Euler derivation scheme **k[˙]** (t) = k(t) − **k(t −** 1). We use a subscript notation to distinguish the
keypoints from AB and CD.

**CF estimator – The latent representation of the confounders is discovered from s[AB]. The graph**
neural network from this module implements the message passing function f () and the aggregation
function g() (see equation 5) by an MLP with 3 hidden layers of 64 neurons and ReLU activation
unit. The resulting nodes embeddings h[AB](t) = GN (s[AB](t)) belong to R[128]. We then apply
a gated recurrent unit with 2 layers and a hidden vector of size 32 to each node in h[AB](t) (sharing parameters between nodes). The last hidden vector is used as the latent representation of the
confounders uk.

**State encoder-decoder – The state encoder is modeled as a GN where the message passing function**
and the aggregation function are MLPs with one hidden layer of 32 units. The encoded state σ[CD] =
_E(s[CD]) lies in R[256]. We perform dynamical prediction in this σ space, and then project back the_
forecasting in the keypoint space using a decoder. The decoder ∆(σ(t)) first applies a shared GRU
with one layer and a hidden vector size of 256 to each keypoint σk(t), followed by a graph neural
network with the same structure as the state encoder.

**Dynamic system – Our dynamic system forecasts the future state ˆσ(t + 1) from the current esti-**
mation ˆσ(t) and the confounders u = [u1 ... uK]. It first applies a graph neural network to the
concatenated vector [σˆ(t) u]. The message passing function f and the aggregation function g are
MLPs with 3 hidden layers of 64 neurons and ReLU activation function. The resulting nodes embeddings GN ((σ)(t)) belong to R[64] and are fed to a GRU sharing weights among each nodes with


-----

2 layers and a hidden vector of size 64. This GRU updates the hidden vector v[CD](t) = [v1 ... vK],
that is then used to compute a displacement vector with a linear transformation:

**_σˆk(t + 1) = σk(t) + Wvk(t) + b._** (13)

CoDy is trained using Adam to minimize equation 6 (learning rate 10[−][4], γ3 = 1). We train
each CoDy instance by providing it with fixed keypoints states s[AB](t) and the initial condition
**s[CD](t = 0) computed by our trained de-rendering module. CoDy first computes the latent con-**
founder representation uk, and then projects the initial condition into the latent dynamic space
**_σ(t = 0) = E(s[CD](t = 0)). We apply the dynamic model multiple times in order to recursively_**
forecast T time steps from CD. We then apply the decoder ∆(σˆ(t)) to compute the trajectory in
the keypoint space.

E ADDITIONAL QUANTITATIVE EVALUATION

E.1 MULTI-OBJECT TRACKING METRICS

When the number of keypoints matches the number of objects in the scene, the keypoint detector
naturally and in an unsupervised manner places keypoints near the center of mass of each object (see
Figure 15). Leveraging this emerged property, we provide additional empirical demonstration of the
accuracy of our model by computing classical Multi-Object Tracking (MOT) metrics. In particular,
we computed the Multi-Object Tracking Precision (MOTP) and the Multi-Object Tracking (MOTA)
as described in Bernardin & Stiefelhagen.

-  MOTA requires to compute the number of missed objects (i.e. not tracked by a keypoints)
and the number of false positives (i.e. keypoints that do not represent an actual object).
MOTA takes values in [−1, 1], where 1 represents perfect tracking:


_t_ _[m][t][ +][ f][t][ +][ s][t]_

_,_ (14)
_t_ _[g][t]_

P


MOTA = 1 −


where mt is the number of missed objects at time t, ft is the number of false positives at
time t, st is the number of swaps at time t, and gt is the number of objects at time t.

-  MOTP is a measurement of the distance between the keypoints and the ground-truth centers of mass conditioned on the pairing process:


_i,t_ _[d]t[i]_

_,_ (15)
_t_ _[c][t]_


MOTP =


where ct is the number of accurately tracked objects at time t and d[i]t [is the distance between]
the keypoint and the center of mass of the i[th] association {keypoints+center of mass}.

Note that these metrics are related: low MOTP indicates that the tracked objects are tracked precisely, and low MOTA indicates that many objects are missed. Thus, to be efficient, a model needs
to achieve both low MOTP and high MOTA.

We also reported the performances of CoPhyNet (Baradel et al., 2020) that predicts counterfactual
outcomes in euclidian space using the ground-truth 3D space. As it uses GT object positions during
training, it is not comparable and should be considered as a soft upper bound of our method. We
present our results in Table 11. This confirms the superiority of our method over UV-CDN in keypoint space. The upper bound CoPhyNet takes advantage of the non-ambiguous 3D representation
modeled by the ground-truth state of the object of the scene.

Our method also outperforms CoPhyNet on the ballsCF task, probably due to two phenomena.
First, ballsCF is the only 2D task of FilteredCoPhy. Thus, CoPhyNet does not have an advantage
of using ground-truth 3D positions. Second, the state-encoder in CoDy projects the 2D position of
each sphere in a space where the dynamics is easier to learn, probably by breaking the non-linearity
of collisions.


-----

|Col1|Ours|UV-CDN|CoPhyNet (not comparable)|
|---|---|---|---|

|Col1|Ours|UV-CDN|CoPhyNet (not comparable) 0.44|
|---|---|---|---|
|MOTA BT-CF ↑ MOTP ↓|0.46 3.34|0.16 4.51|0.44 0.72|
|MOTA B-CF ↑ MOTP ↓|-0.07 4.64|-0.73 5.83|-0.16 5.10|
|MOTA C-CF ↑ MOTP ↓|-0.14 6.35|-0.19 6.35|0.21 4.37|


Table 11: MOT metrics for different methods. While not comparable, we report the CoPhyNet
performance as a soft upper bound. Our method and UV-CDN use one keypoint per object.
MOTA ↑: higher is better; MOTP ↓: lower is better;


30

20

10

0

20

10


20

10

0

20

|Col1|Col2|Col3|Col4|BlocktowerCF|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
||Measurements|||||||
||Measurements|||||||
||Per bins average|||||||


0.0 0.2 0.4 0.6 0.8 1.0


Rotate Move

Rotate Move


10

0

20

10

|Col1|Col2|Col3|Col4|BallsCF|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
||Measurements Per bins average|||||||


0.0 0.2 0.4 0.6 0.8 1.0

CollisionCF


30

20

10

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||ents||||||
|Measurem|ents||||||
|Per bins|average||||||


0.2 0.4 0.6 0.8 1.0

Displacement amplitude


Remove Move

Do-operation type


Figure 16: Effect of the do-operation on the quality of the forecasted video. (left) our method
generalizes well to a wide range of ”Move” operation amplitudes. (right) We observe a difference of
3dB in favor of the Move do-operation, which is unsurprising, as it is the least disturbing intervention

E.2 IMPACT OF THE DO-OPERATIONS


We also measure the impact of the do-operation types on the video forecasting. Fig. 16 (left) is
obtained by computing PSNR for each example of the training set and reporting the result on a
2D graph, depending on the amplitude of the displacement that characterizes the do-operation. We
applied the same method to obtain Fig.16 (right) that focuses on the type of do-operation, that is
moving, removing or rotating an object. These figures are computed using the 2N keypoints models.

Our method generalizes well across different do-operations, including both the type of the operation,
and the amplitude. A key to this success it the careful design of the dataset (balanced with respect to
the types of do-operations), and a reasonable representation (our set of keypoints and coefficients)
able to detect and model each do-operation from images.

F EXPERIMENTS ON REAL-WORLD DATA


Our contributions are focused on the discovery of causality in physics through counterfactual reasoning. We designed our model in order to solve the new benchmark and provided empirical evidence
that our method is well suited for modeling rigid-body physics and counterfactual reasoning. The
following section aims to demonstrate that our approach can also be extended to a real-world dataset.
We provide qualitative results obtained on a derivative of BlocktowerCF using real cubes tower
(Lerer et al., 2016).


-----

Frame 1 Frame 3 Frame 5 Frame 9 Frame 11 Frame 16 Frame 20

Ground Truth

Prediction

Frame 1 Frame 3 Frame 5 Frame 9 Frame 11 Frame 16 Frame 20

Ground Truth

Prediction

Frame 1 Frame 3 Frame 5 Frame 9 Frame 11 Frame 16 Frame 20

Ground Truth

Prediction


Figure 17: We evaluate our method on a real-world dataset Blocktower IRL. After fine-tuning, CoDy
manages to accurately forecast future frames from real videos.

We refer to this dataset as Blocktower IRL. It is composed of 516 videos of wooden blocks stacked
in a stable or unstable manner. The amount of cubes in a tower varies from 2 to 4. We aim to predict
the dynamics of the tower in pixel space. This is highly related with our task BlocktowerCF
(which inspired by the seminal work from (Lerer et al., 2016)) with three main differences: (1) the
dataset shows real cube towers, (2) the problem is not counterfactual, i.e. every cube has the same
mass and (3) the dataset contains only few videos.

To cope with the lack of data, we exploit our pre-trained models on BlocktowerCF and fine-tune
on Blocktower IRL. The adaptation of the de-rendering module is straightforward: we choose the
4 keypoints-5 coefficients configuration and train the module for image reconstruction after loading
the weights from previous training on our simulated task. CoDy, on the other hand, requires careful
tuning to preserve the learned regularities from BlocktowerCF and prevent over-fitting. Since
Blocktower IRL is not counterfactual, we de-activate the confounder estimator and set uk to vectors
of ones. We also freeze the weights of the last layers of the MLPs in the dynamic model.

To the best of our knowledge, we are the first to use this dataset for video prediction. (Lerer et al.,
2016) and Wu et al. (2017) leverage the video for stability prediction but actual trajectory forecasting
was not the main objective. To quantitatively evaluate our method, we predict 20 frames in the future
from a single image sampled in the trajectory. We measured an average PSNR of 26.27 dB, which is
of the same order of magnitude compared to the results obtained in simulation. Figure 17 provides
visual example of the output.


-----

G QUALITATIVE EVALUATION: MORE VISUAL EXAMPLES

More qualitative results produced by our model on different tasks from our datasets are given below.

t=0 t=21 t=42 t=63 t=85 t=106 t=127 t=149

Ground Truth

Filtered CoPhy (8 kpts)

PhyDNet

UV-CDN (8 kpts)

PredRNN

t=0 t=21 t=42 t=63 t=85 t=106 t=127 t=149

Ground Truth

Filtered CoPhy (8 kpts)

PhyDNet

UV-CDN (8 kpts)

PredRNN


Figure 18: Qualitative performance on the BlocktowerCF (BT-CF) benchmark.


-----

t=0 t=21 t=42 t=63 t=85 t=106 t=127 t=149

Ground Truth

Filtered CoPhy (8 kpts)

PhyDNet

UV-CDN (8 kpts)

PredRNN

t=0 t=21 t=42 t=63 t=85 t=106 t=127 t=149

Ground Truth

Filtered CoPhy (8 kpts)

PhyDNet

UV-CDN (8 kpts)

PredRNN


Figure 19: Qualitative performance on the BallsCF (B-CF) benchmark.


-----

t=0 t=10 t=21 t=31 t=42 t=52 t=63 t=74

Ground Truth

Filtered CoPhy (4 kpts)

PhyDNet

UV-CDN (4 kpts)

PredRNN

t=0 t=10 t=21 t=31 t=42 t=52 t=63 t=74

Ground Truth

Filtered CoPhy (4 kpts)

PhyDNet

UV-CDN (4 kpts)

PredRNN


Figure 20: Qualitative performance on the CollisionCF (C-CF) benchmark.


-----