File size: 70,655 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
# TOWARDS STRUCTURED DYNAMIC SPARSE PRE- TRAINING OF BERT

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Identifying algorithms for computational efficient unsupervised training of large
language models is an important and active area of research. In this work, we
develop and study a straightforward, dynamic always-sparse pre-training approach
for BERT language modeling, which leverages periodic compression steps based
on magnitude pruning followed by random parameter re-allocation. This approach
enables us to achieve Pareto improvements in terms of the number of floatingpoint operations (FLOPs) over statically sparse and dense models across a broad
spectrum of network sizes. Furthermore, we demonstrate that training remains
FLOP-efficient when using coarse-grained block sparsity, making it particularly
promising for efficient execution on modern hardware accelerators.

1 INTRODUCTION

The increasing task performance gains of large, pre-trained language models have fueled interest
in computationally efficient unsupervised training (Kaplan et al., 2020). In recent years, sparsity
has regained popularity as a technique for improving the computational efficiency of deep learning
models (Hoefler et al., 2021). Current sparsity methods can be distinguished into approaches that
impose sparsity on the weights of neural networks via weight sparsity (Frankle & Carbin, 2019; Gale
et al., 2019; Bellec et al., 2017; Mostafa & Wang, 2019; Evci et al., 2019; Dettmers & Zettlemoyer,
2019; Mocanu et al., 2018; Jayakumar et al., 2020), or techniques that dynamically route activations
to only interact with a subset of the network weights via conditional sparsity (Shazeer et al., 2017;
Lepikhin et al., 2020; Fedus et al., 2021; Lewis et al., 2021).

In weight sparse training (Frankle & Carbin, 2019; Gale et al., 2019), the number of network
parameters is reduced by imposing sparsity patterns on the network weights. As a result, weight
sparse training can lead to significant savings in FLOPs, making it promising for scaling to larger
network architectures for a given compute budget. One of the most promising candidates for weight
sparse training is dynamic sparsity (DynSparse), which reduces FLOPs while only requiring training
of sparse subsets of the over-parameterized network (Bellec et al., 2017; Mostafa & Wang, 2019;
Evci et al., 2019; Dettmers & Zettlemoyer, 2019; Mocanu et al., 2018; Jayakumar et al., 2020; Liu
et al., 2021a). In DynSparse approaches, the sparsity pattern imposed on the weights is continuously
modified during training using pruning and re-allocation strategies. This evolution leads to a joint
exploration of both network topology and parameters, which has been shown to outperform static
sparsity baselines (Bellec et al., 2017; Mostafa & Wang, 2019; Evci et al., 2019; Dettmers &
Zettlemoyer, 2019).

However, so far, the limited performance on language modeling task (Evci et al., 2019) has resulted in
DynSparse training not seeing wide adoption for large-scale language modeling tasks despite recent
advances (Jayakumar et al., 2020). Given the high cost and energy consumption of unsupervised
training of large-scale language models (Strubell et al., 2019; Patterson et al., 2021), dynamic sparsity
bears the potential to make pre-training more efficient and affordable. To this end, we adopt and
investigate DynSparse training techniques (Dettmers & Zettlemoyer, 2019; Evci et al., 2019) for
pre-training of BERT bidirectional language encoder (Devlin et al., 2018) based on the highly scalable
Transformer architecture (Vaswani et al., 2017).

Our work achieves Pareto improvements versus the dense baseline using both structured and unstructured DynSparse training of BERT.


-----

1.1 CONTRIBUTIONS

_Investigating dynamic always-sparse training for BERT pre-training. We adapt the DynSparse_
training algorithm to BERT pre-training (Section 2.1). In particular, we find that gradient-based
re-allocation (Evci et al., 2019) results in a collapse of the explored network parameters (Figure 11),
which we mitigate through the use of random parameter re-allocation.

_Achieving scalable FLOPs efficient dynamic sparse training. We compare dense and sparse methods_
for a given FLOPs budget and demonstrate both algorithmic scalability and Pareto improvement on
the FLOPs scale, as shown in Figure 3.

_Adapting dynamic always-sparse training to block structures. We extend the unstructured DynSparse_
training towards block-sparse structure (Section 3.2). In particular, we find that the choice of metric
during block pruning has a strong influence on the task performance, as shown in Figure 7.

_Pareto improvements for structured DynSparse training We show that the resulting structured_
DynSparse training of BERT without structured regularization gives Pareto improvements compared to the dense BERT baseline, as shown in Figure 1.

In the following section, we report the results of explorative experiments conducted to motivate our
study of DynSparse training of BERT. The rest of the paper then concentrates on DynSparse training,
with methodology discussed in Section 2 and results presented in Section 3.

1.2 IDENTIFYING SUITABLE ANGLE OF ATTACKS FOR SPARSE PRE-TRAINING

Sparse training of unsupervised language models is relatively under-explored, compared to sparse
training of the supervised fine-tuning objective (Radiya-Dixit & Wang, 2020; Chen et al., 2020; Sanh
et al., 2020). Consequently, we design two explorative experiments to assess whether DynSparse
training is a suitable sparse training algorithm for pre-training of BERT.

Firstly, we analyze the importance of trainable parameters by keeping a random pattern of weights
non-trainable (constant non-zero) or zero-valued throughout training. This experiment allows us to
disentangle the role of ’zero’ vs. ’untrainable’ weights in the connectivity patterns, to shed light
on the parameter dependence of BERT pre-training. Like zeroed weights, the constant weights
are unable to encode new information about the task. Still, they might promote the propagation of
signals or gradient-flow through the network, which has been considered a core aspect of some sparse
training algorithms in vision (Evci et al., 2020; Lubana & Dick, 2021; Tessera et al., 2021). Non-zero
parameters might also lead to better utilization of the remaining trainable parameters. However, as
shown in Figure 1(a), we find that none of these effects plays a relevant role in the training dynamics,
since the task performance of the network with sparsified weights (dashed orange line) matches
the one with the same fraction of untrained weights (solid blue line). Different from vision models
that are often based on convolutions, the transformer architecture contains large dense matrices and
multiplicative interaction (Jayakumar et al., 2019). While zeroing parameters is not expected to affect
the training dynamics, the performance remains bounded by the number of trainable parameters.

Secondly, we would like to analyze the effect of sparsification at different stages of the training process.
For this purpose, we keep a random subset of the network parameters untrainable in the first half of
the pre-training, before making them trainable in the second half (and vice versa). Unlike magnitude
pruning, which eliminates learned information, freezing and unfreezing parameters ensures symmetry
between the different phases (ignoring the linearly decaying learning rate schedule). The agreement
in the task performance towards the end of training in Figure 1(b) indicates that representation is
_continuously built up during training, with no particular effect of when the sparsification is applied._
This lack of preference is interesting, given that pre-training has been found to lead to a reduction of
the intrinsic dimension (Li et al., 2018) with respect to downstream tasks (Aghajanyan et al., 2020).
Our results suggest that sparse pre-training does not necessarily profit from an initial dense training
phase. Therefore, we can distribute computation in a way that is both algorithmic and computationally
beneficial. In DynSparse training, the network representation is "always-sparse", i.e. it does not rely
on the representation of the underlying dense network, making the approach suited for sparse training
of large language modeling architectures. Consequently, we believe that BERT pre-training is well
suited for DynSparse training.


-----

5.0

4.5

4.0

3.5

3.0

2.5

2.0


3.2

3.0

2.8

2.6

2.4


type
zero
untrained

(a)


10 2 10 1

Fraction of trainable parameters


0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

|Col1|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
|(b)||||=0.1431 dense =0.125 late untrained =0.25 early untrained|
||||||
||||||
||||||


Sequences 1e8


Figure 1: (a) MLM validation loss of BERT-Small with a random subset of parameters set to zero
(solid blue curve) or kept untrained (dashed orange). (b) training loss curves of BERT-Small during
pre-training of 10 epochs (757k steps) fixing a random subset of the parameter either early (orange
dashed) or late (blue dash-dotted) during the training, as well as for the dense baseline (solid black).
The vertical line indicates the unfreeze (freeze) event location, where untrainable parameters are made
trainable (or trainable parameters are frozen). We pick the best learning rate for each experiment
using a grid search over 0.000087 · 2[m] with m = 0, 1, ..., 5, given in Table A.15.

2 METHODOLOGY


Throughout this work, we study the self-supervised pre-training objective from the original BERT
model (Devlin et al., 2018), which consists of the Masked Language Model (MLM) loss, corresponding to the task performance in predicting a random subset of masked tokens, and the noisier Next
_Sentence Prediction (NSP) loss for binarized next sentence prediction. We focus on a single phase of_
pre-training with a sequence length of 128, using the Adam optimizer. All hyperparameters are given
in Appendix A for a training length of 10 epochs.

2.1 ADAPTING UNSTRUCTURED DYNSPARSE ALGORITHM TO BERT


Figure 2: Schematic illustration of pruning and re-allocation step
in a typical DynSparse training algorithm leading to an evolution
of the network representation in parameter space. The dynamic
evolution of the sparsity pattern allows the DynSparse training
algorithm to explore a larger fraction of the network parameters
compared to static sparsity, while remaining "always sparse". For
unstructured DynSparse training, the granularity of the sparsity
pattern is of block size 1×1, while for structured DynSparse
training, the block size is chosen between 4×4, 8×8 and 16×16.

In the present work, we first study and adapt the unstructured DynSparse training schematically
shown in Figure 2 to pre-training of the BERT language models. Specifically, we initialize the
sparsity pattern randomly with the same fixed sparsity ratio on all fully connected encoder weights
(non-embedding weights). The weights are initialized using a truncated normal distribution (see
also Figure 9). During an update step of DynSparse training (see Algorithm 1) we use magnitude
pruning to remove a time t dependent fraction pr(t) of the network parameters. The same fraction of
parameters is re-allocated elsewhere in the weight tensor. To complete the sparsity update step, all
newly allocated parameters and their corresponding first and second-order moments of the Adam
optimizer are initialized to zero. Given that DynSparse training has been primarily developed for
vision architectures (Dettmers & Zettlemoyer, 2019; Evci et al., 2019) and did not show competitive
performance on the language tasks, we find it necessary to reassess some of the algorithm choices for
BERT. In particular, during the re-allocation step of DynSparse training, we use random re-allocation
of pruned weights instead of gradient-based techniques as in RigL (Evci et al., 2019). For one, this
avoids potential issues from a collapse of the explored parameter space (compare Figure 11). More
importantly, the absence of dense gradient computation makes our approach always-sparse, such that
the full dense model is never actually instantiated. We found that the cosine decay of the pruning


-----

ratio introduced in Evci et al. (2019) outperforms constant pruning schedules and leads to a reduction
of the changes in network topology during training. We refer to the maximum pruning ratio pr simply
as "pruning ratio" throughout the paper. All DynSparse hyperparameters are optimized for a sparsity
ratio of 0.9 (for more details, refer to Appendix A.1).

2.2 BLOCK-SPARSE DYNSPARSE ALGORITHM

Extending unstructured DynSparse training towards using structured sparse computation requires
modification to both the prune and update steps in Figure 2. Magnitude pruning can be justified
as a simple compression algorithm resulting in unstructured sparsity. However, there is no unique
way to extend the magnitude pruning metric to blocks of parameters. Choosing a good metric for
block pruning is essential, as magnitude pruning has been surprisingly successful in preserving the
task performance of sparse networks (Gale et al., 2019). In the following, we evaluate the selection
of norms consisting of L[∞]-norm, L[2]-norm and L[1]-norm as different criteria for estimating the
importance of blocks of parameters. For a block of weights B = W{r,s|(r,s)∈B} taken from a weight
tensor W indexed by B, the L[p]-norm is given by

1/p

_L[p](B) =_ _Bi,j_ _,_ (1)

 _|_ _|[p]_

_i,j_

[X] 

where the exponent p, e.g. p = ∞, 2, 1, controls the relative importance of individual parameters of a
block according to their magnitude. In the limit of block size 1×1, the block pruning according to
Eq. (1) reduces to magnitude pruning, allowing us to investigate the task performance with increasing
block sizes. For small values of p → 0, each parameter in the block contributes equally towards the
importance of the block element as _Bi,j_ 1, while for large values of p the importance of
_|_ _|[p]_ _→_ _→∞_
the block collapses towards the parameter with the largest magnitude, with L[p][→∞](B) → max(|B|).
Therefore, the pruning metric for blocks controls the extent to which the magnitude of each of the
parameters in a block contributes to the importance of the block itself.

**Algorithm 1: Structured DynSparse training**
**Input: total number of training step T**, total number of sparsity updates n − 1, pruning ratio pr(t) at
time t, sparsity ratio s, block size B;
**Initialize: Impose random block-sparse pattern on non-embedding weights with uniform constant**
sparsity ratio across all layers. Initialize weights sampled from a truncated normal distribution;
**for k = 1 to n do**

Train network with static sparsity pattern for T/n steps;
For each weight tensor:
**prune fraction pr(t) of blocks with smallest L[p]-norm defined in Eq. (1);**
**re-allocate the same fraction pr(t) of blocks; re-allocated parameters, first and second-order**
moments of Adam optimizer are all initialized to zero
**end**


**Structured regularization** Group Lasso regularization is commonly used as a structured sparsityinducing regularizer (Wen et al., 2017; Narang et al., 2017). We introduce the Group Lasso regularization in the update ∆W of weight tensor W following the decoupling of weight decay and
Adam optimizer from Loshchilov & Hutter (2017). More specifically, the entry (i, j) of the parameter
update ∆Wij is adjusted to ∆Wij[reg] as


_Wij_

_,_ (2)
_Wr,s[2]_ [+][ ϵ]
(r,s)∈B(i,j)

P


∆Wij[reg] = ∆Wij lr(t) _λgroup_ _wstd_
_−_ _·_ _·_ _·_


where B(i, j) denotes the set of weight indices that belong to the same block as the (i, j)th weight
element and η(t) is the linearly decaying learning rate (see Appendix A). The remaining coefficients
are the Group Lasso coefficientextra pre-factors wstd = 0.02 and λgroup√B, and the small constant corresponding to the standard deviation of the weights at ϵ = 10[−][6] for numerical stability. The


-----

initialization (see Appendix A) and the square-root of the block size respectively are chosen such
as to ensure that the regularization coefficients of weight decay and group lasso are comparable in
magnitude.

2.3 PARETO CURVE ASSESSMENT OF FLOPS EFFICIENCY

A recent review by Hoefler et al. (2021) pointed out the need for a rigorous framework for comparing
sparse training algorithms. In the present work, we introduce a methodology for comparing the
sparse task performance on the full BERT-family Pareto curve (Turc et al., 2019), beyond the Same
_Capacity Sparse vs. Dense Comparison approach introduced by Tessera et al. (2021). Comparing_
different algorithms using a Pareto curve allows to perform a multi-objective assessment under
competing constraints, e.g., the desire to use little compute and achieve a high task performance.
This multi-objective assessment is particularly useful for assessing the generality and scalability of
different training algorithms. Furthermore, the use of Pareto curves allows us to systematically assess
algorithmic differences by comparing DynSparse training with dense and static baselines on an equal
FLOPs budget.

Choosing optimal learning rates for sparse and dense models of various sparsity ratios and model
sizes is essential to ensure a fair comparison of different methods. Naive grid search optimization of
the hyperparameters for a full Pareto investigation quickly becomes intractable. To mitigate this, we
have identified and tested the applicability of scaling rules for learning rates across model sizes and
sparsity ratios.

The dense and sparse BERT-family learning rates are obtained from a grid search for 0.0001 · 2[m]
with m = 0, 1, ..., 5 shown in Figure 12 (see Appendix A.4). Interestingly, the results indicate that
for a given number of parameters, the optimal learning rate of the sparse model is significantly larger
than the learning rates of dense models (Figure 13). To reduce the number of hyperparameter sweeps
for large model sizes, we generalize the learning rate scaling with sparsity s as

_η(s) = η(s = 0) exp (1.969s[2]_ + 0.2905s), (3)

where η(s = 0) is the optimal learning rate obtained for a dense model of a given size. We tested the
prediction of the unstructured static sparse learning rate fit from Eq. (3) using DynSparse training
with block sparsity 16×16 across both model sizes and sparsity ratio, and obtained good agreement
between the predicted optimal sparse learning rate obtained from this rule and values obtained through
a grid search, as shown in Figure 13. We also found that the learning rate rule generalizes from static
to unstructured DynSparse, as shown in Table A.14. Identifying the mechanism allowing sparse
models to profit from larger learning rates than dense models with the same number of parameters
(see Figure 13) is left as an exciting direction for future research.

3 RESULTS

3.1 ADAPTING DYNSPARSE TRAINING TO BERT

In order to establish a general improvement of the dynamic sparse training algorithm with both
FLOPs and memory, we study dynamic sparse training of BERT family across multiple model sizes.
We analyze the scaling behavior of our DynSparse models with model size (for a fixed sparsity ratio)
and sparsity ratio (for a fixed model size).

We find that the DynSparse training algorithm with random re-allocation and sparsity s = 0.9 leads
to Pareto improvements compared to the dense BERT-family (see Figure 3). The improvements
of DynSparse training over the dense baseline remain for a range of model sizes, indicating that
DynSparse training can achieve more efficient utilization of FLOPs or network parameters at any
scale. Furthermore, we find that these performance advantages are due to the continued updates of
the sparsity pattern. We do not observe any improvements of the static baseline in FLOPs efficiency
of larger models when the randomly initialized sparsity pattern is kept constant. In fact, for large
model sizes, static sparsity almost perfectly matches the dense baseline. This indicates that the
sparse network architecture itself brings no performance advantages. Any improvements are therefore
expected to arise from continuous compression and evolution of the network representation. For
DynSparse BERT-Base, we achieve an improvement in FLOPs efficiency by a factor of 0.48 compared
to an interpolation of the dense BERT family, as indicated by the horizontal black arrow in Figure 3.


-----

4 × 10[0]

3 × 10[0]



3 × 10[0]

2 × 10[0]


2 × 10[0]


Dense

Tiny DynSparse

Static

Mini

Small

Medium

(a)
Base

Non-embedding FLOPs


10[9] 10[10]

Dense
DynSparse
0.0
0.5
0.75
0.9
0.99

(b)

Non-embedding FLOPs

Figure 4: Comparing validation MLM
loss of DynSparse training of BERTMedium with various sparsity ratios (indicated by color and marker style and
joint by orange dotted line) with dense
training of BERT family (black dashed
line) as a function of non-embedding
FLOPs. For all sparsity ratios, we use
the hyperparameters optimized for sparsity ratio 0.9.

|Dense Tiny DynSparse Static Mini Small Medium (a) Base|Dense DynSparse Static|
|---|---|
|||
|108 109 1010 10||


Figure 3: Pareto curve of the BERT family (Turc et al.,
2019), comparing validation MLM loss of unstructured
DynSparse training (orange dotted line) with static sparsity (solid blue line) and the dense baseline (black dashed
line, the standard deviation is not visible at this scale) as
a function of FLOPs. All sparsity results are obtained for
pre-training with sparsity ratio 0.9, n = 160 pattern updates, and optimal pruning ratio pr = 0.5 (see Figure 5).
The black arrow indicates a reduction of FLOPs for the
same MLM loss by a factor of 0.48.


We observe task performance improvements across a range of sparsity ratios (see Figure 4). However,
since the results used hyperparameters tuned for sparsity 0.9, performance for other sparsity ratios
could potentially be further improved with additional tuning. In sum, we find that DynSparse training
leads to more efficient utilization of parameters and FLOPs for all model sizes.


2.47

2.46

2.45

2.44

2.43

2.42

2.41


2.46

2.43

2.42

2.41

pr

0.25
0.5
0.75
1.0

(b)

0.05 0.10

ratio of removed new weights


0.10

0.09

0.08

0.07

0.06

0.05

0.04

0.03


|(c)|pr = 0.5 n = 160 n 40 80 160 320 pr 0.25 0.5 0.75 1.0|
|---|---|


0.8 1.0

n

40
80
160
320

(a)

ratio of DOF covered


0.7 0.8 0.9 1.0

ratio of DOF covered


Figure 5: Characterization of the DynSparse pre-training of BERT-Medium with sparsity ratio 0.9.
All layer-wise averages shown correspond to maximum value obtained during training. (a) MLM
loss as a function of the fraction of explored network parameters (DOF) with changing number of
sparsity pattern updates n. (b) MLM loss as a function of the ratio of removed, new weights with
changing pruning ratio pr. (c) Joint effect of pruning ratio pr (solid line) on the ratio of removed,
new weights, and DOF covered during DynSparse training. The best performing values (n = 160,
_pr = 0.5) from (a) are marked by a circle._

To improve our understanding of the sparse training dynamics, we extract measures that can help
to explain the efficiency of specific hyperparameter choices (see Appendix A.1). Given that the
DynSparse task performance advantage arises from the continual update of the sparsity pattern, we
begin by quantifying the amount of parameter exploration. While the DynSparse models have only a
tiny fraction of parameters available at any given time, the pattern update means that they can explore
all network parameters throughout training and thus increase the effective weight space. We measure
the effectively covered space by tracking the fraction of network weights of the corresponding dense
network that have been activated at any point during the training and compare with the parameter
count of the equivalent dense network to obtain the total explored degrees of freedom (DOF)[1]. All


1A similar quantity has been independently studied in Liu et al. (2021b) as "in-time over-parametrization."


-----

2.66

2.65

2.64

2.63

2.62

2.61

2.60

2.59

2.58


2.530

2.525

2.520

2.515

2.510

2.505

2.500

2.495


10 2 10 1

group lasso
weight decay

2 1

regularization coefficient

Figure 6: MLM validation loss of DynSparse
BERT-Medium with sparsity s = 0.9 for block size
_B = 16 as a function of the regularization coefficient_
for Group Lasso regularization (solid blue) or weight
decay (orange dashed). The error bars correspond to
the standard deviation over three runs. Number of
updates n = 80, pruning ratio pr = 0.5.


L [1] norm L [2] norm L norm

Block Metric

Figure 7: Block metric dependence of
DynSparse training of BERT-Medium with
sparsity s = 0.9 for block size B = 16.
Confidence interval is estimated by
calculating the standard deviation over
three datapoints with their numerical
values given in Table A.7.


quantities shown in the following correspond to averages taken over all layers, as we did not observe
a systematic layer dependence of these quantities.

The total explored degrees of freedom monotonically increases during training, starting at the fraction
of non-zero at the beginning of training and then saturating at an algorithm and hyperparameter
dependent value toward the end of training (see Figure 11 for a typical shape). We observe that the
maximal number of explored DOF can be controlled through the pruning ratio pr and the number of
sparsity pattern updates n (Figure 5). An increase in the update frequency leads to a simultaneous
saturation in both task performance and the number of explored degrees of freedom (Figure 5(a)). On
the other hand, the pruning ratio pr reaches an optimal value and strongly influences the performance
with a different fraction of removed, new weights (Figure 5(b)). Notably, we find that the best pruning
ratios are reached once the ratio of DOF approaches 1, corresponding to almost complete exploration
of all network parameters (Figure 5(c)). Further increases in pr remove trainable weights that have
just been initialized in the previous update step and lead to a deterioration in the task performance.
Overall, we note that the best task performance is obtained by balancing the DOF while avoiding
wasted compute in the form of parameters that are being allocated and immediately removed (as
demonstrated in Figure 5). Given these findings, we postulate that ideal training outcomes require an
_exploration of all available parameters as well as an only moderate amount of noise injection._

3.2 BLOCK-SPARSE DYNSPARSE TRAINING


In structured DynSparse training, the block pruning is done according to the L[p]-norms from Eq. (1)
of the respective blocks of parameters. For the L[p]-norms norms studied (p = 1, 2, ∞), as shown
in Table 7 we obtain the best performance using the L[1]-norm, which corresponds to the sum of the
parameter magnitudes. Moreover, all block parameters contribute toward the block’s importance,
given that L[1]-norm outperforms other norms that assign larger importance to dominating weights in
a block.

Next, we evaluate the use of structured regularization applied to sparse weights during DynSparse
training with block size 16×16. To compare potential advantages from using a structured regularization against an unstructured regularization method, we have also evaluated the task performance
for tuning weight decay coefficient instead of the Group Lasso coefficients. As shown in Figure 6,
we obtain the best task performance using weight decay. The regularization coefficients are only
tuned for the sparse, non-embedding weights. Other sources of unstructured regularization such as
dropout (and in the case of Group Lasso also weight decay) are set to zero. While our results are in
agreement with the competitiveness of block pruning versus the Group Lasso experiments in Narang
et al. (2017), we have not tested more advanced regularization methods (Yang et al., 2019; Mummadi
et al., 2019). We find that the structured regularization does not lead to any performance advantages
over tuning weight decay.


-----

Table 1: Task performance of DynSparse training of
BERT-Base with sparsity s = 0.9 for various block sizes
_B, compared to dense BERT-Small with similar number_
of FLOPs and linear interpolation of baseline values
("Matched") with exactly the same number of FLOPs.
Hyperparameters are not specifically tuned for B = 16
(number of updates n = 80, pruning ratio pr = 0.5). See
Appendix Table A.8 for block size dependence. The
standard deviation is estimated over three runs.

**Model** _B_ **MLM** **FLOPs**


0.85

0.80

0.75

0.70

0.65

0.60

0.55

0.50


2 4 6 8 10 12 14 16

Block size B

Figure 8: Block size dependence of the
reduction in FLOPs for DynSparse
training compared to (interpolation) of
the dense BERT family for a given task
performance. Values correspond to the
block sparse DynSparse training of
BERT-Base given in Table 1.


Small (dense) -  2.310 ± 0.002 10.07 · 10[9]

Matched (dense) -  2.350 8.33 · 10[9]

Base (s = 0.9) 16 2.311 ± 0.01 8.33 · 10[9]

Base (s = 0.9) 8 2.295 ± 0.002 8.33 · 10[9]

Base (s = 0.9) 4 2.272 ± 0.01 8.33 · 10[9]

Base (s = 0.9) 1 2.160 ± 0.002 8.33 · 10[9]


Next, we compare structured DynSparse training against the baseline using both horizontal and
vertical slice of a Pareto plot. For a vertical slice, e.g., a constant FLOPs comparison, we demonstrate
that DynSparse training can preserve some task performance advantages when block sparsity of size
4×4, 8×8, and 16×16 is used (Table 1). For a horizontal slice (see horizontal arrow in Figure 3)
measuring the reduction in FLOPs for a given task performance, we achieve a reduction between
0.5 for unstructured sparsity and 0.83 for block size B = 16, as shown in Figure 8. The inverse of
the FLOPs efficiency improvements gives the maximum relative execution time of FLOPs for sparse
compared to dense computation to preserve Pareto improvements in terms of wallclock time and
needs to be below a factor of 2 for unstructured sparsity and 1.2 for block sparsity for DynSparse
training (see Appendix A.3). This compute efficiency makes DynSparse training promising for
practical applications that seek to further benefit from the higher computational efficiency of block
computation.

4 RELATED WORK


**Lottery ticket and pruning at initialization** Weight sparsity has been traditionally viewed as a
technique for compressing network representation leading to reduced FLOPs and trainable parameters. The lottery ticket hypothesis (Frankle & Carbin, 2019; Frankle et al., 2020) postulates that
through iteratively pruning and re-initialization, it is often possible to identify smaller subnetworks at
initialization or early on during training that can be trained to full model performance of a large overparametrized network. Since then, there has been a significant amount of work studying techniques
for identifying sparsity distributions at initialization (Lee et al., 2019; Wang et al., 2020; Tanaka et al.,
2020; Zhang & Stadie, 2020; Lee et al., 2020; Frankle et al., 2021; Su et al., 2020). Recently, the
identification of lottery tickets early during training has allowed achieving time-to-train savings after
a short, dense training phase (Chen et al., 2021) by pruning attention heads and neurons early during
the pre-training phase (Chen et al., 2021).

**Dynamic sparsity** In DynSparse (Mocanu et al., 2018; Bellec et al., 2017; Liu et al., 2019; Mostafa
& Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2019; Liu et al., 2021a), the sparse connectivity pattern is evolved during training. Most DynSparse algorithms currently rely on magnitude
pruning to remove unimportant network parameters. However, the algorithm show large differences
in the exact re-allocation criteria, which range from random re-allocation (Bellec et al., 2017; Mocanu et al., 2018; Liu et al., 2019; 2021a) to a directed evolution based on momentum (Dettmers &
Zettlemoyer, 2019) or gradients (Evci et al., 2019).

**Compute-efficient sparse training** Complementary to viewing weight sparsity as a compression
technique of dense networks, sparsity allows increasing network dimensions, potentially resulting in
an augmentation of the effective model capacity for a given amount of compute and memory (Gray
et al., 2017). However, most investigations into sparse training currently impose algorithmic con

-----

straints through the use of pre-defined sparsity patterns (Vooturi et al., 2020; Zhou et al., 2021),
coarse-grained sparsity structures (Gray et al., 2017) or even result in increased compute and memory
compared to dense training through the use of masking.

In the present work, we contribute towards compute-efficient training from an algorithmic point
of view, by extending DynSparse training towards structure. Additionally, we leverage the 2nd
generation of Graphcore’s Intelligence Processing Unit (IPU) (Graphcore, 2021) to dynamically train
large, structured DynSparse models using Graphcore’s DynSparse library[2].

**Structured sparsity** Simple unstructured sparse training algorithms based on magnitude pruning
heuristic have shown remarkable ability to preserve the task performance of over-parametrized
neural networks (Gale et al., 2019). Nevertheless, on the execution side, unconstrained magnitude
pruning results in unstructured sparsity patterns, which remain challenging to support on traditional
hardware accelerators (Narang et al., 2017). Using coarser-grained sparsity structures resulting in
contiguous memory access can mitigate this problem. Nevertheless, the resulting gains in execution
efficiency are often achieved at the cost of a deterioration in task performance (Narang et al., 2017;
Mostafa & Wang, 2019). Approaches to improve the task performance of structured sparsity during
training range from structured regularization (Wen et al., 2017; Narang et al., 2017; Yang et al.,
2019; Mummadi et al., 2019; Louizos et al., 2018), threshold-based pruning using representation
based on block sparsity (Narang et al., 2017), network slimming (Liu et al., 2017) and low-rank
factorization (Wang et al., 2019) to frequently changing sparsity pattern with distinct granularity
varying from block (Hadifar et al., 2020), channel (Gao et al., 2018) to full layers (Fan et al., 2020).

**Conditional sparsity and models with large number of parameters** Unlike dynamic sparsity,
conditional sparsity does not reduce the number of trainable parameters that define the model. The
task performance of semi-supervised language models generally improves with model size under
appropriate scaling of total computation time and dataset size (Kaplan et al., 2020). In conditional
sparse training (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021; Lewis et al., 2021),
activations are dynamically routed to subsets of the network weights distributed over a large number
of hardware accelerators. Conditional sparse training leverages increases in the number of network
parameters to improve the task performance for a constant FLOPs budget (Fedus et al., 2021).

5 CONCLUSION & FUTURE WORK

In this work, we demonstrated that DynSparse training of BERT leads to a more FLOP-efficient
utilization of the trainable parameters. Our experimental work has focused on BERT MLM pretraining with sequence length 128, and further research is needed to evaluate the performance of
pre-training with larger sequence lengths and fine-tuning to downstream tasks.

An important direction stems from the practical opportunity to translate the FLOPs savings into
reduced cost of training. Remarkably, we found that even a naive block-sparse version of the
DynSparse algorithm remains FLOP-Pareto efficient, which forms the first step towards more
compute-efficient training of large-scale language models. However, further task performance
improvements are necessary to fully translate the task performance advantages into time-to-train
win on the Pareto curve. In particular, it will be important to shed further light on the conditions
that enable the performance gains in unsupervised training, particularly the relationship between the
number of available parameters and achievable task performance.

REFERENCES

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic Dimensionality Explains the
Effectiveness of Language Model Fine-Tuning. arXiv Prepr. arXiv2012.13255, 2020.

Brian R. Bartoldson, Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. The GeneralizationStability Tradeoff in Neural Network Pruning. ICLR, 2019.

Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep Rewiring: Training
very sparse deep networks. 6th Int. Conf. Learn. Represent. ICLR - Conf. Track Proc., 2017.

[2https://github.com/graphcore/examples/tree/master/applications/](https://github.com/graphcore/examples/tree/master/applications/tensorflow/dynamic_sparsity)
[tensorflow/dynamic_sparsity](https://github.com/graphcore/examples/tree/master/applications/tensorflow/dynamic_sparsity)


-----

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and
Michael Carbin. The Lottery Ticket Hypothesis for Pre-trained BERT Networks. NeurIPS, 2020.

Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, and Jingjing Liu. EarlyBERT:
Efficient BERT Training via Early-bird Lottery Tickets. ACL-IJCNLP, 2021.

Tim Dettmers and Luke Zettlemoyer. Sparse Networks from Scratch: Faster Training without Losing
Performance. arXiv Prepr. arXiv1907.04840, 2019.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. NAACL-HLT (1), pp. 4171–4186, 2018.

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the Lottery:
Making All Tickets Winners. 37th Int. Conf. Mach. Learn. ICML, pp. 2923–2933, 2019.

Utku Evci, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. Gradient Flow in Sparse Neural
Networks and How Lottery Tickets Win. arXiv Prepr. arXiv2010.03533, 2020.

Angela Fan, Edouard Grave, and Armand Joulin. Reducing Transformer Depth on Demand with
Structured Dropout. ICLR, 2020.

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter
Models with Simple and Efficient Sparsity. arXiv Prepr. arXiv2101.03961, 2021.

Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable
Neural Networks. ICLR, 2019.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear Mode
Connectivity and the Lottery Ticket Hypothesis. ICML, 2020.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Pruning Neural
Networks at Initialization: Why are We Missing the Mark? ICLR, 2021.

Trevor Gale, Erich Elsen, and Sara Hooker. The State of Sparsity in Deep Neural Networks. arXiv
_Prepr. arXiv1902.09574, 2019._

Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic Channel
Pruning: Feature Boosting and Suppression. 7th Int. Conf. Learn. Represent. ICLR, 2018.

[Graphcore. Graphcore Homepage, 2021. URL https://www.graphcore.ai/.](https://www.graphcore.ai/)

Scott Gray, Alec Radford, and Diederik P Kingma. GPU Kernels for Block-Sparse Weights, 2017.
[URL https://openai.com/blog/block-sparse-gpu-kernels/.](https://openai.com/blog/block-sparse-gpu-kernels/)

Amir Hadifar, Johannes Deleu, Chris Develder, and Thomas Demeester. Block-wise Dynamic
Sparseness. arXiv Prepr. arXiv2001.04686, 2020.

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in Deep
Learning: Pruning and growth for efficient inference and training in neural networks. arXiv Prepr.
_arXiv2102.00554, 2021._

Siddhant M Jayakumar, Wojciech M Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon
Osidnero, Yee Whye Teh, Tim Harley, and Razvan Pascanu Deepmind. Multiplicative Interactions
and Where to Find Them. ICLR, 2019.

Siddhant M. Jayakumar, Razvan Pascanu, Jack W. Rae, Simon Osindero, and Erich Elsen. Top-KAST:
Top-K Always Sparse Training. NeurIPS, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language
Models. arXiv Prepr. arXiv2001.08361, 2020.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: Single-shot Network Pruning
based on Connection Sensitivity. ICLR, 2019.


-----

Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A Signal Propagation
Perspective for Pruning Neural Networks at Initialization. ICLR, 2020.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang,
Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding. arXiv Prepr. arXiv2006.16668, 2020.

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. BASE Layers:
Simplifying Training of Large, Sparse Models. arXiv Prepr. arXiv2103.16716, 2021.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension
of Objective Landscapes. 6th Int. Conf. Learn. Represent. ICLR - Conf. Track Proc., 2018.

Shiwei Liu, Decebal Constantin Mocanu, Amarsagar Reddy Ramapuram Matavalam, Yulong Pei, and
Mykola Pechenizkiy. Sparse evolutionary Deep Learning with over one million artificial neurons
on commodity hardware. arXiv Prepr. arXiv1901.09181, 2019.

Shiwei Liu, Decebal Constantin Mocanu, Yulong Pei, and Mykola Pechenizkiy. Selfish Sparse RNN
Training. Proc. 38th Int. Conf. Mach. Learn., 2021a.

Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do We Actually Need
Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. arXiv Prepr.
_arXiv2102.02887, 2021b._

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning
Efficient Convolutional Networks through Network Slimming. In Proc. IEEE Int. Conf. Comput.
_Vis., volume 2017-Octob, pp. 2755–2763, 2017._

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. 7th Int. Conf. Learn.
_Represent. ICLR, 2017._

Christos Louizos, Max Welling, and Diederik P. Kingma. Learning Sparse Neural Networks through
L_0 Regularization. ICLR, 2018.

Ekdeep Singh Lubana and Robert P. Dick. A Gradient Flow Framework For Analyzing Network
Pruning. ICLR, 2021.

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu,
and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity
inspired by network science. Nat. Commun., 9(1):2383, 2018.

Hesham Mostafa and Xin Wang. Parameter Efficient Training of Deep Convolutional Neural Networks
by Dynamic Sparse Reparameterization. ICML, 2019.

Chaithanya Kumar Mummadi, Tim Genewein, Dan Zhang, Thomas Brox, and Volker Fischer. Group
Pruning using a Bounded-Lp norm for Group Gating and Regularization. Ger. Conf. Pattern
_Recognit., 2019._

Sharan Narang, Eric Undersander, and Gregory Diamos. Block-Sparse Recurrent Neural Networks.
_arXiv Prepr. arXiv1711.02782, 2017._

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild,
David So, Maud Texier, and Jeff Dean. Carbon Emissions and Large Neural Network Training.
_arXiv Prepr. arXiv2104.10350, 2021._

Alexandra Peste, Eugenia Iofinova, Adrian Vladu, and Dan Alistarh. AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks. arXiv Prepr. arXiv2106.12379, 2021.

Evani Radiya-Dixit and Xin Wang. How fine can fine-tuning be? Learning efficient language models.
_arXiv Prepr., arXiv 2004.14129, 2020._

Victor Sanh, Thomas Wolf, and Alexander M. Rush. Movement Pruning: Adaptive Sparsity by
Fine-Tuning. NeuIPS, 2020.


-----

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and
Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
_5th Int. Conf. Learn. Represent. ICLR - Conf. Track Proc., 2017._

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for
Deep Learning in NLP. ACL 2019 - 57th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., pp.
3645–3650, 2019.

Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee.
Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. NeurIPS, 2020.

Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural networks
without any data by iteratively conserving synaptic flow. NeurIPS, 2020.

Kale-ab Tessera, Sara Hooker, and Benjamin Rosman. Keep the Gradients Flowing: Using Gradient
Flow to Study Sparse Network Optimization. arXiv Prepr. arXiv2102.01670, 2021.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-Read Students Learn Better:
On the Importance of Pre-training Compact Models. arXiv Prepr. arXiv1908.08962, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Transformer: Attention is all you need. Adv. Neural Inf. Process. Syst.
_30, pp. 5998–6008, 2017._

Dharma Teja Vooturi, Girish Varma, and Kishore Kothapalli. Ramanujan Bipartite Graph Products
for Efficient Block Sparse Neural Networks. arXiv Prepr. arXiv2006.13486, 2020.

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by
preserving gradient flow. arXiv Prepr. arXiv2002.07376, 2020.

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured Pruning of Large Language Models. Assoc.
_Comput. Linguist., pp. 6151–6162, 2019._

Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran
Chen, and Hai Li. Learning Intrinsic Sparse Structures within Long Short-Term Memory. 6th Int.
_Conf. Learn. Represent. ICLR - Conf. Track Proc., 2017._

Huanrui Yang, Wei Wen, and Hai Li. DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures. ICLR, 2019.

Matthew Shunshi Zhang and Bradly C Stadie. One Shot Pruning of Recurrent Neural Networks by
Jacobian spectrum evaluation. ICLR, 2020.

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and
Hongsheng Li. Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch.
_ICLR, 2021._


-----

TECHNICAL DETAILS

_• Optimizer: Throughout this work we use element-wise optimization based on Adam with_
weight decay 0.01, β1 = 0.9, β2 = 0.999, ϵ = 10[−][6] _× loss-scaling-factor and gradient_
clipping, which is known to work well with the sparse gradients found in NLP models.

_• Default learning rate schedule consists of 10000 linear warmup steps up to the maximum_
learning rate (0.0002 for BERT-Medium and 0.0001 for BERT-Base), followed by a linear
decay over the full training run.

_• Default dropout is 0.1 for all models larger then BERT-Small. To avoid artificial perfor-_
mance gains through an adjustment of the regularizer in the presence of sparsity induced
regularization (Bartoldson et al., 2019), we keep dropout in the sparse models identical to
the one used in the corresponding baseline.

_• Default floating-point precision: We use datatype FP16.16 (16 bit compute with 16 bit_
partials) throughout the model. The second-order moment in the Adam optimizer is computed and stored in FP32. Embedding is kept in FP16. The default loss-scaling factor for
both BERT-Medium and BERT-Base is 512.

_• Initialization scheme: The sparsity pattern is initialized randomly. The weights are initial-_
ized using a truncated normal initializer with an initialization range of wstd = 0.02. This
choice was motivated by having compared different initializations for the sparse model and
found that the dense default truncated normal gives the best task performance, as shown
in Figure 9. We found that preserving the variance of the activation statistics of the sparse
model compared to the dense model (Evci et al., 2020) does not lead to any performance
gains.


2.85

2.80

2.75

2.70

2.65


Figure 9: BERT-Medium with static unstructured sparsity s = 0.9 imposed on all weights
using Glorot (blue) or truncated normal (orange) initialization scheme. The marker shape
indicates whether the standard deviation of
the weight initializiation was increased.


init-type

glorot
truncated normal

rescale_var

False
True

10 3

learning rate ( )



_• Pre-training dataset: Phase I pre-training is performed on Wikipedia and BookCorpus_
using Whole Word Masking with a sequence length of 128.

_• Code_ for DynSparse library used in this work is available at Graphcore’s
Github [https://github.com/graphcore/examples/tree/master/](https://github.com/graphcore/examples/tree/master/applications/tensorflow/dynamic_sparsity)
[applications/tensorflow/dynamic_sparsity](https://github.com/graphcore/examples/tree/master/applications/tensorflow/dynamic_sparsity)

A.1 HYPERPARAMETERS SPECIFIC TO DYNAMIC SPARSITY (DYNSPARSE)


Two very important hyper-parameters for DynSparse training are the sparsity pattern update frequency,
i.e. how often the network topology is modified, and the pruning ratio, which determines the fraction
of the network topology modified at each update step. The sparsity ratio per layer is kept fixed
throughout training.

_• Update frequency dependence: Comparing the task performance of 20, 40, 80, 160 and_
320 updates at sparsity ratio 0.9, we have found that the task performance improves with the
number of sparsity pattern updates (Table A.2 and A.4). We chose the optimal number of
updates as n = 160 (n = 80) for sparsity ratio 0.9 and block size 1×1 (16×16).

_• Extreme sparsity regime: All hyperparameters have been optimized for sparsity 0.9._
However, we tested how well the pruning ratio and update frequency dependence optimized


-----

Table A.2: Number of sparsity pattern updates n dependence of unstructured (1×1)
DynSparse BERT-Medium, η = 0.001397,
sparsity s = 0.9, 10 epochs, phase I (pruning
ratio pr = 0.5 with cosine decay and random
reallocation).

_n_ **MLM loss** **NSP**

40 2.468 0.645
80 2.430 0.656
160 **2.409** 0.626
320 2.419 0.649

Table A.4: Number of sparsity pattern updates n dependence of structured (16×16)
DynSparse BERT-Medium, η = 0.001397,
sparsity s = 0.9, 10 epochs, phase I (pruning
ratio pr = 0.5 with cosine decay and random
reallocation).

_n_ **MLM loss** **NSP loss**

40 2.616 0.731
80 2.606 0.650
160 2.633 0.692
320 2.645 0.693


Table A.3: Pruning ratio pr dependence of unstructured (1×1) DynSparse BERT-Medium,
_η = 0.001397, sparsity s = 0.9, 10 epochs,_
phase I (number of updates n = 160 with
cosine decay and random reallocation). Same
hyperparameters as in Table A.2.

_pr_ **MLM loss** **NSP loss**

0.25 2.439 0.655
0.50 2.413 0.684
0.75 2.411 0.668
1.00 2.459 0.698

Table A.5: Pruning ratio pr dependence
of structured (16×16) DynSparse BERTMedium, η = 0.001397, sparsity s = 0.9, 10
epochs, phase I (number of updates n = 160
with cosine decay and random reallocation).
Same hyperparameters as in Table A.4.

_pr_ **MLM loss** **NSP loss**

0.25 2.648 0.694
0.50 **2.633** 0.692
0.75 2.634 0.745
1.00 2.675 0.701


Table A.6: Pruning ratio pr and number of update n dependence of unstructured (1×1) DynSparse
BERT-Medium, η = 0.001837, sparsity s = 0.99, 10 epochs, phase I (with cosine decay and random
reallocation).

_s_ _n_ _pr_ **MLM loss** **NSP loss**

0.99 160 0.10 2.999 0.833
0.99 160 0.25 2.939 0.789
0.99 160 0.50 2.889 0.750
0.99 160 0.75 **2.872** 0.775

0.99 80 0.50 2.922 0.842
0.99 160 0.50 2.889 0.750
0.99 320 0.50 2.868 0.772
0.99 640 0.50 2.886 0.791


-----

2.46

2.44

2.42

2.450

2.425


Figure 10: MLM loss vs pruning ratio pr times
number of sparsity pattern updates n for unstructured DynSparse training of BERT-Medium
with sparsity ratio 0.9 for different values of
(Top panel) pruning ratio pr (with n = 160) and
(Bottom panel) sparsity pattern updates n (with
_pr = 0.5). Same data as in Figure 5._


40 60 80 100 120 140 160

n = 160 0.5 1.0
0.25 0.75

pr n

20 40 60 80 100 120 140 160

pr = 0.5 80 320

40 160

pr n


1.0

0.8

0.6

0.4

0.2


0.5 static
0.9 static
0.5 rigl
0.9 rigl
0.5 random
0.9 random

0 200000 400000 600000 800000

iterations

Figure 11: (Left panel) Fraction of explored degrees of freedom for static sparsity and unstructured
DynSparse training using gradient based (RigL) (Evci et al., 2019) vs random re-allocation (Dettmers
& Zettlemoyer, 2019). (Right panel) Corresponding sparsity patterns for the first up-projection in
the feedfoward component ("Boom-up") of the second transformer block, accumulated throughout
training, for sparsity ratio 0.9 using gradient based (RigL) and random based reallocation. A black
(white) dot corresponds to a parameter being non-zero (zero) at any point during training. The dark
horizontal blocks in the RigL updates indicate a collapse due to outliers along the input dimension,
which indicates that the effect arises from the activation part of the dense gradient update. This
suggests that the collapse could be mitigated by reducing the influence of the activations during
DynSparse training update.


for sparsity 0.9 translates to sparsity 0.99. We have found that increasing the pruning ratio
to pr = 0.75 can lead to small performance gains, as shown in Table A.6.

_• Total number of pruned parameters The pruning ratio pr and the number of updates n_
jointly control the total number of pruned and re-allocated parameters. The total number of
pruned and re-allocated parameters is proportional to their product. We obtain an optimal
value of their product in terms of task performance as shown in Figure 10.

_• Re-allocation criteria: We found that random re-allocation outperforms gradient-based re-_
allocation. While the pruning criterion leads to a compression of the network topology, the
growing criterion directs the evolution of the network topology and distinguishes DynSparse
training as a form of neural architecture search during training from mere gradual pruning
approaches. Understanding the requirements for efficient joint subspace exploration of
parameter and network topology space using DynSparse training will be essential to scale
towards larger language models. In Figure 11, we show that for gradient-based re-allocation,
the dense gradient is dominated by outliers in the activation, e.g., along the input dimension
of each layer, which imposes a strong bias on the available degrees of freedom during the
update step. In agreement with this observation, we find that for random-based re-allocation,
a significantly larger fraction of the network parameters is explored during training, while
for gradient-based re-allocation the training remains constrained into a small subset of all
network parameters (left panel of Figure 11).

_• Block size metric The importance of blocks of parameters is assessed by evaluating the_
_L[1]-norm of the corresponding blocks (see Table A.7 and Figure 7)._


-----

Table A.8: Task performance of DynSparse
BERT-Medium with sparsity 0.9 for various
block sizes B compared to dense BERT-Mini
with similar number of FLOPs and linear interpolation of the baseline values ("Matched")
with exaclty the same number of FLOPs. Hyperparameters are not specifically tuned for
different block sizes. See also BERT-Base
results in Table 1.

**Model** _B_ **MLM** **FLOPs**

Mini -  2.614 2.617 · 10[9]

Matched (dense) -  2.603 2.738 · 10[9]

Medium (s = 0.9) 16 2.621 2.738 · 10[9]

Medium (s = 0.9) 8 2.591 2.738 · 10[9]

Medium (s = 0.9) 4 2.546 2.738 · 10[9]

Medium (s = 0.9) 1 **2.408** 2.738 · 10[9]


Table A.7: Task performance of DynSparse
training of BERT-Medium with sparsity 0.9
for block size B = 16 for various block size
metrics.

**Block metric** **MLM loss** **NSP loss**

_L[2]-norm_ 2.611 0.684
_L[2]-norm_ 2.623 0.684
_L[2]-norm_ 2.627 0.664

_L[∞]-norm_ 2.632 0.686
_L[∞]-norm_ 2.635 0.720
_L[∞]-norm_ 2.637 0.670

_L[1]-norm_ 2.603 0.665
_L[1]-norm_ 2.606 0.650
_L[1]-norm_ 2.615 0.731


Table A.9: MLM validation loss of BERT-Small for results given in Figure 1.

**s** **type** _η_ **MLM loss** **NSP loss**

0.25 zero 0.000343 2.390 0.653
0.50 zero 0.000589 2.485 0.687
0.75 zero 0.001011 2.637 0.737
0.90 zero 0.001397 2.829 0.802
0.99 zero 0.001697 3.244 0.907

0.25 untrained 0.000686 2.375 0.681
0.50 untrained 0.001178 2.491 0.675
0.75 untrained 0.002021 2.645 0.731
0.90 untrained 0.002795 2.850 0.829
0.99 untrained 0.003394 3.243 0.827

_• Block size dependence The block size dependence of BERT-Medium with sparsity 0.9 is_
given in Table A.8.

_• Untrainable vs zero-valued parameters Numerical values for the results shown in the left_
panel of Figure 1 are given in Table (A.9).

A.2 SPARSE FLOPS: FLOPS ESTIMATES FOR SPARSE MULTIPLICATION WITH DENSE INPUT

Throughout this report we assume the FLOPs for training a dense layer with sparse weight elements
to approximately scale as O(3 × 2 × I × B × O × f ), where B the batch dimension, I is the
input dimension, O the output dimension and f is the density of sparsity pattern imposed on the
corresponding dense layer, which is to the sparsity ratio s as f = 1 − _s. The FLOPs estimate can be_
divided into the following components:

1. FLOPs estimate for sparse forward pass: Assuming a sparse matrix M has a sparsity
ratio s or a density f = 1 − _s, the required matrix multiplication for a given dense input ⃗x_
and output ⃗y is
_ybi =_ _Mijxbj,_ (4)

_j|MXij_ =0̸

where M has dimension [O, I] and dim(y) = [B, O], dim(x) = [B, I],


(a) Sparse Multiplicationgives us a reduction of the total number of FLOPs by a fraction of non-zero elements: performing the summation zbij = Mijxbj for i, j iff Mij ̸= 0
in M times B leading to B × O × I × f FLOPs.


-----

(b) Sparse Addition: performing _j_ _[z][bij][ requires us to calculate the exact number of]_

non-zeros along the input dimension, giving B × O × prob(out) × I × prob(in) −
_B_ _O_ prob(out), where we defined some probability for non-zero values along[P]
_×_ _×_
the output dimension prob(out) and input dimension prob(in). Assuming a uniform
distribution, we estimate the FLOPs count to scale approximately linearly with the
sparsity ratio B × O × I × f − _B × O × f/prob(in) to first order._

The total FLOPs of sparse multiplication used in the forward pass scales approximately
linearly in the number of non-zeros, i.e. O(2I × B × O × f ).

2. FLOPs estimate for recursive propagation of error through the network: Involves a
multiplication of the dense error with the transposed sparse matrix leading O(2I _×B×O×f_ )
additional FLOPs.

3. FLOPs estimates for the outer product The weight update itself is formed by a sparse
outer product, where only the sparse components need to be updated, which leads to a
further reduction in the number of FLOPs that scales linearly with the density of the matrix.

A.3 RELATING IMPROVEMENTS IN FLOPS EFFICIENCY TO IMPLEMENTATION REQUIREMENTS

To relate the algorithmic improvement in FLOPs efficiency to implementation requirements in general
hardware agnostic way, we consider the sparse extra cost ε that we define as

_ε := [∆][t][sparse]_ (5)

∆t[dense][,]

where ∆t[i] (i = sparse, dense) is the average time it takes to execute a FLOP of type i for a
specific model size and sparsity ratio. For a given fixed number of training steps and the same task
performance, DynSparse training with theoretical FLOPs F [sparse] (defined in Appendix A.2) is only
faster than dense training with FLOPs F [dense] if the time to execute a sparse training step t[sparse] is
smaller than the time to execute a dense training step t[dense]. Or formally:

_t[sparse]_ _< t[dense],_ _F_ [sparse]∆t[sparse] _< F_ [dense]∆t[dense]. (6)

In other words, the utilization of fewer but "slower" FLOPs in the sparse model still translates to a
faster execution of the sparse model overall. Note that this comparison is performed at equal task
performance and for the same number of training steps.

Using this formalism, we can view improvements in task performance in the context of throughput requirements for a given algorithm in a hardware agnostic way independent of the exact implementation.
For t[sparse] = t[dense], we can derive the maximum critical extra cost for a DynSparse implementation
before DynSparse training loses its advantages over dense computation in terms of time-to-train win.
Specifically, for a given fixed number of training step and same task performance, the critical cost
factor is given by
_εcritical = F_ [dense] _/ F_ [sparse] (7)

where F [sparse] corresponds to the DynSparse training and F [dense] to the (interpolated) dense BERT
family. We emphasize, that besides this requirement for a time-to-train win sparse training also allows
to run models with larger model dimensions for a given number of parameters.

A.4 LEARNING RATE FOR SPARSE AND DENSE MODELS

The results of the learning rate sweep of BERT with various sparsities are given in Table A.10. The
corresponding learning rate sweep for the dense BERT-family is given in Table A.11. We confirmed
that the optimal learning rates for static sparsity agree with the ones for DynSparse in Table A.14.
We confirmed that the predicted learning rate dependence of the DynSparse model generalizes to
block sparsity and across multiple model sizes as given in Table A.12 for block sparsity 16x16.

A.4.1 LEARNING RATE FOR SPARSE MODELS

In Figure 12, we show the learning rate sweep of the BERT-Medium model with static sparsity, for
various sparsity ratios. We estimate the optimal learning rate for sparse models through the minimum
of a cubic interpolation of the task performance vs learning rates for a given sparsity ratio, as indicated


-----

Table A.10: Learning rate (η) sweep of static
unstructured sparsity BERT-Medium, sparsity
_s = 0, 0.25, 0.5, 0.75, 0.9._

_η_ **sparsity** **MLM loss** **NSP loss**


Table A.11: Learning rate (η) sweep for dense
BERT-family consisting of BERT-Tiny, Mini,
Small, Medium and Base.

**model** _η_ **MLM loss** **NSP loss**


0.0001 0.00 2.179 0.610
0.0002 0.00 2.115 0.598
0.0002 0.00 2.115 0.605
0.0004 0.00 2.116 0.606
0.0008 0.00 2.164 0.633

0.0001 0.25 2.278 0.627
0.0002 0.25 2.204 0.642
0.0004 0.25 2.186 0.596
0.0008 0.25 2.223 0.638

0.0001 0.50 2.412 0.679
0.0002 0.50 2.338 0.671
0.0004 0.50 2.283 0.631
0.0008 0.50 2.298 0.648

0.0002 0.75 2.551 0.741
0.0004 0.75 2.483 0.685
0.0008 0.75 2.446 0.671
0.0016 0.75 2.449 0.647
0.0032 0.75 2.547 0.707

0.0004 0.90 2.723 0.758
0.0008 0.90 2.677 0.711
0.0016 0.90 2.648 0.706
0.0032 0.90 2.669 0.697


Mini 0.000050 3.062 0.839
Mini 0.000100 2.833 0.811
Mini 0.000400 2.625 0.742
Mini 0.000800 2.606 0.775
Mini 0.001600 2.628 0.779
Mini 0.003200 2.665 0.783

Small 0.000800 2.326 0.644
Small 0.000400 2.310 0.621
Small 0.000200 2.329 0.635
Small 0.001600 2.418 0.768
Medium 0.000200 2.115 0.605

Medium 0.000400 2.116 0.606
Medium 0.000800 2.164 0.633
Medium 0.000100 2.179 0.610


Base 0.000025 2.115 0.599
Base 0.000100 1.878 0.542
Base 0.000050 1.972 0.569
Base 0.000200 1.843 0.488

Figure 12: MLM validation loss for the Top panel
dense BERT family Bottom panel static BERT
of different sparsity ratios between 0 and 0.9 as a
function of learning rate. The solid line correspond
to a cubic fit for all data with the same sparsity ratio.
The minimum of the resulting fit corresponds to
the optimal learning rate for a given sparsity and is
indicated by the black triangles connected by blue
lines.


3.0

2.5

2.0

2.6

2.4

2.2

|Col1|Col2|
|---|---|
||fit mini small medium base|


10 4 10

|Col1|Col2|
|---|---|
||fit 0.0 0.25 0.5 0.75 0.9|


4 3


10 4 10 3

4 3

learning rate ( )


-----

Table A.13: Learning rate sweep of BERT-Small
alternating between dense and sparse training with
either non-trainable parameter or zero-valued parameter corresponding to sparsity s = 0.9, 10
epochs phase I, for various pruning methods. Optimal values are given in Table A.16. We switch
_n = 160 times between sparse/non-trainable pa-_
rameters and the dense training.

**non active** **pruning** _η_ **MLM** **NSP**

non-train fixed 0.0002 2.366 0.671
non-train fixed 0.0004 **2.358** 0.668
non-train fixed 0.0008 7.242 0.693

non-train magnitude 0.0002 2.379 0.658
non-train magnitude 0.0004 **2.354** 0.675
non-train magnitude 0.0008 11.160 0.766

non-train random 0.0001 2.431 0.733
non-train random 0.0002 2.365 0.669
non-train random 0.0004 **2.349** 0.693
non-train random 0.0008 7.272 0.693

zero fixed 2.5e-05 3.317 0.967
zero fixed 5e-05 **3.199** 0.817
zero fixed 0.0001 3.277 0.819
zero fixed 0.0002 3.329 0.884
zero fixed 0.0004 3.358 0.964
zero fixed 0.0008 3.424 0.799

zero magnitude 0.0002 2.746 0.756
zero magnitude 0.0004 **2.685** 0.711
zero magnitude 0.0008 3.056 0.834
zero magnitude 0.0016 6.538 1.217

zero random 0.0001 6.232 1.142
zero random 0.0002 6.132 1.273
zero random 0.0004 **6.094** 1.185
zero random 0.0008 6.284 0.987


Table A.12: Learning rate (η) sweep for
DynSparse BERT-family BERT-Mini,
Small, Medium and Base for sparsity 0.9
with block size 16×16.

**model** _η_ **MLM** **NSP**

Base 0.000160 2.520 0.692
Base 0.000320 2.429 0.693
Base 0.000640 2.340 0.647
Base 0.001280 2.328 0.603
Base 0.002560 2.369 0.656

Medium 0.000125 2.878 0.892
Medium 0.000250 2.760 0.720
Medium 0.000500 2.670 0.730
Medium 0.002000 2.640 0.715

Mini 0.001250 3.184 0.882
Mini 0.002500 3.145 0.871
Mini 0.005000 3.147 0.869
Mini 0.005120 3.195 0.907

Small 0.000313 2.927 0.865
Small 0.000625 2.841 0.773
Small 0.001250 2.788 0.861
Small 0.005000 2.826 0.797


Table A.14: Learning rate sweep of DynSparse BERT-Medium, sparsity s = 0.9, 10 epochs phase
I, used to confirm that the optimal learning rates for static sparsity from Table A.10 translate into
optimal learning rates for DynSparse.

_η_ **MLM loss** **NSP loss**

0.00064 2.467 0.647
0.00128 2.410 0.670
0.0026 2.429 0.674
0.0051 2.521 0.654


-----

3.2

3.0

2.8

2.6

2.4


Mini
Small
Medium
Base

10 3

learning rate ( )

Figure 13: (Left panel) Dense fit to the optimal learning rate estimated as the position of the black
triangles from Figure 12 for BERT-Medium with various sparsities and the dense BERT-family as
a function of the number of trainable parameters N for various model sizes (indicated by symbol
style and color) and sparsity ratios (colored crosses). The black lines indicate linear fits that are
best approximated by log(η) = −0.8383(±0.05) log(N ) + 6.13(±0.7) for the sparse models and
log(η) = −0.44(±0.05) log(N ) − 0.47(±0.9) for the dense models. (Right panel) Testing the
prediction of the optimal sparse learning rate from Eq. 3 (markerstyle "+") on BERT-family with
sparsity 0.9 and block size 16×16 (values given in Table A.12).


Table A.16: MLM validation loss of BERTSmall trained by alternating between dense
and training only a fraction of 0.1 of the nonembedding weights with the non-trainable
parameter set to either zero or just untrainable without modifications. We pick the best
learning rate for each experiment using a grid
search over 2.5·10[0][.][5]·2[m] with m = 0, 1, 2, ...
(Table A.13) (number of updates n = 160).

_η_ **non-train** **selection** **MLM**


Table A.15: Learning rate sweep of
DynSparse BERT-Small unfreeze (freeze) experiment with initial (final) fraction of nontrainable parameters 0.9, 10 epochs phase I.

**type** _η_ **MLM loss** **NSP loss**


freeze 0.000087 2.467 0.715
freeze 0.000175 **2.407** 0.703
freeze 0.000349 2.420 0.685
freeze 0.000699 2.540 0.695

unfreeze 0.000175 2.933 0.666
unfreeze 0.000349 2.598 0.676
unfreeze 0.000699 **2.440** 0.703
unfreeze 0.001397 7.251 0.693
unfreeze 0.002795 7.520 0.784


5e-05 zero fixed 3.199
0.0004 zero magnitude **2.685**
0.0004 zero random 6.094

0.0004 untrained fixed 2.358
0.0004 untrained magnitude 2.354
0.0004 untrained random **2.349**


by the triangle markers in Figure 12. We find that the optimal learning rate η calculated from the
interpolation is best approximated by

log(η(s)) ≈ 1.969(±0.2)s[2] + 0.2905(±0.2)s − 8.175(±0.04) (8)

as a function of sparsity s or equivalently as (see Figure 12)

log(η(N )) ≈−0.838(±0.05) log(N ) + 6.13(±0.73) (9)

for number of parameters N . Interestingly, a linear learning rate vs logarithmic memory fit as
used in Kaplan et al. (2020) (η(N ) ≈ 0.003239 − 0.0001395 log(N ) from Eq. (D1)) is leading to
qualitatively worse agreement, which might be explained by our optimization for a fixed number of
training steps.

A.5 ROLE OF SELECTION CRITERIA


To understand the role of magnitude pruning criteria in the DynSparse training dynamics, we have
disentangled the pruning step from the parameter re-allocation step by temporarily replacing the
always sparse training algorithm with an alternation between dense and sparse training phases (Peste


-----

et al., 2021). The dense training interval removes the influence from the regrowing selection as all
network parameters are periodically activated without preference. We have found that magnitude
pruning ("magnitude") outperforms both pruning into a fixed subspace chosen randomly at initialization ("fixed") and a changing random subspace re-drawn each time ("random") as shown in the top
half of Table A.16. The strong performance degradation of the random re-drawn sparsity patterns
illustrates the importance of the large network parameter in preserving the task performance.

This picture changes if, instead of setting the parameter to zero, we make parameters non-trainable,
which avoids the information loss associated with the pruning step. In fact, in this case, we find
that randomly selecting the subset of trainable parameters outperforms both the selection of the
parameters with the largest magnitude ("magnitude") as well as training a fixed subset of parameters
randomly chosen at initialization ("fixed") shown in the bottom part of Table A.16. Our results show
that magnitude pruning gives performance advantages as it preserves information. However, the
increased exploration coming from random parameter selection would, in the absence of information
loss due to pruning, benefit the task performance were it not associated with information loss.


-----