File size: 71,710 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
# BWCP: PROBABILISTIC LEARNING-TO-PRUNE CHAN## NELS FOR CONVNETS VIA BATCH WHITENING

**Anonymous authors**
Paper under double-blind review

ABSTRACT

This work presents a probabilistic channel pruning method to accelerate Convolutional Neural Networks (CNNs). Previous pruning methods often zero out
unimportant channels in training in a deterministic manner, which reduces CNN’s
learning capacity and results in suboptimal performance. To address this problem,
we develop a probability-based pruning algorithm, called batch whitening channel
pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated. BWCP has several merits. (1) It
simultaneously trains and prunes CNNs from scratch in a probabilistic way, exploring larger network space than deterministic methods. (2) BWCP is empowered by
the proposed batch whitening tool, which is able to empirically and theoretically
increase the activation probability of useful channels while keeping unimportant
channels unchanged without adding any extra parameters and computational cost
in inference. (3) Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet
with various network architectures show that BWCP outperforms its counterparts
by achieving better accuracy given limited computational budgets. For example,
ResNet50 pruned by BWCP has only 0.58% Top-1 accuracy drop on ImageNet,
while reducing 42.9% FLOPs of the plain ResNet50.

1 INTRODUCTION

Deep convolutional neural networks (CNNs) have achieved superior performance in a variety of
computer vision tasks such as image recognition (He et al., 2016), object detection (Ren et al.,
2017), and semantic segmentation (Chen et al., 2018). However, despite their great success, deep
CNN models often have massive demand on storage, memory bandwidth, and computational power
(Han & Dally, 2018), making them difficult to be plugged onto resource-limited platforms, such as
portable and mobile devices (Deng et al., 2020). Therefore, proposing efficient and effective model
compression methods has become a hot research topic in the deep learning community.

Model pruning, as one of the vital model compression techniques, has been extensively investigated.
It reduces model size and computational cost by removing unnecessary or unimportant weights or
channels in a CNN (Han et al., 2016). For example, many recent works (Wen et al., 2016; Guo et al.,
2016) prune fine-grained weights of filters. Han et al. (2015) proposes to discard the weights that
have magnitude less than a predefined threshold. Guo et al. (2016) further utilizes a sparse mask on
a weight basis to achieve pruning. Although these unstructured pruning methods achieve optimal
pruning schedule, they do not take the structure of CNNs into account, preventing them from being
accelerated on hardware such as GPU for parallel computations (Liu et al., 2018).

To achieve efficient model storage and computations, we focus on structured channel pruning (Wen
et al., 2016; Yang et al., 2019a; Liu et al., 2017), which removes entire structures in a CNN such as
filter or channel. A typical structured channel pruning approach commonly contains three stages,
including pre-training a full model, pruning unimportant channels by the predefined criteria such as
_ℓp norm, and fine-tuning the pruned model (Liu et al., 2017; Luo et al., 2017), as shown in Fig.1 (a)._
However, it is usually hard to find a global pruning threshold to select unimportant channels, because
the norm deviation between channels is often too small (He et al., 2019). More importantly, as some
channels are permanently zeroed out in the pruning stage, such a multi-stage procedure usually not
only relies on hand-crafted heuristics but also limits the learning capacity (He et al., 2018a; 2019).


-----

|0.9|1.3|Col3|
|---|---|---|
|0.3 0.2|||

|Col1|Col2|Col3|
|---|---|---|
||||

|Col1|Col2|Col3|
|---|---|---|
||||

|Col1|Col2|Col3|
|---|---|---|
||||


**Channels** **0-1 Masks** **Channels** **Channels**

**1.3** **Norm** **Large**

**0.9** **Threshold** **Prune** **Fine-tune**

**0.3** **Medium**

**0.2**

**0.1** **Small**

**(a) Norm-based method**

**Channels** **Soft Masks** **Soft Masks** **Channels**

**0.9** **1.3** **Probability** **0.73** **0.8** **BatchWhitening** **0.80** **0.95** **Prune** **Probability** **Large**

**0.3** **0.42** **0.46** **Medium**

**0.2** **0.12** **0.15**

**0.1** **0.03** **0.03** **Small**

**(b) Our proposed BWCP**

Figure 1: Illustration of our proposed BWCP. (a) Previous channel pruning methods utilize a hard
criterion such as the norm (Liu et al., 2017) of channels to deterministically remove unimportant
channels, which deteriorates performance and needs a extra fine-tuning process(Frankle & Carbin,
**b) Our proposed BWCP is a probability-based pruning framework where unimportant**
channels are stochastically pruned with activation probability, thus maintaining the learning capacity
of original CNNs. In particular, our proposed batch whitening (BW) tool can increase the activation
probability of useful channels while keeping the activation probability of unimportant channels

unchanged, enabling BWCP to identify unimportant channels reliably.

To tackle the above issues, we propose a simple but effective probability-based channel pruning
framework, named batch-whitening channel pruning (BWCP), where unimportant channels are
pruned in a stochastic manner, thus preserving the channel space of CNNs in training (i.e. the
diversity of CNN architectures is preserved). To be specific, as shown in Fig.1 (b), we assign each
channel with an activation probability (i.e. the probability of a channel being activated), by exploring
the properties of the batch normalization layer (Ioffe & Szegedy, 2015; Arpit et al., 2016). A larger
activation probability indicates that the corresponding channel is more likely to be preserved.

We also introduce a capable tool, termed batch whitening (BW), which can increase the activation
probability of useful channels, while keeping the unnecessary channels unchanged. By doing so,
the deviation of the activation probability between channels is explicitly enlarged, enabling BWCP
to identify unimportant channels during training easily. Such an appealing property is justified by
theoretical analysis and experiments. Furthermore, we exploit activation probability adjusted by
BW to generate a set of differentiable masks by a soft sampling procedure with Gumbel-Softmax
technique, allowing us to train BWCP in an online “pruning-from-scratch” fashion stably. After
training, we obtain the final compact model by directly discarding the channels with zero masks.

The main contributions of this work are three-fold. (1) We propose a probability-based channel
pruning framework BWCP, which explores a larger network space than deterministic methods. (2)
BWCP can easily identify unimportant channels by adjusting their activation probabilities without
adding any extra model parameters and computational cost in inference. (3) Extensive experiments on
CIFAR-10, CIFAR-100 and ImageNet datasets with various network architectures show that BWCP
can achieve better recognition performance given the comparable amount of resources compared
to existing approaches. For example, BWCP can reduce 68.08% Flops by compressing 93.12%
parameters of VGG-16 with merely accuracy drop and ResNet-50 pruned by BWCP has only 0.58%
top-1 accuracy drop on ImageNet while reducing 42.9% FLOPs.

2 RELATED WORK

**Weight Pruning. Early network pruning methods mainly remove the unimportant weights in the**
network. For instance, Optimal Brain Damage (LeCun et al., 1990) measures the importance of
weights by evaluating the impact of weight on the loss function and prunes less important ones.
However, it is not applicable in modern network structure due to the heavy computation of the Hessian
matrix. Recent work assesses the importance of the weights through the magnitude of the weights
itself. Specifically, (Guo et al., 2016) prune the network by encouraging weights to become exactly
zero. The computation involves weights with zero can be discarded. However, a major drawback of
weight pruning techniques is that they do not take the structure of CNNs into account, thus failing to
help scale pruned models on commodity hardware such as GPUs (Liu et al., 2018; Wen et al., 2016).

**Channel Pruning. Channel pruning approaches directly prune feature maps or filters of CNNs,**
making it easy for hardware-friendly implementation. For instance, relaxed ℓ0 regularization (Louizos


-----

et al., 2017) and group regularizer (Yang et al., 2019a) impose channel-level sparsity, and filters with
small value are selected to be pruned. Some recent work also propose to rank the importance of
filters by different criteria including ℓ1 norm (Liu et al., 2017; Li et al., 2017), ℓ2 norm (Frankle &
Carbin, 2018) and High Rank channels (Lin et al., 2020). For example, (Liu et al., 2017) explores the
importance of filters through scale parameter γ in batch normalization. Although these approaches
introduce minimum overhead to the training process, they are not trained in an end-to-end manner
and usually either apply on a pre-trained model or require an extra fine-tuning procedure.

Recent works tackle this issue by pruning CNNs from scratch. For example, FPGM (He et al., 2019)
zeros in unimportant channels and continues training them after each training epoch. Furthermore,
both SSS and DSA learn a differentiable binary mask that is generated by channel importance and
does not require any additional fine-tuning. Our proposed BWCP is most related to variational
pruning (Zhao et al., 2019) and SCP (Kang & Han, 2020) as they also employ the property of
normalization layer and associate the importance of channel with probability. The main difference
is that our method adopts the idea of whitening to perform channel pruning. We will show that the
proposed batch whitening (BW) technique can adjusts the activation probability of different channels
according to their importance, making it easy to identify unimportant channels. Although previous
work SPP(Wang et al., 2017) and DynamicCP (Gao et al., 2018) also attempt to boost salient channels
and skip unimportant ones, they fail to consider the natural property inside normalization layer and
deign the activation probability empirically .

3 PRELIMINARY

**Notation. We use regular letters, bold letters, and capital letters to denote scalars such as ‘x’, and**
vectors (e.g.vector, matrix, and tensor) such as ‘x’ and random variables such as ‘X’, respectively.

We begin with introducing a building layer in recent deep neural nets which typically consists of
a convolution layer, a batch normalization (BN) layer, and a rectified linear unit (ReLU) (Ioffe &
Szegedy, 2015; He et al., 2016). Formally, it can be written by

**xc = wc ∗** **z,** **x˜c = γcx¯c + βc,** **yc = max{0, ˜xc}** (1)

where c ∈ [C] denotes channel index and C is channel size. In Eqn.(1), ‘∗’ indicates convolution
operation and wc is filter weight corresponding to the c-th output channel, i.e. xc ∈ R[N] _[×][H][×][W]_ . To
perform normalization,E[·] and D[·] indicate calculating mean and variance over a batch of samples, and then is re-scaled to xc is firstly standardized to ¯xc through ¯xc = (xc − E[xc])/pD[xc] where

**x˜c by scale parameter γc and bias βc. Moreover, the output feature yc is obtained by ReLU activation**
that discards the negative part of ˜xc.

**Criterion-based channel pruning. For channel pruning, previous methods usually employ a ‘small-**
norm-less-important’ criterion to measure the importance of channels. For example, BN layer can
be applied in channel pruning (Liu et al., 2017), where a channel with a small value of γc would
be removed. The reason is that the c-th output channel ˜xc contributes little to the learned feature
representation when γc is small. Hence, the convolution in Eqn.(1) can be discarded safely, and filter
**wc can thus be pruned. Unlike these criterion-based methods that deterministically prune unimportant**
filters and rely on a heuristic pruning procedure as shown in Fig.1(a), we explore a probability-based
channel pruning framework where less important channels are pruned in a stochastic manner.

**Activation probability. To this end, we define an activation probability of a channel by exploring the**
property of the BN layer. Those channels with a larger activation probability could be preserved with
a higher probability. To be specific, since ¯xc is acquired by subtracting the sample mean and being
divided by the sample variance, we can treat each channel feature as a random variable following
standard Normal distribution (Arpit et al., 2016), denoted as _X[¯]c. Note that only positive parts can_
be activated by ReLU function. Proposition 1 gives the activation probability of the c-th channel,
_i.e. P_ ( X[˜]c) > 0.

**Proposition 1 Let a random variable** _X[¯]c_ (0, 1) and Yc = max 0, γcX[¯]c + βc _. Then we have_
_∼N_ _{_ _x_ _}_
_(1) P_ (Yc > 0) = P ( X[˜]c > 0) = (1 + Erf(βc/(√2|γc|))/2 where Erf(x) = 0 [2][/][√][π][ ·][ exp][−][t][2] _[dt][,]_

_and (2) P_ ( X[˜]c > 0) = 0 _βc_ 0 and γc 0.
_⇔_ _≤_ _→_ R
Note that a pruned channel can be modelled by P ( X[˜]c > 0) = 0. With Proposition 1 (see proof
in Appendix A.2), we know that the unnecessary channels satisfy that γc approaches 0 and βc is


-----

**BN** **BW** **Cov.** **Newton**

ഥ𝒙 ഥ𝒙 𝚺 **Iter.**

𝜷 𝜸 𝜷 𝜸 𝚺[−𝟏]𝟐

෥𝒙 ෝ𝒙

**Soft Gating Module**

𝑷(ෝ𝒙> 𝟎) 𝐆 .

**Act. Prob.** **Gumbel-Softmax** **Soft Mask** **ReLU**

Figure 2: A schematic of the proposed Batch Whitening Channel Pruning (BWCP) algorithm that
consists of a BW module and a soft sampling procedure. By modifying BN layer with a whitening
operator, the proposed BW technique adjusts activation probabilities of different channels. These
activation probabilities are then utilized by a soft sampling procedure.

negative. To achieve channel pruning, previous compression techniques (Li et al., 2017; Zhao et al.,
2019) merely impose a regularization on γc, which would deteriorate the representation power of
unpruned channels (Perez et al., 2018; Wang et al., 2020). Instead, we adopt the idea of whitening
to build a probabilistic channel pruning framework where unnecessary channels are stochastically
disgarded with a small activation probability while important channels are preserved with a large
activation probability.

4 BATCH WHITENING CHANNEL PRUNING

This section introduces the proposed batch whitening channel pruning (BWCP) algorithm, which
contains a batch whitening module that can adjust the activation probability of channels, and a soft
sampling module that stochastically prunes channels with the activation probability adjusted by BW.
The whole pipeline of BWCP is illustrated in Fig.2.

By modifying the BN layer in Eqn.(1), we have the formulation of BWCP,
**x[out]c** = **xˆc** _mc(P_ ( X[ˆ]c > 0)) (2)

_⊙_

batch whitening soft sampling

where x[out]c _, ˆxc_ R[N] _[×][H][×][W]_ denote the output of proposed BWCP algorithm and BW module,|{z} | {z }
respectively. ‘ _∈’ denotes broadcast multiplication. mc_ [0, 1] denotes a soft sampling that takes the
_⊙_ _∈_
activation probability of output features of BW (i.e. P ( X[ˆ]c > 0)) and returns a soft mask. The closer
the activation probability is to 0 or 1, the more likely the mask is to be hard. To distinguish important
channels from unimportant ones, BW is proposed to increase the activation probability of useful
channels while keeping the probability of unimportant channels unchanged during training. Since
Eqn.(2) always retain all channels in the network, our BWCP can preserve the learning capacity of
the original network during training (He et al., 2018a). The following sections present BW and soft
sampling module in detail.

4.1 BATCH WHITENING

Unlike previous works (Zhao et al., 2019; Kang & Han, 2020) that simply measure the importance of
channels by parameters in BN layer, we attempt to whiten features after BN layer by the proposed
BW module. We show that BW can change the activation probability of channels according to their
importances without adding additional parameters and computational overhead in inference.

As shown in Fig.2, BW acts after the BN layer. By rewriting Eqn.(1) into a vector form, we have the
formulation of BW,
**xˆnij = Σ[−]** [1]2 (γ ⊙ **x¯nij + β)** (3)

wherelocation ˆx (niji, j ∈) for all channels.R[C][×][1] is a vector of Σ[−] C2[1] is a whitening operator and elements that denote the output of BW for the Σ ∈ R[C][×][C] is the covariance matrix n-th sample at

of channel features **x˜c** _c=1[. Moreover,][ γ][ ∈]_ [R][C][×][1][ and][ β][ ∈] [R][C][×][1][ are two vectors by stacking]
_{_ _}[C]_
_γchannels ofc and βc of all the channels respectively. ¯xncij into a column vector._ ¯xnij ∈ R[C][×][1] is a vector by stacking elements from all

**Training and inference. Note that BW in Eqn.(3) requires computing a root inverse of a covariance**
matrix of channel features after the BN layer. Towards this end, we calculate the covariance matrix Σ
within a batch of samples during each training step as given by


-----

_N,H,W_

(γ **x¯nij)(γ** **x¯nij)[T]** = (γγ[T]) **_ρ_** (4)
_⊙_ _⊙_ _⊙_
_n,i,j=1_

X


**Σ =**


_NHW_


where ρ is a C-by-C correlation matrix of channel features {x¯c}c[C]=1 [(see details in Appendix A.1).]
The Newton Iteration is further employed to calculate its root inverse, Σ[−] 2[1], as given by the following

iterations
**Σk = [1]** _k_ 1[Σ][)][, k][ = 1][,][ 2][,][ · · ·][, T.] (5)

2 [(3][Σ][k][−][1][ −] **[Σ][3]−**

where k and T are the iteration index and iteration number respectively and Σ0 = I is a identity
matrix. Note that when ∥I − **Σ∥2 < 1, Eqn.(5) converges to Σ[−]** [1]2 (Bini et al., 2005). To satisfy this

condition, Σ can be normalized by Σ/tr(Σ) following (Huang et al., 2019), where tr(·) is the trace
operator. In this way, the normalized covariance matrix can be written as ΣN = γγ[T] _⊙_ **_ρ/ ∥γ∥2[2][.]_**

During inference, we use the moving average to calculate the population estimate of **Σ[ˆ]** _−N_ [1]2 by

following the updating rules, **Σ[ˆ]** _−N_ 2[1] = (1 − _g) Σ[ˆ]_ _−N_ 2[1] + gΣ−N 2[1] [. Here][ Σ][N][ is the covariance calculated]

within each mini-batch at each training step, and g denotes the momentum of moving average. Note
that **Σ[ˆ]** _−N_ [1]2 is fixed during inference, the proposed BW does not introduce extra costs in memory or

computation since **Σ[ˆ]** _−N_ 2[1] can be viewed as a convolution kernel with size of 1, which can be absorbed

into previous convolutional layer. For completeness, we also analyze the training overhead of BWCP
in Appendix Sec.A.3 where we see BWCP introduces a little extra training overhead.

4.2 ANALYSIS OF BWCP

In this section, we show that BWCP can easily identify unimportant channels by increasing the
difference of activation between important and unimportant channels.

**Proposition 2 Let a random variable** _X[¯] ∼N_ (0, 1) and Yc = max{0, [ Σ[ˆ] _−N_ [1]2 [(][γ][ ⊙] _X[¯] + β)]c}._

_Then we have P_ (Yc > δ) = P ( X[ˆ]c > δ) = (1 + Erf((β[ˆ]c − _δ)/(√2|γˆc|))/2, where δ is a small_

_positive constant, ˆγc and_ _β[ˆ]c are two equivalent scale parameter and bias defined by BW module._
_Take T = 1 in Eqn.(5) as an example, we have ˆγc =_ [1]2 [(3][γ][c][ −] [P]d[C]=1 _[γ]d[2][γ][c][ρ][dc][/][ ∥][γ][∥][2]2[)][, and][ ˆ]βc =_

1
2 [(3][β][c][ −] [P]d[C]=1 _[β][d][γ][d][γ][c][ρ][dc][/][ ∥][γ][∥]2[2][)][ where][ ρ][dc][ is the Pearson’s correlation between channel features]_
**x¯c and ¯xd.**

By Proposition .2, BWCP can adjust activation probability by changing the values of γc and βc
in Proposition 1 through BW module (see detail in Appendix A.4). Here we introduce a small
positive constant δ to avoid the small activation feature value. To see how BW changes the activation
probability of different channels, we consider two cases as shown in Proposition 3.

with a small activation probability as it sufficiently approaches zero. We can see from PropositionCase 1: βc ≤ 0 and γc → 0. In this case, the c-th channel of the BN layer would be activated
3, the activation probability of c-th channel still approaches zero after BW is applied, showing
that the proposed BW module can keep the unimportant channels unchanged in this case. Case 2:
_γc_ _> 0. For this case, the c-th channel of the BN layer would be activated with a high activation_
_|_ _|_
probability. From Proposition 3, the activation probability of c-th channel is enlarged after BW is
applied. Therefore, our proposed BW module can increase the activation probability of important
channels. Detailed proof of Proposition 3 can be found in Appendix A.5. We also empirically verify
Proposition 3 in Sec. 5.3. Notice that we neglect a trivial case in which the channel can be also
activated (i.e. βc > 0 and |γc| → 0). In fact, the channels can be removed in this case because the
channel feature is always constant which can be deemed as a bias.

4.3 SOFT SAMPLING MODULE

The soft sampling procedure samples the output of BW through a set of differentiable masks. To be
specific, as shown in Fig.2, we leverage the Gumbel-Softmax sampling (Jang et al., 2017) that takes
the activation probability generated by BW and produces a soft mask as given by
_mc = GumbelSoftmax(P_ ( X[ˆ]c > 0); τ ) (6)
where τ is the temperature. By Eqn.(2) and Eqn.(6), BWCP stochastically prunes unimportant
channels with activation probability. A smaller activation probability makes mc more likely to be


-----

close to 0. Hence, our proposed BW can help identify less important channels by enlarging the
activation probability of important channels, as mentioned in Sec.4.2. Note that mc can converge
to 0-1 mask when τ approaches to 0. In the experiment, we find that setting τ = 0.5 is enough for
BWCP to achieve hard pruning at test time.

**Proposition 3 Let δ = ∥γ∥2** _Cj=1[(][γ][j][β][c][ −]_ _[γ][c][β][j][)][2][ρ]cj[2]_ _[/][(][∥][γ][∥][2]2_ _[−]_ [P]j[C]=1 _[γ]j[2][ρ][cj][)][. With][ ˆ]γ and_ _β[ˆ]_

_defined in Proposition 2, we have (1)qP P_ ( X[ˆ]c > δ) = 0 if _γc_ 0 and βc 0, and (2) P ( X[ˆ]c > δ)
_|_ _| →_ _≤_ _≥_
_P_ ( X[˜]c _δ) if_ _γc_ _> 0._
_≥_ _|_ _|_
**Solution to residual issue. Note that the number of channels in the last convolution layer must**
be the same as previous blocks due to the element-wise summation in the recent advanced CNN
architectures (He et al., 2016; Huang et al., 2017). We solve this problem by letting BW layer in the
last convolution layer and shortcut share the same mask as discussed in Appendix A.6.

4.4 TRAINING OF BWCP

This section introduces a sparsity regularization, which makes the model compact, and then describes
the training algorithm of BWCP.

**Sparse Regularization. With Proposition.1, we see a main characteristic of pruned channels in BN**
layer is that γc sufficiently approaches 0, and βc is negative. By Proposition 3, we find that it is also
a necessary condition that a channel can be pruned after BW module is applied. Hence, we obtain
unnecessary channels by directly imposing a regularization on γc and βc as given by

_C_
sparse = (7)
_L_ _c=1_ _[λ][1][|][γ][c][|][ +][ λ][2][β][c]_

where the first term makes γc small, and the second term encouragesX _βc to be negative. The_
above sparse regularizer is imposed on all BN layers of the network. By changing the strength of
regularization (i.e. λ1 and λ2), we can achieve different pruning ratios. In fact, βc and |γc| represent
the mean and standard deviation of a Normal distribution, respectively. Following the empirical rule
of Normal distribution, setting λ1 as triple or double λ2 would be a good choice to encourage sparse
channels in implementation. Moreover, we observe that 42.2% and 41.3% channels with βc 0
, while 0.47% and 5.36% channels with _γc_ _< 0.05 on trained plain ResNet-34 and ResNet-50. ≤_
_|_ _|_
Hence, changing the strength of regularization on γc will affect FLOPs more than that of βc. If one
wants to pursue a more compact model, increasing λ1 is more effective than λ2.

**Training Algorithm. BWCP can be easily plugged into a CNN by modifying the traditional BN**
operations. Hence, the training of BWCP can be simply implemented in existing software platforms
such as PyTorch and TensorFlow. In other words, the forward propagation of BWCP can be
represented by Eqn.(2-3) and Eqn.(6), all of which define differentiable transformations. Therefore,
our proposed BWCP can train and prune deep models in an end-to-end manner. Appendix A.7 also
provides the explicit gradient back-propagation of BWCP. On the other hand, we do not introduce
extra parameters to learn the pruning mask mc. Instead, mc in Eqn.(6) is totally determined by the
parameters in BN layers including γ, β and Σ. Hence, we can perform joint training of pruning mask
_mc and model parameters. The BWCP framework is provided in Algorithm 1 of Appendix Sec A.6_

**Final architecture. The final architecture is fixed at the end of training. During training, we use the**
Gumbel-Softmax procedure by Eqn.(6) to produce a soft mask. At test time, we instead use a hard
0-1 mask achieved by a sign function (i.e. sign(P ( X[ˆ]c > 0) > 0.5)) to obtain the network’s output.
To make the inference stage stable, we use a sigmoid-alike transformation to make the activation
probability approach 0 or 1 in training. By this strategy, we find that both the training and inference
stage are stable and obtain a fixed compact model. After training, we obtain the final compact model
by directly pruning channels with a mask value of 0. Therefore, our proposed BWCP does not need
an extra fine-tuning procedure.

5 EXPERIMENTS

In this section, we extensively experiment with the proposed BWCP on CIFAR-10/100 and ImageNet.
We show the advantages of BWCP in both recognition performance and FLOPs reduction comparing
with existing channel pruning methods. We also provide an ablation study to analyze the proposed
framework. The details of datasets and training configurations are provided in Appendix B.


-----

Table 1: Performance comparison between our proposed approach BWCP and other methods on
CIFAR-10. “Baseline Acc.” and “Acc.” denote the accuracies of the original and pruned models,
respectively. “Acc. Drop” means the accuracy of the base model minus that of pruned models (smaller
is better). “Channels ↓”, “Model Size ↓”, and “FLOPs ↓” denote the relative reductions in individual
metrics compared to the unpruned networks (larger is better). ‘*’ indicates the method needs a extra
fine-tuning to recover performance. The best-performing results are highlighted in bold.

Model DCP* (Zhuang et al., 2018)Method Baseline Acc. (%)93.80 Acc. (%)93.49 Acc. Drop0.31 Channels- _↓_ (%) Model Size49.24 ↓ (%) FLOPs50.25 ↓ (%)

AMC* (He et al., 2018b) 92.80 91.90 0.90 -  -  50.00

ResNet-56 SFP (He et al., 2018a) 93.59 92.26 1.33 40 – **52.60**

FPGM (He et al., 2019) 93.59 92.93 0.66 40 – **52.60**
SCP (Kang & Han, 2020) 93.69 93.23 0.46 **45** **46.47** 51.20
BWCP (Ours) 93.64 93.37 **0.27** 40 44.42 50.35

Slimming* (Liu et al., 2017) 94.39 92.59 1.80 80 73.53 68.95
Variational Pruning (Zhao et al., 2019) 94.11 93.16 0.95 60 59.76 44.78

DenseNet-40

SCP (Kang & Han, 2020) 94.39 93.77 0.62 81 75.41 70.77
BWCP (Ours) 94.21 93.82 **0.39** **82** **76.03** **71.72**

Slimming* (Liu et al., 2017) 93.85 92.91 0.94 70 87.97 48.12
Variational Pruning (Zhao et al., 2019) 93.25 93.18 0.07 62 73.34 39.10

VGGNet-16

SCP (Kang & Han, 2020) 93.85 93.79 0.06 75 93.05 66.23
BWCP (Ours) 93.85 93.82 **0.03** **76** **93.12** **68.08**

DCP* (Zhuang et al., 2018) 94.47 94.69 -0.22 -  23.6 27.0
MobileNet-V2 MDP (Guo et al., 2020) 95.02 95.14 -0.12 -  -  28.7
BWCP (Ours) 94.56 94.90 **-0.36** -  **32.3** **37.7**


5.1 RESULTS ON CIFAR-10

For CIFAR-10 dataset, we evaluate our BWCP on ResNet-56, DenseNet-40 and VGG-16 and compare
our approach with Slimming (Liu et al., 2017), Variational Pruning (Zhao et al., 2019) and SCP (Kang
& Han, 2020). These methods prune redundant channels using BN layers like our algorithm. We
also compare BWCP with previous strong baselines such as AMC (He et al., 2018b) and DCP
(Zhuang et al., 2018). The results of slimming are obtained from SCP (Kang & Han, 2020). As
mentioned in Sec.4.2, our BWCP adjusts their activation probability of different channels. Therefore,
it would present better recognition accuracy with comparable computation consumption by entirely
exploiting important channels. As shown in Table 1, our BWCP achieves the lowest accuracy drops
and comparable FLOPs reduction compared with existing channel pruning methods in all tested base
networks. For example, although our model is not fine-tuned, the accuracy drop of the pruned network
given by BWCP based on DenseNet-40 and VGG-16 outperforms Slimming with fine-tuning by
1.41% and 0.91% points, respectively. And ResNet-56 pruned by BWCP attains better classification
accuracy than previous strong baseline DCP AMC (He et al., 2018b) and DCP (Zhuang et al., 2018)
without an extra fine-tuning stage. Besides, our method achieves superior accuracy compared to the
Variational Pruning even with significantly smaller model sizes on DensNet-40 and VGGNet-16,
demonstrating its effectiveness. We also test BWCP with MobileNet-V2 on the CIFAR10 dataset.
From Table 1, we see that BWCP achieves better classification accuracy while reducing more FLOPs
We also report results of BWCP on CIFAR100 in Appendix B.3.

5.2 RESULTS ON IMAGENET

For ImageNet dataset, we test our proposed BWCP on two representative base models ResNet-34 and
ResNet-50. The proposed BWCP is compared with SFP (He et al., 2018a)), FPGM (He et al., 2019),
SSS (Huang & Wang, 2018)), SCP (Kang & Han, 2020) HRank (Lin et al., 2020) and DSA (Ning
et al., 2020) since they prune channels without an extra fine-tuning stage. As shown in Table 2, we
see that BWCP consistently outperforms its counterparts in recognition accuracy under comparable
FLOPs. For ResNet-34, FPGM (He et al., 2019) and SFP (He et al., 2018a) without fine-tuning
accelerates ResNet-34 by 41.1% speedup ratio with 2.13% and 2.09% accuracy drop respectively, but
our BWCP without finetuning achieve almost the same speedup ratio with only 1.16% top-1 accuracy
drop. On the other hand, BWCP also significantly outperforms FPGM (He et al., 2019) by 1.07%
top-1 accuracy after going through a fine-tuning stage. For ResNet-50, BWCP still achieves better
performance compared with other approaches. For instance, at the level of 40% FLOPs reduction,
the top-1 accuracy of BWCP exceeds SSS (Huang & Wang, 2018) by 3.72%. Moreover, BWCP
outperforms DSA (Ning et al., 2020) by top-1 accuracy of 0.34% and 0.21% at level of 40% and
50% FLOPs respectively. However, BWCP has slightly lower top-5 accuracy than DSA (Ning et al.,
2020).

**Inference Acceleration. We analyze the realistic hardware acceleration in terms of GPU and CPU**
running time during inference. The CPU type is Intel Xeon CPU E5-2682 v4, and the GPU is


-----

Table 2: Performance of our proposed BWCP and other pruning methods on ImageNet using base
models ResNet-34 and ResNet-50. ’*’ indicates the pruned model is fine-tuned.

Model FPGM* (He et al., 2019)Method Baseline Top-1 Acc. (%)73.92 Baseline Top-5 Acc. (%)91.62 Top-1 Acc. Drop1.38 Top-5 Acc. Drop0.49 FLOPs41.1 ↓ (%)

BWCP* (Ours) 73.72 91.64 **0.31** **0.34** **41.0**

ResNet-34 SFP (He et al., 2018a) 73.92 91.62 2.09 1.29 41.1

FPGM (He et al., 2019) 73.92 91.62 2.13 0.92 41.1
BWCP (Ours) 73.72 91.64 **1.16** **0.83** **41.0**

FPGM* (He et al., 2019) 76.15 92.87 1.32 0.55 **53.5**
BWCP* (Ours) 76.20 93.15 **0.48** **0.40** 51.2

SSS (Huang & Wang, 2018) 76.12 92.86 4.30 2.07 43.0
DSA (Ning et al., 2020) – – 0.92 0.41 40.0
HRank* (Lin et al., 2020) 76.15 92.87 1.17 0.64 **43.7**

ResNet-50 ThiNet* (Luo et al., 2017) 72.88 91.14 0.84 0.47 36.8

BWCP (Ours) 76.20 93.15 **0.58** **0.40** 42.9
FPGM (He et al., 2019) 76.15 92.87 2.02 0.93 53.5
SCP (Kang & Han, 2020) 75.89 92.98 1.69 0.98 **54.3**
DSA (Ning et al., 2020) – – 1.33 0.80 50.0
BWCP (Ours) 76.20 93.15 **1.02** **0.60** 51.2


Table 3: Effect of BW, Gumbel-Softmax (GS),
and sparse Regularization in BWCP. The results
are obtained by training ResNet-56 on CIFAR-10
dataset. ‘BL’ denotes baseline model.

CasesBL BW GS Reg Acc. (%)93.64 Model Size- _↓_ FLOPs- _↓_

(1)    **94.12** -  - 
(2)    93.46 -  - 
(3)    92.84 **46.37** 51.16
(4)    94.10 7.78 6.25
(5)    92.70 45.22 **51.80**
BWCP    93.37 44.42 50.35


Table 4: Effect of regularization strength λ1 and
_λ2 with magnitude 1e −_ 4 for the sparsity loss in
Eqn.(7). The results are obtained using VGG-16
on CIFAR-100 dataset.

_λ1_ _λ2_ Acc. (%) Acc. Drop FLOPs (%)
_↓_

1.2 0.6 73.85 -0.34 33.53
1.2 1.2 73.66 -0.15 35.92
1.2 2.4 73.33 0.18 54.19
0.6 1.2 74.27 -0.76 30.67
2.4 1.2 71.73 1.78 60.75


NVIDIA GTX1080Ti. We evaluate the inference time using ResNet-50 with a mini-batch of 32 (1) on
GPU (CPU). GPU inference batch size is larger than CPU to emphasize our method’s acceleration on
the highly parallel platform as a structured pruning method. We see that BWCP has 29.2% inference
time reduction on GPU, from 48.7ms for base ResNet-50 to 34.5ms for pruned ResNet-50, and
21.2% inference time reduction on CPU, from 127.1ms for base ResNet-50 to 100.2ms for pruned
ResNet-50.

5.3 ABLATION STUDY

**Effect of BWCP on activation probability. From the analysis in Sec. 4.2, we have shown that**
BWCP can increase the activation probability of useful channels while keeping the activation
probability of unimportant channels unchanged through BW technique. Here we demonstrate this
using Resnet-34 and Resnet-50 trained on ImageNet dataset. We calculate the activation probability
of channels of BN and BW layer. It can be seen from Fig.3 (a-d) that (1) BW increases the activation
probability of important channels when _γc_ _> 0; (2) BW keeps the the activation probability of_
_|_ _|_
unimportant channels unchanged when βc 0 and γc 0. Therefore, BW indeed works by making
useful channels more important and unnecessary channels less important, respectively. In this way, ≤ _→_
BWCP can identify unimportant channels reliably.

**Effect of BW, Gumbel-Softmax (GS) and sparse Regularization (Reg). The proposed BWCP**
consists of three components including BW module (i.e. Eqn. (3)) and Soft Sampling module with
Gumbel-Softmax (i.e. Eqn. (6)) and a spare regularization (i.e. Eqn. (7)). Here we investigate the
effect of each component. To this end, five variants of BWCP are considered: (1) only BW module
is used; (2) only sparse regularization is imposed; (3)BWCP w/o BW module; (4) BWCP w/o
sparse regularization; and (5) BWCP with Gumbel-Softmax replaced by Straight Through Estimator
(STE) (Bengio et al., 2013). For case (5), we select channels by hard 0-1 mask generated with
_mc = sign(P_ ( X[ˆ]c > 0) − 0.5) [1]. The gradient is back-propagated through STE. From results on
Table 3, we can make the following conclusions: (a) BW improves the recognition performance,
implying that it can enhance the representation of channels; (b) sparse regularization on γ and β
slightly harm the classification accuracy of original model but it encourages channels to be sparse as
also shown in Proposition 3; (c) BWCP with Gumbel-Softmax achieves higher accuracy than STE,
showing that a soft sampling technique is better than the deterministic ones as reported in (Jang et al.,
2017).

1y = sign(x) = 1 if x ≥ 0 and 0 if x < 0.


-----

(a) ResNet-34-layer1.0.bn1 (b) ResNet-34-layer1.0.bn1 (c) ResNet-50-layer1.0.bn1 (d) ResNet-50-layer1.0.bn1 (e) VGGNet-layer1 (f) VGGNet layer12

BN-Act. Prob. 0.2 1.0 BN-Act. Prob. Original-BN Original-BN

0.8 BW-Act. Prob. BW-Act. Prob. 0 BWCP-BN 0.3 BWCP-BN

0.0 0.8 0.3 BWCP-BW BWCP-BW

0.6 1

0.2 0.6

0.4 0.4 0.4 2

3 Correlation Score Correlation Score

0.2 0.6 0.2 0.2

BN- 4 BN- 0.2

0.0 0.8 BN- 0.0 BN
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 40000 80000 120000 0 40000 80000 120000
Channel Index Channel Index Channel Index Channel Index Training Iterations Training Iterations


Figure 3: ((a) & (b)) and ((c) & (d)) show the effect of BWCP on activation probability with trained
ResNet-34 and ResNet-50 on ImageNet, respectively. The proposed batch whitening (BW) can
increase the activation probability of useful channels when _γc_ _> 0 while keeping the unimportant_
_|_ _|_
channels unchanged when whenoutput response maps in shallow and deeper BWCP modules during the whole training period. BWCP βc ≤ 0 and γc → 0. (e) & (f) show the correlation score for the
has lower correlation score among feature channels than original BN baseline.

**Impact of regularization strength λ1 and λ2. We analyze the effect of regularization strength λ1**
and λ2 for sparsity loss on CIFAR-100. The trade-off between accuracy and FLOPs reduction is
investigated using VGG-16. Table 4 illustrates that the network becomes more compact as λ1 and λ2
increase, implying that both terms in Eqn.(7) can make channel features sparse. Moreover, the flops
metric is more sensitive to the regularization on γ, which validates our analysis in Sec.4.2). Besides,
we should search for proper values for λ1 and λ2 to trade off between accuracy and FLOPs reduction,
which is a drawback for our method.

**Effect of the number of BW. Here the effect of the number of BW modules of BWCP is in-**
vestigated trained on CIFAR-10 using Resnet-56 consisting of a series of bottleneck structures.
Note that there are three BN layers in each bottleneck. We study four variants of BWCP:

(a) we use BW to modify the last BN in each
bottleneck module; hence there are a total of 18 Table 5: Effect of the number of BW modules on
BW layers in Resnet-56; (b) the last two BN CIFAR-10 dataset trained with ResNet-56. ‘# BW’
layers are modified by our BW technique (36 indicates the number of BW. More BW modules
BW layers) (c) All BN layers in bottlenecks are in the network would lead to a lower recognition
replaced by BW (54 BW layers), which is our accuracy drop with comparable computation conproposed method. The results are reported in sumption.
Table 5. We can see that BWCP achieves the # BW Acc. (%) Acc. Drop Model Size ↓ (%) FLOPs ↓ (%)

18 93.01 0.63 44.70 50.77

best top-1 accuracy when BW acts on all BN 36 93.14 0.50 45.29 50.45
layers, given the comparable FLOPs and model 54 93.37 0.27 44.42 50.35
size. This indicates that the proposed BWCP more benefits from more BW layers in the network.

**BWCP selects representative channel features. It is worth noting that BWCP can whiten channel**
features after BN through BW as shwon in Eqn.(3). Therfore, BW can learn diverse channel features
by reducing the correlations among channels(Yang et al., 2019b). We investigate this using VGGNet16 with BN and the proposed BWCP trained on CIFAR-10. The correlation score can be calculated
by taking the average over the absolute value of the correlation matrix of channel features. A larger
value indicates that there is redundancy in the encoded features. We plot the correlation score
among channels at different depths of the network. As shown in Fig.3 (e & f), channel features after
BW block have significantly smaller correlations, implying that channels selected by BWCP are
representative. This also accounts for the effectiveness of the proposed scheme.

6 DISCUSSION AND CONCLUSION

This paper presented an effective and efficient pruning technique, termed Batch Whitening Channel
Pruning (BWCP). We show BWCP increases the activation probability of useful channels while
keeping unimportant channels unchanged, making it appealing to pursue a compact model. Particularly, BWCP can be easily applied to prune various CNN architectures by modifying the batch
normalization layer. However, to achieve different levels of FLOPs reduction, the proposed BWCP
needs to search for the strength of sparse regularization. With probabilistic formulation in BWCP,
the expected FLOPs can be modeled. The multiplier method can be used to encourage the model to
attain target FLOPs. For future work, an advanced Pareto optimization algorithm can be designed to
tackle such multi-objective joint minimization. We hope that the analyses of BWCP could bring a
new perspective for future work in channel pruning.


-----

**Ethics Statement. We aim at compressing neural nets by the proposed BWCP framework. It could**
improve the energy efficiency of neural network models and reduce the emission of carbon dioxide.
We notice that deep neural networks trained with BWCP can be plugged into portable or edge devices
such as mobile phones. Hence, our work and AI in edge devices would have the same negative impact
on ethics. Moreover, network pruning may have different effects on different classes, thus producing
unfair models as a result. We will carefully investigate the results of our method on the fairness of the
model output in the future.

**Reproducibility Statement. For theoretical results, clear explanations of assumptions and a complete**
proof of propostion 1-3 are included in Appendix. To reproduce the experimental results, we provide
training details and hyper-parameters in Appendix Sec.B. Moreover, we will also make our code
available by a link to an anonymous repository during the discussion stage.

REFERENCES

Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propagation:
A parametric technique for removing internal covariate shift in deep networks. International
_Conference in Machine Learning, 2016._

Yoshua Bengio, Nicholas Leonard, and Aaron C. Courville. Estimating or propagating gradients´
through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL
[http://arxiv.org/abs/1308.3432.](http://arxiv.org/abs/1308.3432)

Dario A Bini, Nicholas J Higham, and Beatrice Meini. Algorithms for the matrix pth root. Numerical
_Algorithms, 39(4):349–378, 2005._

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully
connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848,
2018.

Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware
acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):
485–532, 2020.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural
networks. arXiv preprint arXiv:1803.03635, 2018.

Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic channel
pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331, 2018.

Jinyang Guo, Wanli Ouyang, and Dong Xu. Multi-dimensional pruning: A unified framework for
model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
_Recognition, pp. 1508–1517, 2020._

Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances
_in neural information processing systems, pp. 1379–1387, 2016._

Song Han and William J. Dally. Bandwidth-efficient deep learning. In Proceedings of the 55th
_Annual Design Automation Conference on, pp. 147, 2018._

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
efficient neural network. In Advances in neural information processing systems, pp. 1135–1143,
2015.

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. In ICLR 2016 : International Conference
_on Learning Representations 2016, 2016._

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
770–778, 2016.


-----

Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating
deep convolutional neural networks. arXiv preprint arXiv:1808.06866, 2018a.

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median
for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on
_Computer Vision and Pattern Recognition, pp. 4340–4349, 2019._

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
compression and acceleration on mobile devices. In Proceedings of the European Conference on
_Computer Vision (ECCV), pp. 784–800, 2018b._

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
_recognition, pp. 4700–4708, 2017._

Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Iterative normalization: Beyond standardization
towards efficient whitening. In Proceedings of the IEEE Conference on Computer Vision and
_Pattern Recognition, pp. 4874–4883, 2019._

Zehao Huang and Naiyan Wang. Data-Driven Sparse Structure Selection for Deep Neural Networks.
In ECCV, 2018.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2017.

Minsoo Kang and Bohyung Han. Operation-aware soft channel pruning using differentiable masks.
_arXiv preprint arXiv:2007.03938, 2020._

A Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Yann LeCun, John S Denker, and Sara A Solla. Optimal Brain Damage. In NIPS, 1990.

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for
Efficient ConvNets. In ICLR, 2017.

Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling
Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF
_Conference on Computer Vision and Pattern Recognition, pp. 1529–1538, 2020._

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE
_International Conference on Computer Vision, pp. 2736–2744, 2017._

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of
network pruning. arXiv preprint arXiv:1810.05270, 2018.

Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through
_l 0 regularization. International Conference on Learning Representation, 2017._

Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural
network compression. In Proceedings of the IEEE international conference on computer vision,
pp. 5058–5066, 2017.

Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: More
efficient budgeted pruning via differentiable sparsity allocation. arXiv preprint arXiv:2004.02164,
2020.

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual
reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial
_Intelligence, volume 32, 2018._


-----

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine
_Intelligence, 39(6):1137–1149, 2017._

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.

Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured probabilistic pruning for
convolutional neural network acceleration. arXiv preprint arXiv:1709.06994, 2017.

Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep
multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems,
33, 2020.

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in
deep neural networks. In Proceedings of the 30th International Conference on Neural Information
_Processing Systems, pp. 2074–2082, 2016._

Huanrui Yang, Wei Wen, and Hai Li. Deephoyer: Learning sparser neural network with differentiable
scale-invariant sparsity measures. arXiv preprint arXiv:1908.09979, 2019a.

Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, and Devi Parikh. Cross-channel communication networks. In Advances in Neural Information Processing Systems, pp. 1295–1304,
2019b.

Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. Good subnetworks
provably exist: Pruning via greedy forward selection. In International Conference on Machine
_Learning, pp. 10820–10830. PMLR, 2020._

Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational
convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision
_and Pattern Recognition, pp. 2780–2789, 2019._

Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang,
and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. arXiv preprint
_arXiv:1810.11809, 2018._


-----

The appendix provides more details about approach and experiments of our proposed batch whitening
channel pruning (BWCP) framework. The broader impact of this work is also discussed.

A MORE DETAILS ABOUT APPROACH

A.1 CALCULATION OF COVARIANCE MATRIX Σ

By Eqn.(1) in main text, the output of BN is ˜xncij = γcx¯ncij + βc. Hence, we have E[˜xc] =

_NHW1_ _N,H,Wn,i,j_ (γcx¯ncij + βc) = βc. Then the entry in c-th row and d-th column of covariance
matrix Σ of ˜xc is calculated as follows:
P

_N,H,W_


_n,i,j_ (γcx¯ncij + βc − E[˜xc])(γdx¯ndij + βd − E[˜xd]) = γcγdρcd (8)

X


Σcd =


_NHW_


where ρcd is the element in c-th row and j-th column of correlation matrix of ¯x. Hence, we have ρcd
1 _N,H,W_ _∈_

[ 1, 1]. Furthermore, we can write Σ into the vector form: Σ = γγ[T] _NHW_ _n,i,j_ **x¯nijx¯nij[T]** =
_−_ _⊙_

**_γγ[T]_** **_ρ._**
_⊙_ P

A.2 PROOF OF PROPOSITION 1

For (1), we notice that we can defineassume γc > 0 without loss of generality. Then, we have γc = −γc and _X[¯]c = −X[¯]c ∼N_ (0, 1) if γc < 0. Hence, we can


_P_ (Yc > 0) = P ( X[˜]c > 0) = P ( X[¯]c > )
_−_ _[β]γc[c]_

+∞ 1
= _√2π_ [exp][−] _[t]2[2] dt_
Z− _[βc]γc_

0 1 +∞
= Z− _[βc]γc_ _√2π_ [exp][−] _[t]2[2] dt +_ Z0

_βcγc_ 1 +∞
= 0 _√2π_ [exp][−] _[t]2[2] dt +_ 0
Z Z

Erf( _√β2cγc_ [) + 1]
=

2

When γc < 0, we can set γc = −γc. Hence, we arrive at

_P_ (Yc > 0) = P ( X[˜]c > 0) = Erf( _√2β|cγc|_ [) + 1]


1

2 dt

2π [exp][−] _[t][2]_

1

2 dt

2π [exp][−] _[t][2]_


(9)

(10)


For (2), let us denoterepresents a random variables corresponding to output featureX[¯]c ∼N (0, 1), and _X[˜]c = γcX[¯]c + β ycc and in Eqn.(1) in main text. Firstly, it is Yc = max{0,_ _X[˜]c} where Yc_
easy to see that P ( X[˜]c > 0) = 0 ⇔ E ¯Xc [[][Y][c][] = 0][ and][ E][ ¯]Xc [[][Y][ 2]c [] = 0][. In the following we show that]
E ¯Xc [[][Y][c][] = 0][ and][ E][ ¯]Xc [[][Y][ 2]c [] = 0][ ⇔] _[β][c]_ [= 0][. Similar with (1), we assume][ γ][c] _[>][ 0][ without]_
loss of generality. _[≤]_ [0 and][ γ][c]

For the sufficiency, we have

_−_ _[β]γcc_ 1 _x¯[2]c_ +∞ 1 _x¯[2]c_
E ¯Xc [[][Y][c][] =] 0 · _√2π_ [exp][−] 2 dx¯c + (γcx¯c + βc) · _√2π_ [exp][−] 2 dx¯c,
Z−∞ Z− _[βc]γc_

_βc[2]_ (11)
_−_ 2γc[2]
= _[γ][c][exp]√2π_ + _[β]2 [c]_ [(1 + Erf[][ β]√2[c]γc ]),

where Erf[x] = _√2π_ _x0_ [exp][−][t][2] _[dt][ is the error function. From Eqn.(11), we have]_

_βc[2]_

R _γcexp−_ 2γc[2] _βc_ (12)

lim _Xc_ [[][Y][c][] =] lim + lim ]) = 0
_γc→0[+][ E][ ¯]_ _γc→0[+]_ _√2π_ _γc→0[+]_ 2 [(1 + Erf[][ β]√2[c]γc


-----

Table 6: Runing time comparison during training between BWCP, vanilla BN and SCP. The proposed
BWCP achieves better trade-off between FLOPs reduction and accuracy drop although it introduces
a little extra computational cost during training. ‘F’ denotes forward running time (s) while ‘F+B’
denotes forward and backward running time (s). The results are averaged over 100 iterations. The
GPU is NVIDIA GTX1080Ti. The CPU type is Intel Xeon E5-2682 v4.

Model Mothod CPU (F) (s) CPU (F+B) (s) GPU (F) (s) GPU (F+B) (s) Acc. Drop FLOPs ↓ (%)

vanilla BN 0.184 0.478 0.015 0.031 0 0
ResNet-50 SCP 0.193 0.495 0.034 0.067 1.69 54.3
BWCP (Ours) 0.239 0.610 0.053 0.104 1.02 51.2

In the same way, we can calculate


_−_ _[β]γcc_ 1 _x¯[2]c_ +∞
Ex¯c [[][Y][ 2]c [] =] 0 · _√2π_ [exp][−] 2 dx¯c + (γcx¯c + βc)[2] _·_
Z−∞ Z− _[βc]γc_

_βc[2]_
_−_ 2γc[2] _c_ [+][ β]c[2]
= _[γ][c][β][c][exp]_ + _[γ][2]_ (1 + Erf[ _[β][c]_ ]),

_√2π_ 2 _√2γc_


1 _x¯[2]c_

2 dx¯c,

2π [exp][−]


(13)


From Eqn.(13), we have

_βc[2]_
_γcβcexp−_ 2γc[2] _γc[2]_ [+][ β]c[2]
lim _xc_ [[][Y][ 2]c [] =] lim + lim
_γc→0[+][ E][¯]_ _γc→0[+]_ _√2π_ _γc→0[+]_ 2


(14)

(1 + Erf[ _√[β]2[c]γc_ ]) = 0


For necessity, we show that ifIt can be acquired by solving Eqn Eqn.(11) and Eqn.(13). To be specific, Ex¯c [[][Y]c[] = 0][ and][ E]x¯c [[][Y]c2] = 0, then γc → 0 β andc Eqn.(.11) βc ≤ 0. In essence,Eqn.(13)
_∗_ _−_
gives us γc = 0[+]. Substituting it into Eqn.(.11), we can obtain βc ≤ 0. This completes the proof.

A.3 TRAINING OVERHEAD OF BWCP

The proposed BWCP introduces a little extra computational cost during training. To see this, we
evaluate the computational complexity of SCP and BWCP for ResNet50 on ImageNet with an input
image size of 224 × 224. We can see from the table below that the training BWCP is slightly slower
on both CPU and GPU than the plain ResNet and SCP. In fact, the computational burden mainly
comes from calculating the covariance matrix and its root inverse. In our paper, we calculate the
root inverse of the covariance matrix by Newton’s iteration, which is fast and efficient. Although
BWCP brings extra training overhead, it achieves better top-1 accuracy drop under the same FLOPs
consumption.

A.4 PROOF OF PROPOSITION 2

First, we can derive that _X[ˆ]c = Σ−N_ 2[1] [(][γ] _[⊙]X[¯]_ +β) = Σ−N [1]2 [(][γ] _[⊙]X[¯]_ )+Σ−N 2[1] **_[β][ = (][Σ]−N_** 2[1] **_[γ][)][⊙]X[¯]_** +Σ−N 2[1] **_[β][.]_**

Hence, the newly defined scale and bias parameters are ˆγ = Σ−N 2[1] **_[γ][ and][ ˆ]β = Σ−N_** [1]2 **_[β][. When][ T][ = 1][,]_**

we have Σ−N 2[1] = [1]2 [(3][I][ −] **[Σ][N]** [)][ by Eqn.(5) in main text. Hence we obtain,]

**_γˆ = [1]_** **_ρ)γ_**

2 [(3][I][ −] **[Σ][N]** [)][γ][ = 1]2 [(3][I][ −] **_[γγ]γ_** [T]2 _⊙_

_∥_ _∥[2]_


= [1]

2 [(3][γ][ −]


_γ1γjρ1jγj,_ _,_
_· · ·_
_j=1_

X


_γCγjρCjγj_
_j=1_

X


(15)


**_γ_** 2
_∥_ _∥[2]_


_γj[2][ρ][1][j]_

)γ1, _, (3_
**_γ_** 2 _· · ·_ _−_
_∥_ _∥[2]_


_γj[2][ρ][Cj]_

)γC
**_γ_** 2
_∥_ _∥[2]_


= [1]


(3

 _−_


_j=1_


_j=1_


-----

Similarly, **_β[ˆ] can be given by_**

**_βˆ = [1]_** **_ρ)β_**

2 [(3][I][ −] **[Σ][)][β][ = 1]2** [(3][I][ −] **_[γγ]γ_** [T]2 _⊙_

_∥_ _∥[2]_


= [1]

2 [(3][β][ −]


_γ1γjρ1jβj,_ _,_ _γCγjρCjβj_
_· · ·_
_j=1_ _j=1_

X X


(16)


**_γ_** 2
_∥_ _∥[2]_


_C_ _C_

_γjβjρ1j_ _γjβjρCj_

= [1] 3β1 ( )γ1, _, 3βC_ ( )γC

2  _−_ _j=1_ **_γ_** 2 _· · ·_ _−_ _j=1_ **_γ_** 2 

X _∥_ _∥[2]_ X _∥_ _∥[2]_

 

Taking each component of vector Eqn.(15-16) gives us the expression of ˆγc and _β[ˆ]c in Proposition 2._

A.5 PROOF OF PROPOSITION 3


For (1), through Eqn.(15), we acquire |γˆc| = [1]2 _[|][3]_ _[−]_ [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ _γc| →_ 0 if |γc| → 0.

_γc[2]_ _[||][γ][c][|][. Therefore,][ |][ˆ]_

On the other hand, by Eqn.(16), we have _β[ˆ]c ≈_ 2[1] [(3][ −] _∥γ∥2[2]_ [)][β][c][ < β][c][ ≤] [0][. Here we assume that]

_ρcd = 1 if c = d and 0 otherwise. Note that the assumption is plausible by Fig.3 (e & f) in main_
text from which we see that the correlation among channel features will gradually decrease during
training. We also empirically verify these two conclusions by Fig.4. From Fig.4 we can see that
_|γˆc| ≥|γc| where the equality holds iff |γc| = 0, and_ _β[ˆ]c is larger than βc if βc is positive, and vice_
versa. By Proposition 1, we arrive at

_P_ ( X[ˆ]c > δ) < P ( X[ˆ]c > 0) = 0 (17)


where the first ‘>’ holds since δ is a small positive constant and ‘=’ follows from _γˆc_ 0 and
_βˆc_ 0. For (2), to show P ( X[ˆ]c > δ) > P ( X[˜]c > δ), we only need to prove P ( X[¯] |c >| →[δ][−]γˆcβ[ˆ]c [)][ >]
_≤_ _|_ _|_

_P_ ( X[¯]c > _[δ][−]γc[β][c]_ [)][, which is equivalent to][ δ][−]γˆcβ[ˆ]c _[<][ δ][−]γc[β][c]_ [. To this end, we calculate]

_|_ _|_ _|_ _|_ _|_ _|_

_|γc|γˆβ[ˆ]cc −|γγˆcc|βc_ = _|γc|_ 2[1] [(3][β][c][ −] [(][P]1j[C]=1 _γj∥βγj∥ρ2[2]cj_ [)][γ]γ[c]j[2][)][ρ][ −][cj] [1]2 [(3][ −] [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ [)][|][γ][c][|][β][c]

_|_ _| −|_ _|_ 2 [(3][ −] [P]j[C]=1 _∥γ∥2[2]_ [)][|][γ][c][| −|][γ][c][|]

_C_ _γj_ _βj_ _γcρcj_ _γj[2][β][c][ρ][cj]_

= _j=1_ _∥γ∥2[2]_ _−_ [P]j[C]=1 _∥γ∥2[2]_

_γj[2][ρ][cj]_

P 1 − [P]j[C]=1 _∥γ∥2[2]_ (18)

_C_ _γj_ (βj _γc−γj_ _βc)ρcj_ 1 _C_
_j=1_ **_γ_** 2 **_γ_** 2 _j=1_ [(][β][j][γ][c][ −] _[γ][j][β][c][)][2][ρ]cj[2]_

= _∥_ _∥[2]_ _∥_ _∥_

P 1 − [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ _≤_ qP1 − [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_

= _∥γ∥2_ _Cj=1_ [(][β][j][γ][c][ −] _[γ][j][β][c][)][2][ρ]cj[2]_ = δ
qP∥γ∥2[2] _[−]_ [P]j[C]=1 _[γ]j[2][ρ][cj]_

where the ‘ ’ holds due to the Cauchy–Schwarz inequality. By Eqn.(18), we derive that _γc_ (δ
_≤_ _|_ _|_ _−_
_βˆc)_ _γˆc_ (δ _βc) which is exactly what we want. Lastly, we empirically verify that δ defined in_
_≤|_ _|_ _−_
Proposition 3 is a small positive constant. In fact, δ represents the minimal activation feature value
(in ResNet-34 during the whole training stage and value ofi.e. _X[ˆ]c = ˆγcX[¯]c + β[ˆ]c ≥_ _δ by definition). We visualize the value of δ of each layer in trained ResNet-34 on δ in shallow and deep layers_
ImageNet dataset in Fig.5. As we can see, δ is always a small positive number in the training process.
We thus empirically set δ as 0.05 in all experiments.

To conclude, by Eqn.(17), BWCP can keep the activation probability of unimportant channel unchanged; by Eqn.(18), BWCP can increase the activation probability of important channel. In this
way, the proposed BWCP can pursuit a compact deep model with good performance.


-----

Figure 4: Experimental observation of how our proposed BWCP changes the values of γc and βc

(a) layer1.0.bn1 (b) layer2.0.bn1 (c) layer3.0.bn1 (d) layer4.0.bn1

BN- 0.5 0.0 0.2

1 BW- 0.0

Line 0 0.0 0.2 0.2

0 0.5 0.4 0.4

Value 1 0.6 0.6

1.0 0.8 0.8

2

3 1.5 BN-BW-Line 0 1.01.2 BN-BW-Line 0 1.01.2 BN-BW-Line 0

0 10 20 30 40 50 60 0 20 40 60 80 100 120 0 50 100 150 200 250 0 100 200 300 400 500

(e) layer1.0.bn1 (f) layer2.0.bn1 (g) layer3.0.bn1 (h) layer4.0.bn1

BN-| | 0.8 BN-| | 1.2 BN-| | BN-| |

2.0 BW-| | BW-| | BW-| | 0.8 BW-| |

1.0

1.5 0.6 0.8 0.6

Value 1.0 0.4 0.6 0.4

0.4

0.5 0.2 0.2

0.2

0.0 0.0 0.0 0.0

0 10 20 30 40 50 60 0 20 40 60 80 100 120 0 50 100 150 200 250 0 100 200 300 400 500

Channel Index Channel Index Channel Index Channel Index

in BN layers through the proposed BW technique. Results are obtained by tranining ResNet50 on
ImageNet dataset. We investigate γc and βc at different depths of the network including layer1.0.bn1,
layer2.0.bn1,layer3.0.bn1 and layer4.0.bn1. (a-d) show BW enlarges βc when βc > 0 while reducing
_βc when βc ≤_ 0. (e-h) show that BW consistently increases the magnitude of γc across the network.

(a) ResNet-34-layer1.0.bn1 (b) ResNet-34-layer4.0.bn1 (c) ResNet-34-per layer

0.022 0.0050

0.020 0.0045 0.05

0.018 0.0040 0.04

0.016

0.014 0.0035 0.03

0.012 0.0030 0.02

0.010 0.0025

0.008 0.0020 0.01

0.006 0.0015 0.00

0 20 40 60 80 100 0 20 40 60 80 100 0 5 10 15 20 25 30

Training Epochs Training Epochs Layers


Figure 5: Experimental observation of the values of δ defined in proposition 3. Results are obtained
by tranining ResNet-34 on ImageNet dataset. (a & b) investigate δ at different depths of the network
including layer1.0.bn1 and layer4.0.bn1 respectively. (c) visualizes δ for each layer of ResNet-34.
We see that δ in proposition 3 is always a small positive constant.

Figure 6: Illustration of BWCP with shortcut in basic block structure of ResNet. For shortcut with
Conv-BN modules, we use a simple strategy that lets BW layer in the last convolution layer and


shortcut share the same mask. For shortcut with identity mappings, we use the mask in previous layer.


-----

**Algorithm 1 Forward Propagation of the proposed BWCP.**

1: Input: mini-batch inputs x ∈ R[N] _[×][C][×][H][×][W]_ .
2: Hyperparameters: momentum g for calculating root inverse of covariance matrix, iteration number T .
3: Output: the activations x[out] obtained by BWCP.
4: calculate standardized activation: {x¯c}c[C]=1 [in Eqn.(1).]

5: calculate the output of BN layer: ˜xc = γcx¯c + βc.

6: calculate normalized covariance matrix: ΣN = _∥γγγ∥[T]2[2]_ _NHW1_ _N,H,Wn,i,j=1_ **x[¯]nijx¯nijT**

7: Σ0 = I. _[⊙]_

P

8: for k = 1 to T do
9: **Σk =** 2[1] [(3][Σ][k][−][1][ −] **[Σ]k[3]−1[Σ][N]** [)]

10: end for
11: calculate whitening matrix for training: Σ−N [1]2 = ΣT .

12: calculate whitening matrix for inference: **Σ[ˆ]** _−N_ 2 _←_ (1 − _g) Σ[ˆ]_ _−N_ 2 + gΣN− 2 [.]

13: calculate whitened output: ˆxnij = Σ−N 2[1] **x[˜]nij.**

14: calculate equivalent scale and bias defined by BW: ˆγ = Σ[−] 2[1] γ and **_β[ˆ] = Σ[−]_** [1]2 β.

15: calculate the activation probability by Proposition 2 with ˆγ and **_β[ˆ], obtain soft masks {mc}c[C]=1_** [by Eqn.(6).]

16: calculate the output of BWCP: x[out]c = ˆxc _mc._
_⊙_

**(a) BN** **(b) BWCP**


Figure 7: Illustration of forward propagation of (a) BN and (b) BWCP. The proposed BWCP prunes



**BW** CNNs by replacing original BN layer with BWCP module.Cov. **Newton** **BN** **BWCov.**

**Iter.**

A.6 SOLUTION TO RESIDUAL ISSUE

The recent advanced CNN architectures usually have residual blocks with shortcut connections (He

**Soft Gating Module** et al., 2016; Huang et al., 2017). As shown in Fig.6, the number of channels in the last convolutionSoft Gating Module

layer must be the same as in previous blocks due to the element-wise summation. Basically, there

**Act. Prob.** **Gumbel-Softmax Soft Maskare two types of residual connections,BN modules, and shortcut with identity. For shortcut with Conv-BN modules, the proposed BWReLU** **Act. Prob. i.e. shortcut with downsampling layer consisting of Conv-Gumbel-SoftMax** **Soft Mask**

technique is utilized in downsampling layer to generate pruning mask m[s]c[. Furthermore, we use a]
simple strategy that lets BW layer in the last convolution layer and shortcut share the same mask
as given byshortcut with identity mappings, we use the mask in the previous layer. In doing so, their activated mc = m[s]c _[·][ m]c[last]_ where m[last]c and m[s]c [denote masks of the last convolution layer. For]
output channels must be the same.

A.7 BACK-PROPAGATION OF BWCP


The forward propagation of BWCP can be represented by Eqn.(3-4) and Eqn.(9) in the main text
(see detail in Table 1), all of which define differentiable transformations. Here we provide the
back-propagation of BWCP. By comparing the forward representation of BN and BWCP in Fig.7,
we need to back-propagate the gradient _∂x∂[out]nijL_ [to] _∂x∂¯nijL_ [for backward propagation of BWCP. For]

simplicity, we neglect the subscript ‘nij’.

By chain rules, we have


_∂L_ **_γ_** **m** _∂L_ _−N_ 2[1] [)] (19)
_∂x¯_ [= ˆ] ⊙ _⊙_ _∂x[out][ +][ ∂]∂[L]x¯_ [(][Σ]


-----

Table 7: Performance of our BWCP on different base models compared with other approaches on
CIFAR-100 dataset.

Model Slimming* (Liu et al., 2017)Mothod Baseline Acc. (%)77.24 Acc. (%)74.52 Acc. Drop2.72 Channels60 _↓_ (%) Model Size29.26 ↓ (%) FLOPs47.92 ↓ (%)

ResNet-164 SCP (Kang & Han, 2020) 77.24 76.62 0.62 57 28.89 45.36
BWCP (Ours) 77.24 76.77 **0.47** 41 21.58 39.84

Slimming* (Liu et al., 2017) 74.24 73.53 0.71 **60** 54.99 **50.32**
Variational Pruning (Zhao et al., 2019) 74.64 72.19 2.45 37 37.73 22.67

DenseNet-40

SCP (Kang & Han, 2020) 74.24 73.84 0.40 **60** **55.22** 46.25
BWCP (Ours) 74.24 74.18 **0.06** 54 53.53 40.40

VGGNet-19 Slimming* (Liu et al., 2017) 72.56 73.01 -0.45 **50** **76.47** **38.23**
BWCP (Ours) 72.56 73.20 **-0.64** 23 41.00 22.09

Slimming* (Liu et al., 2017) 73.51 73.45 0.06 **40** **66.30** 27.86
Variational Pruning (Zhao et al., 2019) 73.26 73.33 -0.07 32 37.87 18.05

VGGNet-16

BWCP (Ours) 73.51 73.60 **-0.09** 34 58.16 **34.46**

where _[∂]∂[L]x¯_ [(][Σ]−N [1]2 [)][ denotes the gradient][ w.r.t.][ ¯]x back-propagated through Σ−N 2[1] [. To calculate it, we first]

obtain the gradient w.r.t. Σ−N 2[1] as given by


_∂L_

_∂Σ−N_ [1]2


= γ _[∂][L]_

_∂γˆ_


T + β _[∂][L]_

_∂β[ˆ]_


(20)


where
_∂_ _∂_ _∂_
_L_ **x** **m** _L_ **x** _L_ (21)
_∂γˆ_ [= ¯] ⊙ _⊙_ _∂x[out][ +][ ∂]∂[m]γˆ_ [(ˆ] ⊙ _∂x[out][ )]_

and
_∂_ _∂_
_L_ = m + _[∂][m]_ (ˆx _L_ (22)

_∂β[ˆ]_ _∂β[ˆ]_ _⊙_ _∂x[out][ )]_

The remaining thing is to calculate _[∂]∂[m]γˆ_ [and][ ∂]∂[m]β[ˆ] [. Based on the Gumbel-Softmax transformation, we]

arrive at


_mc(1_ _mc)f_ (ˆγc,β[ˆ]c) _βcγc_
_−_ _−_

_τP ( X[ˆ]C_ _>0)(1−P ( X[ˆ]C_ _>0))_ _|γc|[2][ if][ d][ =][ c]_ (23)
0, otherwise

_mc(1−mc)f_ (ˆγc,β[ˆ]c)

_τP ( X[ˆ]C_ _>0)(1_ _P ( X[ˆ]C_ _>0))_ _[,][ if][ d][ =][ c]_ (24)
_−_
0, otherwise


_∂mc_

_∂γˆd_

_∂mc_

_∂β[ˆ]d_


where f (ˆγc, _β[ˆ]c) is the probability density function of R.V._ _X[ˆ]c as written in Eqn.(2) of main text._


To proceed, we deliver the gradient w.r.t. Σ−N [1]2

main text. Note that Σ−N [1]2 = ΣT, we have

_∂_
_L_ =

_∂ΣN_ _−_ [1]2


in Eqn.(20) to Σ by Newton Iteration in Eqn.(6) of

_T_

(Σ[3]k 1[)][T][ ∂][L] (25)
_−_ _∂Σk_
_k=1_

X


where _∂∂ΣLk_ [can be calculated by following iterations:]

_∂L_ = [3] _∂L_ _∂L_ (Σ[2]k 1[Σ][)][T][ −] [1] _k_ 1[)][T][ ∂][L] **Σ[T]**

_∂Σk−1_ 2 _∂Σk_ _−_ [1]2 _∂Σk_ _−_ 2 [(][Σ][2]− _∂Σk_

_−_ [1]2 [(][Σ][k][−][1][)][T][ ∂]∂Σ[L]k (Σk−1Σ)[T] _k = T, · · ·, 1._


Given the gradient w.r.t. ΣN in Eqn.(25), we can calculate the gradient w.r.t. ¯x back-propagated
through Σ−N 2[1] in Eqn.(19) as follows


_∂L_ _−N_ [1]2 [) = (][ γγ][T] + _[∂][L]_
_∂x¯_ [(][Σ] _∥γ∥[2][ ⊙]_ [(][ ∂]∂Σ[L]N _∂ΣN_

Based on Eqn.(19-26), we obtain the back-propagation of BWCP.


T))¯x (26)


-----

B MORE DETAILS ABOUT EXPERIMENT

B.1 DATASET AND METRICS

We evaluate the performance of our proposed BWCP on various image classification benchmarks,
including CIFAR10/100 (Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015). The CIFAR-10
and CIFAR-100 datasets have 10 and 100 categories, respectively, while both contain 60k color images
with a size of 32 × 32, in which 50k training images and 10K test images are included. Moreover,
the ImageNet dataset consists of 1.28M training images and 50k validation images. Top-1 accuracy
are used to evaluate the recognition performance of models on CIFAR 10/100. Top-1 and Top-5
accuracies are reported on ImageNet. We utilize the common protocols, i.e. number of parameters
and Float Points Operations (FLOPs) to obtain model size and computational consumption.

For CIFAR-10/100, we use ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and VGGNet (Simonyan & Zisserman, 2014) as our base model. For ImageNet, we use ResNet-34 and ResNet-50.
We compare our algorithm with other channel pruning methods without a fine-tuning procedure.
Note that a extra fine-tuning process would lead to remarkable improvement of performace (Ye et al.,
2020). For fair comparison, we also fine-tune our BWCP to compare with those pruning methods.
The training configurations are provided in Appendix B.2. The base networks and BWCP are trained
together from scratch for all of our models.

B.2 TRAINING CONFIGURATION

**Training Setting on ImageNet. All networks are trained using 8 GPUs with a mini-batch of 32 per**
GPU. We train all the architectures from scratch for 120 epochs using stochastic gradient descent
(SGD) with momentum 0.9 and weight decay 1e-4. We perform normal training without sparse
regularization in Eqn.(7) on the original networks for first 20 epochs by following (Ning et al., 2020).
The base learning rate is set to 0.1 and is multiplied by 0.1 after 50, 80 and 110 epochs. During
fine-tuning, we use the standard SGD optimizer with Nesterov momentum 0.9 and weight decay
0.00005 to fine-tune pruned network for 150 epochs. We decay learning rate using cosine schedule
with initial learning rate 0.01. The coefficient of sparse regularization λ1 and λ2 are set to 7e-5 and
3.5e-5 to achieve Flops Reduction at a level of 40%, while λ1 and λ2 are set to 9e-5 and 3.5e-5
respectively to achieve FLOPs reduction at a level of 40%. Besides, the covariance matrix in the
proposed BW technique is calculated within each GPU. Like (Huang et al., 2019), we also use
group-wise decorrelation with group size 16 across the network to improve the efficiency of BW.

**Training setting on CIFAR-10 and CIFAR-100. We train all models on CIFAR-10 and CIFAR-100**
with a batch size of 64 on a single GPU for 160 epochs with momentum 0.9 and weight decay 1e-4.
The initial learning rate is 0.1 and is decreased by 10 times at 80 and 120 epoch. The coefficient of
sparse regularization λ1 and λ2 are set to 4e-5 and 8e-5 for CIFAR-10 dataset and 7e-6 and 1.4e-5
for CIFAR-100 dataset.

B.3 MORE RESULTS OF BWCP

The results of BWCP on CIFAR-100 dataset is reported in Table 7. As we can see, our approach
BWCP achieves the lowest accuracy drops and comparable FLOPs reduction compared with existing
channel pruning methods in all tested base models.


-----