|
# BWCP: PROBABILISTIC LEARNING-TO-PRUNE CHAN## NELS FOR CONVNETS VIA BATCH WHITENING |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
This work presents a probabilistic channel pruning method to accelerate Convolutional Neural Networks (CNNs). Previous pruning methods often zero out |
|
unimportant channels in training in a deterministic manner, which reduces CNN’s |
|
learning capacity and results in suboptimal performance. To address this problem, |
|
we develop a probability-based pruning algorithm, called batch whitening channel |
|
pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated. BWCP has several merits. (1) It |
|
simultaneously trains and prunes CNNs from scratch in a probabilistic way, exploring larger network space than deterministic methods. (2) BWCP is empowered by |
|
the proposed batch whitening tool, which is able to empirically and theoretically |
|
increase the activation probability of useful channels while keeping unimportant |
|
channels unchanged without adding any extra parameters and computational cost |
|
in inference. (3) Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet |
|
with various network architectures show that BWCP outperforms its counterparts |
|
by achieving better accuracy given limited computational budgets. For example, |
|
ResNet50 pruned by BWCP has only 0.58% Top-1 accuracy drop on ImageNet, |
|
while reducing 42.9% FLOPs of the plain ResNet50. |
|
|
|
1 INTRODUCTION |
|
|
|
Deep convolutional neural networks (CNNs) have achieved superior performance in a variety of |
|
computer vision tasks such as image recognition (He et al., 2016), object detection (Ren et al., |
|
2017), and semantic segmentation (Chen et al., 2018). However, despite their great success, deep |
|
CNN models often have massive demand on storage, memory bandwidth, and computational power |
|
(Han & Dally, 2018), making them difficult to be plugged onto resource-limited platforms, such as |
|
portable and mobile devices (Deng et al., 2020). Therefore, proposing efficient and effective model |
|
compression methods has become a hot research topic in the deep learning community. |
|
|
|
Model pruning, as one of the vital model compression techniques, has been extensively investigated. |
|
It reduces model size and computational cost by removing unnecessary or unimportant weights or |
|
channels in a CNN (Han et al., 2016). For example, many recent works (Wen et al., 2016; Guo et al., |
|
2016) prune fine-grained weights of filters. Han et al. (2015) proposes to discard the weights that |
|
have magnitude less than a predefined threshold. Guo et al. (2016) further utilizes a sparse mask on |
|
a weight basis to achieve pruning. Although these unstructured pruning methods achieve optimal |
|
pruning schedule, they do not take the structure of CNNs into account, preventing them from being |
|
accelerated on hardware such as GPU for parallel computations (Liu et al., 2018). |
|
|
|
To achieve efficient model storage and computations, we focus on structured channel pruning (Wen |
|
et al., 2016; Yang et al., 2019a; Liu et al., 2017), which removes entire structures in a CNN such as |
|
filter or channel. A typical structured channel pruning approach commonly contains three stages, |
|
including pre-training a full model, pruning unimportant channels by the predefined criteria such as |
|
_ℓp norm, and fine-tuning the pruned model (Liu et al., 2017; Luo et al., 2017), as shown in Fig.1 (a)._ |
|
However, it is usually hard to find a global pruning threshold to select unimportant channels, because |
|
the norm deviation between channels is often too small (He et al., 2019). More importantly, as some |
|
channels are permanently zeroed out in the pruning stage, such a multi-stage procedure usually not |
|
only relies on hand-crafted heuristics but also limits the learning capacity (He et al., 2018a; 2019). |
|
|
|
|
|
----- |
|
|
|
|0.9|1.3|Col3| |
|
|---|---|---| |
|
|0.3 0.2||| |
|
|
|
|Col1|Col2|Col3| |
|
|---|---|---| |
|
|||| |
|
|
|
|Col1|Col2|Col3| |
|
|---|---|---| |
|
|||| |
|
|
|
|Col1|Col2|Col3| |
|
|---|---|---| |
|
|||| |
|
|
|
|
|
**Channels** **0-1 Masks** **Channels** **Channels** |
|
|
|
**1.3** **Norm** **Large** |
|
|
|
**0.9** **Threshold** **Prune** **Fine-tune** |
|
|
|
**0.3** **Medium** |
|
|
|
**0.2** |
|
|
|
**0.1** **Small** |
|
|
|
**(a) Norm-based method** |
|
|
|
**Channels** **Soft Masks** **Soft Masks** **Channels** |
|
|
|
**0.9** **1.3** **Probability** **0.73** **0.8** **BatchWhitening** **0.80** **0.95** **Prune** **Probability** **Large** |
|
|
|
**0.3** **0.42** **0.46** **Medium** |
|
|
|
**0.2** **0.12** **0.15** |
|
|
|
**0.1** **0.03** **0.03** **Small** |
|
|
|
**(b) Our proposed BWCP** |
|
|
|
Figure 1: Illustration of our proposed BWCP. (a) Previous channel pruning methods utilize a hard |
|
criterion such as the norm (Liu et al., 2017) of channels to deterministically remove unimportant |
|
channels, which deteriorates performance and needs a extra fine-tuning process(Frankle & Carbin, |
|
**b) Our proposed BWCP is a probability-based pruning framework where unimportant** |
|
channels are stochastically pruned with activation probability, thus maintaining the learning capacity |
|
of original CNNs. In particular, our proposed batch whitening (BW) tool can increase the activation |
|
probability of useful channels while keeping the activation probability of unimportant channels |
|
|
|
unchanged, enabling BWCP to identify unimportant channels reliably. |
|
|
|
To tackle the above issues, we propose a simple but effective probability-based channel pruning |
|
framework, named batch-whitening channel pruning (BWCP), where unimportant channels are |
|
pruned in a stochastic manner, thus preserving the channel space of CNNs in training (i.e. the |
|
diversity of CNN architectures is preserved). To be specific, as shown in Fig.1 (b), we assign each |
|
channel with an activation probability (i.e. the probability of a channel being activated), by exploring |
|
the properties of the batch normalization layer (Ioffe & Szegedy, 2015; Arpit et al., 2016). A larger |
|
activation probability indicates that the corresponding channel is more likely to be preserved. |
|
|
|
We also introduce a capable tool, termed batch whitening (BW), which can increase the activation |
|
probability of useful channels, while keeping the unnecessary channels unchanged. By doing so, |
|
the deviation of the activation probability between channels is explicitly enlarged, enabling BWCP |
|
to identify unimportant channels during training easily. Such an appealing property is justified by |
|
theoretical analysis and experiments. Furthermore, we exploit activation probability adjusted by |
|
BW to generate a set of differentiable masks by a soft sampling procedure with Gumbel-Softmax |
|
technique, allowing us to train BWCP in an online “pruning-from-scratch” fashion stably. After |
|
training, we obtain the final compact model by directly discarding the channels with zero masks. |
|
|
|
The main contributions of this work are three-fold. (1) We propose a probability-based channel |
|
pruning framework BWCP, which explores a larger network space than deterministic methods. (2) |
|
BWCP can easily identify unimportant channels by adjusting their activation probabilities without |
|
adding any extra model parameters and computational cost in inference. (3) Extensive experiments on |
|
CIFAR-10, CIFAR-100 and ImageNet datasets with various network architectures show that BWCP |
|
can achieve better recognition performance given the comparable amount of resources compared |
|
to existing approaches. For example, BWCP can reduce 68.08% Flops by compressing 93.12% |
|
parameters of VGG-16 with merely accuracy drop and ResNet-50 pruned by BWCP has only 0.58% |
|
top-1 accuracy drop on ImageNet while reducing 42.9% FLOPs. |
|
|
|
2 RELATED WORK |
|
|
|
**Weight Pruning. Early network pruning methods mainly remove the unimportant weights in the** |
|
network. For instance, Optimal Brain Damage (LeCun et al., 1990) measures the importance of |
|
weights by evaluating the impact of weight on the loss function and prunes less important ones. |
|
However, it is not applicable in modern network structure due to the heavy computation of the Hessian |
|
matrix. Recent work assesses the importance of the weights through the magnitude of the weights |
|
itself. Specifically, (Guo et al., 2016) prune the network by encouraging weights to become exactly |
|
zero. The computation involves weights with zero can be discarded. However, a major drawback of |
|
weight pruning techniques is that they do not take the structure of CNNs into account, thus failing to |
|
help scale pruned models on commodity hardware such as GPUs (Liu et al., 2018; Wen et al., 2016). |
|
|
|
**Channel Pruning. Channel pruning approaches directly prune feature maps or filters of CNNs,** |
|
making it easy for hardware-friendly implementation. For instance, relaxed ℓ0 regularization (Louizos |
|
|
|
|
|
----- |
|
|
|
et al., 2017) and group regularizer (Yang et al., 2019a) impose channel-level sparsity, and filters with |
|
small value are selected to be pruned. Some recent work also propose to rank the importance of |
|
filters by different criteria including ℓ1 norm (Liu et al., 2017; Li et al., 2017), ℓ2 norm (Frankle & |
|
Carbin, 2018) and High Rank channels (Lin et al., 2020). For example, (Liu et al., 2017) explores the |
|
importance of filters through scale parameter γ in batch normalization. Although these approaches |
|
introduce minimum overhead to the training process, they are not trained in an end-to-end manner |
|
and usually either apply on a pre-trained model or require an extra fine-tuning procedure. |
|
|
|
Recent works tackle this issue by pruning CNNs from scratch. For example, FPGM (He et al., 2019) |
|
zeros in unimportant channels and continues training them after each training epoch. Furthermore, |
|
both SSS and DSA learn a differentiable binary mask that is generated by channel importance and |
|
does not require any additional fine-tuning. Our proposed BWCP is most related to variational |
|
pruning (Zhao et al., 2019) and SCP (Kang & Han, 2020) as they also employ the property of |
|
normalization layer and associate the importance of channel with probability. The main difference |
|
is that our method adopts the idea of whitening to perform channel pruning. We will show that the |
|
proposed batch whitening (BW) technique can adjusts the activation probability of different channels |
|
according to their importance, making it easy to identify unimportant channels. Although previous |
|
work SPP(Wang et al., 2017) and DynamicCP (Gao et al., 2018) also attempt to boost salient channels |
|
and skip unimportant ones, they fail to consider the natural property inside normalization layer and |
|
deign the activation probability empirically . |
|
|
|
3 PRELIMINARY |
|
|
|
**Notation. We use regular letters, bold letters, and capital letters to denote scalars such as ‘x’, and** |
|
vectors (e.g.vector, matrix, and tensor) such as ‘x’ and random variables such as ‘X’, respectively. |
|
|
|
We begin with introducing a building layer in recent deep neural nets which typically consists of |
|
a convolution layer, a batch normalization (BN) layer, and a rectified linear unit (ReLU) (Ioffe & |
|
Szegedy, 2015; He et al., 2016). Formally, it can be written by |
|
|
|
**xc = wc ∗** **z,** **x˜c = γcx¯c + βc,** **yc = max{0, ˜xc}** (1) |
|
|
|
where c ∈ [C] denotes channel index and C is channel size. In Eqn.(1), ‘∗’ indicates convolution |
|
operation and wc is filter weight corresponding to the c-th output channel, i.e. xc ∈ R[N] _[×][H][×][W]_ . To |
|
perform normalization,E[·] and D[·] indicate calculating mean and variance over a batch of samples, and then is re-scaled to xc is firstly standardized to ¯xc through ¯xc = (xc − E[xc])/pD[xc] where |
|
|
|
**x˜c by scale parameter γc and bias βc. Moreover, the output feature yc is obtained by ReLU activation** |
|
that discards the negative part of ˜xc. |
|
|
|
**Criterion-based channel pruning. For channel pruning, previous methods usually employ a ‘small-** |
|
norm-less-important’ criterion to measure the importance of channels. For example, BN layer can |
|
be applied in channel pruning (Liu et al., 2017), where a channel with a small value of γc would |
|
be removed. The reason is that the c-th output channel ˜xc contributes little to the learned feature |
|
representation when γc is small. Hence, the convolution in Eqn.(1) can be discarded safely, and filter |
|
**wc can thus be pruned. Unlike these criterion-based methods that deterministically prune unimportant** |
|
filters and rely on a heuristic pruning procedure as shown in Fig.1(a), we explore a probability-based |
|
channel pruning framework where less important channels are pruned in a stochastic manner. |
|
|
|
**Activation probability. To this end, we define an activation probability of a channel by exploring the** |
|
property of the BN layer. Those channels with a larger activation probability could be preserved with |
|
a higher probability. To be specific, since ¯xc is acquired by subtracting the sample mean and being |
|
divided by the sample variance, we can treat each channel feature as a random variable following |
|
standard Normal distribution (Arpit et al., 2016), denoted as _X[¯]c. Note that only positive parts can_ |
|
be activated by ReLU function. Proposition 1 gives the activation probability of the c-th channel, |
|
_i.e. P_ ( X[˜]c) > 0. |
|
|
|
**Proposition 1 Let a random variable** _X[¯]c_ (0, 1) and Yc = max 0, γcX[¯]c + βc _. Then we have_ |
|
_∼N_ _{_ _x_ _}_ |
|
_(1) P_ (Yc > 0) = P ( X[˜]c > 0) = (1 + Erf(βc/(√2|γc|))/2 where Erf(x) = 0 [2][/][√][π][ ·][ exp][−][t][2] _[dt][,]_ |
|
|
|
_and (2) P_ ( X[˜]c > 0) = 0 _βc_ 0 and γc 0. |
|
_⇔_ _≤_ _→_ R |
|
Note that a pruned channel can be modelled by P ( X[˜]c > 0) = 0. With Proposition 1 (see proof |
|
in Appendix A.2), we know that the unnecessary channels satisfy that γc approaches 0 and βc is |
|
|
|
|
|
----- |
|
|
|
**BN** **BW** **Cov.** **Newton** |
|
|
|
ഥ𝒙 ഥ𝒙 𝚺 **Iter.** |
|
|
|
𝜷 𝜸 𝜷 𝜸 𝚺[−𝟏]𝟐 |
|
|
|
𝒙 ෝ𝒙 |
|
|
|
**Soft Gating Module** |
|
|
|
𝑷(ෝ𝒙> 𝟎) 𝐆 . |
|
|
|
**Act. Prob.** **Gumbel-Softmax** **Soft Mask** **ReLU** |
|
|
|
Figure 2: A schematic of the proposed Batch Whitening Channel Pruning (BWCP) algorithm that |
|
consists of a BW module and a soft sampling procedure. By modifying BN layer with a whitening |
|
operator, the proposed BW technique adjusts activation probabilities of different channels. These |
|
activation probabilities are then utilized by a soft sampling procedure. |
|
|
|
negative. To achieve channel pruning, previous compression techniques (Li et al., 2017; Zhao et al., |
|
2019) merely impose a regularization on γc, which would deteriorate the representation power of |
|
unpruned channels (Perez et al., 2018; Wang et al., 2020). Instead, we adopt the idea of whitening |
|
to build a probabilistic channel pruning framework where unnecessary channels are stochastically |
|
disgarded with a small activation probability while important channels are preserved with a large |
|
activation probability. |
|
|
|
4 BATCH WHITENING CHANNEL PRUNING |
|
|
|
This section introduces the proposed batch whitening channel pruning (BWCP) algorithm, which |
|
contains a batch whitening module that can adjust the activation probability of channels, and a soft |
|
sampling module that stochastically prunes channels with the activation probability adjusted by BW. |
|
The whole pipeline of BWCP is illustrated in Fig.2. |
|
|
|
By modifying the BN layer in Eqn.(1), we have the formulation of BWCP, |
|
**x[out]c** = **xˆc** _mc(P_ ( X[ˆ]c > 0)) (2) |
|
|
|
_⊙_ |
|
|
|
batch whitening soft sampling |
|
|
|
where x[out]c _, ˆxc_ R[N] _[×][H][×][W]_ denote the output of proposed BWCP algorithm and BW module,|{z} | {z } |
|
respectively. ‘ _∈’ denotes broadcast multiplication. mc_ [0, 1] denotes a soft sampling that takes the |
|
_⊙_ _∈_ |
|
activation probability of output features of BW (i.e. P ( X[ˆ]c > 0)) and returns a soft mask. The closer |
|
the activation probability is to 0 or 1, the more likely the mask is to be hard. To distinguish important |
|
channels from unimportant ones, BW is proposed to increase the activation probability of useful |
|
channels while keeping the probability of unimportant channels unchanged during training. Since |
|
Eqn.(2) always retain all channels in the network, our BWCP can preserve the learning capacity of |
|
the original network during training (He et al., 2018a). The following sections present BW and soft |
|
sampling module in detail. |
|
|
|
4.1 BATCH WHITENING |
|
|
|
Unlike previous works (Zhao et al., 2019; Kang & Han, 2020) that simply measure the importance of |
|
channels by parameters in BN layer, we attempt to whiten features after BN layer by the proposed |
|
BW module. We show that BW can change the activation probability of channels according to their |
|
importances without adding additional parameters and computational overhead in inference. |
|
|
|
As shown in Fig.2, BW acts after the BN layer. By rewriting Eqn.(1) into a vector form, we have the |
|
formulation of BW, |
|
**xˆnij = Σ[−]** [1]2 (γ ⊙ **x¯nij + β)** (3) |
|
|
|
wherelocation ˆx (niji, j ∈) for all channels.R[C][×][1] is a vector of Σ[−] C2[1] is a whitening operator and elements that denote the output of BW for the Σ ∈ R[C][×][C] is the covariance matrix n-th sample at |
|
|
|
of channel features **x˜c** _c=1[. Moreover,][ γ][ ∈]_ [R][C][×][1][ and][ β][ ∈] [R][C][×][1][ are two vectors by stacking] |
|
_{_ _}[C]_ |
|
_γchannels ofc and βc of all the channels respectively. ¯xncij into a column vector._ ¯xnij ∈ R[C][×][1] is a vector by stacking elements from all |
|
|
|
**Training and inference. Note that BW in Eqn.(3) requires computing a root inverse of a covariance** |
|
matrix of channel features after the BN layer. Towards this end, we calculate the covariance matrix Σ |
|
within a batch of samples during each training step as given by |
|
|
|
|
|
----- |
|
|
|
_N,H,W_ |
|
|
|
(γ **x¯nij)(γ** **x¯nij)[T]** = (γγ[T]) **_ρ_** (4) |
|
_⊙_ _⊙_ _⊙_ |
|
_n,i,j=1_ |
|
|
|
X |
|
|
|
|
|
**Σ =** |
|
|
|
|
|
_NHW_ |
|
|
|
|
|
where ρ is a C-by-C correlation matrix of channel features {x¯c}c[C]=1 [(see details in Appendix A.1).] |
|
The Newton Iteration is further employed to calculate its root inverse, Σ[−] 2[1], as given by the following |
|
|
|
iterations |
|
**Σk = [1]** _k_ 1[Σ][)][, k][ = 1][,][ 2][,][ · · ·][, T.] (5) |
|
|
|
2 [(3][Σ][k][−][1][ −] **[Σ][3]−** |
|
|
|
where k and T are the iteration index and iteration number respectively and Σ0 = I is a identity |
|
matrix. Note that when ∥I − **Σ∥2 < 1, Eqn.(5) converges to Σ[−]** [1]2 (Bini et al., 2005). To satisfy this |
|
|
|
condition, Σ can be normalized by Σ/tr(Σ) following (Huang et al., 2019), where tr(·) is the trace |
|
operator. In this way, the normalized covariance matrix can be written as ΣN = γγ[T] _⊙_ **_ρ/ ∥γ∥2[2][.]_** |
|
|
|
During inference, we use the moving average to calculate the population estimate of **Σ[ˆ]** _−N_ [1]2 by |
|
|
|
following the updating rules, **Σ[ˆ]** _−N_ 2[1] = (1 − _g) Σ[ˆ]_ _−N_ 2[1] + gΣ−N 2[1] [. Here][ Σ][N][ is the covariance calculated] |
|
|
|
within each mini-batch at each training step, and g denotes the momentum of moving average. Note |
|
that **Σ[ˆ]** _−N_ [1]2 is fixed during inference, the proposed BW does not introduce extra costs in memory or |
|
|
|
computation since **Σ[ˆ]** _−N_ 2[1] can be viewed as a convolution kernel with size of 1, which can be absorbed |
|
|
|
into previous convolutional layer. For completeness, we also analyze the training overhead of BWCP |
|
in Appendix Sec.A.3 where we see BWCP introduces a little extra training overhead. |
|
|
|
4.2 ANALYSIS OF BWCP |
|
|
|
In this section, we show that BWCP can easily identify unimportant channels by increasing the |
|
difference of activation between important and unimportant channels. |
|
|
|
**Proposition 2 Let a random variable** _X[¯] ∼N_ (0, 1) and Yc = max{0, [ Σ[ˆ] _−N_ [1]2 [(][γ][ ⊙] _X[¯] + β)]c}._ |
|
|
|
_Then we have P_ (Yc > δ) = P ( X[ˆ]c > δ) = (1 + Erf((β[ˆ]c − _δ)/(√2|γˆc|))/2, where δ is a small_ |
|
|
|
_positive constant, ˆγc and_ _β[ˆ]c are two equivalent scale parameter and bias defined by BW module._ |
|
_Take T = 1 in Eqn.(5) as an example, we have ˆγc =_ [1]2 [(3][γ][c][ −] [P]d[C]=1 _[γ]d[2][γ][c][ρ][dc][/][ ∥][γ][∥][2]2[)][, and][ ˆ]βc =_ |
|
|
|
1 |
|
2 [(3][β][c][ −] [P]d[C]=1 _[β][d][γ][d][γ][c][ρ][dc][/][ ∥][γ][∥]2[2][)][ where][ ρ][dc][ is the Pearson’s correlation between channel features]_ |
|
**x¯c and ¯xd.** |
|
|
|
By Proposition .2, BWCP can adjust activation probability by changing the values of γc and βc |
|
in Proposition 1 through BW module (see detail in Appendix A.4). Here we introduce a small |
|
positive constant δ to avoid the small activation feature value. To see how BW changes the activation |
|
probability of different channels, we consider two cases as shown in Proposition 3. |
|
|
|
with a small activation probability as it sufficiently approaches zero. We can see from PropositionCase 1: βc ≤ 0 and γc → 0. In this case, the c-th channel of the BN layer would be activated |
|
3, the activation probability of c-th channel still approaches zero after BW is applied, showing |
|
that the proposed BW module can keep the unimportant channels unchanged in this case. Case 2: |
|
_γc_ _> 0. For this case, the c-th channel of the BN layer would be activated with a high activation_ |
|
_|_ _|_ |
|
probability. From Proposition 3, the activation probability of c-th channel is enlarged after BW is |
|
applied. Therefore, our proposed BW module can increase the activation probability of important |
|
channels. Detailed proof of Proposition 3 can be found in Appendix A.5. We also empirically verify |
|
Proposition 3 in Sec. 5.3. Notice that we neglect a trivial case in which the channel can be also |
|
activated (i.e. βc > 0 and |γc| → 0). In fact, the channels can be removed in this case because the |
|
channel feature is always constant which can be deemed as a bias. |
|
|
|
4.3 SOFT SAMPLING MODULE |
|
|
|
The soft sampling procedure samples the output of BW through a set of differentiable masks. To be |
|
specific, as shown in Fig.2, we leverage the Gumbel-Softmax sampling (Jang et al., 2017) that takes |
|
the activation probability generated by BW and produces a soft mask as given by |
|
_mc = GumbelSoftmax(P_ ( X[ˆ]c > 0); τ ) (6) |
|
where τ is the temperature. By Eqn.(2) and Eqn.(6), BWCP stochastically prunes unimportant |
|
channels with activation probability. A smaller activation probability makes mc more likely to be |
|
|
|
|
|
----- |
|
|
|
close to 0. Hence, our proposed BW can help identify less important channels by enlarging the |
|
activation probability of important channels, as mentioned in Sec.4.2. Note that mc can converge |
|
to 0-1 mask when τ approaches to 0. In the experiment, we find that setting τ = 0.5 is enough for |
|
BWCP to achieve hard pruning at test time. |
|
|
|
**Proposition 3 Let δ = ∥γ∥2** _Cj=1[(][γ][j][β][c][ −]_ _[γ][c][β][j][)][2][ρ]cj[2]_ _[/][(][∥][γ][∥][2]2_ _[−]_ [P]j[C]=1 _[γ]j[2][ρ][cj][)][. With][ ˆ]γ and_ _β[ˆ]_ |
|
|
|
_defined in Proposition 2, we have (1)qP P_ ( X[ˆ]c > δ) = 0 if _γc_ 0 and βc 0, and (2) P ( X[ˆ]c > δ) |
|
_|_ _| →_ _≤_ _≥_ |
|
_P_ ( X[˜]c _δ) if_ _γc_ _> 0._ |
|
_≥_ _|_ _|_ |
|
**Solution to residual issue. Note that the number of channels in the last convolution layer must** |
|
be the same as previous blocks due to the element-wise summation in the recent advanced CNN |
|
architectures (He et al., 2016; Huang et al., 2017). We solve this problem by letting BW layer in the |
|
last convolution layer and shortcut share the same mask as discussed in Appendix A.6. |
|
|
|
4.4 TRAINING OF BWCP |
|
|
|
This section introduces a sparsity regularization, which makes the model compact, and then describes |
|
the training algorithm of BWCP. |
|
|
|
**Sparse Regularization. With Proposition.1, we see a main characteristic of pruned channels in BN** |
|
layer is that γc sufficiently approaches 0, and βc is negative. By Proposition 3, we find that it is also |
|
a necessary condition that a channel can be pruned after BW module is applied. Hence, we obtain |
|
unnecessary channels by directly imposing a regularization on γc and βc as given by |
|
|
|
_C_ |
|
sparse = (7) |
|
_L_ _c=1_ _[λ][1][|][γ][c][|][ +][ λ][2][β][c]_ |
|
|
|
where the first term makes γc small, and the second term encouragesX _βc to be negative. The_ |
|
above sparse regularizer is imposed on all BN layers of the network. By changing the strength of |
|
regularization (i.e. λ1 and λ2), we can achieve different pruning ratios. In fact, βc and |γc| represent |
|
the mean and standard deviation of a Normal distribution, respectively. Following the empirical rule |
|
of Normal distribution, setting λ1 as triple or double λ2 would be a good choice to encourage sparse |
|
channels in implementation. Moreover, we observe that 42.2% and 41.3% channels with βc 0 |
|
, while 0.47% and 5.36% channels with _γc_ _< 0.05 on trained plain ResNet-34 and ResNet-50. ≤_ |
|
_|_ _|_ |
|
Hence, changing the strength of regularization on γc will affect FLOPs more than that of βc. If one |
|
wants to pursue a more compact model, increasing λ1 is more effective than λ2. |
|
|
|
**Training Algorithm. BWCP can be easily plugged into a CNN by modifying the traditional BN** |
|
operations. Hence, the training of BWCP can be simply implemented in existing software platforms |
|
such as PyTorch and TensorFlow. In other words, the forward propagation of BWCP can be |
|
represented by Eqn.(2-3) and Eqn.(6), all of which define differentiable transformations. Therefore, |
|
our proposed BWCP can train and prune deep models in an end-to-end manner. Appendix A.7 also |
|
provides the explicit gradient back-propagation of BWCP. On the other hand, we do not introduce |
|
extra parameters to learn the pruning mask mc. Instead, mc in Eqn.(6) is totally determined by the |
|
parameters in BN layers including γ, β and Σ. Hence, we can perform joint training of pruning mask |
|
_mc and model parameters. The BWCP framework is provided in Algorithm 1 of Appendix Sec A.6_ |
|
|
|
**Final architecture. The final architecture is fixed at the end of training. During training, we use the** |
|
Gumbel-Softmax procedure by Eqn.(6) to produce a soft mask. At test time, we instead use a hard |
|
0-1 mask achieved by a sign function (i.e. sign(P ( X[ˆ]c > 0) > 0.5)) to obtain the network’s output. |
|
To make the inference stage stable, we use a sigmoid-alike transformation to make the activation |
|
probability approach 0 or 1 in training. By this strategy, we find that both the training and inference |
|
stage are stable and obtain a fixed compact model. After training, we obtain the final compact model |
|
by directly pruning channels with a mask value of 0. Therefore, our proposed BWCP does not need |
|
an extra fine-tuning procedure. |
|
|
|
5 EXPERIMENTS |
|
|
|
In this section, we extensively experiment with the proposed BWCP on CIFAR-10/100 and ImageNet. |
|
We show the advantages of BWCP in both recognition performance and FLOPs reduction comparing |
|
with existing channel pruning methods. We also provide an ablation study to analyze the proposed |
|
framework. The details of datasets and training configurations are provided in Appendix B. |
|
|
|
|
|
----- |
|
|
|
Table 1: Performance comparison between our proposed approach BWCP and other methods on |
|
CIFAR-10. “Baseline Acc.” and “Acc.” denote the accuracies of the original and pruned models, |
|
respectively. “Acc. Drop” means the accuracy of the base model minus that of pruned models (smaller |
|
is better). “Channels ↓”, “Model Size ↓”, and “FLOPs ↓” denote the relative reductions in individual |
|
metrics compared to the unpruned networks (larger is better). ‘*’ indicates the method needs a extra |
|
fine-tuning to recover performance. The best-performing results are highlighted in bold. |
|
|
|
Model DCP* (Zhuang et al., 2018)Method Baseline Acc. (%)93.80 Acc. (%)93.49 Acc. Drop0.31 Channels- _↓_ (%) Model Size49.24 ↓ (%) FLOPs50.25 ↓ (%) |
|
|
|
AMC* (He et al., 2018b) 92.80 91.90 0.90 - - 50.00 |
|
|
|
ResNet-56 SFP (He et al., 2018a) 93.59 92.26 1.33 40 – **52.60** |
|
|
|
FPGM (He et al., 2019) 93.59 92.93 0.66 40 – **52.60** |
|
SCP (Kang & Han, 2020) 93.69 93.23 0.46 **45** **46.47** 51.20 |
|
BWCP (Ours) 93.64 93.37 **0.27** 40 44.42 50.35 |
|
|
|
Slimming* (Liu et al., 2017) 94.39 92.59 1.80 80 73.53 68.95 |
|
Variational Pruning (Zhao et al., 2019) 94.11 93.16 0.95 60 59.76 44.78 |
|
|
|
DenseNet-40 |
|
|
|
SCP (Kang & Han, 2020) 94.39 93.77 0.62 81 75.41 70.77 |
|
BWCP (Ours) 94.21 93.82 **0.39** **82** **76.03** **71.72** |
|
|
|
Slimming* (Liu et al., 2017) 93.85 92.91 0.94 70 87.97 48.12 |
|
Variational Pruning (Zhao et al., 2019) 93.25 93.18 0.07 62 73.34 39.10 |
|
|
|
VGGNet-16 |
|
|
|
SCP (Kang & Han, 2020) 93.85 93.79 0.06 75 93.05 66.23 |
|
BWCP (Ours) 93.85 93.82 **0.03** **76** **93.12** **68.08** |
|
|
|
DCP* (Zhuang et al., 2018) 94.47 94.69 -0.22 - 23.6 27.0 |
|
MobileNet-V2 MDP (Guo et al., 2020) 95.02 95.14 -0.12 - - 28.7 |
|
BWCP (Ours) 94.56 94.90 **-0.36** - **32.3** **37.7** |
|
|
|
|
|
5.1 RESULTS ON CIFAR-10 |
|
|
|
For CIFAR-10 dataset, we evaluate our BWCP on ResNet-56, DenseNet-40 and VGG-16 and compare |
|
our approach with Slimming (Liu et al., 2017), Variational Pruning (Zhao et al., 2019) and SCP (Kang |
|
& Han, 2020). These methods prune redundant channels using BN layers like our algorithm. We |
|
also compare BWCP with previous strong baselines such as AMC (He et al., 2018b) and DCP |
|
(Zhuang et al., 2018). The results of slimming are obtained from SCP (Kang & Han, 2020). As |
|
mentioned in Sec.4.2, our BWCP adjusts their activation probability of different channels. Therefore, |
|
it would present better recognition accuracy with comparable computation consumption by entirely |
|
exploiting important channels. As shown in Table 1, our BWCP achieves the lowest accuracy drops |
|
and comparable FLOPs reduction compared with existing channel pruning methods in all tested base |
|
networks. For example, although our model is not fine-tuned, the accuracy drop of the pruned network |
|
given by BWCP based on DenseNet-40 and VGG-16 outperforms Slimming with fine-tuning by |
|
1.41% and 0.91% points, respectively. And ResNet-56 pruned by BWCP attains better classification |
|
accuracy than previous strong baseline DCP AMC (He et al., 2018b) and DCP (Zhuang et al., 2018) |
|
without an extra fine-tuning stage. Besides, our method achieves superior accuracy compared to the |
|
Variational Pruning even with significantly smaller model sizes on DensNet-40 and VGGNet-16, |
|
demonstrating its effectiveness. We also test BWCP with MobileNet-V2 on the CIFAR10 dataset. |
|
From Table 1, we see that BWCP achieves better classification accuracy while reducing more FLOPs |
|
We also report results of BWCP on CIFAR100 in Appendix B.3. |
|
|
|
5.2 RESULTS ON IMAGENET |
|
|
|
For ImageNet dataset, we test our proposed BWCP on two representative base models ResNet-34 and |
|
ResNet-50. The proposed BWCP is compared with SFP (He et al., 2018a)), FPGM (He et al., 2019), |
|
SSS (Huang & Wang, 2018)), SCP (Kang & Han, 2020) HRank (Lin et al., 2020) and DSA (Ning |
|
et al., 2020) since they prune channels without an extra fine-tuning stage. As shown in Table 2, we |
|
see that BWCP consistently outperforms its counterparts in recognition accuracy under comparable |
|
FLOPs. For ResNet-34, FPGM (He et al., 2019) and SFP (He et al., 2018a) without fine-tuning |
|
accelerates ResNet-34 by 41.1% speedup ratio with 2.13% and 2.09% accuracy drop respectively, but |
|
our BWCP without finetuning achieve almost the same speedup ratio with only 1.16% top-1 accuracy |
|
drop. On the other hand, BWCP also significantly outperforms FPGM (He et al., 2019) by 1.07% |
|
top-1 accuracy after going through a fine-tuning stage. For ResNet-50, BWCP still achieves better |
|
performance compared with other approaches. For instance, at the level of 40% FLOPs reduction, |
|
the top-1 accuracy of BWCP exceeds SSS (Huang & Wang, 2018) by 3.72%. Moreover, BWCP |
|
outperforms DSA (Ning et al., 2020) by top-1 accuracy of 0.34% and 0.21% at level of 40% and |
|
50% FLOPs respectively. However, BWCP has slightly lower top-5 accuracy than DSA (Ning et al., |
|
2020). |
|
|
|
**Inference Acceleration. We analyze the realistic hardware acceleration in terms of GPU and CPU** |
|
running time during inference. The CPU type is Intel Xeon CPU E5-2682 v4, and the GPU is |
|
|
|
|
|
----- |
|
|
|
Table 2: Performance of our proposed BWCP and other pruning methods on ImageNet using base |
|
models ResNet-34 and ResNet-50. ’*’ indicates the pruned model is fine-tuned. |
|
|
|
Model FPGM* (He et al., 2019)Method Baseline Top-1 Acc. (%)73.92 Baseline Top-5 Acc. (%)91.62 Top-1 Acc. Drop1.38 Top-5 Acc. Drop0.49 FLOPs41.1 ↓ (%) |
|
|
|
BWCP* (Ours) 73.72 91.64 **0.31** **0.34** **41.0** |
|
|
|
ResNet-34 SFP (He et al., 2018a) 73.92 91.62 2.09 1.29 41.1 |
|
|
|
FPGM (He et al., 2019) 73.92 91.62 2.13 0.92 41.1 |
|
BWCP (Ours) 73.72 91.64 **1.16** **0.83** **41.0** |
|
|
|
FPGM* (He et al., 2019) 76.15 92.87 1.32 0.55 **53.5** |
|
BWCP* (Ours) 76.20 93.15 **0.48** **0.40** 51.2 |
|
|
|
SSS (Huang & Wang, 2018) 76.12 92.86 4.30 2.07 43.0 |
|
DSA (Ning et al., 2020) – – 0.92 0.41 40.0 |
|
HRank* (Lin et al., 2020) 76.15 92.87 1.17 0.64 **43.7** |
|
|
|
ResNet-50 ThiNet* (Luo et al., 2017) 72.88 91.14 0.84 0.47 36.8 |
|
|
|
BWCP (Ours) 76.20 93.15 **0.58** **0.40** 42.9 |
|
FPGM (He et al., 2019) 76.15 92.87 2.02 0.93 53.5 |
|
SCP (Kang & Han, 2020) 75.89 92.98 1.69 0.98 **54.3** |
|
DSA (Ning et al., 2020) – – 1.33 0.80 50.0 |
|
BWCP (Ours) 76.20 93.15 **1.02** **0.60** 51.2 |
|
|
|
|
|
Table 3: Effect of BW, Gumbel-Softmax (GS), |
|
and sparse Regularization in BWCP. The results |
|
are obtained by training ResNet-56 on CIFAR-10 |
|
dataset. ‘BL’ denotes baseline model. |
|
|
|
CasesBL BW GS Reg Acc. (%)93.64 Model Size- _↓_ FLOPs- _↓_ |
|
|
|
(1) **94.12** - - |
|
(2) 93.46 - - |
|
(3) 92.84 **46.37** 51.16 |
|
(4) 94.10 7.78 6.25 |
|
(5) 92.70 45.22 **51.80** |
|
BWCP 93.37 44.42 50.35 |
|
|
|
|
|
Table 4: Effect of regularization strength λ1 and |
|
_λ2 with magnitude 1e −_ 4 for the sparsity loss in |
|
Eqn.(7). The results are obtained using VGG-16 |
|
on CIFAR-100 dataset. |
|
|
|
_λ1_ _λ2_ Acc. (%) Acc. Drop FLOPs (%) |
|
_↓_ |
|
|
|
1.2 0.6 73.85 -0.34 33.53 |
|
1.2 1.2 73.66 -0.15 35.92 |
|
1.2 2.4 73.33 0.18 54.19 |
|
0.6 1.2 74.27 -0.76 30.67 |
|
2.4 1.2 71.73 1.78 60.75 |
|
|
|
|
|
NVIDIA GTX1080Ti. We evaluate the inference time using ResNet-50 with a mini-batch of 32 (1) on |
|
GPU (CPU). GPU inference batch size is larger than CPU to emphasize our method’s acceleration on |
|
the highly parallel platform as a structured pruning method. We see that BWCP has 29.2% inference |
|
time reduction on GPU, from 48.7ms for base ResNet-50 to 34.5ms for pruned ResNet-50, and |
|
21.2% inference time reduction on CPU, from 127.1ms for base ResNet-50 to 100.2ms for pruned |
|
ResNet-50. |
|
|
|
5.3 ABLATION STUDY |
|
|
|
**Effect of BWCP on activation probability. From the analysis in Sec. 4.2, we have shown that** |
|
BWCP can increase the activation probability of useful channels while keeping the activation |
|
probability of unimportant channels unchanged through BW technique. Here we demonstrate this |
|
using Resnet-34 and Resnet-50 trained on ImageNet dataset. We calculate the activation probability |
|
of channels of BN and BW layer. It can be seen from Fig.3 (a-d) that (1) BW increases the activation |
|
probability of important channels when _γc_ _> 0; (2) BW keeps the the activation probability of_ |
|
_|_ _|_ |
|
unimportant channels unchanged when βc 0 and γc 0. Therefore, BW indeed works by making |
|
useful channels more important and unnecessary channels less important, respectively. In this way, ≤ _→_ |
|
BWCP can identify unimportant channels reliably. |
|
|
|
**Effect of BW, Gumbel-Softmax (GS) and sparse Regularization (Reg). The proposed BWCP** |
|
consists of three components including BW module (i.e. Eqn. (3)) and Soft Sampling module with |
|
Gumbel-Softmax (i.e. Eqn. (6)) and a spare regularization (i.e. Eqn. (7)). Here we investigate the |
|
effect of each component. To this end, five variants of BWCP are considered: (1) only BW module |
|
is used; (2) only sparse regularization is imposed; (3)BWCP w/o BW module; (4) BWCP w/o |
|
sparse regularization; and (5) BWCP with Gumbel-Softmax replaced by Straight Through Estimator |
|
(STE) (Bengio et al., 2013). For case (5), we select channels by hard 0-1 mask generated with |
|
_mc = sign(P_ ( X[ˆ]c > 0) − 0.5) [1]. The gradient is back-propagated through STE. From results on |
|
Table 3, we can make the following conclusions: (a) BW improves the recognition performance, |
|
implying that it can enhance the representation of channels; (b) sparse regularization on γ and β |
|
slightly harm the classification accuracy of original model but it encourages channels to be sparse as |
|
also shown in Proposition 3; (c) BWCP with Gumbel-Softmax achieves higher accuracy than STE, |
|
showing that a soft sampling technique is better than the deterministic ones as reported in (Jang et al., |
|
2017). |
|
|
|
1y = sign(x) = 1 if x ≥ 0 and 0 if x < 0. |
|
|
|
|
|
----- |
|
|
|
(a) ResNet-34-layer1.0.bn1 (b) ResNet-34-layer1.0.bn1 (c) ResNet-50-layer1.0.bn1 (d) ResNet-50-layer1.0.bn1 (e) VGGNet-layer1 (f) VGGNet layer12 |
|
|
|
BN-Act. Prob. 0.2 1.0 BN-Act. Prob. Original-BN Original-BN |
|
|
|
0.8 BW-Act. Prob. BW-Act. Prob. 0 BWCP-BN 0.3 BWCP-BN |
|
|
|
0.0 0.8 0.3 BWCP-BW BWCP-BW |
|
|
|
0.6 1 |
|
|
|
0.2 0.6 |
|
|
|
0.4 0.4 0.4 2 |
|
|
|
3 Correlation Score Correlation Score |
|
|
|
0.2 0.6 0.2 0.2 |
|
|
|
BN- 4 BN- 0.2 |
|
|
|
0.0 0.8 BN- 0.0 BN |
|
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 40000 80000 120000 0 40000 80000 120000 |
|
Channel Index Channel Index Channel Index Channel Index Training Iterations Training Iterations |
|
|
|
|
|
Figure 3: ((a) & (b)) and ((c) & (d)) show the effect of BWCP on activation probability with trained |
|
ResNet-34 and ResNet-50 on ImageNet, respectively. The proposed batch whitening (BW) can |
|
increase the activation probability of useful channels when _γc_ _> 0 while keeping the unimportant_ |
|
_|_ _|_ |
|
channels unchanged when whenoutput response maps in shallow and deeper BWCP modules during the whole training period. BWCP βc ≤ 0 and γc → 0. (e) & (f) show the correlation score for the |
|
has lower correlation score among feature channels than original BN baseline. |
|
|
|
**Impact of regularization strength λ1 and λ2. We analyze the effect of regularization strength λ1** |
|
and λ2 for sparsity loss on CIFAR-100. The trade-off between accuracy and FLOPs reduction is |
|
investigated using VGG-16. Table 4 illustrates that the network becomes more compact as λ1 and λ2 |
|
increase, implying that both terms in Eqn.(7) can make channel features sparse. Moreover, the flops |
|
metric is more sensitive to the regularization on γ, which validates our analysis in Sec.4.2). Besides, |
|
we should search for proper values for λ1 and λ2 to trade off between accuracy and FLOPs reduction, |
|
which is a drawback for our method. |
|
|
|
**Effect of the number of BW. Here the effect of the number of BW modules of BWCP is in-** |
|
vestigated trained on CIFAR-10 using Resnet-56 consisting of a series of bottleneck structures. |
|
Note that there are three BN layers in each bottleneck. We study four variants of BWCP: |
|
|
|
(a) we use BW to modify the last BN in each |
|
bottleneck module; hence there are a total of 18 Table 5: Effect of the number of BW modules on |
|
BW layers in Resnet-56; (b) the last two BN CIFAR-10 dataset trained with ResNet-56. ‘# BW’ |
|
layers are modified by our BW technique (36 indicates the number of BW. More BW modules |
|
BW layers) (c) All BN layers in bottlenecks are in the network would lead to a lower recognition |
|
replaced by BW (54 BW layers), which is our accuracy drop with comparable computation conproposed method. The results are reported in sumption. |
|
Table 5. We can see that BWCP achieves the # BW Acc. (%) Acc. Drop Model Size ↓ (%) FLOPs ↓ (%) |
|
|
|
18 93.01 0.63 44.70 50.77 |
|
|
|
best top-1 accuracy when BW acts on all BN 36 93.14 0.50 45.29 50.45 |
|
layers, given the comparable FLOPs and model 54 93.37 0.27 44.42 50.35 |
|
size. This indicates that the proposed BWCP more benefits from more BW layers in the network. |
|
|
|
**BWCP selects representative channel features. It is worth noting that BWCP can whiten channel** |
|
features after BN through BW as shwon in Eqn.(3). Therfore, BW can learn diverse channel features |
|
by reducing the correlations among channels(Yang et al., 2019b). We investigate this using VGGNet16 with BN and the proposed BWCP trained on CIFAR-10. The correlation score can be calculated |
|
by taking the average over the absolute value of the correlation matrix of channel features. A larger |
|
value indicates that there is redundancy in the encoded features. We plot the correlation score |
|
among channels at different depths of the network. As shown in Fig.3 (e & f), channel features after |
|
BW block have significantly smaller correlations, implying that channels selected by BWCP are |
|
representative. This also accounts for the effectiveness of the proposed scheme. |
|
|
|
6 DISCUSSION AND CONCLUSION |
|
|
|
This paper presented an effective and efficient pruning technique, termed Batch Whitening Channel |
|
Pruning (BWCP). We show BWCP increases the activation probability of useful channels while |
|
keeping unimportant channels unchanged, making it appealing to pursue a compact model. Particularly, BWCP can be easily applied to prune various CNN architectures by modifying the batch |
|
normalization layer. However, to achieve different levels of FLOPs reduction, the proposed BWCP |
|
needs to search for the strength of sparse regularization. With probabilistic formulation in BWCP, |
|
the expected FLOPs can be modeled. The multiplier method can be used to encourage the model to |
|
attain target FLOPs. For future work, an advanced Pareto optimization algorithm can be designed to |
|
tackle such multi-objective joint minimization. We hope that the analyses of BWCP could bring a |
|
new perspective for future work in channel pruning. |
|
|
|
|
|
----- |
|
|
|
**Ethics Statement. We aim at compressing neural nets by the proposed BWCP framework. It could** |
|
improve the energy efficiency of neural network models and reduce the emission of carbon dioxide. |
|
We notice that deep neural networks trained with BWCP can be plugged into portable or edge devices |
|
such as mobile phones. Hence, our work and AI in edge devices would have the same negative impact |
|
on ethics. Moreover, network pruning may have different effects on different classes, thus producing |
|
unfair models as a result. We will carefully investigate the results of our method on the fairness of the |
|
model output in the future. |
|
|
|
**Reproducibility Statement. For theoretical results, clear explanations of assumptions and a complete** |
|
proof of propostion 1-3 are included in Appendix. To reproduce the experimental results, we provide |
|
training details and hyper-parameters in Appendix Sec.B. Moreover, we will also make our code |
|
available by a link to an anonymous repository during the discussion stage. |
|
|
|
REFERENCES |
|
|
|
Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propagation: |
|
A parametric technique for removing internal covariate shift in deep networks. International |
|
_Conference in Machine Learning, 2016._ |
|
|
|
Yoshua Bengio, Nicholas Leonard, and Aaron C. Courville. Estimating or propagating gradients´ |
|
through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL |
|
[http://arxiv.org/abs/1308.3432.](http://arxiv.org/abs/1308.3432) |
|
|
|
Dario A Bini, Nicholas J Higham, and Beatrice Meini. Algorithms for the matrix pth root. Numerical |
|
_Algorithms, 39(4):349–378, 2005._ |
|
|
|
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. |
|
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully |
|
connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, |
|
2018. |
|
|
|
Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware |
|
acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4): |
|
485–532, 2020. |
|
|
|
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural |
|
networks. arXiv preprint arXiv:1803.03635, 2018. |
|
|
|
Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic channel |
|
pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331, 2018. |
|
|
|
Jinyang Guo, Wanli Ouyang, and Dong Xu. Multi-dimensional pruning: A unified framework for |
|
model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern |
|
_Recognition, pp. 1508–1517, 2020._ |
|
|
|
Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances |
|
_in neural information processing systems, pp. 1379–1387, 2016._ |
|
|
|
Song Han and William J. Dally. Bandwidth-efficient deep learning. In Proceedings of the 55th |
|
_Annual Design Automation Conference on, pp. 147, 2018._ |
|
|
|
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for |
|
efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, |
|
2015. |
|
|
|
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks |
|
with pruning, trained quantization and huffman coding. In ICLR 2016 : International Conference |
|
_on Learning Representations 2016, 2016._ |
|
|
|
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image |
|
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. |
|
770–778, 2016. |
|
|
|
|
|
----- |
|
|
|
Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating |
|
deep convolutional neural networks. arXiv preprint arXiv:1808.06866, 2018a. |
|
|
|
Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median |
|
for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on |
|
_Computer Vision and Pattern Recognition, pp. 4340–4349, 2019._ |
|
|
|
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model |
|
compression and acceleration on mobile devices. In Proceedings of the European Conference on |
|
_Computer Vision (ECCV), pp. 784–800, 2018b._ |
|
|
|
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected |
|
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern |
|
_recognition, pp. 4700–4708, 2017._ |
|
|
|
Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Iterative normalization: Beyond standardization |
|
towards efficient whitening. In Proceedings of the IEEE Conference on Computer Vision and |
|
_Pattern Recognition, pp. 4874–4883, 2019._ |
|
|
|
Zehao Huang and Naiyan Wang. Data-Driven Sparse Structure Selection for Deep Neural Networks. |
|
In ECCV, 2018. |
|
|
|
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by |
|
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. |
|
|
|
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2017. |
|
|
|
Minsoo Kang and Bohyung Han. Operation-aware soft channel pruning using differentiable masks. |
|
_arXiv preprint arXiv:2007.03938, 2020._ |
|
|
|
A Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. |
|
|
|
Yann LeCun, John S Denker, and Sara A Solla. Optimal Brain Damage. In NIPS, 1990. |
|
|
|
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for |
|
Efficient ConvNets. In ICLR, 2017. |
|
|
|
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling |
|
Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF |
|
_Conference on Computer Vision and Pattern Recognition, pp. 1529–1538, 2020._ |
|
|
|
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE |
|
_International Conference on Computer Vision, pp. 2736–2744, 2017._ |
|
|
|
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of |
|
network pruning. arXiv preprint arXiv:1810.05270, 2018. |
|
|
|
Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through |
|
_l 0 regularization. International Conference on Learning Representation, 2017._ |
|
|
|
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural |
|
network compression. In Proceedings of the IEEE international conference on computer vision, |
|
pp. 5058–5066, 2017. |
|
|
|
Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: More |
|
efficient budgeted pruning via differentiable sparsity allocation. arXiv preprint arXiv:2004.02164, |
|
2020. |
|
|
|
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual |
|
reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial |
|
_Intelligence, volume 32, 2018._ |
|
|
|
|
|
----- |
|
|
|
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object |
|
detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine |
|
_Intelligence, 39(6):1137–1149, 2017._ |
|
|
|
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, |
|
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition |
|
challenge. International Journal of Computer Vision, 115(3):211–252, 2015. |
|
|
|
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image |
|
recognition. arXiv preprint arXiv:1409.1556, 2014. |
|
|
|
Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured probabilistic pruning for |
|
convolutional neural network acceleration. arXiv preprint arXiv:1709.06994, 2017. |
|
|
|
Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep |
|
multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems, |
|
33, 2020. |
|
|
|
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in |
|
deep neural networks. In Proceedings of the 30th International Conference on Neural Information |
|
_Processing Systems, pp. 2074–2082, 2016._ |
|
|
|
Huanrui Yang, Wei Wen, and Hai Li. Deephoyer: Learning sparser neural network with differentiable |
|
scale-invariant sparsity measures. arXiv preprint arXiv:1908.09979, 2019a. |
|
|
|
Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, and Devi Parikh. Cross-channel communication networks. In Advances in Neural Information Processing Systems, pp. 1295–1304, |
|
2019b. |
|
|
|
Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. Good subnetworks |
|
provably exist: Pruning via greedy forward selection. In International Conference on Machine |
|
_Learning, pp. 10820–10830. PMLR, 2020._ |
|
|
|
Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational |
|
convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision |
|
_and Pattern Recognition, pp. 2780–2789, 2019._ |
|
|
|
Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, |
|
and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. arXiv preprint |
|
_arXiv:1810.11809, 2018._ |
|
|
|
|
|
----- |
|
|
|
The appendix provides more details about approach and experiments of our proposed batch whitening |
|
channel pruning (BWCP) framework. The broader impact of this work is also discussed. |
|
|
|
A MORE DETAILS ABOUT APPROACH |
|
|
|
A.1 CALCULATION OF COVARIANCE MATRIX Σ |
|
|
|
By Eqn.(1) in main text, the output of BN is ˜xncij = γcx¯ncij + βc. Hence, we have E[˜xc] = |
|
|
|
_NHW1_ _N,H,Wn,i,j_ (γcx¯ncij + βc) = βc. Then the entry in c-th row and d-th column of covariance |
|
matrix Σ of ˜xc is calculated as follows: |
|
P |
|
|
|
_N,H,W_ |
|
|
|
|
|
_n,i,j_ (γcx¯ncij + βc − E[˜xc])(γdx¯ndij + βd − E[˜xd]) = γcγdρcd (8) |
|
|
|
X |
|
|
|
|
|
Σcd = |
|
|
|
|
|
_NHW_ |
|
|
|
|
|
where ρcd is the element in c-th row and j-th column of correlation matrix of ¯x. Hence, we have ρcd |
|
1 _N,H,W_ _∈_ |
|
|
|
[ 1, 1]. Furthermore, we can write Σ into the vector form: Σ = γγ[T] _NHW_ _n,i,j_ **x¯nijx¯nij[T]** = |
|
_−_ _⊙_ |
|
|
|
**_γγ[T]_** **_ρ._** |
|
_⊙_ P |
|
|
|
A.2 PROOF OF PROPOSITION 1 |
|
|
|
For (1), we notice that we can defineassume γc > 0 without loss of generality. Then, we have γc = −γc and _X[¯]c = −X[¯]c ∼N_ (0, 1) if γc < 0. Hence, we can |
|
|
|
|
|
_P_ (Yc > 0) = P ( X[˜]c > 0) = P ( X[¯]c > ) |
|
_−_ _[β]γc[c]_ |
|
|
|
+∞ 1 |
|
= _√2π_ [exp][−] _[t]2[2] dt_ |
|
Z− _[βc]γc_ |
|
|
|
0 1 +∞ |
|
= Z− _[βc]γc_ _√2π_ [exp][−] _[t]2[2] dt +_ Z0 |
|
|
|
_βcγc_ 1 +∞ |
|
= 0 _√2π_ [exp][−] _[t]2[2] dt +_ 0 |
|
Z Z |
|
|
|
Erf( _√β2cγc_ [) + 1] |
|
= |
|
|
|
2 |
|
|
|
When γc < 0, we can set γc = −γc. Hence, we arrive at |
|
|
|
_P_ (Yc > 0) = P ( X[˜]c > 0) = Erf( _√2β|cγc|_ [) + 1] |
|
|
|
|
|
1 |
|
|
|
2 dt |
|
|
|
2π [exp][−] _[t][2]_ |
|
|
|
1 |
|
|
|
2 dt |
|
|
|
2π [exp][−] _[t][2]_ |
|
|
|
|
|
(9) |
|
|
|
(10) |
|
|
|
|
|
For (2), let us denoterepresents a random variables corresponding to output featureX[¯]c ∼N (0, 1), and _X[˜]c = γcX[¯]c + β ycc and in Eqn.(1) in main text. Firstly, it is Yc = max{0,_ _X[˜]c} where Yc_ |
|
easy to see that P ( X[˜]c > 0) = 0 ⇔ E ¯Xc [[][Y][c][] = 0][ and][ E][ ¯]Xc [[][Y][ 2]c [] = 0][. In the following we show that] |
|
E ¯Xc [[][Y][c][] = 0][ and][ E][ ¯]Xc [[][Y][ 2]c [] = 0][ ⇔] _[β][c]_ [= 0][. Similar with (1), we assume][ γ][c] _[>][ 0][ without]_ |
|
loss of generality. _[≤]_ [0 and][ γ][c] |
|
|
|
For the sufficiency, we have |
|
|
|
_−_ _[β]γcc_ 1 _x¯[2]c_ +∞ 1 _x¯[2]c_ |
|
E ¯Xc [[][Y][c][] =] 0 · _√2π_ [exp][−] 2 dx¯c + (γcx¯c + βc) · _√2π_ [exp][−] 2 dx¯c, |
|
Z−∞ Z− _[βc]γc_ |
|
|
|
_βc[2]_ (11) |
|
_−_ 2γc[2] |
|
= _[γ][c][exp]√2π_ + _[β]2 [c]_ [(1 + Erf[][ β]√2[c]γc ]), |
|
|
|
where Erf[x] = _√2π_ _x0_ [exp][−][t][2] _[dt][ is the error function. From Eqn.(11), we have]_ |
|
|
|
_βc[2]_ |
|
|
|
R _γcexp−_ 2γc[2] _βc_ (12) |
|
|
|
lim _Xc_ [[][Y][c][] =] lim + lim ]) = 0 |
|
_γc→0[+][ E][ ¯]_ _γc→0[+]_ _√2π_ _γc→0[+]_ 2 [(1 + Erf[][ β]√2[c]γc |
|
|
|
|
|
----- |
|
|
|
Table 6: Runing time comparison during training between BWCP, vanilla BN and SCP. The proposed |
|
BWCP achieves better trade-off between FLOPs reduction and accuracy drop although it introduces |
|
a little extra computational cost during training. ‘F’ denotes forward running time (s) while ‘F+B’ |
|
denotes forward and backward running time (s). The results are averaged over 100 iterations. The |
|
GPU is NVIDIA GTX1080Ti. The CPU type is Intel Xeon E5-2682 v4. |
|
|
|
Model Mothod CPU (F) (s) CPU (F+B) (s) GPU (F) (s) GPU (F+B) (s) Acc. Drop FLOPs ↓ (%) |
|
|
|
vanilla BN 0.184 0.478 0.015 0.031 0 0 |
|
ResNet-50 SCP 0.193 0.495 0.034 0.067 1.69 54.3 |
|
BWCP (Ours) 0.239 0.610 0.053 0.104 1.02 51.2 |
|
|
|
In the same way, we can calculate |
|
|
|
|
|
_−_ _[β]γcc_ 1 _x¯[2]c_ +∞ |
|
Ex¯c [[][Y][ 2]c [] =] 0 · _√2π_ [exp][−] 2 dx¯c + (γcx¯c + βc)[2] _·_ |
|
Z−∞ Z− _[βc]γc_ |
|
|
|
_βc[2]_ |
|
_−_ 2γc[2] _c_ [+][ β]c[2] |
|
= _[γ][c][β][c][exp]_ + _[γ][2]_ (1 + Erf[ _[β][c]_ ]), |
|
|
|
_√2π_ 2 _√2γc_ |
|
|
|
|
|
1 _x¯[2]c_ |
|
|
|
2 dx¯c, |
|
|
|
2π [exp][−] |
|
|
|
|
|
(13) |
|
|
|
|
|
From Eqn.(13), we have |
|
|
|
_βc[2]_ |
|
_γcβcexp−_ 2γc[2] _γc[2]_ [+][ β]c[2] |
|
lim _xc_ [[][Y][ 2]c [] =] lim + lim |
|
_γc→0[+][ E][¯]_ _γc→0[+]_ _√2π_ _γc→0[+]_ 2 |
|
|
|
|
|
(14) |
|
|
|
(1 + Erf[ _√[β]2[c]γc_ ]) = 0 |
|
|
|
|
|
For necessity, we show that ifIt can be acquired by solving Eqn Eqn.(11) and Eqn.(13). To be specific, Ex¯c [[][Y]c[] = 0][ and][ E]x¯c [[][Y]c2] = 0, then γc → 0 β andc Eqn.(.11) βc ≤ 0. In essence,Eqn.(13) |
|
_∗_ _−_ |
|
gives us γc = 0[+]. Substituting it into Eqn.(.11), we can obtain βc ≤ 0. This completes the proof. |
|
|
|
A.3 TRAINING OVERHEAD OF BWCP |
|
|
|
The proposed BWCP introduces a little extra computational cost during training. To see this, we |
|
evaluate the computational complexity of SCP and BWCP for ResNet50 on ImageNet with an input |
|
image size of 224 × 224. We can see from the table below that the training BWCP is slightly slower |
|
on both CPU and GPU than the plain ResNet and SCP. In fact, the computational burden mainly |
|
comes from calculating the covariance matrix and its root inverse. In our paper, we calculate the |
|
root inverse of the covariance matrix by Newton’s iteration, which is fast and efficient. Although |
|
BWCP brings extra training overhead, it achieves better top-1 accuracy drop under the same FLOPs |
|
consumption. |
|
|
|
A.4 PROOF OF PROPOSITION 2 |
|
|
|
First, we can derive that _X[ˆ]c = Σ−N_ 2[1] [(][γ] _[⊙]X[¯]_ +β) = Σ−N [1]2 [(][γ] _[⊙]X[¯]_ )+Σ−N 2[1] **_[β][ = (][Σ]−N_** 2[1] **_[γ][)][⊙]X[¯]_** +Σ−N 2[1] **_[β][.]_** |
|
|
|
Hence, the newly defined scale and bias parameters are ˆγ = Σ−N 2[1] **_[γ][ and][ ˆ]β = Σ−N_** [1]2 **_[β][. When][ T][ = 1][,]_** |
|
|
|
we have Σ−N 2[1] = [1]2 [(3][I][ −] **[Σ][N]** [)][ by Eqn.(5) in main text. Hence we obtain,] |
|
|
|
**_γˆ = [1]_** **_ρ)γ_** |
|
|
|
2 [(3][I][ −] **[Σ][N]** [)][γ][ = 1]2 [(3][I][ −] **_[γγ]γ_** [T]2 _⊙_ |
|
|
|
_∥_ _∥[2]_ |
|
|
|
|
|
= [1] |
|
|
|
2 [(3][γ][ −] |
|
|
|
|
|
_γ1γjρ1jγj,_ _,_ |
|
_· · ·_ |
|
_j=1_ |
|
|
|
X |
|
|
|
|
|
_γCγjρCjγj_ |
|
_j=1_ |
|
|
|
X |
|
|
|
|
|
(15) |
|
|
|
|
|
**_γ_** 2 |
|
_∥_ _∥[2]_ |
|
|
|
|
|
_γj[2][ρ][1][j]_ |
|
|
|
)γ1, _, (3_ |
|
**_γ_** 2 _· · ·_ _−_ |
|
_∥_ _∥[2]_ |
|
|
|
|
|
_γj[2][ρ][Cj]_ |
|
|
|
)γC |
|
**_γ_** 2 |
|
_∥_ _∥[2]_ |
|
|
|
|
|
= [1] |
|
|
|
|
|
(3 |
|
|
|
_−_ |
|
|
|
|
|
_j=1_ |
|
|
|
|
|
_j=1_ |
|
|
|
|
|
----- |
|
|
|
Similarly, **_β[ˆ] can be given by_** |
|
|
|
**_βˆ = [1]_** **_ρ)β_** |
|
|
|
2 [(3][I][ −] **[Σ][)][β][ = 1]2** [(3][I][ −] **_[γγ]γ_** [T]2 _⊙_ |
|
|
|
_∥_ _∥[2]_ |
|
|
|
|
|
= [1] |
|
|
|
2 [(3][β][ −] |
|
|
|
|
|
_γ1γjρ1jβj,_ _,_ _γCγjρCjβj_ |
|
_· · ·_ |
|
_j=1_ _j=1_ |
|
|
|
X X |
|
|
|
|
|
(16) |
|
|
|
|
|
**_γ_** 2 |
|
_∥_ _∥[2]_ |
|
|
|
|
|
_C_ _C_ |
|
|
|
_γjβjρ1j_ _γjβjρCj_ |
|
|
|
= [1] 3β1 ( )γ1, _, 3βC_ ( )γC |
|
|
|
2 _−_ _j=1_ **_γ_** 2 _· · ·_ _−_ _j=1_ **_γ_** 2 |
|
|
|
X _∥_ _∥[2]_ X _∥_ _∥[2]_ |
|
|
|
|
|
|
|
Taking each component of vector Eqn.(15-16) gives us the expression of ˆγc and _β[ˆ]c in Proposition 2._ |
|
|
|
A.5 PROOF OF PROPOSITION 3 |
|
|
|
|
|
For (1), through Eqn.(15), we acquire |γˆc| = [1]2 _[|][3]_ _[−]_ [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ _γc| →_ 0 if |γc| → 0. |
|
|
|
_γc[2]_ _[||][γ][c][|][. Therefore,][ |][ˆ]_ |
|
|
|
On the other hand, by Eqn.(16), we have _β[ˆ]c ≈_ 2[1] [(3][ −] _∥γ∥2[2]_ [)][β][c][ < β][c][ ≤] [0][. Here we assume that] |
|
|
|
_ρcd = 1 if c = d and 0 otherwise. Note that the assumption is plausible by Fig.3 (e & f) in main_ |
|
text from which we see that the correlation among channel features will gradually decrease during |
|
training. We also empirically verify these two conclusions by Fig.4. From Fig.4 we can see that |
|
_|γˆc| ≥|γc| where the equality holds iff |γc| = 0, and_ _β[ˆ]c is larger than βc if βc is positive, and vice_ |
|
versa. By Proposition 1, we arrive at |
|
|
|
_P_ ( X[ˆ]c > δ) < P ( X[ˆ]c > 0) = 0 (17) |
|
|
|
|
|
where the first ‘>’ holds since δ is a small positive constant and ‘=’ follows from _γˆc_ 0 and |
|
_βˆc_ 0. For (2), to show P ( X[ˆ]c > δ) > P ( X[˜]c > δ), we only need to prove P ( X[¯] |c >| →[δ][−]γˆcβ[ˆ]c [)][ >] |
|
_≤_ _|_ _|_ |
|
|
|
_P_ ( X[¯]c > _[δ][−]γc[β][c]_ [)][, which is equivalent to][ δ][−]γˆcβ[ˆ]c _[<][ δ][−]γc[β][c]_ [. To this end, we calculate] |
|
|
|
_|_ _|_ _|_ _|_ _|_ _|_ |
|
|
|
_|γc|γˆβ[ˆ]cc −|γγˆcc|βc_ = _|γc|_ 2[1] [(3][β][c][ −] [(][P]1j[C]=1 _γj∥βγj∥ρ2[2]cj_ [)][γ]γ[c]j[2][)][ρ][ −][cj] [1]2 [(3][ −] [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ [)][|][γ][c][|][β][c] |
|
|
|
_|_ _| −|_ _|_ 2 [(3][ −] [P]j[C]=1 _∥γ∥2[2]_ [)][|][γ][c][| −|][γ][c][|] |
|
|
|
_C_ _γj_ _βj_ _γcρcj_ _γj[2][β][c][ρ][cj]_ |
|
|
|
= _j=1_ _∥γ∥2[2]_ _−_ [P]j[C]=1 _∥γ∥2[2]_ |
|
|
|
_γj[2][ρ][cj]_ |
|
|
|
P 1 − [P]j[C]=1 _∥γ∥2[2]_ (18) |
|
|
|
_C_ _γj_ (βj _γc−γj_ _βc)ρcj_ 1 _C_ |
|
_j=1_ **_γ_** 2 **_γ_** 2 _j=1_ [(][β][j][γ][c][ −] _[γ][j][β][c][)][2][ρ]cj[2]_ |
|
|
|
= _∥_ _∥[2]_ _∥_ _∥_ |
|
|
|
P 1 − [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ _≤_ qP1 − [P]j[C]=1 _γ∥j[2]γ[ρ]∥[cj]2[2]_ |
|
|
|
= _∥γ∥2_ _Cj=1_ [(][β][j][γ][c][ −] _[γ][j][β][c][)][2][ρ]cj[2]_ = δ |
|
qP∥γ∥2[2] _[−]_ [P]j[C]=1 _[γ]j[2][ρ][cj]_ |
|
|
|
where the ‘ ’ holds due to the Cauchy–Schwarz inequality. By Eqn.(18), we derive that _γc_ (δ |
|
_≤_ _|_ _|_ _−_ |
|
_βˆc)_ _γˆc_ (δ _βc) which is exactly what we want. Lastly, we empirically verify that δ defined in_ |
|
_≤|_ _|_ _−_ |
|
Proposition 3 is a small positive constant. In fact, δ represents the minimal activation feature value |
|
(in ResNet-34 during the whole training stage and value ofi.e. _X[ˆ]c = ˆγcX[¯]c + β[ˆ]c ≥_ _δ by definition). We visualize the value of δ of each layer in trained ResNet-34 on δ in shallow and deep layers_ |
|
ImageNet dataset in Fig.5. As we can see, δ is always a small positive number in the training process. |
|
We thus empirically set δ as 0.05 in all experiments. |
|
|
|
To conclude, by Eqn.(17), BWCP can keep the activation probability of unimportant channel unchanged; by Eqn.(18), BWCP can increase the activation probability of important channel. In this |
|
way, the proposed BWCP can pursuit a compact deep model with good performance. |
|
|
|
|
|
----- |
|
|
|
Figure 4: Experimental observation of how our proposed BWCP changes the values of γc and βc |
|
|
|
(a) layer1.0.bn1 (b) layer2.0.bn1 (c) layer3.0.bn1 (d) layer4.0.bn1 |
|
|
|
BN- 0.5 0.0 0.2 |
|
|
|
1 BW- 0.0 |
|
|
|
Line 0 0.0 0.2 0.2 |
|
|
|
0 0.5 0.4 0.4 |
|
|
|
Value 1 0.6 0.6 |
|
|
|
1.0 0.8 0.8 |
|
|
|
2 |
|
|
|
3 1.5 BN-BW-Line 0 1.01.2 BN-BW-Line 0 1.01.2 BN-BW-Line 0 |
|
|
|
0 10 20 30 40 50 60 0 20 40 60 80 100 120 0 50 100 150 200 250 0 100 200 300 400 500 |
|
|
|
(e) layer1.0.bn1 (f) layer2.0.bn1 (g) layer3.0.bn1 (h) layer4.0.bn1 |
|
|
|
BN-| | 0.8 BN-| | 1.2 BN-| | BN-| | |
|
|
|
2.0 BW-| | BW-| | BW-| | 0.8 BW-| | |
|
|
|
1.0 |
|
|
|
1.5 0.6 0.8 0.6 |
|
|
|
Value 1.0 0.4 0.6 0.4 |
|
|
|
0.4 |
|
|
|
0.5 0.2 0.2 |
|
|
|
0.2 |
|
|
|
0.0 0.0 0.0 0.0 |
|
|
|
0 10 20 30 40 50 60 0 20 40 60 80 100 120 0 50 100 150 200 250 0 100 200 300 400 500 |
|
|
|
Channel Index Channel Index Channel Index Channel Index |
|
|
|
in BN layers through the proposed BW technique. Results are obtained by tranining ResNet50 on |
|
ImageNet dataset. We investigate γc and βc at different depths of the network including layer1.0.bn1, |
|
layer2.0.bn1,layer3.0.bn1 and layer4.0.bn1. (a-d) show BW enlarges βc when βc > 0 while reducing |
|
_βc when βc ≤_ 0. (e-h) show that BW consistently increases the magnitude of γc across the network. |
|
|
|
(a) ResNet-34-layer1.0.bn1 (b) ResNet-34-layer4.0.bn1 (c) ResNet-34-per layer |
|
|
|
0.022 0.0050 |
|
|
|
0.020 0.0045 0.05 |
|
|
|
0.018 0.0040 0.04 |
|
|
|
0.016 |
|
|
|
0.014 0.0035 0.03 |
|
|
|
0.012 0.0030 0.02 |
|
|
|
0.010 0.0025 |
|
|
|
0.008 0.0020 0.01 |
|
|
|
0.006 0.0015 0.00 |
|
|
|
0 20 40 60 80 100 0 20 40 60 80 100 0 5 10 15 20 25 30 |
|
|
|
Training Epochs Training Epochs Layers |
|
|
|
|
|
Figure 5: Experimental observation of the values of δ defined in proposition 3. Results are obtained |
|
by tranining ResNet-34 on ImageNet dataset. (a & b) investigate δ at different depths of the network |
|
including layer1.0.bn1 and layer4.0.bn1 respectively. (c) visualizes δ for each layer of ResNet-34. |
|
We see that δ in proposition 3 is always a small positive constant. |
|
|
|
Figure 6: Illustration of BWCP with shortcut in basic block structure of ResNet. For shortcut with |
|
Conv-BN modules, we use a simple strategy that lets BW layer in the last convolution layer and |
|
|
|
|
|
shortcut share the same mask. For shortcut with identity mappings, we use the mask in previous layer. |
|
|
|
|
|
----- |
|
|
|
**Algorithm 1 Forward Propagation of the proposed BWCP.** |
|
|
|
1: Input: mini-batch inputs x ∈ R[N] _[×][C][×][H][×][W]_ . |
|
2: Hyperparameters: momentum g for calculating root inverse of covariance matrix, iteration number T . |
|
3: Output: the activations x[out] obtained by BWCP. |
|
4: calculate standardized activation: {x¯c}c[C]=1 [in Eqn.(1).] |
|
|
|
5: calculate the output of BN layer: ˜xc = γcx¯c + βc. |
|
|
|
6: calculate normalized covariance matrix: ΣN = _∥γγγ∥[T]2[2]_ _NHW1_ _N,H,Wn,i,j=1_ **x[¯]nijx¯nijT** |
|
|
|
7: Σ0 = I. _[⊙]_ |
|
|
|
P |
|
|
|
8: for k = 1 to T do |
|
9: **Σk =** 2[1] [(3][Σ][k][−][1][ −] **[Σ]k[3]−1[Σ][N]** [)] |
|
|
|
10: end for |
|
11: calculate whitening matrix for training: Σ−N [1]2 = ΣT . |
|
|
|
12: calculate whitening matrix for inference: **Σ[ˆ]** _−N_ 2 _←_ (1 − _g) Σ[ˆ]_ _−N_ 2 + gΣN− 2 [.] |
|
|
|
13: calculate whitened output: ˆxnij = Σ−N 2[1] **x[˜]nij.** |
|
|
|
14: calculate equivalent scale and bias defined by BW: ˆγ = Σ[−] 2[1] γ and **_β[ˆ] = Σ[−]_** [1]2 β. |
|
|
|
15: calculate the activation probability by Proposition 2 with ˆγ and **_β[ˆ], obtain soft masks {mc}c[C]=1_** [by Eqn.(6).] |
|
|
|
16: calculate the output of BWCP: x[out]c = ˆxc _mc._ |
|
_⊙_ |
|
|
|
**(a) BN** **(b) BWCP** |
|
|
|
|
|
Figure 7: Illustration of forward propagation of (a) BN and (b) BWCP. The proposed BWCP prunes |
|
|
|
|
|
|
|
**BW** CNNs by replacing original BN layer with BWCP module.Cov. **Newton** **BN** **BWCov.** |
|
|
|
**Iter.** |
|
|
|
A.6 SOLUTION TO RESIDUAL ISSUE |
|
|
|
The recent advanced CNN architectures usually have residual blocks with shortcut connections (He |
|
|
|
**Soft Gating Module** et al., 2016; Huang et al., 2017). As shown in Fig.6, the number of channels in the last convolutionSoft Gating Module |
|
|
|
layer must be the same as in previous blocks due to the element-wise summation. Basically, there |
|
|
|
**Act. Prob.** **Gumbel-Softmax Soft Maskare two types of residual connections,BN modules, and shortcut with identity. For shortcut with Conv-BN modules, the proposed BWReLU** **Act. Prob. i.e. shortcut with downsampling layer consisting of Conv-Gumbel-SoftMax** **Soft Mask** |
|
|
|
technique is utilized in downsampling layer to generate pruning mask m[s]c[. Furthermore, we use a] |
|
simple strategy that lets BW layer in the last convolution layer and shortcut share the same mask |
|
as given byshortcut with identity mappings, we use the mask in the previous layer. In doing so, their activated mc = m[s]c _[·][ m]c[last]_ where m[last]c and m[s]c [denote masks of the last convolution layer. For] |
|
output channels must be the same. |
|
|
|
A.7 BACK-PROPAGATION OF BWCP |
|
|
|
|
|
The forward propagation of BWCP can be represented by Eqn.(3-4) and Eqn.(9) in the main text |
|
(see detail in Table 1), all of which define differentiable transformations. Here we provide the |
|
back-propagation of BWCP. By comparing the forward representation of BN and BWCP in Fig.7, |
|
we need to back-propagate the gradient _∂x∂[out]nijL_ [to] _∂x∂¯nijL_ [for backward propagation of BWCP. For] |
|
|
|
simplicity, we neglect the subscript ‘nij’. |
|
|
|
By chain rules, we have |
|
|
|
|
|
_∂L_ **_γ_** **m** _∂L_ _−N_ 2[1] [)] (19) |
|
_∂x¯_ [= ˆ] ⊙ _⊙_ _∂x[out][ +][ ∂]∂[L]x¯_ [(][Σ] |
|
|
|
|
|
----- |
|
|
|
Table 7: Performance of our BWCP on different base models compared with other approaches on |
|
CIFAR-100 dataset. |
|
|
|
Model Slimming* (Liu et al., 2017)Mothod Baseline Acc. (%)77.24 Acc. (%)74.52 Acc. Drop2.72 Channels60 _↓_ (%) Model Size29.26 ↓ (%) FLOPs47.92 ↓ (%) |
|
|
|
ResNet-164 SCP (Kang & Han, 2020) 77.24 76.62 0.62 57 28.89 45.36 |
|
BWCP (Ours) 77.24 76.77 **0.47** 41 21.58 39.84 |
|
|
|
Slimming* (Liu et al., 2017) 74.24 73.53 0.71 **60** 54.99 **50.32** |
|
Variational Pruning (Zhao et al., 2019) 74.64 72.19 2.45 37 37.73 22.67 |
|
|
|
DenseNet-40 |
|
|
|
SCP (Kang & Han, 2020) 74.24 73.84 0.40 **60** **55.22** 46.25 |
|
BWCP (Ours) 74.24 74.18 **0.06** 54 53.53 40.40 |
|
|
|
VGGNet-19 Slimming* (Liu et al., 2017) 72.56 73.01 -0.45 **50** **76.47** **38.23** |
|
BWCP (Ours) 72.56 73.20 **-0.64** 23 41.00 22.09 |
|
|
|
Slimming* (Liu et al., 2017) 73.51 73.45 0.06 **40** **66.30** 27.86 |
|
Variational Pruning (Zhao et al., 2019) 73.26 73.33 -0.07 32 37.87 18.05 |
|
|
|
VGGNet-16 |
|
|
|
BWCP (Ours) 73.51 73.60 **-0.09** 34 58.16 **34.46** |
|
|
|
where _[∂]∂[L]x¯_ [(][Σ]−N [1]2 [)][ denotes the gradient][ w.r.t.][ ¯]x back-propagated through Σ−N 2[1] [. To calculate it, we first] |
|
|
|
obtain the gradient w.r.t. Σ−N 2[1] as given by |
|
|
|
|
|
_∂L_ |
|
|
|
_∂Σ−N_ [1]2 |
|
|
|
|
|
= γ _[∂][L]_ |
|
|
|
_∂γˆ_ |
|
|
|
|
|
T + β _[∂][L]_ |
|
|
|
_∂β[ˆ]_ |
|
|
|
|
|
(20) |
|
|
|
|
|
where |
|
_∂_ _∂_ _∂_ |
|
_L_ **x** **m** _L_ **x** _L_ (21) |
|
_∂γˆ_ [= ¯] ⊙ _⊙_ _∂x[out][ +][ ∂]∂[m]γˆ_ [(ˆ] ⊙ _∂x[out][ )]_ |
|
|
|
and |
|
_∂_ _∂_ |
|
_L_ = m + _[∂][m]_ (ˆx _L_ (22) |
|
|
|
_∂β[ˆ]_ _∂β[ˆ]_ _⊙_ _∂x[out][ )]_ |
|
|
|
The remaining thing is to calculate _[∂]∂[m]γˆ_ [and][ ∂]∂[m]β[ˆ] [. Based on the Gumbel-Softmax transformation, we] |
|
|
|
arrive at |
|
|
|
|
|
_mc(1_ _mc)f_ (ˆγc,β[ˆ]c) _βcγc_ |
|
_−_ _−_ |
|
|
|
_τP ( X[ˆ]C_ _>0)(1−P ( X[ˆ]C_ _>0))_ _|γc|[2][ if][ d][ =][ c]_ (23) |
|
0, otherwise |
|
|
|
_mc(1−mc)f_ (ˆγc,β[ˆ]c) |
|
|
|
_τP ( X[ˆ]C_ _>0)(1_ _P ( X[ˆ]C_ _>0))_ _[,][ if][ d][ =][ c]_ (24) |
|
_−_ |
|
0, otherwise |
|
|
|
|
|
_∂mc_ |
|
|
|
_∂γˆd_ |
|
|
|
_∂mc_ |
|
|
|
_∂β[ˆ]d_ |
|
|
|
|
|
where f (ˆγc, _β[ˆ]c) is the probability density function of R.V._ _X[ˆ]c as written in Eqn.(2) of main text._ |
|
|
|
|
|
To proceed, we deliver the gradient w.r.t. Σ−N [1]2 |
|
|
|
main text. Note that Σ−N [1]2 = ΣT, we have |
|
|
|
_∂_ |
|
_L_ = |
|
|
|
_∂ΣN_ _−_ [1]2 |
|
|
|
|
|
in Eqn.(20) to Σ by Newton Iteration in Eqn.(6) of |
|
|
|
_T_ |
|
|
|
(Σ[3]k 1[)][T][ ∂][L] (25) |
|
_−_ _∂Σk_ |
|
_k=1_ |
|
|
|
X |
|
|
|
|
|
where _∂∂ΣLk_ [can be calculated by following iterations:] |
|
|
|
_∂L_ = [3] _∂L_ _∂L_ (Σ[2]k 1[Σ][)][T][ −] [1] _k_ 1[)][T][ ∂][L] **Σ[T]** |
|
|
|
_∂Σk−1_ 2 _∂Σk_ _−_ [1]2 _∂Σk_ _−_ 2 [(][Σ][2]− _∂Σk_ |
|
|
|
_−_ [1]2 [(][Σ][k][−][1][)][T][ ∂]∂Σ[L]k (Σk−1Σ)[T] _k = T, · · ·, 1._ |
|
|
|
|
|
Given the gradient w.r.t. ΣN in Eqn.(25), we can calculate the gradient w.r.t. ¯x back-propagated |
|
through Σ−N 2[1] in Eqn.(19) as follows |
|
|
|
|
|
_∂L_ _−N_ [1]2 [) = (][ γγ][T] + _[∂][L]_ |
|
_∂x¯_ [(][Σ] _∥γ∥[2][ ⊙]_ [(][ ∂]∂Σ[L]N _∂ΣN_ |
|
|
|
Based on Eqn.(19-26), we obtain the back-propagation of BWCP. |
|
|
|
|
|
T))¯x (26) |
|
|
|
|
|
----- |
|
|
|
B MORE DETAILS ABOUT EXPERIMENT |
|
|
|
B.1 DATASET AND METRICS |
|
|
|
We evaluate the performance of our proposed BWCP on various image classification benchmarks, |
|
including CIFAR10/100 (Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015). The CIFAR-10 |
|
and CIFAR-100 datasets have 10 and 100 categories, respectively, while both contain 60k color images |
|
with a size of 32 × 32, in which 50k training images and 10K test images are included. Moreover, |
|
the ImageNet dataset consists of 1.28M training images and 50k validation images. Top-1 accuracy |
|
are used to evaluate the recognition performance of models on CIFAR 10/100. Top-1 and Top-5 |
|
accuracies are reported on ImageNet. We utilize the common protocols, i.e. number of parameters |
|
and Float Points Operations (FLOPs) to obtain model size and computational consumption. |
|
|
|
For CIFAR-10/100, we use ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and VGGNet (Simonyan & Zisserman, 2014) as our base model. For ImageNet, we use ResNet-34 and ResNet-50. |
|
We compare our algorithm with other channel pruning methods without a fine-tuning procedure. |
|
Note that a extra fine-tuning process would lead to remarkable improvement of performace (Ye et al., |
|
2020). For fair comparison, we also fine-tune our BWCP to compare with those pruning methods. |
|
The training configurations are provided in Appendix B.2. The base networks and BWCP are trained |
|
together from scratch for all of our models. |
|
|
|
B.2 TRAINING CONFIGURATION |
|
|
|
**Training Setting on ImageNet. All networks are trained using 8 GPUs with a mini-batch of 32 per** |
|
GPU. We train all the architectures from scratch for 120 epochs using stochastic gradient descent |
|
(SGD) with momentum 0.9 and weight decay 1e-4. We perform normal training without sparse |
|
regularization in Eqn.(7) on the original networks for first 20 epochs by following (Ning et al., 2020). |
|
The base learning rate is set to 0.1 and is multiplied by 0.1 after 50, 80 and 110 epochs. During |
|
fine-tuning, we use the standard SGD optimizer with Nesterov momentum 0.9 and weight decay |
|
0.00005 to fine-tune pruned network for 150 epochs. We decay learning rate using cosine schedule |
|
with initial learning rate 0.01. The coefficient of sparse regularization λ1 and λ2 are set to 7e-5 and |
|
3.5e-5 to achieve Flops Reduction at a level of 40%, while λ1 and λ2 are set to 9e-5 and 3.5e-5 |
|
respectively to achieve FLOPs reduction at a level of 40%. Besides, the covariance matrix in the |
|
proposed BW technique is calculated within each GPU. Like (Huang et al., 2019), we also use |
|
group-wise decorrelation with group size 16 across the network to improve the efficiency of BW. |
|
|
|
**Training setting on CIFAR-10 and CIFAR-100. We train all models on CIFAR-10 and CIFAR-100** |
|
with a batch size of 64 on a single GPU for 160 epochs with momentum 0.9 and weight decay 1e-4. |
|
The initial learning rate is 0.1 and is decreased by 10 times at 80 and 120 epoch. The coefficient of |
|
sparse regularization λ1 and λ2 are set to 4e-5 and 8e-5 for CIFAR-10 dataset and 7e-6 and 1.4e-5 |
|
for CIFAR-100 dataset. |
|
|
|
B.3 MORE RESULTS OF BWCP |
|
|
|
The results of BWCP on CIFAR-100 dataset is reported in Table 7. As we can see, our approach |
|
BWCP achieves the lowest accuracy drops and comparable FLOPs reduction compared with existing |
|
channel pruning methods in all tested base models. |
|
|
|
|
|
----- |
|
|
|
|