|
# TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING |
|
|
|
|
|
**Junxian He[∗]** |
|
Carnegie Mellon University |
|
junxianh@cs.cmu.edu |
|
|
|
**Xuezhe Ma** |
|
University of Southern California |
|
xuezhema@isi.edu |
|
|
|
|
|
**Chunting Zhou[∗]** |
|
Carnegie Mellon University |
|
chuntinz@cs.cmu.edu |
|
|
|
**Graham Neubig** |
|
Carnegie Mellon University |
|
gneubig@cs.cmu.edu |
|
|
|
|
|
**Taylor Berg-Kirkpatrick** |
|
UC San Diego |
|
tberg@eng.ucsd.edu |
|
|
|
ABSTRACT |
|
|
|
|
|
Fine-tuning large pretrained language models on downstream tasks has become |
|
the de-facto learning paradigm in NLP. However, conventional approaches finetune all the parameters of the pretrained model, which becomes prohibitive as |
|
the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small |
|
number of (extra) parameters to attain strong performance. While effective, the |
|
critical ingredients for success and the connections among the various methods |
|
are poorly understood. In this paper, we break down the design of state-of-the-art |
|
parameter-efficient transfer learning methods and present a unified framework that |
|
establishes connections between them. Specifically, we re-frame them as modifications to specific hidden states in pretrained models, and define a set of design |
|
dimensions along which different methods vary, such as the function to compute |
|
the modification and the position to apply the modification. Through comprehensive empirical studies across machine translation, text summarization, language |
|
understanding, and text classification benchmarks, we utilize the unified view to |
|
identify important design choices in previous methods. Furthermore, our unified |
|
framework enables the transfer of design elements across different approaches, |
|
and as a result we are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, |
|
achieving comparable results to fine-tuning all parameters on all four tasks.[1] |
|
|
|
1 INTRODUCTION |
|
|
|
Transfer learning from pre-trained language models (PLMs) is now the prevalent paradigm in natural |
|
language processing, yielding strong performance on many tasks (Peters et al., 2018; Devlin et al., |
|
2019; Qiu et al., 2020). The most common way to adapt general-purpose PLMs to downstream tasks |
|
is to fine-tune all the model parameters (full fine-tuning). However, this results in a separate copy of |
|
fine-tuned model parameters for each task, which is prohibitively expensive when serving models |
|
that perform a large number of tasks. This issue is particularly salient with the ever-increasing size |
|
of PLMs, which now range from hundreds of millions (Radford et al., 2019; Lewis et al., 2020) to |
|
hundreds of billions (Brown et al., 2020) or even trillions of parameters (Fedus et al., 2021). |
|
|
|
To mitigate this issue, a few lightweight alternatives have been proposed to update only a small number of extra parameters while keeping most pretrained parameters frozen. For example, adapter tun_ing (Houlsby et al., 2019) inserts small neural modules called adapters to each layer of the pretrained_ |
|
network and only the adapters are trained at fine-tuning time. Inspired by the success of prompting |
|
methods that control PLMs through textual prompts (Brown et al., 2020; Liu et al., 2021a), prefix |
|
_tuning (Li & Liang, 2021) and prompt tuning (Lester et al., 2021) prepend an additional l tunable_ |
|
|
|
_∗Equal Contribution. Order determined by random dice rolling._ |
|
[1Code is available at https://github.com/jxhe/unify-parameter-efficient-tuning.](https://github.com/jxhe/unify-parameter-efficient-tuning) |
|
|
|
|
|
----- |
|
|
|
**Adapter** |
|
|
|
+ |
|
|
|
**_WgkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up |
|
|
|
**Nonlinear** |
|
|
|
**_WxHv7r65+/2dhu/k+l1Yvbt1vbv9/avxu4+71f9i+vnar9Z+vfZqbXvtv9b+uHa0drF2sxasPa39z9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down |
|
|
|
|
|
**Prefix Tuning** |
|
|
|
|
|
22 |
|
|
|
21 |
|
|
|
|
|
20 |
|
|
|
19 |
|
|
|
|
|
|
|
**_Pnh0i9w/BM8/NfXP3+zsd38n0qrFzdvt7b/c2v74ruNP+5W/4vpF2v/vabtVdr2v/tfbHtaO13tr1WrAWr/3f2v+v/fndq3dn767f3ZrQn/6kqvNva87Pu/+KxhAzTo=</latexit>_** _k_ |
|
|
|
**_P6t3Zu+t3tyb0pz+p6vzbmvPz7r/CmHnzU=</latexit>_** _v_ |
|
|
|
|
|
18 |
|
|
|
|P2v/tfbHtaO13tr1WrAWr/3f2v+v/fndq3dn767f3ZrQn/6kqvNva87Pu/+KxhAzTo=</latexit> k|KRhzFc=</latexit>| |
|
|---|---| |
|
|
|
|Pq+G7j7vV/2L6xdq/r/1m7dXa9tp/rf1x7Wit3a9FqzFa/+39v9rf3736t3Zu+t3tyb0pz+p6vzbmvPz7r/CmHnzU=</latexit> v|VefDXwGeE8xi</latexit>| |
|
|---|---| |
|
|
|
|
|
|W27ebm3/59b2xXcbf9yt/hfTL9b+fe03a6/Wtf+a+2Pa0drvbXrtWAtXvu/tf9f+/O7V+/O3l2/uzWhP/1JVef1pyfd/9V8xyzUc=</latexit> q|̴LoRA| |
|
|---|---| |
|
|
|
|
|
|Wvu/tf9f+/O7V+/O3l2/uzWhP/1JVef1pyfd/9V40BzUE=</latexit> k|̴LoR| |
|
|---|---| |
|
|
|
|
|
**_Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down **_WgkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up |
|
|
|
_⇥_ |
|
|
|
Add & Layer Norm x L |
|
|
|
Adapter |
|
|
|
Feed Forward |
|
|
|
Add & Layer Norm |
|
|
|
Adapter |
|
|
|
Attention |
|
|
|
**_Q7tVdr2t/XPvL2tHaxdr1WrD2Ze1/1v537f/e/eu7rveu2sT+tOfVHX+Zc35efDXwGT2Mxd</latexit>_** **_P2v/tfbHtaO13tr1WrAWr/3f2v+v/fndq3dn767f3ZrQn/6kqvNva87Pu/+KxhAzTo=</latexit>_** _k_ **_KRhzFc=</latexit>_** **_Pq+G7j7vV/2L6xdq/r/1m7dXa9tp/rf1x7Wit3a9FqzFa/+39v9rf3736t3Zu+t3tyb0pz+p6vzbmvPz7r/CmHnzU=</latexit>_** _v_ **_VefDXwGeE8xi</latexit>_** |
|
|
|
**_W27ebm3/59b2xXcbf9yt/hfTL9b+fe03a6/Wtf+a+2Pa0drvbXrtWAtXvu/tf9f+/O7V+/O3l2/uzWhP/1JVef1pyfd/9V8xyzUc=</latexit>_** _q_ +̴LoRA **_Wvu/tf9f+/O7V+/O3l2/uzWhP/1JVef1pyfd/9V40BzUE=</latexit>_** _k_ +̴LoRA **_Wvj2tFab+16LViL1/5v7f/X/vzu1buzd9fvbk3oT39S1fm3Nefn3X/FdaozUw=</latexit>_** _v_ |
|
|
|
Hidden States |
|
|
|
**Multi-Head** |
|
|
|
Figure 1: Illustration of the transformer architecture |
|
and several state-of-the-art parameter-efficient tuning |
|
methods. We use blocks with dashed borderlines to |
|
represent the added modules by those methods. |
|
|
|
|in|e-tuning 21.9|4 Ours 21.90|Col4|Col5| |
|
|---|---|---|---|---| |
|
|P|refix Tuning 2|Adapte 0.46|r 20.98 LoRA 20.50|| |
|
|||||| |
|
|||||| |
|
||BitFit 17.32|||| |
|
|
|
|
|
10 15 |
|
|
|
|
|
Fine-tuned Parameters (%) |
|
|
|
Figure 2: Performance of different methods on the |
|
XSum (Narayan et al., 2018) summarization task. |
|
The number of fine-tuned parameters is relative to |
|
the tuned parameters in full fine-tuning. |
|
|
|
|
|
prefix tokens to the input or hidden layers and only train these soft prompts when fine-tuning on |
|
downstream tasks. More recently, Hu et al. (2021) learn low-rank matrices to approximate parameter updates. We illustrate these methods in Figure 1. These approaches have all been reported to |
|
demonstrate comparable performance to full fine-tuning on different sets of tasks, often through updating less than 1% of the original model parameters. Besides parameter savings, parameter-efficient |
|
tuning makes it possible to quickly adapt to new tasks without catastrophic forgetting (Pfeiffer et al., |
|
2021) and often exhibits superior robustness in out-of-distribution evaluation (Li & Liang, 2021). |
|
|
|
However, we contend that the important ingredients that contribute to the success of these parameterefficient tuning methods are poorly understood, and the connections between them are still unclear. |
|
In this paper, we aim to answer three questions: (1) How are these methods connected? (2) Do these |
|
methods share design elements that are essential for their effectiveness, and what are they? (3) Can |
|
the effective ingredients of each method be transferred to others to yield more effective variants? |
|
|
|
In order to answer these questions, we first derive an alternative form of prefix tuning that reveals |
|
prefix tuning’s close connections with adapters (§3.1). Based on this we then devise a unified framework that frames the aforementioned methods as different ways to modify the hidden representations |
|
of frozen PLMs (§3.2). Our unified framework decomposes previous methods along a shared set |
|
of design dimensions, such as the function used to perform the modification, the position in which |
|
to impose this modification, and how to integrate the modification. This framework allows us to |
|
transfer design choices across approaches to propose new variants such as adapters with multiple |
|
heads (§3.3). In experiments, we first show that existing parameter-efficient tuning methods still |
|
lag behind full fine-tuning on higher-resource and challenging tasks (§4.2), as exemplified in Figure 2. Then we utilize the unified framework to identify critical design choices and validate the |
|
proposed variants empirically (§4.3-4.6). Our experiments on four NLP benchmarks covering text |
|
summarization, machine translation (MT), text classification, and general language understanding, |
|
demonstrate that the proposed variant uses less parameters than existing methods while being more |
|
effective, matching full fine-tuning results on all four tasks. |
|
|
|
2 PRELIMINARIES |
|
|
|
|
|
2.1 RECAP OF THE TRANSFORMER ARCHITECTURE |
|
|
|
The transformer model (Vaswani et al., 2017) is now the workhorse architecture behind most stateof-the-art PLMs. In this section we recap the equations of this model for completeness. Transformer |
|
models are composed of L stacked blocks, where each block (Figure 1) contains two types of sub |
|
|
|
----- |
|
|
|
layers: multi-head self-attention and a fully connected feed-forward network (FFN).[2] The conventional attention function maps queries Q ∈ R[n][×][d][k] and key-value pairs K ∈ R[m][×][d][k] _, V ∈_ R[m][×][d][v] : |
|
|
|
Attn(Q, K, V ) = softmax( **_[QK]√dk[T]_** )V, (1) |
|
|
|
where n and m are the number of queries and key-value pairs respectively. Multi-head attention performs the attention function in parallel over Nh heads, where each head is separately parameterized |
|
by Wq[(][i][)][,][ W][ (]k[i][)][,][ W][ (]v _[i][)]_ _∈_ R[d][×][d][h] to project inputs to queries, keys, and values. Given a sequence of |
|
_m vectors C ∈_ R[m][×][d] over which we would like to perform attention and a query vector x ∈ R[d], |
|
multi-head attention (MHA) computes the output on each head and concatenates them:[3] |
|
|
|
MHA(C, x) = Concat(head1, · · ·, headh)Wo, headi = Attn(xWq[(][i][)][,][ CW]k[ (][i][)][,][ CW]v[ (][i][)][)][,] (2) |
|
|
|
whereparameters, which indicates that each attention head is operating on a lower-dimensional space. The Wo ∈ R[d][×][d]. d is the model dimension, and in MHA dh is typically set to d/Nh to save |
|
other important sublayer is the fully connected feed-forward network (FFN) which consists of two |
|
linear transformations with a ReLU activation function in between: |
|
|
|
FFN(x) = ReLU(xW1 + b1)W2 + b2, (3) |
|
|
|
wherea residual connection is used followed by layer normalization (Ba et al., 2016). W1 ∈ R[d][×][d][m], W2 ∈ R[d][m][×][d]. Transformers typically use a large dm, e.g. dm = 4d. Finally, |
|
|
|
2.2 OVERVIEW OF PREVIOUS PARAMETER-EFFICIENT TUNING METHODS |
|
|
|
Below and in Figure 1, we introduce several state-of-the-art parameter-efficient tuning methods. |
|
Unless otherwise specified, they only tune the added parameters while the PLM’s are frozen. |
|
|
|
**Adapters (Houlsby et al., 2019):** The adapter approach inserts small modules (adapters) between |
|
transformer layers. The adapter layer generally uses a down-projection withproject the input h to a lower-dimensional space specified by bottleneck dimension Wdown r, followed by ∈ R[d][×][r] to |
|
a nonlinear activation functionsurrounded by a residual connection, leading to a final form: f (·), and a up-projection with Wup ∈ R[r][×][d]. These adapters are |
|
|
|
**_h ←_** **_h + f_** (hWdown)Wup. (4) |
|
|
|
Houlsby et al. (2019) places two adapters sequentially within one layer of the transformer, one after |
|
the multi-head attention and one after the FFN sub-layer. Pfeiffer et al. (2021) have proposed a more |
|
efficient adapter variant that is inserted only after the FFN “add & layer norm” sub-layer. |
|
|
|
**Prefix Tuning (Li & Liang, 2021):** Inspired by the success of textual prompting methods (Liu |
|
et al., 2021a), prefix tuning prepends l tunable prefix vectors to the keys and values of the multihead attention at every layer. Specifically, two sets of prefix vectorswith the original key K and value V . Then multi-head attention is performed on the new prefixed Pk, Pv ∈ R[l][×][d] are concatenated |
|
keys and values. The computation of headi in Eq. 2 becomes: |
|
|
|
headi = Attn(xWq[(][i][)][,][ concat][(][P][ (]k[i][)][,][ CW]k[ (][i][)][)][,][ concat][(][P][ (]v _[i][)][,][ CW]v[ (][i][)][))][,]_ (5) |
|
|
|
**_Pk and Pv are split into Nh head vectors respectively and Pk[(][i][)][,][ P][ (]v_** _[i][)]_ _∈_ R[l][×][d/N][h] denote the i-th |
|
head vector. Prompt-tuning (Lester et al., 2021) simplifies prefix-tuning by only prepending to the |
|
input word embeddings in the first layer; similar work also includes P-tuning (Liu et al., 2021b). |
|
|
|
**LoRA (Hu et al., 2021):** LoRA injects trainable low-rank matrices into transformer layers to |
|
approximate the weight updates. For a pre-trained weight matrix W ∈ R[d][×][k], LoRA represents its |
|
update with a low-rank decomposition W +∆W = W +WdownWup, where Wdown ∈ R[d][×][r], Wup ∈ |
|
R[r][×][k] are tunable parameters. LoRA applies this update to the query and value projection matrices |
|
(Wq, Wv) in the multi-head attention sub-layer, as shown in Figure 1. For a specific input x to the |
|
linear projection in multi-head attention, LoRA modifies the projection output h as: |
|
|
|
**_h ←_** **_h + s · xWdownWup,_** (6) |
|
|
|
2In an encoder-decoder architecture, the transformer decoder usually has another multi-head cross-attention |
|
module between the self-attention and FFN, which we omit here for simplicity. |
|
3Below, we sometimes ignore the head index i to simplify notation when there is no confusion. |
|
|
|
|
|
----- |
|
|
|
|Add Scaling hE9/UtX5lzXn590PfwVch8x0</latexit> W9dKsKvMlzLa8vUkm1PJkqgAMJt4+fiJO4fCG2bZw9a685DnPFenc1b8M2q8CPUA8O9gU19z1Cdk8YQr9pdarWimf79gkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit> up PLM module Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit> down x1/WXN+3v3wV7B0zIQ=</latexit>|Col2|Scaling| |
|
|---|---|---| |
|
||hE9/UtX5lzXn590PfwVch8x0</latexit>|| |
|
|
|
|Add Scaling hE9/UtX5lzXn590PfwVch8x0</latexit> W9dKsKvMlzLa8vUkm1PJkqgAMJt4+fiJO4fCG2bZw9a685DnPFenc1b8M2q8CPUA8O9gU19z1Cdk8YQr9pdarWimf79gkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit> up PLM module ReLU Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit> down x1/WXN+3v3wV7B0zIQ=</latexit>|Col2|Scaling| |
|
|---|---|---| |
|
||hE9/UtX5lzXn590PfwVch8x0</latexit>|| |
|
|
|
|
|
Add Gating & Add Add Add Add |
|
|
|
Scaling Scaling |
|
|
|
**_hF2vXa8Hal7X/Wfvftf9796/vu96765N6E9/UtX5lzXn590PfwVch8x0</latexit>_** **_WIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up **_ht/GW3+l9Mv1j73drv16tba/9ce0va0drF2vXa8Hal7X/Wfvftf9796/vu96765N6E9/UtX5lzXn590PfwVch8x0</latexit>_** **_WIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up **_hE9/UtX5lzXn590PfwVch8x0</latexit>_** **_W9dKsKvMlzLa8vUkm1PJkqgAMJt4+fiJO4fCG2bZw9a685DnPFenc1b8M2q8CPUA8O9gU19z1Cdk8YQr9pdarWimf79gkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up **_hE9/UtX5lzXn590PfwVch8x0</latexit>_** **_W9dKsKvMlzLa8vUkm1PJkqgAMJt4+fiJO4fCG2bZw9a685DnPFenc1b8M2q8CPUA8O9gU19z1Cdk8YQr9pdarWimf79gkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up **_hE9/UtX5lzXn590PfwVch8x0</latexit>_** **_W9dKsKvMlzLa8vUkm1PJkqgAMJt4+fiJO4fCG2bZw9a685DnPFenc1b8M2q8CPUA8O9gU19z1Cdk8YQr9pdarWimf79gkUfBdVXrwPGIAf+EOMszp6GBfgP6wbL1CGYyzke6wN7YZgaLv0C5z/Bw39/esPG9vN/6m0enHzfmv7X7a2Lz5s/G3+l9Mv1n7h7V/XHuztr32r2t/WDta61drwVrs7X/Xvuftf/9+OHj/Uf/Y2BMf/2rqs/frzk/H+P/B9Wg0mE=</latexit>_** up |
|
|
|
PLM module ReLU PLM module Softmax PLM module PLM module ReLU PLM module ReLU |
|
|
|
**_Wpg7BXM7xWO/bC8PUcOkXOP8xHv7r65+/2dhu/k+l1Yvbt1vbv9/avxu4+71f9i+vnar9Z+vfZqbXvtv9b+uHa0drF2sxasPa39z9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down **_Wpg7BXM7xWO/bC8PUcOkXOP8xHv7r65+/2dhu/k+l1Yvbt1vbv9/avxu4+71f9i+vnar9Z+vfZqbXvtv9b+uHa0drF2sxasPa39z9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down **_Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down **_Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down **_Wz9r/rv313e/f9d8F78CY/vQnVZ/WHN+3qX/B3kw0g=</latexit>_** down |
|
|
|
**_xx9Wfuftf9d+793/qu+6737tqE/vQnVZ1/WXN+3v3wV7B0zIQ=</latexit>_** **_x0i7Xfrf1+7dXa9tof1/6ydrR2sXa9Fqx9Wfuftf9d+793/qu+6737tqE/vQnVZ1/WXN+3v3wV7B0zIQ=</latexit>_** **_x1/WXN+3v3wV7B0zIQ=</latexit>_** **_x1/WXN+3v3wV7B0zIQ=</latexit>_** **_x1/WXN+3v3wV7B0zIQ=</latexit>_** |
|
|
|
(a) Adapter (b) Prefix Tuning (c) LoRA (d) Parallel Adapter (e) Scaled PA |
|
|
|
Figure 3: Graphical illustration of existing methods and the proposed variants. “PLM module” represents a |
|
certain sublayer of the PLM (e.g. attention or FFN) that is frozen. “Scaled PA” denotes scaled parallel adapter. |
|
We do not include multi-head parallel adapter here to save space. |
|
|
|
where s 1 is a tunable scalar hyperparameter.[4] |
|
|
|
**Others:** Other parameter-efficient tuning methods include BitFit (Ben Zaken et al., 2021), which |
|
only fine-tunes bias vectors in the pre-trained model, and diff-pruning (Guo et al., 2021), which |
|
learns a sparse parameter update vector. |
|
|
|
3 BRIDGING THE GAP – A UNIFIED VIEW |
|
|
|
We first derive an equivalent form of prefix tuning to establish its connection with adapters. We |
|
then propose a unified framework for parameter-efficient tuning that includes several state-of-the-art |
|
methods as instantiations. |
|
|
|
3.1 A CLOSER LOOK AT PREFIX TUNING |
|
|
|
Eq. 5 describes the mechanism of prefix tuning which changes the attention module through |
|
prepending l learnable vectors to the original attention keys and values. Here, we derive an equivalent form of Eq. 5 and provide an alternative view of prefix tuning:[5] |
|
|
|
|
|
head = Attn(xWq, concat(Pk, CWk), concat(Pv, CWv)) |
|
|
|
= softmax **_xWqconcat(Pk, CWk)[⊤][ ]CWPv_** _v_ |
|
|
|
|
|
|
|
|