UNIST-Eunchan
commited on
Commit
•
7f93a8e
1
Parent(s):
4490d89
Update README.md
Browse files
README.md
CHANGED
@@ -258,6 +258,90 @@ widget:
|
|
258 |
can benefit model development itself (section 8).
|
259 |
Question, Answer:
|
260 |
example_title: NLG-Eval (2202.06935)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
261 |
|
262 |
|
263 |
datasets:
|
@@ -411,6 +495,23 @@ output= [' What was the size of each untrained model?[SEP] The size of the model
|
|
411 |
|
412 |
```
|
413 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
414 |
|
415 |
## Training and evaluation data
|
416 |
- Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
|
|
|
258 |
can benefit model development itself (section 8).
|
259 |
Question, Answer:
|
260 |
example_title: NLG-Eval (2202.06935)
|
261 |
+
- text: >-
|
262 |
+
Generate Question, Answer pair correspond to the following research paper.
|
263 |
+
[Abstract] Humans have harbored a longstanding desire to acquire additional abilities through
|
264 |
+
absorption. Super Mario serves as an embodiment of this human dream, which
|
265 |
+
can collect items to gain extra skills such as throwing fireballs and being temporarily
|
266 |
+
invincible. In this paper, we uncover that Language Models (LMs), either encoderor decoder-based, can obtain new capabilities by assimilating the parameters of
|
267 |
+
homologous models without the need for retraining or GPUs. Typically, new
|
268 |
+
abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in
|
269 |
+
the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters).
|
270 |
+
We initially observe that by introducing a novel operation called DARE (Drop And
|
271 |
+
REscale), most of the delta parameters can be directly set to zeros without affecting
|
272 |
+
the capabilities of SFT LMs and larger models can tolerate a higher proportion
|
273 |
+
of discarded parameters. Based on this observation, we further sparsify delta
|
274 |
+
parameters of multiple SFT homologous models with DARE and subsequently
|
275 |
+
merge them into a single model by parameter averaging. We conduct experiments
|
276 |
+
on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also
|
277 |
+
merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental
|
278 |
+
results show that: (1) The delta parameter value ranges for SFT models are typically
|
279 |
+
small, often within 0.005, and DARE can eliminate 99% of them effortlessly.
|
280 |
+
However, once the models are continuously pre-trained, the value ranges can grow
|
281 |
+
to around 0.03, making DARE impractical. We have also tried to remove fine-tuned
|
282 |
+
instead of delta parameters and find that a 10% reduction can lead to drastically
|
283 |
+
decreased performance (even to 0.0). This highlights that SFT merely stimulates
|
284 |
+
the abilities via delta parameters rather than injecting new abilities into LMs; (2)
|
285 |
+
DARE can merge multiple task-specific LMs into one LM with diverse abilities.
|
286 |
+
For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following
|
287 |
+
ability while surpassing WizardMath’s original 64.2 performance. All resources
|
288 |
+
are available at https://github.com/yule-BUAA/MergeLM.
|
289 |
+
[Introduction] Human beings have always expressed their ambition to acquire additional abilities through various
|
290 |
+
ways such as movies and games. For example, in X-Men’s Apocalypse, the character can absorb the
|
291 |
+
powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games
|
292 |
+
can gain superpowers like throwing fireballs by absorbing in-game items. Large Language Models
|
293 |
+
(LLMs), such as GPT-4 [45], can reasonably be considered as early iterations of artificial general
|
294 |
+
intelligence systems, given their performance is remarkably close to human-level capabilities. In this paper, we astonishingly find that LMs, similar to Apocalypse and Super Mario, can enhance their
|
295 |
+
capabilities by absorbing other models without the need for training or GPUs.
|
296 |
+
Formally, Supervised Fine-Tuning (SFT) is the most widely adopted strategy for assigning taskspecific capabilities to LMs by optimizing their parameters [13, 67]. The effectiveness of SFT is
|
297 |
+
fully evident in the alteration of the model parameters before and after SFT, referred to as delta
|
298 |
+
parameters [12]. We initially demonstrate that SFT LM (either encoder- or decoder-based) always
|
299 |
+
tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which
|
300 |
+
randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the
|
301 |
+
remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE,
|
302 |
+
when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with
|
303 |
+
minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the
|
304 |
+
larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank
|
305 |
+
structures akin to LoRA [25]. Thus, even when most of these structures are removed, resulting in a
|
306 |
+
low-rank and extremely sparse delta parameter set, the LM can still retain its capabilities.
|
307 |
+
Based on this observation, we can confidently merge multiple homologous SFT LMs (pre-trained
|
308 |
+
from the same backbone) without significant concerns about the decrease in their capabilities. As
|
309 |
+
long as a small portion of the delta parameters remains unaffected in the merging process, the abilities
|
310 |
+
of LMs unlocked by SFT can still be preserved. We first employ DARE to eliminate redundant
|
311 |
+
delta parameters in each model before merging, which can potentially mitigate the interference of
|
312 |
+
parameters among multiple models [62]. Then, we apply established model merging techniques
|
313 |
+
[59, 26, 44, 27, 62] to the parameters with reduced redundancy to create a single model with diverse
|
314 |
+
capabilities. We conduct extensive experiments on encoder-based LMs on eight datasets from the
|
315 |
+
GLUE benchmark, and decoder-based Llama 2 with three distinct abilities: instruction-following,
|
316 |
+
mathematical reasoning, and code-generating. We observe that:
|
317 |
+
(1) SFT LMs exhibit a substantial number of redundant delta parameters whether they are based on
|
318 |
+
BERT, RoBERTa, or Llama 2. DARE allows the removal of approximately 90% or even 99% delta
|
319 |
+
parameters without significantly affecting the performance of downstream tasks. The rescale operation
|
320 |
+
in DARE is a crucial component to guarantee effective ablations of delta parameters. Without
|
321 |
+
rescaling, removing only 10% delta parameters would noticeably affect performance. We attribute
|
322 |
+
this phenomenon to the fact that rescaling helps preserve the connectivity of model parameters [46].
|
323 |
+
(2) DARE is able to enhance the performance of most existing model merging methods when merging
|
324 |
+
encoder-based LMs on the eight datasets from GLUE. When it comes to larger LMs based on Llama
|
325 |
+
2, the simple parameter averaging method can already produce surprisingly good results. As shown
|
326 |
+
in Figure 1(b), we merge WizardLM and WizardMath by combining DARE and parameter averaging,
|
327 |
+
leading to a significant improvement of WizardLM’s mathematical reasoning ability from 2.2 to 64.2
|
328 |
+
accuracy on GSM8K, while also modestly enhancing its instruction-following ability with win rate
|
329 |
+
from 67.2 to 67.5 on AlpacaEval. It is worth noticing that all these benefits are achieved by solely
|
330 |
+
using CPUs without further training. Similar improvements can also be observed when merging
|
331 |
+
code-generating models. (3) DARE is applicable to SFT delta parameters whose value ranges are relatively small. Different
|
332 |
+
from the observations of delta parameters, dropping only 10% fine-tuned parameters would lead to a
|
333 |
+
catastrophic decrease in performance, even approaching zero. We also find that the delta parameters
|
334 |
+
of SFT LMs usually stay within a range of 0.005 or less, indicating minimal modifications to the
|
335 |
+
pre-trained LM. However, once we continue pre-training, the delta parameters can rapidly reach
|
336 |
+
around 0.03, making DARE infeasible. This further confirms that SFT primarily unlocks the abilities
|
337 |
+
of the pre-trained LM, rather than introducing additional abilities.
|
338 |
+
Last but not least, we have implemented an open-sourced codebase at https://github.com/
|
339 |
+
yule-BUAA/MergeLM, which integrates existing popular model merging methods and supports both
|
340 |
+
encoder- and decoder-based language models. We hope this work can advance the understanding of
|
341 |
+
how alignment works from the perspective of parameters.
|
342 |
+
|
343 |
+
Question, Answer:
|
344 |
+
example_title: LM-SuperMario (2311.03099)
|
345 |
|
346 |
|
347 |
datasets:
|
|
|
495 |
|
496 |
```
|
497 |
|
498 |
+
## Inference Examples
|
499 |
+
```
|
500 |
+
If Inference API generate bad, you can use model.generate() in your code for better output!
|
501 |
+
```
|
502 |
+
|
503 |
+
- (1) Attention is All You Need
|
504 |
+
- (https://arxiv.org/abs/1706.03762)
|
505 |
+
- (2) The Power of Scale for Parameter-Efficient Prompt Tuning
|
506 |
+
- (https://arxiv.org/abs/2104.08691)
|
507 |
+
- (3)(LK-99 Paper/ Not an NLP paper) The First Room-Temperature Ambient-Pressure Superconductor
|
508 |
+
- (https://arxiv.org/abs/2307.12008)
|
509 |
+
- (4) Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
|
510 |
+
- (https://arxiv.org/abs/2202.06935)
|
511 |
+
- (5) Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
|
512 |
+
- (https://arxiv.org/abs/2311.03099)
|
513 |
+
|
514 |
+
|
515 |
|
516 |
## Training and evaluation data
|
517 |
- Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
|