UNIST-Eunchan
/

FLAN-T5-NLP-Paper-to-Question-Generation

@@ -258,6 +258,90 @@ widget:
     can benefit model development itself (section 8).
      Question, Answer:
   example_title: NLG-Eval (2202.06935)
 datasets:
@@ -411,6 +495,23 @@ output= [' What was the size of each untrained model?[SEP] The size of the model
 ```
 ## Training and evaluation data
 - Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.

     can benefit model development itself (section 8).
      Question, Answer:
   example_title: NLG-Eval (2202.06935)
+- text: >-
+    Generate Question, Answer pair correspond to the following research paper.
+    [Abstract] Humans have harbored a longstanding desire to acquire additional abilities through
+    absorption. Super Mario serves as an embodiment of this human dream, which
+    can collect items to gain extra skills such as throwing fireballs and being temporarily
+    invincible. In this paper, we uncover that Language Models (LMs), either encoderor decoder-based, can obtain new capabilities by assimilating the parameters of
+    homologous models without the need for retraining or GPUs. Typically, new
+    abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in
+    the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters).
+    We initially observe that by introducing a novel operation called DARE (Drop And
+    REscale), most of the delta parameters can be directly set to zeros without affecting
+    the capabilities of SFT LMs and larger models can tolerate a higher proportion
+    of discarded parameters. Based on this observation, we further sparsify delta
+    parameters of multiple SFT homologous models with DARE and subsequently
+    merge them into a single model by parameter averaging. We conduct experiments
+    on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also
+    merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental
+    results show that: (1) The delta parameter value ranges for SFT models are typically
+    small, often within 0.005, and DARE can eliminate 99% of them effortlessly.
+    However, once the models are continuously pre-trained, the value ranges can grow
+    to around 0.03, making DARE impractical. We have also tried to remove fine-tuned
+    instead of delta parameters and find that a 10% reduction can lead to drastically
+    decreased performance (even to 0.0). This highlights that SFT merely stimulates
+    the abilities via delta parameters rather than injecting new abilities into LMs; (2)
+    DARE can merge multiple task-specific LMs into one LM with diverse abilities.
+    For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following
+    ability while surpassing WizardMath’s original 64.2 performance. All resources
+    are available at https://github.com/yule-BUAA/MergeLM.
+    [Introduction] Human beings have always expressed their ambition to acquire additional abilities through various
+    ways such as movies and games. For example, in X-Men’s Apocalypse, the character can absorb the
+    powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games
+    can gain superpowers like throwing fireballs by absorbing in-game items. Large Language Models
+    (LLMs), such as GPT-4 [45], can reasonably be considered as early iterations of artificial general
+    intelligence systems, given their performance is remarkably close to human-level capabilities. In this paper, we astonishingly find that LMs, similar to Apocalypse and Super Mario, can enhance their
+    capabilities by absorbing other models without the need for training or GPUs.
+    Formally, Supervised Fine-Tuning (SFT) is the most widely adopted strategy for assigning taskspecific capabilities to LMs by optimizing their parameters [13, 67]. The effectiveness of SFT is
+    fully evident in the alteration of the model parameters before and after SFT, referred to as delta
+    parameters [12]. We initially demonstrate that SFT LM (either encoder- or decoder-based) always
+    tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which
+    randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the
+    remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE,
+    when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with
+    minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the
+    larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank
+    structures akin to LoRA [25]. Thus, even when most of these structures are removed, resulting in a
+    low-rank and extremely sparse delta parameter set, the LM can still retain its capabilities.
+    Based on this observation, we can confidently merge multiple homologous SFT LMs (pre-trained
+    from the same backbone) without significant concerns about the decrease in their capabilities. As
+    long as a small portion of the delta parameters remains unaffected in the merging process, the abilities
+    of LMs unlocked by SFT can still be preserved. We first employ DARE to eliminate redundant
+    delta parameters in each model before merging, which can potentially mitigate the interference of
+    parameters among multiple models [62]. Then, we apply established model merging techniques
+    [59, 26, 44, 27, 62] to the parameters with reduced redundancy to create a single model with diverse
+    capabilities. We conduct extensive experiments on encoder-based LMs on eight datasets from the
+    GLUE benchmark, and decoder-based Llama 2 with three distinct abilities: instruction-following,
+    mathematical reasoning, and code-generating. We observe that:
+    (1) SFT LMs exhibit a substantial number of redundant delta parameters whether they are based on
+    BERT, RoBERTa, or Llama 2. DARE allows the removal of approximately 90% or even 99% delta
+    parameters without significantly affecting the performance of downstream tasks. The rescale operation
+    in DARE is a crucial component to guarantee effective ablations of delta parameters. Without
+    rescaling, removing only 10% delta parameters would noticeably affect performance. We attribute
+    this phenomenon to the fact that rescaling helps preserve the connectivity of model parameters [46].
+    (2) DARE is able to enhance the performance of most existing model merging methods when merging
+    encoder-based LMs on the eight datasets from GLUE. When it comes to larger LMs based on Llama
+    2, the simple parameter averaging method can already produce surprisingly good results. As shown
+    in Figure 1(b), we merge WizardLM and WizardMath by combining DARE and parameter averaging,
+    leading to a significant improvement of WizardLM’s mathematical reasoning ability from 2.2 to 64.2
+    accuracy on GSM8K, while also modestly enhancing its instruction-following ability with win rate
+    from 67.2 to 67.5 on AlpacaEval. It is worth noticing that all these benefits are achieved by solely
+    using CPUs without further training. Similar improvements can also be observed when merging
+    code-generating models. (3) DARE is applicable to SFT delta parameters whose value ranges are relatively small. Different
+    from the observations of delta parameters, dropping only 10% fine-tuned parameters would lead to a
+    catastrophic decrease in performance, even approaching zero. We also find that the delta parameters
+    of SFT LMs usually stay within a range of 0.005 or less, indicating minimal modifications to the
+    pre-trained LM. However, once we continue pre-training, the delta parameters can rapidly reach
+    around 0.03, making DARE infeasible. This further confirms that SFT primarily unlocks the abilities
+    of the pre-trained LM, rather than introducing additional abilities.
+    Last but not least, we have implemented an open-sourced codebase at https://github.com/
+    yule-BUAA/MergeLM, which integrates existing popular model merging methods and supports both
+    encoder- and decoder-based language models. We hope this work can advance the understanding of
+    how alignment works from the perspective of parameters.
+     Question, Answer:
+  example_title: LM-SuperMario (2311.03099)
 datasets:
 ```
+## Inference Examples
+```
+If Inference API generate bad, you can use model.generate() in your code for better output!
+```
+- (1) Attention is All You Need
+  - (https://arxiv.org/abs/1706.03762)
+- (2) The Power of Scale for Parameter-Efficient Prompt Tuning
+  - (https://arxiv.org/abs/2104.08691)
+- (3)(LK-99 Paper/ Not an NLP paper) The First Room-Temperature Ambient-Pressure Superconductor
+  - (https://arxiv.org/abs/2307.12008)
+- (4) Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
+  - (https://arxiv.org/abs/2202.06935)
+- (5) Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
+  - (https://arxiv.org/abs/2311.03099)
 ## Training and evaluation data
 - Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.