# Ahrefs/flan-llama-7b-delta

## NOTE: This "delta model" cannot be used directly.
Users have to apply it on top of the original LLaMA weights to get actual flan-llama weights. (sample refer below)

## How to Use:

```python
device = 0 # Define your GPU device here
llama_path = '' # Define your original llama-7b load path here (huggingface checkpoint)

import transformers 
from collections import OrderedDict
model_llama = transformers.AutoModelForCausalLM.from_pretrained(llama_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(llama_path)
model_flan_llama = transformers.AutoModelForCausalLM.from_pretrained("Ahrefs/flan-llama-7b-delta")

model_state_dict = []
for key in model_flan_llama.state_dict().keys():
    model_state_dict.append((key, model_flan_llama.state_dict()[key]+model_llama.state_dict()[key]))
model_state_dict = OrderedDict(model_state_dict)
model_flan_llama.load_state_dict(model_state_dict)

model_flan_llama = model_flan_llama.to(device)
model_flan_llama.eval()

def generate(prompt, model, device):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    gen_output = model.generate(input_ids.to(device), max_new_tokens=512, early_stopping=True)[0]
    answer_cot = tokenizer.decode(gen_output, skip_special_tokens=True)
    return answer_cot

prompt = "Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering."
print(generate(prompt, model_flan_llama, device))
```

output:
```
Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering. Geoffrey Hinton is a living person. George Washington was not alive when Geoffrey Hinton was born. The final answer: no.
```

## Dataset and Training:
We finetune the original llama-7b model on extracted and sampled [Flan-2022](https://github.com/google-research/FLAN) dataset. The data are filtered to be limited to maximum source sequence length of 1536, and maximum target sequence length of 512, which accounts for roughly 5.5mil samples. (The sampled and extracted unfiltered dataset to be published on huggingface datasets soon) 

We finetune the original llama-7b model on 8 A100 GPUs using pytorch's FSDP, with a learning rate of 2e-5, with warm up ratio of 0.03 and cosine rate decay, and batch size of 128.

## Evaluation Results

We ran [EleutherAI's evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.3.0 using same benchmarks and 
parametrezation as [HF Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):

| arc_challenge (acc_norm, 25-shot) | hellaswag (acc_norm, 10-shot) | mmlu (acc, 5-shot) | truthfulqa_mc (mc2, 0-shot) |
| --------------------------------- | ----------------------------- | ------------------ | --------------------------- |
|                              40.2 |                          64.2 |               50.0 |                        31.7 |

## Reference

* [Finetuned language models are zero-shot learners](https://arxiv.org/abs/2109.01652)
  ```
@article{wei2021finetuned,
  title={Finetuned language models are zero-shot learners},
  author={Wei, Jason and Bosma, Maarten and Zhao, Vincent Y and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V},
  journal={arXiv preprint arXiv:2109.01652},
  year={2021}
}
  ```
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
  ```
@article{touvron2023llama,
  title={LLaMA: Open and Efficient Foundation Language Models},
  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}
  ```