kazemnejad
commited on
Commit
•
7e35312
1
Parent(s):
bf94f38
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- bigcode/starcoderdata
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
---
|
8 |
+
# McGill-NLP/codellm_1b_rotary
|
9 |
+
|
10 |
+
This model is a 1B-scale decoder-only transformer designed to explore the impact of positional encoding on length generalization, specifically trained with **Rotary** positional encoding to assess its effectiveness in length generalization tasks.
|
11 |
+
|
12 |
+
## Usage Example
|
13 |
+
```python
|
14 |
+
import torch
|
15 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
16 |
+
|
17 |
+
model_name = "McGill-NLP/codellm_1b_rotary"
|
18 |
+
|
19 |
+
# Important: `trust_remote_code=True` is required due to
|
20 |
+
# the custom architecture supporting different positional encodings,
|
21 |
+
# necessitating the download of the model implementation from Huggingface
|
22 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
24 |
+
|
25 |
+
print(model.config.position_encoding_type)
|
26 |
+
# Outputs: `rotary`
|
27 |
+
|
28 |
+
prompt = "def print_hello_world():"
|
29 |
+
input_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
|
30 |
+
input_ids = torch.cat([
|
31 |
+
torch.tensor([[tokenizer.bos_token_id]], device="cuda"), input_ids
|
32 |
+
], dim=1) # Prepend <bos> token
|
33 |
+
|
34 |
+
output = model.generate(input_ids, do_sample=True, temperature=0.2, max_length=16)
|
35 |
+
print(tokenizer.decode(output[0]))
|
36 |
+
```
|
37 |
+
|
38 |
+
## Model Details
|
39 |
+
|
40 |
+
### Model Description
|
41 |
+
|
42 |
+
- **Developed by:** McGill NLP Group
|
43 |
+
- **Model type:** Decoder-only transformer
|
44 |
+
- **Language(s) (NLP):** Primarily English, with potential application across various programming languages as demonstrated by its training on a code dataset.
|
45 |
+
- **License:** Apache 2.0
|
46 |
+
- **Finetuned from model:** This model is pretrained from scratch.
|
47 |
+
|
48 |
+
### Model Sources
|
49 |
+
|
50 |
+
- **Repository:** [McGill-NLP/Length-Generalization GitHub Repository](https://github.com/McGill-NLP/length-generalization)
|
51 |
+
- **Paper:** [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)
|
52 |
+
|
53 |
+
## Uses
|
54 |
+
|
55 |
+
### Direct Use
|
56 |
+
|
57 |
+
The model is designed for direct application in NLP tasks that require understanding and generating text. It's especially suited for working with source code, making it a valuable tool for tasks such as code completion, bug fixing, or even code generation.
|
58 |
+
|
59 |
+
## Bias, Risks, and Limitations
|
60 |
+
|
61 |
+
Given the model's training on source code, it might inherit biases present in the underlying dataset, including but not limited to, biases towards more commonly used programming languages or coding styles. Users should be cautious when applying this model to diverse or underrepresented coding languages and contexts.
|
62 |
+
This model has not undergone safety training and it is only produced for research purposes. The user is soley responsible for outputs of this model.
|
63 |
+
|
64 |
+
### Recommendations
|
65 |
+
|
66 |
+
Users should consider the context and diversity of the application domain when employing this model, especially in critical systems. Further evaluation and fine-tuning might be necessary to mitigate any potential biases or limitations for specific use cases.
|
67 |
+
|
68 |
+
## How to Get Started with the Model
|
69 |
+
|
70 |
+
Use the example provided in the README to get started with generating text or code. Ensure you have the necessary dependencies installed, including `torch` and `transformers`, and follow the guidelines for setting up your environment.
|
71 |
+
|
72 |
+
## Training Details
|
73 |
+
|
74 |
+
### Training Data
|
75 |
+
|
76 |
+
The model was pretrained on a dataset comprising 30M source code files from the StarCoder corpus, amounting to 30B token. The training data mix:
|
77 |
+
|
78 |
+
- 40% Python
|
79 |
+
- 25% Java
|
80 |
+
- 25% JavaScript
|
81 |
+
- 5% GitHub issues
|
82 |
+
- 5% GitHub commits
|
83 |
+
|
84 |
+
|
85 |
+
### Training Procedure
|
86 |
+
|
87 |
+
The model follows a decoder-only architecture with 1.3 billion parameters and was trained to predict the next token in the sequence. For more detailed information on the training procedure, refer to the paper linked above.
|
88 |
+
|
89 |
+
|
90 |
+
## Technical Specifications
|
91 |
+
|
92 |
+
### Model Architecture and Objective
|
93 |
+
|
94 |
+
The model leverages a decoder-only transformer architecture without explicit positional encoding.
|
95 |
+
|
96 |
+
## Citation
|
97 |
+
|
98 |
+
Please cite the following paper if you use this model in your work:
|
99 |
+
|
100 |
+
```bibtex
|
101 |
+
@inproceedings{kazemnejad2023:ImpactOfPeOnLengthGen,
|
102 |
+
title={The Impact of Positional Encoding on Length Generalization in Transformers},
|
103 |
+
author={Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan and Payel Das and Siva Reddy},
|
104 |
+
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
|
105 |
+
year={2023},
|
106 |
+
url={https://openreview.net/forum?id=Drrl2gcjzl}
|
107 |
+
}
|
108 |
+
```
|
109 |
+
|
110 |
+
## More Information
|
111 |
+
|
112 |
+
For further details about the model's architecture, training, and applications, please refer to the paper and the GitHub repository linked above.
|