McGill-NLP/codellm_1b_alibi
This model is a 1B-scale decoder-only transformer designed to explore the impact of positional encoding on length generalization, specifically trained with ALiBi positional encoding to assess its effectiveness in length generalization tasks.
Usage Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "McGill-NLP/codellm_1b_alibi"
# Important: `trust_remote_code=True` is required due to
# the custom architecture supporting different positional encodings,
# necessitating the download of the model implementation from Huggingface
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(model.config.position_encoding_type)
# Outputs: `alibi`
prompt = "def print_hello_world():"
input_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
input_ids = torch.cat([
torch.tensor([[tokenizer.bos_token_id]], device="cuda"), input_ids
], dim=1) # Prepend <bos> token
output = model.generate(input_ids, do_sample=True, temperature=0.2, max_length=16)
print(tokenizer.decode(output[0]))
Model Details
Model Description
- Developed by: McGill NLP Group
- Model type: Decoder-only transformer
- Language(s) (NLP): Primarily English, with potential application across various programming languages as demonstrated by its training on a code dataset.
- License: Apache 2.0
- Finetuned from model: This model is pretrained from scratch.
Model Sources
- Repository: McGill-NLP/Length-Generalization GitHub Repository
- Paper: The Impact of Positional Encoding on Length Generalization in Transformers
Uses
Direct Use
The model is designed for direct application in NLP tasks that require understanding and generating text. It's especially suited for working with source code, making it a valuable tool for tasks such as code completion, bug fixing, or even code generation.
Bias, Risks, and Limitations
Given the model's training on source code, it might inherit biases present in the underlying dataset, including but not limited to, biases towards more commonly used programming languages or coding styles. Users should be cautious when applying this model to diverse or underrepresented coding languages and contexts. This model has not undergone safety training and it is only produced for research purposes. The user is soley responsible for outputs of this model.
Recommendations
Users should consider the context and diversity of the application domain when employing this model, especially in critical systems. Further evaluation and fine-tuning might be necessary to mitigate any potential biases or limitations for specific use cases.
How to Get Started with the Model
Use the example provided in the README to get started with generating text or code. Ensure you have the necessary dependencies installed, including torch
and transformers
, and follow the guidelines for setting up your environment.
Training Details
Training Data
The model was pretrained on a dataset comprising 30M source code files from the StarCoder corpus, amounting to 30B token. The training data mix:
- 40% Python
- 25% Java
- 25% JavaScript
- 5% GitHub issues
- 5% GitHub commits
Training Procedure
The model follows a decoder-only architecture with 1.3 billion parameters and was trained to predict the next token in the sequence. For more detailed information on the training procedure, refer to the paper linked above.
Technical Specifications
Model Architecture and Objective
The model leverages a decoder-only transformer architecture with ALiBi positional encoding.
Citation
Please cite the following paper if you use this model in your work:
@inproceedings{kazemnejad2023:ImpactOfPeOnLengthGen,
title={The Impact of Positional Encoding on Length Generalization in Transformers},
author={Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan and Payel Das and Siva Reddy},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=Drrl2gcjzl}
}
More Information
For further details about the model's architecture, training, and applications, please refer to the paper and the GitHub repository linked above.
- Downloads last month
- 11