|
--- |
|
library_name: transformers |
|
tags: |
|
- CodonTransformer |
|
- Computational Biology |
|
- Machine Learning |
|
- Bioinformatics |
|
- Synthetic Biology |
|
license: apache-2.0 |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
![image/png](https://github.com/Adibvafa/CodonTransformer/raw/main/src/banner_final.png) |
|
|
|
**CodonTransformer** is the ultimate tool for codon optimization, transforming protein sequences into optimized DNA sequences specific for your target organisms. Whether you are a researcher or a practitioner in genetic engineering, CodonTransformer provides a comprehensive suite of features to facilitate your work. By leveraging the Transformer architecture and a user-friendly Jupyter notebook, it reduces the complexity of codon optimization, saving you time and effort. |
|
<br> |
|
|
|
**This is the pretrained model, for best results please use the [finetuned model](https://huggingface.co/adibvafa/CodonTransformer)**. |
|
|
|
## Authors |
|
Adibvafa Fallahpour<sup>1,2</sup>\*, Vincent Gureghian<sup>3</sup>\*, Guillaume J. Filion<sup>2</sup>‡, Ariel B. Lindner<sup>3</sup>‡, Amir Pandi<sup>3</sup>‡ |
|
|
|
<sup>1</sup> Vector Institute for Artificial Intelligence, Toronto ON, Canada |
|
<sup>2</sup> University of Toronto Scarborough; Department of Biological Science; Scarborough ON, Canada |
|
<sup>3</sup> Université Paris Cité, INSERM U1284, Center for Research and Interdisciplinarity, F-75006 Paris, France |
|
\* These authors contributed equally to this work. |
|
‡ To whom correspondence should be addressed: <br> |
|
guillaume.filion@utoronto.ca, ariel.lindner@inserm.fr, amir.pandi@cri-paris.org |
|
<br> |
|
|
|
## Use Case |
|
**For a guide on finetuning CodonTransformer, check out our [GitHub.](https://github.com/Adibvafa/CodonTransformer/tree/main?tab=readme-ov-file#finetuning-codontransformer)** |
|
<br>**For an interactive demo, check out our [Google Colab Notebook.](https://adibvafa.github.io/CodonTransformer/GoogleColab)** |
|
<br></br> |
|
After installing CodonTransformer, you can use: |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, BigBirdForMaskedLM |
|
from CodonTransformer.CodonPrediction import predict_dna_sequence |
|
from CodonTransformer.CodonJupyter import format_model_output |
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer") |
|
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer-base").to(DEVICE) |
|
|
|
|
|
# Set your input data |
|
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG" |
|
organism = "Escherichia coli general" |
|
|
|
|
|
# Predict with CodonTransformer |
|
output = predict_dna_sequence( |
|
protein=protein, |
|
organism=organism, |
|
device=DEVICE, |
|
tokenizer=tokenizer, |
|
model=model, |
|
attention_type="original_full", |
|
) |
|
print(format_model_output(output)) |
|
``` |
|
The output is: |
|
<br> |
|
|
|
|
|
```python |
|
----------------------------- |
|
| Organism | |
|
----------------------------- |
|
Escherichia coli general |
|
|
|
----------------------------- |
|
| Input Protein | |
|
----------------------------- |
|
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG |
|
|
|
----------------------------- |
|
| Processed Input | |
|
----------------------------- |
|
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK |
|
|
|
----------------------------- |
|
| Predicted DNA | |
|
----------------------------- |
|
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA |
|
``` |
|
|
|
|
|
## Additional Resources |
|
- **Project Website** <br> |
|
https://adibvafa.github.io/CodonTransformer/ |
|
|
|
- **GitHub Repository** <br> |
|
https://github.com/Adibvafa/CodonTransformer |
|
|
|
- **Google Colab Demo** <br> |
|
https://adibvafa.github.io/CodonTransformer/GoogleColab |
|
|
|
- **PyPI Package** <br> |
|
https://pypi.org/project/CodonTransformer/ |
|
|
|
- **Paper** <br> |
|
https://www.biorxiv.org/content/10.1101/2024.09.13.612903 |