README.md · adibvafa/CodonTransformer-base at 9a4096cb39c33fe826aa9d2df44f4c166cd705ac

metadata

library_name: transformers
tags:
  - CodonTransformer
  - Computational Biology
  - Machine Learning
  - Bioinformatics
  - Synthetic Biology
license: apache-2.0
pipeline_tag: token-classification

CodonTransformer is the ultimate tool for codon optimization, transforming protein sequences into optimized DNA sequences specific for your target organisms. Whether you are a researcher or a practitioner in genetic engineering, CodonTransformer provides a comprehensive suite of features to facilitate your work. By leveraging the Transformer architecture and a user-friendly Jupyter notebook, it reduces the complexity of codon optimization, saving you time and effort.

This is the pretrained model, for best results please use the finetuned model.

Authors

Adibvafa Fallahpour^1,2*, Vincent Gureghian³*, Guillaume J. Filion²‡, Ariel B. Lindner³‡, Amir Pandi³‡

¹ Vector Institute for Artificial Intelligence, Toronto ON, Canada
² University of Toronto Scarborough; Department of Biological Science; Scarborough ON, Canada
³ Université Paris Cité, INSERM U1284, Center for Research and Interdisciplinarity, F-75006 Paris, France
* These authors contributed equally to this work.
‡ To whom correspondence should be addressed:
guillaume.filion@utoronto.ca, ariel.lindner@inserm.fr, amir.pandi@cri-paris.org

Use Case

For a guide on finetuning CodonTransformer, check out our GitHub.
For an interactive demo, check out our Google Colab Notebook.

After installing CodonTransformer, you can use:

import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer-base").to(DEVICE)


# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"


# Predict with CodonTransformer
output = predict_dna_sequence(
    protein=protein,
    organism=organism,
    device=DEVICE,
    tokenizer=tokenizer,
    model=model,
    attention_type="original_full",
)
print(format_model_output(output))

The output is:

-----------------------------
|          Organism         |
-----------------------------
Escherichia coli general

-----------------------------
|       Input Protein       |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG

-----------------------------
|      Processed Input      |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK

-----------------------------
|       Predicted DNA       |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA

Additional Resources

Project Website
https://adibvafa.github.io/CodonTransformer/
GitHub Repository
https://github.com/Adibvafa/CodonTransformer
Google Colab Demo
https://adibvafa.github.io/CodonTransformer/GoogleColab
PyPI Package
https://pypi.org/project/CodonTransformer/
Paper
https://www.biorxiv.org/content/10.1101/2024.09.13.612903