ZhiyuanChen commited on
Commit
b641ec4
1 Parent(s): a185d4c

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +276 -0
  2. config.json +52 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
  5. special_tokens_map.json +12 -0
  6. tokenizer_config.json +68 -0
  7. vocab.txt +131 -0
README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: dna
3
+ tags:
4
+ - Biology
5
+ - RNA
6
+ license: agpl-3.0
7
+ datasets:
8
+ - multimolecule/ena
9
+ library_name: multimolecule
10
+ pipeline_tag: fill-mask
11
+ mask_token: "<mask>"
12
+ widget:
13
+ - example_title: "PRNP"
14
+ text: "CTG<mask>AAGCGGCCCACGCGGACTGACGGGCGGGGG"
15
+ output:
16
+ - label: "GUG"
17
+ score: 0.010724939405918121
18
+ - label: "GNC"
19
+ score: 0.010476444847881794
20
+ - label: "AUC"
21
+ score: 0.010415051132440567
22
+ - label: "GGG"
23
+ score: 0.010389575734734535
24
+ - label: "AAU"
25
+ score: 0.01017767284065485
26
+ ---
27
+
28
+ # CaLM
29
+
30
+ Pre-trained model on protein-coding DNA (cDNA) using a masked language modeling (MLM) objective.
31
+
32
+ ## Statement
33
+
34
+ _Codon language embeddings provide strong signals for use in protein engineering_ is published in [Nature Machine Intelligence](https://doi.org/10.1038/s42256-024-00791-0), which is a Closed Access / Author-Fee journal.
35
+
36
+ > Machine learning has been at the forefront of the movement for free and open access to research.
37
+ >
38
+ > We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
39
+
40
+ The MultiMolecule team is committed to the principles of open access and open science.
41
+
42
+ We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
43
+
44
+ Please consider signing the [Statement on Nature Machine Intelligence](https://openaccess.engineering.oregonstate.edu).
45
+
46
+ ## Disclaimer
47
+
48
+ This is an UNOFFICIAL implementation of the [Codon language embeddings provide strong signals for use in protein engineering](https://doi.org/10.1101/2022.12.15.519894) by Carlos Outeiral and Charlotte M. Deane.
49
+
50
+ The OFFICIAL repository of CaLM is at [oxpig/CaLM](https://github.com/oxpig/CaLM).
51
+
52
+ > [!WARNING]
53
+ > The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
54
+ > This is because
55
+ >
56
+ > The proposed method is published in a Closed Access / Author-Fee journal.
57
+
58
+ **The team releasing CaLM did not write this model card for this model so this model card has been written by the MultiMolecule team.**
59
+
60
+ ## Model Details
61
+
62
+ CaLM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style model pre-trained on a large corpus of protein-coding DNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of DNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the [Training Details](#training-details) section for more information on the training process.
63
+
64
+ ### Model Specification
65
+
66
+ | Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
67
+ | ---------- | ----------- | --------- | ----------------- | ------------------ | --------- | -------- | -------------- |
68
+ | 12 | 768 | 12 | 3072 | 85.75 | 22.36 | 11.17 | 1024 |
69
+
70
+ ### Links
71
+
72
+ - **Code**: [multimolecule.calm](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/calm)
73
+ - **Weights**: [multimolecule/calm](https://huggingface.co/multimolecule/calm)
74
+ - **Data**: [European Nucleotide Archive](https://ebi.ac.uk/ena)
75
+ - **Paper**: [Codon language embeddings provide strong signals for use in protein engineering](https://doi.org/10.1101/2022.12.15.519894)
76
+ - **Developed by**: Carlos Outeiral, Charlotte M. Deane
77
+ - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
78
+ - **Original Repository**: [https://github.com/oxpig/CaLM](https://github.com/oxpig/CaLM)
79
+
80
+ ## Usage
81
+
82
+ The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
83
+
84
+ ```bash
85
+ pip install multimolecule
86
+ ```
87
+
88
+ ### Direct Use
89
+
90
+ You can use this model directly with a pipeline for masked language modeling:
91
+
92
+ ```python
93
+ >>> import multimolecule # you must import multimolecule to register models
94
+ >>> from transformers import pipeline
95
+ >>> unmasker = pipeline('fill-mask', model='multimolecule/calm')
96
+ >>> unmasker("ctg<mask>aagcggcccacgcggactgacgggcggggg")
97
+
98
+ [{'score': 0.010724939405918121,
99
+ 'token': 73,
100
+ 'token_str': 'GUG',
101
+ 'sequence': 'CUG GUG AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
102
+ {'score': 0.010476444847881794,
103
+ 'token': 77,
104
+ 'token_str': 'GNC',
105
+ 'sequence': 'CUG GNC AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
106
+ {'score': 0.010415051132440567,
107
+ 'token': 22,
108
+ 'token_str': 'AUC',
109
+ 'sequence': 'CUG AUC AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
110
+ {'score': 0.010389575734734535,
111
+ 'token': 68,
112
+ 'token_str': 'GGG',
113
+ 'sequence': 'CUG GGG AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
114
+ {'score': 0.01017767284065485,
115
+ 'token': 9,
116
+ 'token_str': 'AAU',
117
+ 'sequence': 'CUG AAU AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'}]
118
+ ```
119
+
120
+ ### Downstream Use
121
+
122
+ #### Extract Features
123
+
124
+ Here is how to use this model to get the features of a given sequence in PyTorch:
125
+
126
+ ```python
127
+ from multimolecule import RnaTokenizer, CaLmModel
128
+
129
+
130
+ tokenizer = RnaTokenizer.from_pretrained('multimolecule/calm')
131
+ model = CaLmModel.from_pretrained('multimolecule/calm')
132
+
133
+ text = "GCCAGTCGCTGACAGCCGCGG"
134
+ input = tokenizer(text, return_tensors='pt')
135
+
136
+ output = model(**input)
137
+ ```
138
+
139
+ #### Sequence Classification / Regression
140
+
141
+ **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
142
+
143
+ Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
144
+
145
+ ```python
146
+ import torch
147
+ from multimolecule import RnaTokenizer, CaLmForSequencePrediction
148
+
149
+
150
+ tokenizer = RnaTokenizer.from_pretrained('multimolecule/calm')
151
+ model = CaLmForSequencePrediction.from_pretrained('multimolecule/calm')
152
+
153
+ text = "GCCAGTCGCTGACAGCCGCGG"
154
+ input = tokenizer(text, return_tensors='pt')
155
+ label = torch.tensor([1])
156
+
157
+ output = model(**input, labels=label)
158
+ ```
159
+
160
+ #### Nucleotide Classification / Regression
161
+
162
+ **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
163
+
164
+ Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
165
+
166
+ ```python
167
+ import torch
168
+ from multimolecule import RnaTokenizer, CaLmForNucleotidePrediction
169
+
170
+
171
+ tokenizer = RnaTokenizer.from_pretrained('multimolecule/calm')
172
+ model = CaLmForNucleotidePrediction.from_pretrained('multimolecule/calm')
173
+
174
+ text = "GCCAGTCGCTGACAGCCGCGG"
175
+ input = tokenizer(text, return_tensors='pt')
176
+ label = torch.randint(2, (len(text), ))
177
+
178
+ output = model(**input, labels=label)
179
+ ```
180
+
181
+ #### Contact Classification / Regression
182
+
183
+ **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
184
+
185
+ Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
186
+
187
+ ```python
188
+ import torch
189
+ from multimolecule import RnaTokenizer, CaLmForContactPrediction
190
+
191
+
192
+ tokenizer = RnaTokenizer.from_pretrained('multimolecule/calm')
193
+ model = CaLmForContactPrediction.from_pretrained('multimolecule/calm')
194
+
195
+ text = "GCCAGTCGCTGACAGCCGCGG"
196
+ input = tokenizer(text, return_tensors='pt')
197
+ label = torch.randint(2, (len(text), len(text)))
198
+
199
+ output = model(**input, labels=label)
200
+ ```
201
+
202
+ ## Training Details
203
+
204
+ CaLM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 25% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
205
+
206
+ ### Training Data
207
+
208
+ The CaLM model was pre-trained coding sequences of all organisms available on the [European Nucleotide Archive (ENA)](https://ebi.ac.uk/ena). European Nucleotide Archive provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
209
+
210
+ CaLM collected coding sequences of all organisms from ENA on April 2022, including 114,214,475 sequences. Only high level assembly information (dataclass CON) were used. Sequences matching the following criteria were filtered out:
211
+
212
+ - with unknown nucleotides (`N`, `Y`, `R`)
213
+ - start codon is not `ATG`
214
+ - contains interstitial stop codons
215
+ - number of nucleotides is not a multiple of three
216
+
217
+ To reduce redundancy, CaLM grouped the entries by organism, and apply CD-HIT (CD-HIT-EST) with a cut-off at 40% sequence identity to the translated protein sequences.
218
+
219
+ The final dataset contains 9,858,385 cDNA sequences.
220
+
221
+ Note that the alphabet in the original implementation is RNA instead of DNA, therefore, we use [`RnaTokenizer`][multimolecule.RnaTokenizer] to tokenize the sequences. `RnaTokenizer` of `multimolecule` will convert "U"s to "T"s for you, you may disable this behaviour by passing `replace_T_with_U=False`.
222
+
223
+ ### Training Procedure
224
+
225
+ #### Preprocessing
226
+
227
+ CaLM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
228
+
229
+ - 25% of the tokens are masked.
230
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
231
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
232
+ - In the 10% remaining cases, the masked tokens are left as is.
233
+
234
+ #### PreTraining
235
+
236
+ The model was trained on 4 NVIDIA Quadro RTX4000 GPUs with 8GiB memories.
237
+
238
+ - Learning rate: 1e-4
239
+ - Optimizer: AdamW
240
+ - Learning rate scheduler: cosine
241
+ - Learning rate warm-up: 1,000 steps
242
+ - Epochs: 14
243
+ - Batch Size: 1,000
244
+
245
+ ## Citation
246
+
247
+ **BibTeX**:
248
+
249
+ ```bibtex
250
+ @article {outeiral2022coodn,
251
+ author = {Outeiral, Carlos and Deane, Charlotte M.},
252
+ title = {Codon language embeddings provide strong signals for protein engineering},
253
+ elocation-id = {2022.12.15.519894},
254
+ year = {2022},
255
+ doi = {10.1101/2022.12.15.519894},
256
+ publisher = {Cold Spring Harbor Laboratory},
257
+ abstract = {Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models{\textquoteright} capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.Competing Interest StatementThe authors have declared no competing interest.},
258
+ URL = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894},
259
+ eprint = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894.full.pdf},
260
+ journal = {bioRxiv}
261
+ }
262
+ ```
263
+
264
+ ## Contact
265
+
266
+ Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
267
+
268
+ Please contact the authors of the [CaLM paper](https://doi.org/10.1101/2022.12.15.519894) for questions or comments on the paper/model.
269
+
270
+ ## License
271
+
272
+ This model is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html).
273
+
274
+ ```spdx
275
+ SPDX-License-Identifier: AGPL-3.0-or-later
276
+ ```
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CaLmForPreTraining"
4
+ ],
5
+ "attention_dropout": 0.1,
6
+ "bos_token_id": 1,
7
+ "codon": true,
8
+ "emb_layer_norm_before": false,
9
+ "eos_token_id": 2,
10
+ "head": {
11
+ "act": null,
12
+ "bias": true,
13
+ "dropout": 0.0,
14
+ "hidden_size": null,
15
+ "layer_norm_eps": 1e-12,
16
+ "num_labels": null,
17
+ "output_name": null,
18
+ "problem_type": null,
19
+ "transform": null,
20
+ "transform_act": "gelu"
21
+ },
22
+ "hidden_act": "gelu",
23
+ "hidden_dropout": 0.1,
24
+ "hidden_size": 768,
25
+ "initializer_range": 0.02,
26
+ "intermediate_size": 3072,
27
+ "layer_norm_eps": 1e-12,
28
+ "lm_head": {
29
+ "act": null,
30
+ "bias": true,
31
+ "dropout": 0.0,
32
+ "hidden_size": 768,
33
+ "layer_norm_eps": 1e-12,
34
+ "output_name": null,
35
+ "transform": "nonlinear",
36
+ "transform_act": "gelu"
37
+ },
38
+ "mask_token_id": 4,
39
+ "max_position_embeddings": 1026,
40
+ "model_type": "calm",
41
+ "null_token_id": 5,
42
+ "num_attention_heads": 12,
43
+ "num_hidden_layers": 12,
44
+ "pad_token_id": 0,
45
+ "position_embedding_type": "rotary",
46
+ "token_dropout": false,
47
+ "torch_dtype": "float32",
48
+ "transformers_version": "4.44.0",
49
+ "unk_token_id": 3,
50
+ "use_cache": true,
51
+ "vocab_size": 131
52
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b5780b63a779f62092b9f4824b1bfc7858cfdc20d6dff5fa9fe530dbef77de5
3
+ size 343021604
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60d95021c255b5c972bca21e8c021c457537d19da6d13689f12cefd5ccdaf29c
3
+ size 343066714
special_tokens_map.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<null>"
4
+ ],
5
+ "bos_token": "<cls>",
6
+ "cls_token": "<cls>",
7
+ "eos_token": "<eos>",
8
+ "mask_token": "<mask>",
9
+ "pad_token": "<pad>",
10
+ "sep_token": "<eos>",
11
+ "unk_token": "<unk>"
12
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<cls>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<eos>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "<null>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "additional_special_tokens": [
53
+ "<null>"
54
+ ],
55
+ "bos_token": "<cls>",
56
+ "clean_up_tokenization_spaces": true,
57
+ "cls_token": "<cls>",
58
+ "codon": true,
59
+ "eos_token": "<eos>",
60
+ "mask_token": "<mask>",
61
+ "model_max_length": 1000000000000000019884624838656,
62
+ "nmers": 3,
63
+ "pad_token": "<pad>",
64
+ "replace_T_with_U": true,
65
+ "sep_token": "<eos>",
66
+ "tokenizer_class": "RnaTokenizer",
67
+ "unk_token": "<unk>"
68
+ }
vocab.txt ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <pad>
2
+ <cls>
3
+ <eos>
4
+ <unk>
5
+ <mask>
6
+ <null>
7
+ AAA
8
+ AAC
9
+ AAG
10
+ AAU
11
+ AAN
12
+ ACA
13
+ ACC
14
+ ACG
15
+ ACU
16
+ ACN
17
+ AGA
18
+ AGC
19
+ AGG
20
+ AGU
21
+ AGN
22
+ AUA
23
+ AUC
24
+ AUG
25
+ AUU
26
+ AUN
27
+ ANA
28
+ ANC
29
+ ANG
30
+ ANU
31
+ ANN
32
+ CAA
33
+ CAC
34
+ CAG
35
+ CAU
36
+ CAN
37
+ CCA
38
+ CCC
39
+ CCG
40
+ CCU
41
+ CCN
42
+ CGA
43
+ CGC
44
+ CGG
45
+ CGU
46
+ CGN
47
+ CUA
48
+ CUC
49
+ CUG
50
+ CUU
51
+ CUN
52
+ CNA
53
+ CNC
54
+ CNG
55
+ CNU
56
+ CNN
57
+ GAA
58
+ GAC
59
+ GAG
60
+ GAU
61
+ GAN
62
+ GCA
63
+ GCC
64
+ GCG
65
+ GCU
66
+ GCN
67
+ GGA
68
+ GGC
69
+ GGG
70
+ GGU
71
+ GGN
72
+ GUA
73
+ GUC
74
+ GUG
75
+ GUU
76
+ GUN
77
+ GNA
78
+ GNC
79
+ GNG
80
+ GNU
81
+ GNN
82
+ UAA
83
+ UAC
84
+ UAG
85
+ UAU
86
+ UAN
87
+ UCA
88
+ UCC
89
+ UCG
90
+ UCU
91
+ UCN
92
+ UGA
93
+ UGC
94
+ UGG
95
+ UGU
96
+ UGN
97
+ UUA
98
+ UUC
99
+ UUG
100
+ UUU
101
+ UUN
102
+ UNA
103
+ UNC
104
+ UNG
105
+ UNU
106
+ UNN
107
+ NAA
108
+ NAC
109
+ NAG
110
+ NAU
111
+ NAN
112
+ NCA
113
+ NCC
114
+ NCG
115
+ NCU
116
+ NCN
117
+ NGA
118
+ NGC
119
+ NGG
120
+ NGU
121
+ NGN
122
+ NUA
123
+ NUC
124
+ NUG
125
+ NUU
126
+ NUN
127
+ NNA
128
+ NNC
129
+ NNG
130
+ NNU
131
+ NNN