Update README.md
Browse files
README.md
CHANGED
@@ -66,6 +66,11 @@ The dataset consists of approximately 10.5 million genomic sequences across 48 d
|
|
66 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
67 |
N was present or if the sequence length was not a multiple of 6).
|
68 |
|
|
|
|
|
|
|
|
|
|
|
69 |
#### Training
|
70 |
The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
|
71 |
are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
|
|
|
66 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
67 |
N was present or if the sequence length was not a multiple of 6).
|
68 |
|
69 |
+
**Tokenization example**
|
70 |
+
|
71 |
+
nucleotide sequence: ```ATCCCGGNNTCGACACN```\
|
72 |
+
tokens: ```<CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>```
|
73 |
+
|
74 |
#### Training
|
75 |
The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
|
76 |
are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
|