etrop commited on
Commit
9be6ae0
1 Parent(s): 239d34a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -66,6 +66,11 @@ The dataset consists of approximately 10.5 million genomic sequences across 48 d
66
  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
67
  N was present or if the sequence length was not a multiple of 6).
68
 
 
 
 
 
 
69
  #### Training
70
  The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
71
  are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
 
66
  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
67
  N was present or if the sequence length was not a multiple of 6).
68
 
69
+ **Tokenization example**
70
+
71
+ nucleotide sequence: ```ATCCCGGNNTCGACACN```\
72
+ tokens: ```<CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>```
73
+
74
  #### Training
75
  The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
76
  are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.