Update README.md
Browse files
README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
# GPT2-Spanish
|
2 |
-
GPT2-Spanish is a language generation model trained from scratch with
|
3 |
|
4 |
## Corpus
|
5 |
-
This model was trained with a corpus of
|
6 |
|
7 |
## Tokenizer
|
8 |
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for Unicode characters) and a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.
|
|
|
1 |
# GPT2-Spanish
|
2 |
+
GPT2-Spanish is a language generation model trained from scratch with 11.5GB of Spanish texts and with a Byte Pair Encoding (BPE) tokenizer that was trained for this purpose. The parameters used are the same as the small version of the original OpenAI GPT2 model.
|
3 |
|
4 |
## Corpus
|
5 |
+
This model was trained with a corpus of 11.5GB of texts corresponding to 3.5GB of Wikipedia articles and 8GB of books (narrative, short stories, theater, poetry, essays, and popularization).
|
6 |
|
7 |
## Tokenizer
|
8 |
The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for Unicode characters) and a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.
|