marksverdhei commited on
Commit
4c6e94e
2 Parent(s): 7ca9fdf d3a0093

Merge branch 'main' of https://huggingface.co/marksverdhei/t5-deshuffle into main

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ widget:
4
+ - text: ' brown dog fox jumped lazy over quick the the '
5
+ datasets:
6
+ - 'stas/c4-en-10k'
7
+ ---
8
+
9
+ # T5-deshuffle
10
+
11
+ Bag Of Words (BOW) is a simple and typical encoding for making statistical models discover patterns in language
12
+ However BOW is a lossy compression that eliminates a very important feature of text: order
13
+
14
+ This model is trained to learn the most probable order of an unordered token sequence,
15
+ using a subset of the c4 dataset, and can thus be seen as a "bag-of-words decoder".
16
+
17
+ Currently, it does not perform well. I'm planning to re-train on a larger subset of c4 later (after may).
18
+
19
+ How to run:
20
+ ```python
21
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
22
+
23
+ tokenizer = T5Tokenizer.from_pretrained("marksverdhei/t5-deshuffle")
24
+ model = T5ForConditionalGeneration.from_pretrained("marksverdhei/t5-deshuffle")
25
+
26
+ prompt = ' brown dog fox jumped lazy over quick the the '
27
+
28
+ ids = tokenizer(prompt, return_tensors="pt").input_ids
29
+ generated_tokens, = model.generate(ids)
30
+ print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
31
+ ```