Text2Text Generation
Transformers
PyTorch
English
t5
text-generation-inference
Inference Endpoints
nreimers commited on
Commit
86aa170
1 Parent(s): 57d2cbe

update readme

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -1,3 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
  ## Training
3
- 575k train steps
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # doc2query/all-with_prefix-t5-base-v1
2
+
3
+
4
+ ## Usage
5
+ ```python
6
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
7
+
8
+ model_name = 'doc2query/all-with_prefix-t5-base-v1'
9
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
10
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
11
+
12
+ prefix = "answer2question: "
13
+ text = prefix+"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
14
+
15
+ input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
16
+ outputs = model.generate(
17
+ input_ids=input_ids,
18
+ max_length=64,
19
+ do_sample=True,
20
+ top_p=0.95,
21
+ num_return_sequences=5)
22
+
23
+ print("Text:")
24
+ print(text)
25
+
26
+ print("\nGenerated Queries:")
27
+ for i in range(len(outputs)):
28
+ query = tokenizer.decode(outputs[i], skip_special_tokens=True)
29
+ print(f'{i + 1}: {query}')
30
+ ```
31
 
32
  ## Training
33
+ This model fine-tuned [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) for 575k training steps. For the training script, see the `train_script.py` in this repository.
34
+
35
+ The input-text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.
36
+
37
+ This model was trained on a large collection of datasets. For the exact datasets names and weights see the `data_config.json` in this repository. Most of the datasets are available at [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers).
38
+
39
+ The datasets include besides others:
40
+ - (title, body) pairs from [Reddit](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)
41
+ - (title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
42
+ - (title, review) pairs from Amazon reviews
43
+ - (query, paragraph) pairs from MS MARCO, NQ, and GooAQ
44
+ - (question, duplicate_question) from Quora and WikiAnswers
45
+ - (title, abstract) pairs from S2ORC
46
+
47
+ ## Prefix
48
+
49
+ This model was trained **with prefixed**: You start the text with a specific index that defines what type out output text you would like to receive. Depending on the prefix, the output is different.
50
+
51
+ E.g. the above text about Python produces the following output:
52
+ | Prefix | Output |
53
+ | --- | --- |
54
+ | answer2question | Why should I use python in my business? ; What is the difference between Python and.NET? ; what is the python design philosophy? |
55
+ | review2title | Python a powerful and useful language ; A new and improved programming language ; Object-oriented, practical and accessibl |
56
+ | abstract2title | Python: A Software Development Platform ; A Research Guide for Python X: Conceptual Approach to Programming ; Python : Language and Approach |
57
+ | text2query | is python a low level language? ; what is the primary idea of python? ; is python a programming language? |
58
+
59
+ These are all available pre-fixes:
60
+ - text2reddit
61
+ - question2title
62
+ - answer2question
63
+ - abstract2title
64
+ - review2title
65
+ - news2title
66
+ - text2query
67
+ - question2question
68
+
69
+ For the datasets and weights for the different pre-fixes see `data_config.json` in this repository.
70
+