update readme
Browse files
README.md
CHANGED
@@ -1,3 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
## Training
|
3 |
-
575k
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# doc2query/all-with_prefix-t5-base-v1
|
2 |
+
|
3 |
+
|
4 |
+
## Usage
|
5 |
+
```python
|
6 |
+
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
7 |
+
|
8 |
+
model_name = 'doc2query/all-with_prefix-t5-base-v1'
|
9 |
+
tokenizer = T5Tokenizer.from_pretrained(model_name)
|
10 |
+
model = T5ForConditionalGeneration.from_pretrained(model_name)
|
11 |
+
|
12 |
+
prefix = "answer2question: "
|
13 |
+
text = prefix+"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
|
14 |
+
|
15 |
+
input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
|
16 |
+
outputs = model.generate(
|
17 |
+
input_ids=input_ids,
|
18 |
+
max_length=64,
|
19 |
+
do_sample=True,
|
20 |
+
top_p=0.95,
|
21 |
+
num_return_sequences=5)
|
22 |
+
|
23 |
+
print("Text:")
|
24 |
+
print(text)
|
25 |
+
|
26 |
+
print("\nGenerated Queries:")
|
27 |
+
for i in range(len(outputs)):
|
28 |
+
query = tokenizer.decode(outputs[i], skip_special_tokens=True)
|
29 |
+
print(f'{i + 1}: {query}')
|
30 |
+
```
|
31 |
|
32 |
## Training
|
33 |
+
This model fine-tuned [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) for 575k training steps. For the training script, see the `train_script.py` in this repository.
|
34 |
+
|
35 |
+
The input-text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.
|
36 |
+
|
37 |
+
This model was trained on a large collection of datasets. For the exact datasets names and weights see the `data_config.json` in this repository. Most of the datasets are available at [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers).
|
38 |
+
|
39 |
+
The datasets include besides others:
|
40 |
+
- (title, body) pairs from [Reddit](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)
|
41 |
+
- (title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
|
42 |
+
- (title, review) pairs from Amazon reviews
|
43 |
+
- (query, paragraph) pairs from MS MARCO, NQ, and GooAQ
|
44 |
+
- (question, duplicate_question) from Quora and WikiAnswers
|
45 |
+
- (title, abstract) pairs from S2ORC
|
46 |
+
|
47 |
+
## Prefix
|
48 |
+
|
49 |
+
This model was trained **with prefixed**: You start the text with a specific index that defines what type out output text you would like to receive. Depending on the prefix, the output is different.
|
50 |
+
|
51 |
+
E.g. the above text about Python produces the following output:
|
52 |
+
| Prefix | Output |
|
53 |
+
| --- | --- |
|
54 |
+
| answer2question | Why should I use python in my business? ; What is the difference between Python and.NET? ; what is the python design philosophy? |
|
55 |
+
| review2title | Python a powerful and useful language ; A new and improved programming language ; Object-oriented, practical and accessibl |
|
56 |
+
| abstract2title | Python: A Software Development Platform ; A Research Guide for Python X: Conceptual Approach to Programming ; Python : Language and Approach |
|
57 |
+
| text2query | is python a low level language? ; what is the primary idea of python? ; is python a programming language? |
|
58 |
+
|
59 |
+
These are all available pre-fixes:
|
60 |
+
- text2reddit
|
61 |
+
- question2title
|
62 |
+
- answer2question
|
63 |
+
- abstract2title
|
64 |
+
- review2title
|
65 |
+
- news2title
|
66 |
+
- text2query
|
67 |
+
- question2question
|
68 |
+
|
69 |
+
For the datasets and weights for the different pre-fixes see `data_config.json` in this repository.
|
70 |
+
|