doc2query
/

all-with_prefix-t5-base-v1

+# doc2query/all-with_prefix-t5-base-v1
+## Usage
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+model_name = 'doc2query/all-with_prefix-t5-base-v1'
+tokenizer = T5Tokenizer.from_pretrained(model_name)
+model = T5ForConditionalGeneration.from_pretrained(model_name)
+prefix = "answer2question: "
+text = prefix+"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
+input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
+outputs = model.generate(
+    input_ids=input_ids,
+    max_length=64,
+    do_sample=True,
+    top_p=0.95,
+    num_return_sequences=5)
+print("Text:")
+print(text)
+print("\nGenerated Queries:")
+for i in range(len(outputs)):
+    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
+    print(f'{i + 1}: {query}')
+```
 ## Training
+This model fine-tuned [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) for 575k training steps. For the  training script, see the `train_script.py` in this repository.
+The input-text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.
+This model was trained on a large collection of datasets. For the exact datasets names and weights see the `data_config.json` in this repository. Most of the datasets are available at [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers).
+The datasets include besides others:
+- (title, body) pairs from [Reddit](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)
+- (title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
+- (title, review) pairs from Amazon reviews
+- (query, paragraph) pairs from MS MARCO, NQ, and GooAQ
+- (question, duplicate_question) from Quora and WikiAnswers
+- (title, abstract) pairs from S2ORC
+## Prefix
+This model was trained **with prefixed**: You start the text with a specific index that defines what type out output text you would like to receive. Depending on the prefix, the output is different.
+E.g. the above text about Python produces the following output:
+| Prefix | Output |
+| --- | --- |
+| answer2question | Why should I use python in my business? ; What is the difference between Python and.NET? ;  what is the python design philosophy? |
+| review2title | Python a powerful and useful language ; A new and improved programming language ; Object-oriented, practical and accessibl |
+| abstract2title | Python: A Software Development Platform ; A Research Guide for Python X: Conceptual Approach to Programming ; Python : Language and Approach |
+| text2query |  is python a low level language? ; what is the primary idea of python? ; is python a programming language? |
+These are all available pre-fixes:
+- text2reddit
+- question2title
+- answer2question
+- abstract2title
+- review2title
+- news2title
+- text2query
+- question2question
+For the datasets and weights for the different pre-fixes see `data_config.json` in this repository.