metadata
language:
- en
tags:
- generated_from_trainer
- question-answering
- text-generation
model-index:
- name: LaMini-Flan-T5-77M-qa-generation
results: []
LaMini-Flan-T5-77M-qa-generation
Model Description
This model is a fine-tuned version of MBZUAI/LaMini-Flan-T5-77M trained to generate question and answer pairs from raw text. It is based on the FLAN-T5 architecture and has been optimized for question-answer generation tasks.
Key Features
- Base Model: MBZUAI/LaMini-Flan-T5-77M
- Task: Question and answer pair generation
- Training Data: agentlans/finewebedu-sft
- Added Tokens:
[QUESTION_END]
,[ANSWER_END]
- Evaluation Loss: 1.3572
Usage
To use this model for generating question-answer pairs:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "agentlans/LaMini-Flan-T5-77M-qa-generation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Your input text here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
Output Processing
The model generates output in the following format:
Question[QUESTION_END]Answer[ANSWER_END]Question[QUESTION_END]Answer[ANSWER_END]...
To parse this output into a structured format:
import re
def clean_text(text):
return re.sub(r'\s+', ' ', text).strip()
def parse_qa_pairs(input_text):
qa_blocks = re.split(r'(\[ANSWER_END\])', input_text)
pairs = []
for i in range(0, len(qa_blocks) - 1, 2):
qa_block = qa_blocks[i]
parts = qa_block.split('[QUESTION_END]')
if len(parts) == 2:
question, answer = map(clean_text, parts)
if question and answer:
pairs.append({"question": question, "answer": answer})
return pairs
qa_pairs = parse_qa_pairs(decoded_output)
Example
Input:
The ocean, covering over 70% of our planet's surface, is a vast and mysterious realm teeming with life and beauty. From the vibrant coral reefs that serve as bustling underwater cities to the deep, dark trenches that house some of the most bizarre creatures on Earth, the ocean is a treasure trove of biodiversity. It plays a crucial role in regulating the global climate, absorbing carbon dioxide and producing oxygen through its phytoplankton. Moreover, the ocean's depths remain largely unexplored, holding countless secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science. As we continue to learn more about this incredible ecosystem, it becomes increasingly clear that protecting our oceans is essential for the health of our planet and future generations.
Output:
[
{
"question": "What is the ocean's role in regulating the global climate?",
"answer": "The ocean plays a crucial role in regulating the global climate by absorbing carbon dioxide and producing oxygen through its phytoplankton."
},
{
"question": "What are some of the key discoveries that could revolutionize our understanding of the ocean?",
"answer": "The ocean's depths remain largely unexplored, holding secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science."
},
{
"question": "What is the significance of protecting our oceans for future generations?",
"answer": "Protecting our oceans is essential for the health of our planet and future generations because it is a vital part of our ecosystem and a vital resource for our survival and well-being."
}
]
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- Learning rate: 0.0003
- Train batch size: 16
- Eval batch size: 16
- Seed: 42
- Gradient accumulation steps: 2
- Total train batch size: 32
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- LR scheduler type: linear
- LR scheduler warmup steps: 500
- Number of epochs: 10.0
Training Results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.6321 | 1.2361 | 500 | 1.4333 |
1.5305 | 2.4722 | 1000 | 1.4013 |
1.4754 | 3.7083 | 1500 | 1.3719 |
1.4425 | 4.9444 | 2000 | 1.3693 |
1.3781 | 6.1805 | 2500 | 1.3647 |
1.3687 | 7.4166 | 3000 | 1.3572 |
1.3413 | 8.6527 | 3500 | 1.3596 |
1.3539 | 9.8888 | 4000 | 1.3594 |
Limitations
- The model's performance may vary depending on the complexity and domain of the input text.
- The quality of generated questions and answers can be inconsistent across different topics.
- The model may occasionally generate irrelevant or repetitive question-answer pairs.
Framework Versions
- Transformers 4.44.0
- Pytorch 2.2.2+cu121
- Datasets 2.18.0
- Tokenizers 0.19.1