agentlans's picture
Upload folder using huggingface_hub
bdcce39 verified
metadata
language:
  - en
tags:
  - generated_from_trainer
  - question-answering
  - text-generation
model-index:
  - name: LaMini-Flan-T5-77M-qa-generation
    results: []

LaMini-Flan-T5-77M-qa-generation

Model Description

This model is a fine-tuned version of MBZUAI/LaMini-Flan-T5-77M trained to generate question and answer pairs from raw text. It is based on the FLAN-T5 architecture and has been optimized for question-answer generation tasks.

Key Features

  • Base Model: MBZUAI/LaMini-Flan-T5-77M
  • Task: Question and answer pair generation
  • Training Data: agentlans/finewebedu-sft
  • Added Tokens: [QUESTION_END], [ANSWER_END]
  • Evaluation Loss: 1.3572

Usage

To use this model for generating question-answer pairs:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "agentlans/LaMini-Flan-T5-77M-qa-generation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Your input text here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

Output Processing

The model generates output in the following format:

Question[QUESTION_END]Answer[ANSWER_END]Question[QUESTION_END]Answer[ANSWER_END]...

To parse this output into a structured format:

import re

def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

def parse_qa_pairs(input_text):
    qa_blocks = re.split(r'(\[ANSWER_END\])', input_text)
    pairs = []
    for i in range(0, len(qa_blocks) - 1, 2):
        qa_block = qa_blocks[i]
        parts = qa_block.split('[QUESTION_END]')
        if len(parts) == 2:
            question, answer = map(clean_text, parts)
            if question and answer:
                pairs.append({"question": question, "answer": answer})
    return pairs

qa_pairs = parse_qa_pairs(decoded_output)

Example

Input:

The ocean, covering over 70% of our planet's surface, is a vast and mysterious realm teeming with life and beauty. From the vibrant coral reefs that serve as bustling underwater cities to the deep, dark trenches that house some of the most bizarre creatures on Earth, the ocean is a treasure trove of biodiversity. It plays a crucial role in regulating the global climate, absorbing carbon dioxide and producing oxygen through its phytoplankton. Moreover, the ocean's depths remain largely unexplored, holding countless secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science. As we continue to learn more about this incredible ecosystem, it becomes increasingly clear that protecting our oceans is essential for the health of our planet and future generations.

Output:

[
    {
        "question": "What is the ocean's role in regulating the global climate?",
        "answer": "The ocean plays a crucial role in regulating the global climate by absorbing carbon dioxide and producing oxygen through its phytoplankton."
    },
    {
        "question": "What are some of the key discoveries that could revolutionize our understanding of the ocean?",
        "answer": "The ocean's depths remain largely unexplored, holding secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science."
    },
    {
        "question": "What is the significance of protecting our oceans for future generations?",
        "answer": "Protecting our oceans is essential for the health of our planet and future generations because it is a vital part of our ecosystem and a vital resource for our survival and well-being."
    }
]

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 0.0003
  • Train batch size: 16
  • Eval batch size: 16
  • Seed: 42
  • Gradient accumulation steps: 2
  • Total train batch size: 32
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • LR scheduler type: linear
  • LR scheduler warmup steps: 500
  • Number of epochs: 10.0

Training Results

Training Loss Epoch Step Validation Loss
1.6321 1.2361 500 1.4333
1.5305 2.4722 1000 1.4013
1.4754 3.7083 1500 1.3719
1.4425 4.9444 2000 1.3693
1.3781 6.1805 2500 1.3647
1.3687 7.4166 3000 1.3572
1.3413 8.6527 3500 1.3596
1.3539 9.8888 4000 1.3594

Limitations

  • The model's performance may vary depending on the complexity and domain of the input text.
  • The quality of generated questions and answers can be inconsistent across different topics.
  • The model may occasionally generate irrelevant or repetitive question-answer pairs.

Framework Versions

  • Transformers 4.44.0
  • Pytorch 2.2.2+cu121
  • Datasets 2.18.0
  • Tokenizers 0.19.1