flan-t5-base-paragrapher
This model is designed to preprocess, clean, and reformat text chunks containing line breaks, word breaks, and references into coherent plain text paragraphs. The resulting paragraphs can be used with other models like agentlans/flan-t5-small-title and agentlans/text-summarization.
Model description
The flan-t5-base-paragrapher is a fine-tuned version of google/flan-t5-base, trained on a dataset of open-source introductory social science textbooks. While it was trained on academic texts, it should work well with other types of educational and academic content.
The model achieves the following results on the evaluation set:
- Loss: 1.5175
- Number of Input Tokens Seen: 49 815 380
Intended uses & limitations
This model is intended for preprocessing and reformatting text chunks into coherent paragraphs. It can be particularly useful for:
- Cleaning up text extracted from PDFs or OCR systems
- Reformatting text with irregular line breaks or word breaks
- Preparing text for further processing or analysis
Limitations:
- The model may not perform optimally on highly specialized or technical texts outside its training domain.
- Very long input sequences may be truncated due to the model's maximum sequence length (512 tokens).
Training and evaluation data
The model was trained on a dataset compiled from open-source textbooks. Due to licensing constraints, the specific training data is not published.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Seed: 42
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- LR scheduler type: linear
- Number of epochs: 10.0
Training results
Click to expand training results
Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
---|---|---|---|---|
2.0748 | 0.1126 | 500 | 1.7587 | 562752 |
1.9699 | 0.2251 | 1000 | 1.7031 | 1119424 |
1.9177 | 0.3377 | 1500 | 1.6701 | 1676620 |
1.9179 | 0.4502 | 2000 | 1.6647 | 2244928 |
1.8908 | 0.5628 | 2500 | 1.6502 | 2806840 |
1.8666 | 0.6754 | 3000 | 1.6427 | 3364792 |
1.8456 | 0.7879 | 3500 | 1.6245 | 3925172 |
1.8542 | 0.9005 | 4000 | 1.6218 | 4490100 |
1.8305 | 1.0131 | 4500 | 1.6211 | 5052066 |
1.7588 | 1.1256 | 5000 | 1.6040 | 5607258 |
1.7606 | 1.2382 | 5500 | 1.6020 | 6165278 |
1.7426 | 1.3507 | 6000 | 1.5993 | 6727290 |
1.7477 | 1.4633 | 6500 | 1.5869 | 7292338 |
1.7413 | 1.5759 | 7000 | 1.5791 | 7849466 |
1.7342 | 1.6884 | 7500 | 1.5792 | 8415302 |
1.7247 | 1.8010 | 8000 | 1.5759 | 8970490 |
1.7423 | 1.9136 | 8500 | 1.5744 | 9529290 |
1.7138 | 2.0261 | 9000 | 1.5655 | 10091652 |
1.6719 | 2.1387 | 9500 | 1.5630 | 10650544 |
1.6637 | 2.2512 | 10000 | 1.5584 | 11208648 |
1.6415 | 2.3638 | 10500 | 1.5609 | 11776396 |
1.6565 | 2.4764 | 11000 | 1.5558 | 12338500 |
1.6597 | 2.5889 | 11500 | 1.5530 | 12897552 |
1.6709 | 2.7015 | 12000 | 1.5477 | 13460052 |
1.648 | 2.8140 | 12500 | 1.5424 | 14021984 |
1.642 | 2.9266 | 13000 | 1.5433 | 14586256 |
1.6258 | 3.0392 | 13500 | 1.5419 | 15140609 |
1.6067 | 3.1517 | 14000 | 1.5415 | 15700397 |
1.5946 | 3.2643 | 14500 | 1.5450 | 16265849 |
1.5835 | 3.3769 | 15000 | 1.5415 | 16827557 |
1.5996 | 3.4894 | 15500 | 1.5411 | 17384857 |
1.5834 | 3.6020 | 16000 | 1.5382 | 17945909 |
1.5956 | 3.7145 | 16500 | 1.5351 | 18507721 |
1.5825 | 3.8271 | 17000 | 1.5356 | 19069425 |
1.6001 | 3.9397 | 17500 | 1.5294 | 19631905 |
1.5677 | 4.0522 | 18000 | 1.5369 | 20185192 |
1.5415 | 4.1648 | 18500 | 1.5318 | 20739888 |
1.5362 | 4.2774 | 19000 | 1.5311 | 21304584 |
1.5251 | 4.3899 | 19500 | 1.5323 | 21862856 |
1.5388 | 4.5025 | 20000 | 1.5307 | 22427236 |
1.5508 | 4.6150 | 20500 | 1.5282 | 22985184 |
1.5692 | 4.7276 | 21000 | 1.5265 | 23548396 |
1.5391 | 4.8402 | 21500 | 1.5276 | 24111452 |
1.5431 | 4.9527 | 22000 | 1.5270 | 24673344 |
1.5147 | 5.0653 | 22500 | 1.5292 | 25236559 |
1.4908 | 5.1778 | 23000 | 1.5288 | 25799675 |
1.5153 | 5.2904 | 23500 | 1.5288 | 26352767 |
1.5099 | 5.4030 | 24000 | 1.5250 | 26916707 |
1.5064 | 5.5155 | 24500 | 1.5259 | 27483639 |
1.5146 | 5.6281 | 25000 | 1.5249 | 28040307 |
1.4938 | 5.7407 | 25500 | 1.5233 | 28600639 |
1.5034 | 5.8532 | 26000 | 1.5237 | 29164539 |
1.5091 | 5.9658 | 26500 | 1.5219 | 29730199 |
1.4853 | 6.0783 | 27000 | 1.5241 | 30286010 |
1.4797 | 6.1909 | 27500 | 1.5201 | 30840802 |
1.466 | 6.3035 | 28000 | 1.5238 | 31403710 |
1.4666 | 6.4160 | 28500 | 1.5226 | 31962730 |
1.4732 | 6.5286 | 29000 | 1.5199 | 32518854 |
1.4756 | 6.6412 | 29500 | 1.5219 | 33083634 |
1.4778 | 6.7537 | 30000 | 1.5195 | 33644482 |
1.4674 | 6.8663 | 30500 | 1.5182 | 34207738 |
1.4813 | 6.9788 | 31000 | 1.5202 | 34772050 |
1.4543 | 7.0914 | 31500 | 1.5211 | 35331657 |
1.4389 | 7.2040 | 32000 | 1.5221 | 35888749 |
1.4534 | 7.3165 | 32500 | 1.5215 | 36455101 |
1.4401 | 7.4291 | 33000 | 1.5208 | 37016889 |
1.4435 | 7.5416 | 33500 | 1.5212 | 37570517 |
1.4443 | 7.6542 | 34000 | 1.5205 | 38134577 |
1.4533 | 7.7668 | 34500 | 1.5209 | 38700917 |
1.4589 | 7.8793 | 35000 | 1.5218 | 39259257 |
1.4548 | 7.9919 | 35500 | 1.5185 | 39819093 |
1.4322 | 8.1045 | 36000 | 1.5207 | 40382907 |
1.4271 | 8.2170 | 36500 | 1.5220 | 40938983 |
1.4165 | 8.3296 | 37000 | 1.5203 | 41498811 |
1.4273 | 8.4421 | 37500 | 1.5197 | 42053427 |
1.4281 | 8.5547 | 38000 | 1.5195 | 42615135 |
1.4372 | 8.6673 | 38500 | 1.5197 | 43173055 |
1.4374 | 8.7798 | 39000 | 1.5175 | 43737723 |
1.4278 | 8.8924 | 39500 | 1.5211 | 44300547 |
1.442 | 9.0050 | 40000 | 1.5189 | 44864787 |
1.4235 | 9.1175 | 40500 | 1.5226 | 45418155 |
1.413 | 9.2301 | 41000 | 1.5220 | 45985195 |
1.4193 | 9.3426 | 41500 | 1.5201 | 46538675 |
1.414 | 9.4552 | 42000 | 1.5202 | 47101815 |
1.4084 | 9.5678 | 42500 | 1.5191 | 47655583 |
1.408 | 9.6803 | 43000 | 1.5207 | 48217371 |
1.4207 | 9.7929 | 43500 | 1.5200 | 48781351 |
1.4293 | 9.9054 | 44000 | 1.5198 | 49345155 |
Framework versions
- Transformers 4.44.2
- PyTorch 2.5.1+cu124
- Datasets 3.1.0
- Tokenizers 0.19.1
Usage
Here's an example of how to use the model:
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("agentlans/flan-t5-base-paragrapher")
model = T5ForConditionalGeneration.from_pretrained(
"agentlans/flan-t5-base-paragrapher", device_map="auto"
)
# Define input texts
# Note: These aren't real citations. Only for demonstration purpose.
input_texts = [
"""ge with a narrative—whether through books, films, or oral traditions—we are invited into another person's experience (Brown & Thompson, 2023). This immersion allows us to see the world through different perspectives, breaking down barriers of misunderstanding and prejudice. For example, novels like Harper Lee's "To Kill a Mockingbird" challenge readers to confront issues of racism and injustice through the eyes of a child (Williams, 2018). Similarly, contemporary works such as Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world (Nguyen & Roberts, 2020). By sharing these experiences through storytelling, authors can cultivate empathy in their audiences, encouraging them to reflect on their own beliefs and biases.
Shaping Identity Through Narratives
Stories also play a crucial role in shaping personal and collective identities. From childhood tales told by parents to the myths and legends that define cultural heritage, narratives help individuals understand their place in the world (Anderson & White, 2021). They provide frameworks thro""",
"""cia, M., & Patel, R. (2022). Cultural insights through literature: A comparative analysis. International Journal of Cultural Studies, 15(3), 201-215. Johnson, L., & Lee, H. (2019). Oral traditions: Preserving culture through storytelling. Anthropology Today Journal, 34(4), 56-60. Kumar, P. (2021). Epic tales: Literature as a reflection of society. Literary Critique Review, 29(1), 34-50. Lee, J., & Martinez, F. (2021). Voices unheard: Marginalized narratives in digital spaces. Journal of Digital Culture Studies, 7(2), 45-67. Martinez, C., & Chen, Y. (2022). Cultural navigation: Identity in a globalized world. Global Studies Review Jou""",
]
# Tokenize input texts
input_ids = tokenizer(
input_texts, return_tensors="pt", padding=True, truncation=True
).input_ids.to("cuda")
# Generate outputs
outputs = model.generate(input_ids, max_length=512)
# Print generated outputs
for output in outputs:
print(tokenizer.decode(output, skip_special_tokens=True) + "\n")
Example output:
Through storytelling, we are invited into another person's experience, breaking down barriers of misunderstanding and prejudice. This immersion allows us to see the world through different perspectives, fostering empathy and re-evaluating our own beliefs and biases. For instance, Harper Lee's "To Kill a Mockingbird" challenges readers to confront issues of racism and injustice through the eyes of a child, while contemporary works like Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world. By sharing these experiences through storytelling, authors
The study of cultural insights through literature has yielded valuable insights into the world. Ci and Patel (2022) conducted a comparative analysis of cultural insights through literature, highlighting the importance of cultural storytelling in preserving culture. Kumar (2021) argued that oral traditions can preserve culture through storytelling, highlighting the importance of storytelling in preserving culture. Lee and Martinez (2021) explored marginalized narratives in digital spaces, highlighting the need for cultural navigation in a globalized world. These studies collectively demonstrate the importance of cultural navigation in fostering identity and identity in a globalized world.
- Downloads last month
- 3
Model tree for agentlans/flan-t5-base-paragrapher
Base model
google/flan-t5-base