flan-t5-base-paragrapher

This model is designed to preprocess, clean, and reformat text chunks containing line breaks, word breaks, and references into coherent plain text paragraphs. The resulting paragraphs can be used with other models like agentlans/flan-t5-small-title and agentlans/text-summarization.

Model description

The flan-t5-base-paragrapher is a fine-tuned version of google/flan-t5-base, trained on a dataset of open-source introductory social science textbooks. While it was trained on academic texts, it should work well with other types of educational and academic content.

The model achieves the following results on the evaluation set:

Loss: 1.5175
Number of Input Tokens Seen: 49 815 380

Intended uses & limitations

This model is intended for preprocessing and reformatting text chunks into coherent paragraphs. It can be particularly useful for:

Cleaning up text extracted from PDFs or OCR systems
Reformatting text with irregular line breaks or word breaks
Preparing text for further processing or analysis

Limitations:

The model may not perform optimally on highly specialized or technical texts outside its training domain.
Very long input sequences may be truncated due to the model's maximum sequence length (512 tokens).

Training and evaluation data

The model was trained on a dataset compiled from open-source textbooks. Due to licensing constraints, the specific training data is not published.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Learning rate: 5e-05
Train batch size: 8
Eval batch size: 8
Seed: 42
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
LR scheduler type: linear
Number of epochs: 10.0

Training results

Click to expand training results

Training Loss	Epoch	Step	Validation Loss	Input Tokens Seen
2.0748	0.1126	500	1.7587	562752
1.9699	0.2251	1000	1.7031	1119424
1.9177	0.3377	1500	1.6701	1676620
1.9179	0.4502	2000	1.6647	2244928
1.8908	0.5628	2500	1.6502	2806840
1.8666	0.6754	3000	1.6427	3364792
1.8456	0.7879	3500	1.6245	3925172
1.8542	0.9005	4000	1.6218	4490100
1.8305	1.0131	4500	1.6211	5052066
1.7588	1.1256	5000	1.6040	5607258
1.7606	1.2382	5500	1.6020	6165278
1.7426	1.3507	6000	1.5993	6727290
1.7477	1.4633	6500	1.5869	7292338
1.7413	1.5759	7000	1.5791	7849466
1.7342	1.6884	7500	1.5792	8415302
1.7247	1.8010	8000	1.5759	8970490
1.7423	1.9136	8500	1.5744	9529290
1.7138	2.0261	9000	1.5655	10091652
1.6719	2.1387	9500	1.5630	10650544
1.6637	2.2512	10000	1.5584	11208648
1.6415	2.3638	10500	1.5609	11776396
1.6565	2.4764	11000	1.5558	12338500
1.6597	2.5889	11500	1.5530	12897552
1.6709	2.7015	12000	1.5477	13460052
1.648	2.8140	12500	1.5424	14021984
1.642	2.9266	13000	1.5433	14586256
1.6258	3.0392	13500	1.5419	15140609
1.6067	3.1517	14000	1.5415	15700397
1.5946	3.2643	14500	1.5450	16265849
1.5835	3.3769	15000	1.5415	16827557
1.5996	3.4894	15500	1.5411	17384857
1.5834	3.6020	16000	1.5382	17945909
1.5956	3.7145	16500	1.5351	18507721
1.5825	3.8271	17000	1.5356	19069425
1.6001	3.9397	17500	1.5294	19631905
1.5677	4.0522	18000	1.5369	20185192
1.5415	4.1648	18500	1.5318	20739888
1.5362	4.2774	19000	1.5311	21304584
1.5251	4.3899	19500	1.5323	21862856
1.5388	4.5025	20000	1.5307	22427236
1.5508	4.6150	20500	1.5282	22985184
1.5692	4.7276	21000	1.5265	23548396
1.5391	4.8402	21500	1.5276	24111452
1.5431	4.9527	22000	1.5270	24673344
1.5147	5.0653	22500	1.5292	25236559
1.4908	5.1778	23000	1.5288	25799675
1.5153	5.2904	23500	1.5288	26352767
1.5099	5.4030	24000	1.5250	26916707
1.5064	5.5155	24500	1.5259	27483639
1.5146	5.6281	25000	1.5249	28040307
1.4938	5.7407	25500	1.5233	28600639
1.5034	5.8532	26000	1.5237	29164539
1.5091	5.9658	26500	1.5219	29730199
1.4853	6.0783	27000	1.5241	30286010
1.4797	6.1909	27500	1.5201	30840802
1.466	6.3035	28000	1.5238	31403710
1.4666	6.4160	28500	1.5226	31962730
1.4732	6.5286	29000	1.5199	32518854
1.4756	6.6412	29500	1.5219	33083634
1.4778	6.7537	30000	1.5195	33644482
1.4674	6.8663	30500	1.5182	34207738
1.4813	6.9788	31000	1.5202	34772050
1.4543	7.0914	31500	1.5211	35331657
1.4389	7.2040	32000	1.5221	35888749
1.4534	7.3165	32500	1.5215	36455101
1.4401	7.4291	33000	1.5208	37016889
1.4435	7.5416	33500	1.5212	37570517
1.4443	7.6542	34000	1.5205	38134577
1.4533	7.7668	34500	1.5209	38700917
1.4589	7.8793	35000	1.5218	39259257
1.4548	7.9919	35500	1.5185	39819093
1.4322	8.1045	36000	1.5207	40382907
1.4271	8.2170	36500	1.5220	40938983
1.4165	8.3296	37000	1.5203	41498811
1.4273	8.4421	37500	1.5197	42053427
1.4281	8.5547	38000	1.5195	42615135
1.4372	8.6673	38500	1.5197	43173055
1.4374	8.7798	39000	1.5175	43737723
1.4278	8.8924	39500	1.5211	44300547
1.442	9.0050	40000	1.5189	44864787
1.4235	9.1175	40500	1.5226	45418155
1.413	9.2301	41000	1.5220	45985195
1.4193	9.3426	41500	1.5201	46538675
1.414	9.4552	42000	1.5202	47101815
1.4084	9.5678	42500	1.5191	47655583
1.408	9.6803	43000	1.5207	48217371
1.4207	9.7929	43500	1.5200	48781351
1.4293	9.9054	44000	1.5198	49345155

Framework versions

Transformers 4.44.2
PyTorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.19.1

Usage

Here's an example of how to use the model:

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("agentlans/flan-t5-base-paragrapher")
model = T5ForConditionalGeneration.from_pretrained(
    "agentlans/flan-t5-base-paragrapher", device_map="auto"
)

# Define input texts
# Note: These aren't real citations. Only for demonstration purpose.
input_texts = [
    """ge with a narrative—whether through books, films, or oral traditions—we are invited into another person's experience (Brown & Thompson, 2023). This immersion allows us to see the world through different perspectives, breaking down barriers of misunderstanding and prejudice. For example, novels like Harper Lee's "To Kill a Mockingbird" challenge readers to confront issues of racism and injustice through the eyes of a child (Williams, 2018). Similarly, contemporary works such as Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world (Nguyen & Roberts, 2020). By sharing these experiences through storytelling, authors can cultivate empathy in their audiences, encouraging them to reflect on their own beliefs and biases.
    Shaping Identity Through Narratives
    Stories also play a crucial role in shaping personal and collective identities. From childhood tales told by parents to the myths and legends that define cultural heritage, narratives help individuals understand their place in the world (Anderson & White, 2021). They provide frameworks thro""",
    """cia, M., & Patel, R. (2022). Cultural insights through literature: A comparative analysis. International Journal of Cultural Studies, 15(3), 201-215. Johnson, L., & Lee, H. (2019). Oral traditions: Preserving culture through storytelling. Anthropology Today Journal, 34(4), 56-60. Kumar, P. (2021). Epic tales: Literature as a reflection of society. Literary Critique Review, 29(1), 34-50. Lee, J., & Martinez, F. (2021). Voices unheard: Marginalized narratives in digital spaces. Journal of Digital Culture Studies, 7(2), 45-67. Martinez, C., & Chen, Y. (2022). Cultural navigation: Identity in a globalized world. Global Studies Review Jou""",
]

# Tokenize input texts
input_ids = tokenizer(
    input_texts, return_tensors="pt", padding=True, truncation=True
).input_ids.to("cuda")

# Generate outputs
outputs = model.generate(input_ids, max_length=512)

# Print generated outputs
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True) + "\n")

Example output:

Through storytelling, we are invited into another person's experience, breaking down barriers of misunderstanding and prejudice. This immersion allows us to see the world through different perspectives, fostering empathy and re-evaluating our own beliefs and biases. For instance, Harper Lee's "To Kill a Mockingbird" challenges readers to confront issues of racism and injustice through the eyes of a child, while contemporary works like Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world. By sharing these experiences through storytelling, authors

The study of cultural insights through literature has yielded valuable insights into the world. Ci and Patel (2022) conducted a comparative analysis of cultural insights through literature, highlighting the importance of cultural storytelling in preserving culture. Kumar (2021) argued that oral traditions can preserve culture through storytelling, highlighting the importance of storytelling in preserving culture. Lee and Martinez (2021) explored marginalized narratives in digital spaces, highlighting the need for cultural navigation in a globalized world. These studies collectively demonstrate the importance of cultural navigation in fostering identity and identity in a globalized world.

agentlans
/

flan-t5-base-paragrapher