willwade's picture
ading evalresult
3346ee2 verified
|
raw
history blame
No virus
9.13 kB
metadata
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - AAC
  - assistive-technology
  - spoken
datasets:
  - jfleg
  - daily_dialog
  - leslyarun/c4_200m_gec_train100k_test25k

t5-small-spoken-typo

This model is a fine-tuned version of T5-small, adapted for correcting typographical errors and missing spaces in text. It has been trained on a combination of spoken corpora, including DailyDialog and BNC, with a focus on short utterances common in conversational English.

Task

The primary task of this model is Text Correction, with a focus on:

  • Sentence Correction: Enhancing readability by correcting sentences with missing spaces or typographical errors.
  • Text Normalization: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces

This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.

Usage

from happytransformer import HappyTextToText, TTSettings

happy_tt = HappyTextToText("T5", "willwade/t5-small-spoken-typo")

args = TTSettings(num_beams=5, min_length=1)

# Add the prefix "grammar: " before each input 
result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)

print(result.text) # This sentence has bad grammar and is comrpessed.

or using vanilla transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
model_name = "willwade/t5-small-spoken-typo"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Prepare the input text with the "grammar: " prefix
input_text = "grammar: Hihowareyoudoingtaday?."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text
# Adjust num_beams and min_length to your needs
output = model.generate(input_ids, num_beams=5, min_length=1, max_new_tokens=50, early_stopping=True)

# Decode the generated text
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

print(decoded_output)

Model Details

Model Description

The t5-small-spoken-typo model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms. It has been training on

Then injecting typos from a range of places

  • Using NLPAUG We've made some typos in Comm2 by usiing this library https://github.com/makcedward/nlpaug
  • Typo lists, Birkbeck, etc.: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
    • Find these resources here.
  • TOEFL Spell A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
  • Homonyms We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones/

And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces) Note we use a grammar: prefix for each sentence in training.

Full script to build the dataset is here

Developed by:

  • Name: Will Wade
  • Affiliation: Research & Innovation Manager, Occupational Therapist, Ace Centre, UK
  • Contact Info: wwade@acecentre.org.uk

Model type:

  • Language model fine-tuned for text correction tasks.

Language(s) (NLP):

  • English (en)

License:

  • apache-2.0

Parent Model:

  • The model is fine-tuned from t5-small.

Resources for more information:

Uses

Direct Use

This model can be directly applied for correcting text in various applications, including but not limited to, enhancing the quality of user-generated content, preprocessing text for NLP tasks, and supporting assistive technologies.

Out-of-Scope Use

The model might not perform well on text significantly longer than the training examples (2-5 words), highly formal documents, or languages other than English. Use in sensitive contexts should be approached with caution due to potential biases. Our typical use case here is AAC users - i.e. users using technology to communicate face to face to people

Bias, Risks, and Limitations

The model may inherit biases present in its training data, potentially reflecting or amplifying societal stereotypes. Given its training on conversational English, it may not generalize well to formal text or other dialects and languages.

Recommendations

Users are encouraged to critically assess the model's output, especially when used in sensitive or impactful contexts. Further fine-tuning with diverse and representative datasets could mitigate some limitations.

Training Details

Training Data

The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this here

Training Procedure

Preprocessing

Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.

Speeds, Sizes, Times

  • Training was conducted on LlambdaLabs, taking approximately 4 hrs to complete.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The evaluation was performed on a held-out test set derived from the same corpora and similar sentences, ensuring a diverse range of sentence structures and error types were represented.

Results

The model demonstrates high efficacy in correcting short, erroneous sentences, with particular strength in handling real-world, conversational text.

It performs nearly on par with GPTTurbo16k at around 93% sentence similarity. But there are gaps.

Take for example this output and I've bolded elements for parts that are I feel are incorrect.

Original: Didyoucatchthegamelastnight? Corrected: Did you catch the game last night?

Original: Wannagrabcoffeetomorrow? Corrected: Wanna grab coffee tomorrow?

Original: ImdyingsomeonecancellsoIcandogsitter! Corrected: I'm dying someone cancell so I can dogsitter!

Original: Hahahahahahahathats hilarious! Corrected: Haha ha ha ha that's hilarious!

Original: OMGyouneedtoseethelatestmeme! Corrected: OMG you need to see the latest meme!

Original: Seriouslythisweatherissocrazy! Corrected: Seriously this weather is so crazy!

Original: Whatchauptomefriend? Corrected: What's his friend?

Original: Feelingburntoutaftettodayhelp! Corrected: Feeling burnt out today help!

Original: Guesswhosingleagain! Corrected: Guess who single again!

Original: Youwontyoubelievewhatjusthappened! Corrected: You want you believe what just happened!

Original: Moviemarathonatmyplacethisweekend? Corrected: Movie Marathon at my place this weekend?

Original: Needstudymotivationanyideas? Corrected: Need study motivation any ideas?

Original: Sostressedaboutthispresentation! Corrected: So stressed about this presentation!

Original: Finallyfinishedthatbookyourecommended! Corrected: Finally finished that book you're recommended!

Original: Anygoodshowsbingeablelately? Corrected: Any good shows biteable recently?

We hope to build on this by further fine-tuning in time on real corpous of indviduals using AAC

#EvalResult(loss=0.8066404461860657)

Technical Specifications

Model Architecture and Objective

The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.

Compute Infrastructure

  • Hardware: T4 GPU (Google Colab)
  • Software: PyTorch 1.8.1 with Transformers 4.8.2

Citation

BibTeX:

@misc{t5_small_spoken_typo_2021,
  title={T5-small Spoken Typo Corrector},
  author={Will Wade},
  year={2021},
  howpublished={\url{https://huggingface.co/willwade/t5-small-spoken-typo}},
}