willwade
/

t5-small-spoken-typo

@@ -8,6 +8,9 @@ tags:
 - AAC
 - assistive-technology
 - spoken
 ---
 # t5-small-spoken-typo
@@ -16,14 +19,44 @@ This model is a fine-tuned version of T5-small, adapted for correcting typograph
 ## Task
 The primary task of this model is **Text Correction**, with a focus on:
 - **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
-- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats.
 This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
 # Model Details
 ## Model Description
 The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
 ## Developed by:
 - **Name**: Will Wade

 - AAC
 - assistive-technology
 - spoken
+datasets:
+  - jfleg
+  - daily_dialog
 ---
 # t5-small-spoken-typo
 ## Task
 The primary task of this model is **Text Correction**, with a focus on:
 - **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
+- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces
 This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
+## Usage
+``python
+from happytransformer import HappyTextToText, TTSettings
+happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
+args = TTSettings(num_beams=5, min_length=1)
+# Add the prefix "grammar: " before each input
+result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)
+print(result.text) # This sentence has bad grammar and is comrpessed.
+``
 # Model Details
 ## Model Description
 The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
+It has been training on
+- BNC 2014 Spoken
+- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
+Then injecting  typos from a range of places
+- **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
+  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
+- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
+  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
+And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod
+Next we would like to C4 200M model - or a subset of it at least
 ## Developed by:
 - **Name**: Will Wade