willwade commited on
Commit
da75cc0
1 Parent(s): cfe7f6c

Upload README.md

Browse files

adding some more info

Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -8,6 +8,9 @@ tags:
8
  - AAC
9
  - assistive-technology
10
  - spoken
 
 
 
11
  ---
12
  # t5-small-spoken-typo
13
 
@@ -16,14 +19,44 @@ This model is a fine-tuned version of T5-small, adapted for correcting typograph
16
  ## Task
17
  The primary task of this model is **Text Correction**, with a focus on:
18
  - **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
19
- - **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats.
20
 
21
  This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  # Model Details
24
 
25
  ## Model Description
26
  The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Developed by:
29
  - **Name**: Will Wade
 
8
  - AAC
9
  - assistive-technology
10
  - spoken
11
+ datasets:
12
+ - jfleg
13
+ - daily_dialog
14
  ---
15
  # t5-small-spoken-typo
16
 
 
19
  ## Task
20
  The primary task of this model is **Text Correction**, with a focus on:
21
  - **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
22
+ - **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces
23
 
24
  This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
25
 
26
+
27
+ ## Usage
28
+
29
+ ``python
30
+ from happytransformer import HappyTextToText, TTSettings
31
+
32
+ happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
33
+
34
+ args = TTSettings(num_beams=5, min_length=1)
35
+
36
+ # Add the prefix "grammar: " before each input
37
+ result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)
38
+
39
+ print(result.text) # This sentence has bad grammar and is comrpessed.
40
+ ``
41
+
42
  # Model Details
43
 
44
  ## Model Description
45
  The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
46
+ It has been training on
47
+ - BNC 2014 Spoken
48
+ - [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
49
+
50
+ Then injecting typos from a range of places
51
+ - **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
52
+ - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
53
+ - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
54
+ - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
55
+
56
+ And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod
57
+
58
+ Next we would like to C4 200M model - or a subset of it at least
59
+
60
 
61
  ## Developed by:
62
  - **Name**: Will Wade