willwade commited on
Commit
3eab091
1 Parent(s): a30282f

updating with new datasets

Browse files
Files changed (1) hide show
  1. README.md +62 -11
README.md CHANGED
@@ -45,23 +45,27 @@ print(result.text) # This sentence has bad grammar and is comrpessed.
45
  ## Model Description
46
  The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
47
  It has been training on
48
- - BNC 2014 Spoken
49
  - [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
 
 
 
 
50
 
51
  Then injecting typos from a range of places
 
52
  - **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
53
  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
54
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
55
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
 
56
 
57
- And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod
58
-
59
- We have also provided the C4-200M-250K subset data and the JFLEG dataset for base grammar correction
60
 
61
  Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
62
 
63
 
64
-
65
  ## Developed by:
66
  - **Name**: Will Wade
67
  - **Affiliation**: Research & Innovation Manager, Occupational Therapist, Ace Centre, UK
@@ -108,24 +112,71 @@ The model was trained on a curated subset of the DailyDialog and BNC corpora (20
108
  Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.
109
 
110
  ### Speeds, Sizes, Times
111
- - Training was conducted on Google Colab, taking approximately 11 hrs to complete.
112
 
113
  # Evaluation
114
 
115
  ## Testing Data, Factors & Metrics
116
 
117
  ### Testing Data
118
- Evaluation was performed on a held-out test set derived from the same corpora and similar sentences, ensuring a diverse range of sentence structures and error types were represented.
119
 
120
- ### Metrics
121
- Performance was measured using the accuracy of space insertion and typo correction alongside qualitative assessments of text normalisation.
122
 
123
  ## Results
124
  The model demonstrates high efficacy in correcting short, erroneous sentences, with particular strength in handling real-world, conversational text.
125
 
126
- # Environmental Impact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- The training was conducted with an emphasis on efficiency and minimising carbon emissions. Users leveraging cloud compute resources are encouraged to consider the environmental impact of large-scale model training and inference.
129
 
130
  # Technical Specifications
131
 
 
45
  ## Model Description
46
  The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
47
  It has been training on
48
+ - [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
49
  - [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
50
+ - [Comm2 - AAC Text](https://www.aactext.org/comm2/)
51
+ - [C4-200M - 25K Subset](https://huggingface.co/datasets/leslyarun/c4_200m_gec_train100k_test25k)
52
+ - [JFLEG](https://huggingface.co/datasets/jhu-clsp/jfleg)
53
+
54
 
55
  Then injecting typos from a range of places
56
+ - **Using NLPAUG** We've made some typos in Comm2 by usiing this library https://github.com/makcedward/nlpaug
57
  - **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
58
  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
59
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
60
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
61
+ - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones/
62
 
63
+ And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
64
+ Note we use a ``grammar: `` prefix for each sentence in training.
 
65
 
66
  Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
67
 
68
 
 
69
  ## Developed by:
70
  - **Name**: Will Wade
71
  - **Affiliation**: Research & Innovation Manager, Occupational Therapist, Ace Centre, UK
 
112
  Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.
113
 
114
  ### Speeds, Sizes, Times
115
+ - Training was conducted on LlambdaLabs, taking approximately 4 hrs to complete.
116
 
117
  # Evaluation
118
 
119
  ## Testing Data, Factors & Metrics
120
 
121
  ### Testing Data
122
+ The evaluation was performed on a held-out test set derived from the same corpora and similar sentences, ensuring a diverse range of sentence structures and error types were represented.
123
 
 
 
124
 
125
  ## Results
126
  The model demonstrates high efficacy in correcting short, erroneous sentences, with particular strength in handling real-world, conversational text.
127
 
128
+ It performs nearly on par with GPTTurbo16k at around 93% sentence similarity. But there are gaps.
129
+
130
+ Take for example this output.
131
+
132
+ Original: Didyoucatchthegamelastnight?
133
+ Corrected: Did you catch the game last night?
134
+
135
+ Original: Wannagrabcoffeetomorrow?
136
+ Corrected: Wanna grab coffee tomorrow?
137
+
138
+ Original: ImdyingsomeonecancellsoIcandogsitter!
139
+ Corrected: I'm dying someone cancell so I can dogsitter!
140
+
141
+ Original: Hahahahahahahathats hilarious!
142
+ Corrected: Haha ha ha ha that's hilarious!
143
+
144
+ Original: OMGyouneedtoseethelatestmeme!
145
+ Corrected: OMG you need to see the latest meme!
146
+
147
+ Original: Seriouslythisweatherissocrazy!
148
+ Corrected: Seriously this weather is so crazy!
149
+
150
+ Original: Whatchauptomefriend?
151
+ Corrected: What's his friend?
152
+
153
+ Original: Feelingburntoutaftettodayhelp!
154
+ Corrected: Feeling burnt out today help!
155
+
156
+ Original: Guesswhosingleagain!
157
+ Corrected: Guess who single again!
158
+
159
+ Original: Youwontyoubelievewhatjusthappened!
160
+ Corrected: You want you believe what just happened!
161
+
162
+ Original: Moviemarathonatmyplacethisweekend?
163
+ Corrected: Movie Marathon at my place this weekend?
164
+
165
+ Original: Needstudymotivationanyideas?
166
+ Corrected: Need study motivation any ideas?
167
+
168
+ Original: Sostressedaboutthispresentation!
169
+ Corrected: So stressed about this presentation!
170
+
171
+ Original: Finallyfinishedthatbookyourecommended!
172
+ Corrected: Finally finished that book you're recommended!
173
+
174
+ Original: Anygoodshowsbingeablelately?
175
+ Corrected: Any good shows biteable recently?
176
+
177
+
178
+
179
 
 
180
 
181
  # Technical Specifications
182