willwade commited on
Commit
d6f324e
1 Parent(s): cccfd9c

updating with new data and eval results

Browse files
Files changed (1) hide show
  1. README.md +79 -30
README.md CHANGED
@@ -79,6 +79,9 @@ It has been training on
79
  - [Comm2 - AAC Text](https://www.aactext.org/comm2/)
80
  - [C4-200M - 25K Subset](https://huggingface.co/datasets/leslyarun/c4_200m_gec_train100k_test25k)
81
  - [JFLEG](https://huggingface.co/datasets/jhu-clsp/jfleg)
 
 
 
82
 
83
 
84
  Then injecting typos from a range of places
@@ -88,6 +91,7 @@ Then injecting typos from a range of places
88
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
89
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
90
  - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones/
 
91
 
92
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
93
  Note we use a ``grammar: `` prefix for each sentence in training.
@@ -113,7 +117,7 @@ Full script to build the [dataset is here](https://colab.research.google.com/dri
113
  - The model is fine-tuned from `t5-small`.
114
 
115
  ## Resources for more information:
116
- - [GitHub Repo](https://github.com/willwade/dailyDialogCorrections/)
117
 
118
  # Uses
119
 
@@ -141,10 +145,23 @@ The model was trained on a curated subset of the DailyDialog and BNC corpora (20
141
  Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.
142
 
143
  ### Speeds, Sizes, Times
144
- - Training was conducted on LlambdaLabs, taking approximately 4 hrs to complete.
145
 
146
  # Evaluation
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ## Testing Data, Factors & Metrics
149
 
150
  ### Testing Data
@@ -156,59 +173,91 @@ The model demonstrates high efficacy in correcting short, erroneous sentences, w
156
 
157
  It performs nearly on par with GPTTurbo16k at around 93% sentence similarity. But there are gaps.
158
 
159
- Take for example this output and I've bolded elements for parts that are I feel are incorrect.
160
 
161
 
 
 
162
  Original: Didyoucatchthegamelastnight?
163
  Corrected: Did you catch the game last night?
164
-
165
  Original: Wannagrabcoffeetomorrow?
166
  Corrected: Wanna grab coffee tomorrow?
167
-
168
  Original: ImdyingsomeonecancellsoIcandogsitter!
169
- Corrected: I'm dying someone **cancell** so I can dogsitter!
170
-
171
  Original: Hahahahahahahathats hilarious!
172
- Corrected: Haha ha ha ha that's hilarious!
173
-
174
  Original: OMGyouneedtoseethelatestmeme!
175
- Corrected: OMG you need to see the latest meme!
176
-
177
  Original: Seriouslythisweatherissocrazy!
178
- Corrected: Seriously this weather is so crazy!
179
-
180
  Original: Whatchauptomefriend?
181
- Corrected: What's **his** friend?
182
-
183
  Original: Feelingburntoutaftettodayhelp!
184
- Corrected: Feeling burnt out today help!
185
-
186
  Original: Guesswhosingleagain!
187
  Corrected: Guess who single again!
188
-
189
  Original: Youwontyoubelievewhatjusthappened!
190
- Corrected: You **want** you believe what just happened!
191
-
192
  Original: Moviemarathonatmyplacethisweekend?
193
- Corrected: Movie Marathon at my place this weekend?
194
-
195
  Original: Needstudymotivationanyideas?
196
- Corrected: Need study motivation any ideas?
197
-
198
  Original: Sostressedaboutthispresentation!
199
  Corrected: So stressed about this presentation!
200
-
201
  Original: Finallyfinishedthatbookyourecommended!
202
- Corrected: Finally finished that book **you're** recommended!
203
-
204
  Original: Anygoodshowsbingeablelately?
205
- Corrected: Any good shows **biteable** recently?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
  We hope to build on this by further fine-tuning in time on real corpous of indviduals using AAC
208
 
209
 
210
- #EvalResult(loss=0.8066404461860657)
211
-
212
 
213
  # Technical Specifications
214
 
 
79
  - [Comm2 - AAC Text](https://www.aactext.org/comm2/)
80
  - [C4-200M - 25K Subset](https://huggingface.co/datasets/leslyarun/c4_200m_gec_train100k_test25k)
81
  - [JFLEG](https://huggingface.co/datasets/jhu-clsp/jfleg)
82
+ - [Coedit](https://huggingface.co/datasets/grammarly/coedit)
83
+ - [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
84
+ - [Conversation Starters](https://huggingface.co/Langame/conversation-starters)
85
 
86
 
87
  Then injecting typos from a range of places
 
91
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
92
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
93
  - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones/
94
+ - **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
95
 
96
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
97
  Note we use a ``grammar: `` prefix for each sentence in training.
 
117
  - The model is fine-tuned from `t5-small`.
118
 
119
  ## Resources for more information:
120
+ - [GitHub Repo](https://github.com/AceCentre/Correct-A-Sentence/tree/main/helper-scripts/)
121
 
122
  # Uses
123
 
 
145
  Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.
146
 
147
  ### Speeds, Sizes, Times
148
+ - Training was conducted on LlambdaLabs, taking approximately 6 hrs to complete.
149
 
150
  # Evaluation
151
 
152
+ | Phase | Metric | Value |
153
+ |------------|----------------------------------|---------------|
154
+ | Train | Loss | 0.1642 |
155
+ | Train | Global Step | 375876 |
156
+ | Train | Total FLoPs | 8.33E+15 |
157
+ | Train | Training Time | ~6.5 hr |
158
+ | Eval | Loss | 0.1199 |
159
+ | Eval | Samples per Second | 1159.375 |
160
+ | Eval | Steps per Second | 72.462 |
161
+ | Hyperparam | Learning Rate | 5.02E-06 |
162
+ | Hyperparam | Grad Norm | 1.1725 |
163
+ | Hyperparam | Epoch | 3 |
164
+
165
  ## Testing Data, Factors & Metrics
166
 
167
  ### Testing Data
 
173
 
174
  It performs nearly on par with GPTTurbo16k at around 93% sentence similarity. But there are gaps.
175
 
176
+ Take for example this output.
177
 
178
 
179
+ Original: Howwasyafternoonanyway?
180
+ Corrected: How was my afternoon anyway?
181
  Original: Didyoucatchthegamelastnight?
182
  Corrected: Did you catch the game last night?
 
183
  Original: Wannagrabcoffeetomorrow?
184
  Corrected: Wanna grab coffee tomorrow?
 
185
  Original: ImdyingsomeonecancellsoIcandogsitter!
186
+ Corrected: I'm dying someone cancell so I can do dogsitter!
 
187
  Original: Hahahahahahahathats hilarious!
188
+ Corrected: Hahahahahahahahaha that's hilarious!
 
189
  Original: OMGyouneedtoseethelatestmeme!
190
+ Corrected: OMG, you need to see the latest me!
 
191
  Original: Seriouslythisweatherissocrazy!
192
+ Corrected: Seriously, this weather is so crazy!
 
193
  Original: Whatchauptomefriend?
194
+ Corrected: What's your friend?
 
195
  Original: Feelingburntoutaftettodayhelp!
196
+ Corrected: Feeling burnt out aftet today, help!
 
197
  Original: Guesswhosingleagain!
198
  Corrected: Guess who single again!
 
199
  Original: Youwontyoubelievewhatjusthappened!
200
+ Corrected: You wont you believe what just happened!
 
201
  Original: Moviemarathonatmyplacethisweekend?
202
+ Corrected: Movie marathon at my place this weekend?
 
203
  Original: Needstudymotivationanyideas?
204
+ Corrected: Need study motivation, any ideas?
 
205
  Original: Sostressedaboutthispresentation!
206
  Corrected: So stressed about this presentation!
 
207
  Original: Finallyfinishedthatbookyourecommended!
208
+ Corrected: Finally finished that book you recommended!
 
209
  Original: Anygoodshowsbingeablelately?
210
+ Corrected: Any good shows being possible lately?
211
+ Original: Justsawthecraziestthingonthebus!
212
+ Corrected: Just saw the craziest thing on the bus!
213
+ Original: Sendhelpfoodistrappedintheoven!
214
+ Corrected: Send help food is wrapped in the oven!
215
+ Original: Cantwaittoseeyouattheparty!
216
+ Corrected: Cant wait to see you at the party!
217
+ Original: Missyoutonsalreadyletshangsoon!
218
+ Corrected: Miss youtons already let's hang soon!
219
+ Original: CantbelieveImissedthelastbus!
220
+ Corrected: Can't believe I missed the last bus!
221
+ Original: Needanysuggestionsforagoodmovieatnight?
222
+ Corrected: Need any suggestions for a good movie night?
223
+ Original: Feelingproudaccomplishedsomethingbigtoday!
224
+ Corrected: Feeling proud of something big today!
225
+ Original: Wishcouldteleportmyselftothebeachrightnow.
226
+ Corrected: Wish could teleport myself to the beach right now.
227
+ Original: Justsawthecutestaudiofapuppylearningtotalk.
228
+ Corrected: Just saw the cutest audio from puppy learning to talk.
229
+ Original: Excitedtostartafreshnewprojectthistoday.
230
+ Corrected: Excited to start a fresh new project this today.
231
+ Original: Havingtroubledecidingwhichoptionistobetter.
232
+ Corrected: Having trouble deciding which option is to better.
233
+ Original: Finallyfinishedorganizingmyclosetfeelsamazing!
234
+ Corrected: Finally finished organizing my closet feels amazing!
235
+ Original: Learnedsomethingnewtodayitssoneversadtoolate!
236
+ Corrected: Learned something new today it's so wonderful too late!
237
+ Original: Cravingapizzabuttryingtoresisttemptation.
238
+ Corrected: Craving a pizza, but trying to resist temptation.
239
+ Original: Planningweekendgettogetheranyonesinterested?
240
+ Corrected: Planning weekend get together anyone's interested?
241
+ Original: Canwaittousethisnewlyacquiredskillsoon.
242
+ Corrected: Can wait to use this newly acquired skill soon.
243
+ Original: Feelinggratefulforallofthesupportivepeopleinmylife.
244
+ Corrected: Feeling grateful for all of the supportive people in my life.
245
+ Original: Whatshappeningonthelatestseasonofyourfavoriteshow?
246
+ Corrected: What's happening on the latest season of your favorites, though?
247
+ Original: Anyoneelseafraidofthedarkadmitnojudgement.
248
+ Corrected: Anyone else afraid of the dark admit no judgement.
249
+ Original: Justreadafascinatingarticleaboutancientcivilizations.
250
+ Corrected: Just read a fascinating article about ancient civilizations.
251
+ Original: Feelingaccomplishedcrossedofeverythingontmylisttoday.
252
+ Corrected: Feeling accomplished crossed of everything on my list today.
253
+ Original: Strugglingtofindmotivationanyadviceisappreciated.
254
+ Corrected: Struggling to find motivation any advice is appreciated.
255
+ Original: Cantw waittoseeyouagainletsmakesoonplans!
256
+ Corrected: Can't wait to see you again, let's make soon plans!
257
 
258
  We hope to build on this by further fine-tuning in time on real corpous of indviduals using AAC
259
 
260
 
 
 
261
 
262
  # Technical Specifications
263