File size: 14,617 Bytes
91e0c67
 
 
 
 
9d13fea
91e0c67
 
 
 
da75cc0
a30282f
 
 
cccfd9c
 
 
 
342699a
9d13fea
 
 
 
 
 
 
 
 
 
 
91e0c67
fb89fe7
6a15401
3381ead
6a15401
 
3381ead
 
da75cc0
6a15401
3381ead
6a15401
da75cc0
 
 
2bba69b
da75cc0
 
aba0e2d
da75cc0
 
 
 
 
 
 
2bba69b
da75cc0
81b87d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a15401
 
 
2e0a5fa
da75cc0
3eab091
da75cc0
3eab091
 
 
d6f324e
 
 
3eab091
da75cc0
 
3eab091
da75cc0
 
 
 
2e0a5fa
d6f324e
da75cc0
3eab091
 
a30282f
 
 
da75cc0
3381ead
 
 
 
6a15401
3381ead
 
6a15401
3381ead
 
6a15401
3381ead
 
6a15401
3381ead
 
6a15401
3381ead
d6f324e
6a15401
 
 
 
3381ead
6a15401
 
3381ead
6a15401
 
 
3381ead
6a15401
 
3381ead
6a15401
 
 
2e0a5fa
 
 
 
 
 
 
 
6a15401
3381ead
6a15401
 
 
 
3381ead
6a15401
 
342699a
6a15401
 
 
d6f324e
 
 
 
 
 
 
 
 
 
 
 
 
6a15401
 
 
3eab091
6a15401
 
 
3381ead
6a15401
3eab091
 
d6f324e
854ce23
 
73ff92e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
854ce23
 
3eab091
6a15401
3381ead
6a15401
 
3381ead
6a15401
 
 
 
 
 
3381ead
 
 
8431c4d
3381ead
8431c4d
665a796
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text2text-generation
tags:
- AAC
- assistive-technology
- spoken
datasets:
- jfleg
- daily_dialog
- leslyarun/c4_200m_gec_train100k_test25k
- grammarly/coedit
- willwade/UNL-AAC-Phrases
- Chakshu/conversation_ender
- Langame/conversation-starters
widget:
- inference:
    parameters:
      num_beams: 5
      min_length: 1
      max_new_tokens: 50
      early_stopping: true
    examples:
    - text: 'grammar: Didyoucatchthegamelastnight'
      example_title: Writing without spaces
    - text: 'grammar: My naame is Sysan and my favoritefood iis an brueger'
      example_title: Typos
---
# t5-small-spoken-typo

This model is a fine-tuned version of T5-small, adapted for correcting typographical errors and missing spaces in text. It has been trained on a combination of spoken corpora, including DailyDialog and BNC, with a focus on short utterances common in conversational English.

## Task
The primary task of this model is **Text Correction**, with a focus on:
- **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces

This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.


## Usage

```python
from happytransformer import HappyTextToText, TTSettings

happy_tt = HappyTextToText("T5", "willwade/t5-small-spoken-typo")

args = TTSettings(num_beams=5, min_length=1)

# Add the prefix "grammar: " before each input 
result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)

print(result.text) # This sentence has bad grammar and is comrpessed.
```

or using vanilla transformers

```python
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
model_name = "willwade/t5-small-spoken-typo"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Prepare the input text with the "grammar: " prefix
input_text = "grammar: Hihowareyoudoingtaday?."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text
# Adjust num_beams and min_length to your needs
output = model.generate(input_ids, num_beams=5, min_length=1, max_new_tokens=50, early_stopping=True)

# Decode the generated text
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

print(decoded_output)

```

# Model Details

## Model Description
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
It has been training on
- [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
- [Comm2 - AAC Text](https://www.aactext.org/comm2/)
- [C4-200M - 25K Subset](https://huggingface.co/datasets/leslyarun/c4_200m_gec_train100k_test25k)
- [JFLEG](https://huggingface.co/datasets/jhu-clsp/jfleg)
- [Coedit](https://huggingface.co/datasets/grammarly/coedit)
- [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
- [Conversation Starters](https://huggingface.co/Langame/conversation-starters)


Then injecting  typos from a range of places
- **Using NLPAUG** We've made some typos in Comm2 by usiing this library https://github.com/makcedward/nlpaug
- **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master) 
- **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
- **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc

And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
Note we use a ``grammar: `` prefix for each sentence in training. 

Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)


## Developed by:
- **Name**: Will Wade
- **Affiliation**: Research & Innovation Manager, Occupational Therapist, Ace Centre, UK
- **Contact Info**: wwade@acecentre.org.uk

## Model type: 
- Language model fine-tuned for text correction tasks.

## Language(s) (NLP): 
- English (`en`)

## License:
- apache-2.0

## Parent Model:
- The model is fine-tuned from `t5-small`.

## Resources for more information:
- [GitHub Repo](https://github.com/AceCentre/Correct-A-Sentence/tree/main/helper-scripts/)

# Uses

## Direct Use
This model can be directly applied for correcting text in various applications, including but not limited to, enhancing the quality of user-generated content, preprocessing text for NLP tasks, and supporting assistive technologies.

## Out-of-Scope Use
The model might not perform well on text significantly longer than the training examples (2-5 words), highly formal documents, or languages other than English. Use in sensitive contexts should be approached with caution due to potential biases. **Our typical use case here is AAC users - i.e. users using technology to communicate face to face to people**

# Bias, Risks, and Limitations

The model may inherit biases present in its training data, potentially reflecting or amplifying societal stereotypes. Given its training on conversational English, it may not generalize well to formal text or other dialects and languages.

## Recommendations
Users are encouraged to critically assess the model's output, especially when used in sensitive or impactful contexts. Further fine-tuning with diverse and representative datasets could mitigate some limitations.

# Training Details

## System

- System configuration: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Runtime: Python 3.10.12
- Hardware: NVIDIA A10 GPU with 24GB GDDR6 dedicated memory
- CPU Cores: 30 logical cores @ 2.59GHz
- Disk Space: Approximately 1.3TB

## Training Data
The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)

## Training Procedure

### Preprocessing
Sentences were stripped of apostrophes and commas, spaces were removed, and typos were introduced programmatically to simulate common errors in user-generated content.

### Speeds, Sizes, Times
- Training was conducted on LlambdaLabs, taking approximately 12 hrs to complete.

# Evaluation

| Phase      | Metric                          | Value         |
|------------|----------------------------------|---------------|
| Train      | Loss                            | 0.1642        |
| Train      | Global Step                      | 375876        |
| Train      | Total FLoPs                     | 8.33E+15      |
| Train      | Training Time                    | ~6.5 hr       |
| Eval       | Loss                            | 0.1199        |
| Eval       | Samples per Second               | 1159.375      |
| Eval       | Steps per Second                 | 72.462       |
| Hyperparam | Learning Rate                    | 5.02E-06      |
| Hyperparam | Grad Norm                       | 1.1725        |
| Hyperparam | Epoch                           | 3             |

## Testing Data, Factors & Metrics

### Testing Data
The evaluation was performed on a held-out test set derived from the same corpora and similar sentences, ensuring a diverse range of sentence structures and error types were represented.


## Results 
The model demonstrates high efficacy in correcting short, erroneous sentences, with particular strength in handling real-world, conversational text.

It performs nearly on par with GPTTurbo16k at around 93% sentence similarity. But there are gaps. 

Take for example this output.  


|    | Input                                                 | Output                                                           |
|---:|:------------------------------------------------------|:-----------------------------------------------------------------|
|  0 | Didyoucatchthegamelastnight?                          | Did you catch the game last night?                               |
|  1 | Wannagrabcoffeetomorrow?                              | Wanna grab coffee tomorrow?                                      |
|  2 | ImdyingsomeonecancellsoIcandogsitter!                 | I'm dying someone cancell so I can do dogsitter!                 |
|  3 | Hahahahahahahathats hilarious!                        | Hahahahahahahahaha that's hilarious!                             |
|  4 | OMGyouneedtoseethelatestmeme!                         | OMG, you need to see the latest me!                              |
|  5 | Seriouslythisweatherissocrazy!                        | Seriously, this weather is so crazy!                             |
|  6 | Whatchauptomefriend?                                  | What's your friend?                                              |
|  7 | Feelingburntoutaftettodayhelp!                        | Feeling burnt out aftet today, help!                             |
|  8 | Guesswhosingleagain!                                  | Guess who single again!                                          |
|  9 | Youwontyoubelievewhatjusthappened!                    | You wont you believe what just happened!                         |
| 10 | Moviemarathonatmyplacethisweekend?                    | Movie marathon at my place this weekend?                         |
| 11 | Needstudymotivationanyideas?                          | Need study motivation, any ideas?                                |
| 12 | Sostressedaboutthispresentation!                      | So stressed about this presentation!                             |
| 13 | Finallyfinishedthatbookyourecommended!                | Finally finished that book you recommended!                      |
| 14 | Anygoodshowsbingeablelately?                          | Any good shows being possible lately?                            |
| 15 | Justsawthecraziestthingonthebus!                      | Just saw the craziest thing on the bus!                          |
| 16 | Sendhelpfoodistrappedintheoven!                       | Send help food is wrapped in the oven!                           |
| 17 | Cantwaittoseeyouattheparty!                           | Cant wait to see you at the party!                               |
| 18 | Missyoutonsalreadyletshangsoon!                       | Miss youtons already let's hang soon!                            |
| 19 | CantbelieveImissedthelastbus!                         | Can't believe I missed the last bus!                             |
| 20 | Needanysuggestionsforagoodmovieatnight?               | Need any suggestions for a good movie night?                     |
| 21 | Feelingproudaccomplishedsomethingbigtoday!            | Feeling proud of something big today!                            |
| 22 | Wishcouldteleportmyselftothebeachrightnow.            | Wish could teleport myself to the beach right now.               |
| 23 | Justsawthecutestaudiofapuppylearningtotalk.           | Just saw the cutest audio from puppy learning to talk.           |
| 24 | Excitedtostartafreshnewprojectthistoday.              | Excited to start a fresh new project this today.                 |
| 25 | Havingtroubledecidingwhichoptionistobetter.           | Having trouble deciding which option is to better.               |
| 26 | Finallyfinishedorganizingmyclosetfeelsamazing!        | Finally finished organizing my closet feels amazing!             |
| 27 | Learnedsomethingnewtodayitssoneversadtoolate!         | Learned something new today it's so wonderful too late!          |
| 28 | Cravingapizzabuttryingtoresisttemptation.             | Craving a pizza, but trying to resist temptation.                |
| 29 | Planningweekendgettogetheranyonesinterested?          | Planning weekend get together anyone's interested?               |
| 30 | Canwaittousethisnewlyacquiredskillsoon.               | Can wait to use this newly acquired skill soon.                  |
| 31 | Feelinggratefulforallofthesupportivepeopleinmylife.   | Feeling grateful for all of the supportive people in my life.    |
| 32 | Whatshappeningonthelatestseasonofyourfavoriteshow?    | What's happening on the latest season of your favorites, though? |
| 33 | Anyoneelseafraidofthedarkadmitnojudgement.            | Anyone else afraid of the dark admit no judgement.               |
| 34 | Justreadafascinatingarticleaboutancientcivilizations. | Just read a fascinating article about ancient civilizations.     |
| 35 | Feelingaccomplishedcrossedofeverythingontmylisttoday. | Feeling accomplished crossed of everything on my list today.     |
| 36 | Strugglingtofindmotivationanyadviceisappreciated.     | Struggling to find motivation any advice is appreciated.         |
| 37 | Cantw waittoseeyouagainletsmakesoonplans!             | Can't wait to see you again, let's make soon plans!              |

We hope to build on this by further fine-tuning in time on real corpous of indviduals using AAC


# Technical Specifications

## Model Architecture and Objective
The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.


# Citation

**BibTeX:**

```bibtex
@misc{t5_small_spoken_typo_2021,
  title={T5-small Spoken Typo Corrector},
  author={Will Wade},
  year={2021},
  howpublished={\url{https://huggingface.co/willwade/t5-small-spoken-typo}},
}
```