Upload README.md
Browse filesadding some more info
README.md
CHANGED
@@ -8,6 +8,9 @@ tags:
|
|
8 |
- AAC
|
9 |
- assistive-technology
|
10 |
- spoken
|
|
|
|
|
|
|
11 |
---
|
12 |
# t5-small-spoken-typo
|
13 |
|
@@ -16,14 +19,44 @@ This model is a fine-tuned version of T5-small, adapted for correcting typograph
|
|
16 |
## Task
|
17 |
The primary task of this model is **Text Correction**, with a focus on:
|
18 |
- **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
|
19 |
-
- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats.
|
20 |
|
21 |
This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
# Model Details
|
24 |
|
25 |
## Model Description
|
26 |
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
## Developed by:
|
29 |
- **Name**: Will Wade
|
|
|
8 |
- AAC
|
9 |
- assistive-technology
|
10 |
- spoken
|
11 |
+
datasets:
|
12 |
+
- jfleg
|
13 |
+
- daily_dialog
|
14 |
---
|
15 |
# t5-small-spoken-typo
|
16 |
|
|
|
19 |
## Task
|
20 |
The primary task of this model is **Text Correction**, with a focus on:
|
21 |
- **Sentence Correction**: Enhancing readability by correcting sentences with missing spaces or typographical errors.
|
22 |
+
- **Text Normalization**: Standardizing text by converting informal or irregular forms into more grammatically correct formats. Largely dealing with sentences with no spaces
|
23 |
|
24 |
This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
|
25 |
|
26 |
+
|
27 |
+
## Usage
|
28 |
+
|
29 |
+
``python
|
30 |
+
from happytransformer import HappyTextToText, TTSettings
|
31 |
+
|
32 |
+
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
|
33 |
+
|
34 |
+
args = TTSettings(num_beams=5, min_length=1)
|
35 |
+
|
36 |
+
# Add the prefix "grammar: " before each input
|
37 |
+
result = happy_tt.generate_text("grammar: Hihowareyoudoingtaday?.", args=args)
|
38 |
+
|
39 |
+
print(result.text) # This sentence has bad grammar and is comrpessed.
|
40 |
+
``
|
41 |
+
|
42 |
# Model Details
|
43 |
|
44 |
## Model Description
|
45 |
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
|
46 |
+
It has been training on
|
47 |
+
- BNC 2014 Spoken
|
48 |
+
- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
|
49 |
+
|
50 |
+
Then injecting typos from a range of places
|
51 |
+
- **Typo lists, Birkbeck, etc.**: These datasets contain lists of commonly misspelled words, making them invaluable for training models to recognize and correct spelling errors.
|
52 |
+
- Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
|
53 |
+
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
|
54 |
+
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
|
55 |
+
|
56 |
+
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod
|
57 |
+
|
58 |
+
Next we would like to C4 200M model - or a subset of it at least
|
59 |
+
|
60 |
|
61 |
## Developed by:
|
62 |
- **Name**: Will Wade
|