adding new facts
Browse files
README.md
CHANGED
@@ -40,6 +40,7 @@ The primary task of this model is **Text Correction**, with a focus on:
|
|
40 |
|
41 |
This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
|
42 |
|
|
|
43 |
|
44 |
## Usage
|
45 |
|
@@ -94,6 +95,7 @@ It has been training on
|
|
94 |
- [Coedit](https://huggingface.co/datasets/grammarly/coedit)
|
95 |
- [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
|
96 |
- [Conversation Starters](https://huggingface.co/Langame/conversation-starters)
|
|
|
97 |
|
98 |
|
99 |
Then injecting typos from a range of places
|
@@ -103,13 +105,16 @@ Then injecting typos from a range of places
|
|
103 |
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
|
104 |
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
|
105 |
- **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
|
106 |
-
- **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
|
107 |
|
108 |
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
|
109 |
Note we use a ``grammar: `` prefix for each sentence in training.
|
110 |
|
111 |
Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
|
112 |
|
|
|
|
|
|
|
113 |
|
114 |
## Developed by:
|
115 |
- **Name**: Will Wade
|
|
|
40 |
|
41 |
This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
|
42 |
|
43 |
+
**Note - as of 15 March this model is really tuned to fix positional errors on a qwerty keyboard. **
|
44 |
|
45 |
## Usage
|
46 |
|
|
|
95 |
- [Coedit](https://huggingface.co/datasets/grammarly/coedit)
|
96 |
- [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
|
97 |
- [Conversation Starters](https://huggingface.co/Langame/conversation-starters)
|
98 |
+
- 5% AAC-Like Open Subtitles (Private dataset with thanks to Keith Vertanen)
|
99 |
|
100 |
|
101 |
Then injecting typos from a range of places
|
|
|
105 |
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
|
106 |
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
|
107 |
- **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
|
108 |
+
- **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc. Full weightings can be seen in our training script
|
109 |
|
110 |
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
|
111 |
Note we use a ``grammar: `` prefix for each sentence in training.
|
112 |
|
113 |
Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
|
114 |
|
115 |
+
## To Do:
|
116 |
+
|
117 |
+
We really want to be able to deal with errors in switch scanning which maybe be linear ABC, Frequency or block scanning. Its relatively straightforward but its thinking the best way forward for this..
|
118 |
|
119 |
## Developed by:
|
120 |
- **Name**: Will Wade
|