willwade commited on
Commit
2e0a5fa
1 Parent(s): d6f324e

small tweaks

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -72,7 +72,7 @@ print(decoded_output)
72
  # Model Details
73
 
74
  ## Model Description
75
- The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, introduces and then corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
76
  It has been training on
77
  - [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
78
  - [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
@@ -90,7 +90,7 @@ Then injecting typos from a range of places
90
  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
91
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
92
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
93
- - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones/
94
  - **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
95
 
96
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
@@ -136,6 +136,14 @@ Users are encouraged to critically assess the model's output, especially when us
136
 
137
  # Training Details
138
 
 
 
 
 
 
 
 
 
139
  ## Training Data
140
  The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)
141
 
@@ -264,9 +272,6 @@ We hope to build on this by further fine-tuning in time on real corpous of indvi
264
  ## Model Architecture and Objective
265
  The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.
266
 
267
- ## Compute Infrastructure
268
- - **Hardware**: T4 GPU (Google Colab)
269
- - **Software**: PyTorch 1.8.1 with Transformers 4.8.2
270
 
271
  # Citation
272
 
 
72
  # Model Details
73
 
74
  ## Model Description
75
+ The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
76
  It has been training on
77
  - [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
78
  - [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
 
90
  - Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
91
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
92
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
93
+ - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
94
  - **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
95
 
96
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
 
136
 
137
  # Training Details
138
 
139
+ ## System
140
+
141
+ - System configuration: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
142
+ - Runtime: Python 3.10.12
143
+ - Hardware: NVIDIA A10 GPU with 24GB GDDR6 dedicated memory
144
+ - CPU Cores: 30 logical cores @ 2.59GHz
145
+ - Disk Space: Approximately 1.3TB
146
+
147
  ## Training Data
148
  The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)
149
 
 
272
  ## Model Architecture and Objective
273
  The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.
274
 
 
 
 
275
 
276
  # Citation
277