german or swiss-german
Your german turbo model works very well* in faster-whisper
- no hallucinations
- no repetitions
- no errors in upper/lower case letters
- very fast, almost as fast as the new OpenAI Turbo Model
but there is a “ß” problem, some examples
- wrong “Massstab”, correct would be “Maßstab”
- wrong “Grösse”, correct would be “Größe"
- ...
I have about 150 such misspelled different words in about 15 hours of video
is it possible that your additional dataset was not german, but swiss*-german?
- with "condition_on_previous_text" = True :)
** in switzerland there is no “ß”, it is replaced by “ss”
Yeah, I think that can be due to a typo in the preprocessing of the training dataset.
While casting all into Unicode characters and some chars like ß, ü,ö,ä etc back this can occur
a had a look in your used dataset "asr-german-moxed-evals"
- common_voice_19_0
- multilingual librispeech
- Tuda-De
in all three datasets are wrong "ss" chars and parts of the two first are using sometimes (very) outdated german
in the first 35 entries I have found (column "references")
wrong "ss", should be "ß",
- liess -> ließ
- verschloss -> verschloß
- dass -> daß
- mussten -> mußten
- dass -> daß
- Schweiss -> Schweiß
- Grosse -> Große
- aussen -> außen
- äusserlicher -> äußerlich
wrong spelling (missing char) and outdated german
- wusst -> wußte ('e'missing )
- müsst -> müßte ('e' missing)
- offenliess -> offen ließ (neue Rechtschreibung)
correct "ss"
- müssen
- Sessel
- unablässig
- fallenzulassen
- umfasste (ok, neue Rechtschreibung)
one would have to be corrected the 9799 entries !!! - since this will probably not happen, a fine-tuned model with this data is unfortunately not really suitable - a real pity