primeline/whisper-large-v3-turbo-german

12 days ago

Your german turbo model works very well* in faster-whisper

no hallucinations
no repetitions
no errors in upper/lower case letters
very fast, almost as fast as the new OpenAI Turbo Model

but there is a “ß” problem, some examples

wrong “Massstab”, correct would be “Maßstab”
wrong “Grösse”, correct would be “Größe"
...

I have about 150 such misspelled different words in about 15 hours of video
is it possible that your additional dataset was not german, but swiss*-german?

with "condition_on_previous_text" = True :)
** in switzerland there is no “ß”, it is replaced by “ss”

flozi00

primeLine AI Services org 11 days ago

Yeah, I think that can be due to a typo in the preprocessing of the training dataset.
While casting all into Unicode characters and some chars like ß, ü,ö,ä etc back this can occur

jschoene

10 days ago

•

edited 10 days ago

a had a look in your used dataset "asr-german-moxed-evals"

common_voice_19_0
multilingual librispeech
Tuda-De

in all three datasets are wrong "ss" chars and parts of the two first are using sometimes (very) outdated german

in the first 35 entries I have found (column "references")

wrong "ss", should be "ß",

liess -> ließ
verschloss -> verschloß
dass -> daß
mussten -> mußten
dass -> daß
Schweiss -> Schweiß
Grosse -> Große
aussen -> außen
äusserlicher -> äußerlich

wrong spelling (missing char) and outdated german

wusst -> wußte ('e'missing )
müsst -> müßte ('e' missing)
offenliess -> offen ließ (neue Rechtschreibung)

correct "ss"

müssen
Sessel
unablässig
fallenzulassen
umfasste (ok, neue Rechtschreibung)

one would have to be corrected the 9799 entries !!! - since this will probably not happen, a fine-tuned model with this data is unfortunately not really suitable - a real pity

primeline
/

whisper-large-v3-turbo-german

german or swiss-german