Under-100M Parameter for detecting 20 Marathi numbers?

#1
by MartialTerran - opened

I read some of your comments in the Whisper community. I am wondering if you achieved the objective of operating a number-detector on a small apparatus. Have you produced or obtained a whisper type model that can detect the 20 Marathi numbers in that language having a parameter count below 100 Million?

  1. In the model, whisper-tiny-mr, it seems you have worked with {finetuned?} a pretrained model downloaded from somewhere else? Where did you get a pretrained "whisper-tiny-mr" ? URL?

I have some questions:
2) Why did you pretrain/finetune using a model having 50+thousand token vocabulary? It seems that you only needed a model having a little more than 20 token vocab (your 20 Marathi numbers). So, why deploy/finetune a model having a 50+ thousand token vocab? Did you answer the other guy's question: " Or is this in the context of speech recognition? i.e. someone speaks a full sentence, one word of which is a number that you want to transcribe correctly." You said [@sanchit-gandhi our case is simple earlier one of audio classification. This is the only thing we are doing with the model so don't know any other thing about the model.] Does that mean you only want a 20-logit classification model? Only for detecting 20 Marathi numbers

  1. What are all of those two-letter "Special Tokens" in https://huggingface.co/shripadbhat/whisper-tiny-mr/blob/main/special_tokens_map.json for? Are those special tokens somehow representing your 20 Marathi numbers? If so, why did you implement the 20 Marathi numbers as Special Tokens?

  2. Have you found, produced or obtained a whisper type model that can detect the 20 Marathi numbers in that language having a parameter count below 100 Million? Does this model also reliably refuse to output a number token when the input sound is not one of the the 20 Marathi numbers? (prevents False positives) Have you published code and weights for the small Whisper-type model that has both of these features?

  3. Why did you stop training at Loss: 0.4618 ? That does not seem to be highly a reliable model checkpoint if you are only verifying loss for 20 tokens..... Why not train until a loss of below 0.01?

Loss: 0.4618
Wer: 41.6451

@@@@@@@@@@@@@@@@@@@@@@@@2

Copied from the other chat on Huggingface:

Hey @SameerMahajan ! To clarify, are you simply performing audio classification? i.e. you have an audio input where someone says a number, and you want to predict the number that they said. Or is this in the context of speech recognition? i.e. someone speaks a full sentence, one word of which is a number that you want to transcribe correctly.

Is the performance of the model otherwise good on Marathi? Am wondering whether you can fine-tune it for audio classification or speech recognition to boost Marathi performance as required (https://huggingface.co/blog/fine-tune-whisper)

SameerMahajan
Feb 17, 2023

@sanchit-gandhi our case is simple earlier one of audio classification. This is the only thing we are doing with the model so don't know any other thing about the model. Thanks for sharing your blog on fine tuning which we will take a look at to see whether it helps in our case.

sanchit-gandhi
Feb 22, 2023

edited Feb 22, 2023

Hey @SameerMahajan ! If you have a couple of hours of labelled audio-text data then you should definitely be able to fine-tune Whisper for this task!

One other thing you can try is using a Marathi model that has already been fine-tuned and see if it's any better: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=mr&split=test

The number one model has a demo built for it: https://huggingface.co/spaces/DrishtiSharma/Whisper-Marathi-Transcriber

IMO it's worth trying this out first and seeing how it performs.

You can also use the Python API to directly use the model in a Python script:

import torch
from transformers import pipeline

model_id = "DrishtiSharma/whisper-large-v2-marathi"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
"automatic-speech-recognition",
model=model_id,
device=device,
)

audio = PATH/TO/YOUR/AUDIO

out = pipe(audio)

SameerMahajan
Feb 22, 2023

Thanks @sanchit-gandhi we will take a look. Do you have any example of retraining / tuning model with our own custom data (rather than your datasets which have somewhat complex structure)? The reason I am asking is that we have only numbers and some pre recorded audio files (30-40 samples per number). We can just label them as "1", "2", ... etc., try retraining / tuning and see what we get.

SameerMahajan
Feb 22, 2023

@sanchit-gandhi one problem with this tuned marathi model is that it is very large (6.17 GB as I see it during download). Our use case (https://youtu.be/L3L4mEszzTs) requires us to build an offline android app which typically cannot exceed a couple of hundred MBs...

We are getting good results with custom trained models.

Subject: Inquiry about a Minimal Audio Tokenization Model for 20-Word Vocabulary

Hello!

I'm exploring the possibility of using audio tokenization for a small, battery-powered "remote control" application. My initial goal is to recognize a limited vocabulary of just 20 words (specifically, numerals). I'm new to audio tokenization, so I'd greatly appreciate your expertise in helping me get started.

To make this feasible for a resource-constrained device, I'm interested in the smallest possible model that can achieve reasonable accuracy for this task. Could you please share some information on a minimal model setup suitable for this purpose? In particular, I'm looking for the following details:

  1. Hyperparameters and Training Data: What are the recommended hyperparameters and approximate training data size needed to effectively train a model for this 20-word recognition task? Specifically, could you share details on the audio tokenization parameters (e.g., sampling rate, window size, stride, etc.)?
  2. Model Architecture and Dimensions:
    • Are you basing this on a standard Whisper_model.py architecture, or is there a more streamlined approach for such a small vocabulary? If it's a modified Whisper, could you elaborate on the changes or provide some example code snippets?
    • What is the input embedding dimension of the recommended model?
    • What is the output lm_head dimension, and is it tied to the input embedding dimension?
    • What is the overall vocabulary size, including special tokens like start-of-sequence, end-of-sequence, padding, etc.?
  3. PyTorch Model and TorchScript Conversion: I'd like to deploy the model on my device, so it would be fantastic if you could provide a link to a standalone model.py file (or equivalent) containing the PyTorch model definition. Ideally, this model would be easily convertible to TorchScript for optimized on-device inference.

Thank you in advance for your time and assistance! I'm excited to learn more about this area and build a functional prototype. Any guidance you can provide would be extremely helpful.

विषय: २०-शब्द शब्दसंग्रहासाठी किमान ऑडिओ टोकनायझेशन मॉडेलबद्दल चौकशी

नमस्कार!

मी एका लहान, बॅटरीवर चालणाऱ्या "रिमोट कंट्रोल" अनुप्रयोगासाठी ऑडिओ टोकनायझेशन वापरण्याच्या शक्यतेचा शोध घेत आहे. माझे सुरुवातीचे ध्येय फक्त २० शब्दांचा (विशेषतः, अंकांचा) मर्यादित शब्दसंग्रह ओळखणे आहे. मी ऑडिओ टोकनायझेशनसाठी नवीन आहे, म्हणून मला प्रारंभ करण्यात मदत करण्यासाठी तुमच्या कौशल्याचे मी खूप कौतुक करीन.

हे एका मर्यादित-संसाधन उपकरणासाठी शक्य करण्यासाठी, मला या कामासाठी योग्य अचूकता प्राप्त करू शकणार्‍या शक्य तितक्या लहान मॉडेलमध्ये स्वारस्य आहे. कृपया तुम्ही या उद्देशासाठी योग्य असलेल्या किमान मॉडेल सेटअपबद्दल काही माहिती शेअर करू शकाल का? विशेषतः, मला पुढील तपशीलांमध्ये स्वारस्य आहे:

  1. हायपरपॅरामीटर्स आणि प्रशिक्षण डेटा: या २०-शब्द ओळखण्याच्या कामासाठी मॉडेलला प्रभावीपणे प्रशिक्षित करण्यासाठी शिफारस केलेले हायपरपॅरामीटर्स आणि अंदाजे प्रशिक्षण डेटा आकार काय आहेत? विशेषतः, तुम्ही ऑडिओ टोकनायझेशन पॅरामीटर्स (उदा. सॅम्पलिंग रेट, विंडो आकार, स्ट्राइड, इत्यादी) बद्दल तपशील शेअर करू शकाल का?
  2. मॉडेल आर्किटेक्चर आणि परिमाणे:
    • तुम्ही हे एका मानक Whisper_model.py आर्किटेक्चरवर आधारित करत आहात, किंवा अशा लहान शब्दसंग्रहासाठी अधिक सुव्यवस्थित दृष्टीकोन आहे का? जर ते सुधारित व्हिस्पर असेल, तर तुम्ही बदलांवर तपशीलवार माहिती देऊ शकाल किंवा काही उदाहरण कोड स्निपेट्स देऊ शकाल?
    • शिफारस केलेल्या मॉडेलचे इनपुट एम्बेडिंग डायमेंशन काय आहे?
    • आउटपुट lm_head डायमेंशन काय आहे आणि ते इनपुट एम्बेडिंग डायमेंशनशी बांधलेले आहे का?
    • स्टार्ट-ऑफ-सीक्वेन्स, एंड-ऑफ-सीक्वेन्स, पॅडिंग इत्यादी विशेष टोकन्ससह एकूण शब्दसंग्रह आकार किती आहे?
  3. PyTorch मॉडेल आणि TorchScript रूपांतरण: मला मॉडेल माझ्या उपकरणावर तैनात करायचे आहे, म्हणून तुम्ही PyTorch मॉडेल परिभाषा असलेली स्टँडअलोन model.py फाइल (किंवा समतुल्य) ची लिंक देऊ शकल्यास ते उत्कृष्ट होईल. आदर्शपणे, हे मॉडेल ऑप्टिमाइझ केलेल्या ऑन-डिव्हाइस अनुमानासाठी TorchScript मध्ये सहजपणे रूपांतरित करण्यायोग्य असेल.

तुमच्या वेळेबद्दल आणि मदतीबद्दल आगाऊ धन्यवाद! मी या क्षेत्राबद्दल अधिक जाणून घेण्यास आणि एक कार्यात्मक प्रोटोटाइप तयार करण्यास उत्सुक आहे. तुम्ही देऊ शकणारी कोणतीही मार्गदर्शन अत्यंत उपयुक्त ठरेल.

@MartialTerran architecturally you can start with google speech commands model. You can check out our models built on it at https://www.kaggle.com/models/sameersmahajan/marathi-numbers for how you can go about it. You can find more detailed discussion on the problem along with some code in my github repo https://github.com/sameermahajan/ML-Audio-Models

Sign up or log in to comment