Erkhembayar Gantulga

Updated README

1899cc9 4 months ago

4.03 kB

	---
	language:
	- mn
	base_model: openai/whisper-medium
	library_name: transformers
	datasets:
	- mozilla-foundation/common_voice_17_0
	- google/fleurs
	tags:
	- audio
	- automatic-speech-recognition
	widget:
	- example_title: Common Voice sample 1
	src: sample1.flac
	- example_title: Common Voice sample 2
	src: sample2.flac
	model-index:
	- name: whisper-medium-mn
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 17.0
	type: common_voice_17_0
	config: mn
	split: test
	args:
	language: mn
	metrics:
	- name: Test WER
	type: wer
	value: 12.9580
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Whisper Medium Mn - Erkhembayar Gantulga

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the Common Voice 17.0 and Google Fleurs datasets.
	It achieves the following results on the evaluation set:
	- Loss: 0.1083
	- Wer: 12.9580

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	Datasets used for training:
	- [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
	- [Google Fleurs](https://huggingface.co/datasets/google/fleurs)

	For training, combined Common Voice 17.0 and Google Fleurs datasets:

	```
	from datasets import load_dataset, DatasetDict, concatenate_datasets
	from datasets import Audio

	common_voice = DatasetDict()

	common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "mn", split="train+validation+validated", use_auth_token=True)
	common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "mn", split="test", use_auth_token=True)

	common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

	common_voice = common_voice.remove_columns(
	["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes", "variant"]
	)

	google_fleurs = DatasetDict()

	google_fleurs["train"] = load_dataset("google/fleurs", "mn_mn", split="train+validation", use_auth_token=True)
	google_fleurs["test"] = load_dataset("google/fleurs", "mn_mn", split="test", use_auth_token=True)

	google_fleurs = google_fleurs.remove_columns(
	["id", "num_samples", "path", "raw_transcription", "gender", "lang_id", "language", "lang_group_id"]
	)
	google_fleurs = google_fleurs.rename_column("transcription", "sentence")

	dataset = DatasetDict()
	dataset["train"] = concatenate_datasets([common_voice["train"], google_fleurs["train"]])
	dataset["test"] = concatenate_datasets([common_voice["test"], google_fleurs["test"]])
	```

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 100
	- training_steps: 4000
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:-------:\|
	\| 0.2986 \| 0.4912 \| 500 \| 0.3557 \| 40.1515 \|
	\| 0.2012 \| 0.9823 \| 1000 \| 0.2310 \| 28.3512 \|
	\| 0.099 \| 1.4735 \| 1500 \| 0.1864 \| 23.4453 \|
	\| 0.0733 \| 1.9646 \| 2000 \| 0.1405 \| 18.3024 \|
	\| 0.0231 \| 2.4558 \| 2500 \| 0.1308 \| 16.5645 \|
	\| 0.0191 \| 2.9470 \| 3000 \| 0.1155 \| 14.5569 \|
	\| 0.0059 \| 3.4381 \| 3500 \| 0.1122 \| 13.4728 \|
	\| 0.006 \| 3.9293 \| 4000 \| 0.1083 \| 12.9580 \|


	### Framework versions

	- Transformers 4.44.0
	- Pytorch 2.3.1+cu121
	- Datasets 2.21.0
	- Tokenizers 0.19.1