About the dataset
Hello brother, it would be really helpful for me if you provide the details of the dataset you used for finetuning. I have been struggling to get a good Malayalam dataset.
Hey!
There are some really awesome Malayalam datasets available.
Here's the list of a few:
- https://huggingface.co/datasets/oscar-corpus/oscar
The oscar dataset malayalam is a fun find! - https://huggingface.co/ai4bharat
The AI4Bharat has a really good collection of datasets a portion of which is in malayalam - https://huggingface.co/datasets/VishnuPJ/Alpaca_Instruct_Malayalam
Vishnu P J's Alpaca Malayalam Instruction is another treasure trove. - https://huggingface.co/datasets/VishnuPJ/Malayalam_CultureX_IndicCorp_SMC
Again Vishnu's this dataset is by far the biggest one I've seen in malayalam. - https://huggingface.co/datasets/uonlp/CulturaX/tree/main/ml
CulturaX is also another dataset. - https://gitlab.com/smc/corpus
SMC (Swathanthra Malayalam Computing)'s corpus is another awesome dataset. You need to parse the dataset to your own needs. - https://huggingface.co/datasets/animaRegem/bad_malayalam_dataset
My bad malayalam dataset is a parsed version of SMC Corpus availabe on GitLab; It lacks considerable preprocessing and thus the name, bad malayalam dataset. - https://www.kaggle.com/datasets/disisbig/malyalam-news-dataset
Even this dataset of malayalam new headings is a great find.
I hope this helps!
Thank you very much. Looking forward to using the dataset you have provided.
But don't you think in Malayalm there is a lack real world data like different slangs ?
To be fair yes, the availability of open data that has different slangs and stuff are quite hard to get, but not necessarily impossible for huge companies. Ig, if we could collaborate with Manglish keyboard or something else that has awesome data, it'll be great. But overally, the lack of real world data is a huge problem, especially considering the inherent diversity in malayalam language, still I'd say it'll be an awesome win if we could, at the very least make a excellent model that can work on print language itself.
True