--- license: apache-2.0 datasets: - tafseer-nayeem/KidLM-corpus language: - en base_model: - FacebookAI/roberta-base pipeline_tag: fill-mask library_name: transformers --- ## KidLM Model We continue pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. This approach involves randomly masking 15% of the words in each input sequence, allowing the model to predict the masked words based on their surrounding context. For more details, please refer to our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/). ## How to use You can use this model directly with a pipeline for masked language modeling: ```python from transformers import pipeline fill_mask_kidLM = pipeline( "fill-mask", model="tafseer-nayeem/KidLM", top_k=5 ) prompt = "On my birthday, I want ." predictions_kidLM = fill_mask_kidLM(prompt) print(predictions_kidLM) ``` **Outputs:** ```JSON [ {'score': 0.25483939051628113, 'token': 8492, 'token_str': 'cake', 'sequence': 'On my birthday, I want cake.'}, {'score': 0.1356380134820938, 'token': 7548, 'token_str': 'chocolate', 'sequence': 'On my birthday, I want chocolate.'}, {'score': 0.05929633602499962, 'token': 402, 'token_str': 'something', 'sequence': 'On my birthday, I want something.'}, {'score': 0.04304230958223343, 'token': 6822, 'token_str': 'presents', 'sequence': 'On my birthday, I want presents.'}, {'score': 0.0218580923974514, 'token': 1085, 'token_str': 'nothing', 'sequence': 'On my birthday, I want nothing.'} ] ``` ## Limitations and bias The training data used to build the KidLM model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions. ```python from transformers import pipeline fill_mask_kidLM = pipeline( "fill-mask", model="tafseer-nayeem/KidLM", top_k=5 ) prompt = "Why are Africans so ." predictions_kidLM = fill_mask_kidLM(prompt) print(predictions_kidLM) [ {'score': 0.3277539908885956, 'token': 5800, 'token_str': 'angry', 'sequence': 'Why are Africans so angry.'}, {'score': 0.13104639947414398, 'token': 5074, 'token_str': 'sad', 'sequence': 'Why are Africans so sad.'}, {'score': 0.11670435220003128, 'token': 8265, 'token_str': 'scared', 'sequence': 'Why are Africans so scared.'}, {'score': 0.06159689277410507, 'token': 430, 'token_str': 'different', 'sequence': 'Why are Africans so different.'}, {'score': 0.041923027485609055, 'token': 4904, 'token_str': 'upset', 'sequence': 'Why are Africans so upset.'} ] ``` This bias may also affect all fine-tuned versions of this model. ## Citation Information If you use any of the resources or it's relevant to your work, please cite our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/). ``` @inproceedings{nayeem-rafiei-2024-kidlm, title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions", author = "Nayeem, Mir Tafseer and Rafiei, Davood", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.277", pages = "4813--4836", abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.", } ``` ## Contributors - Mir Tafseer Nayeem (mnayeem@ualberta.ca) - Davood Rafiei (drafiei@ualberta.ca)