README.md · tafseer-nayeem/KidLM at d52c82bd8730fd4575820de061058b7f11c27b5b

metadata

license: apache-2.0
datasets:
  - tafseer-nayeem/KidLM-corpus
language:
  - en
base_model:
  - FacebookAI/roberta-base
pipeline_tag: fill-mask
library_name: transformers

KidLM Model

We continue pre-train the RoBERTa (base) model on our KidLM corpus using a masked language modeling (MLM) objective. This approach involves randomly masking 15% of the words in each input sequence, allowing the model to predict the masked words based on their surrounding context. For more details, please refer to our EMNLP 2024 paper.

How to use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

fill_mask_kidLM = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM",
        top_k=5
)

prompt = "On my birthday, I want <mask>."

predictions_kidLM = fill_mask_kidLM(prompt)

print(predictions_kidLM)

Outputs:

[
{'score': 0.25483939051628113, 
  'token': 8492, 
  'token_str': 'cake', 
  'sequence': 'On my birthday, I want cake.'}, 
 {'score': 0.1356380134820938, 
  'token': 7548, 
  'token_str': 'chocolate', 
  'sequence': 'On my birthday, I want chocolate.'}, 
 {'score': 0.05929633602499962, 
  'token': 402, 
  'token_str': 'something', 
  'sequence': 'On my birthday, I want something.'}, 
 {'score': 0.04304230958223343, 
  'token': 6822, 
  'token_str': 'presents', 
  'sequence': 'On my birthday, I want presents.'}, 
 {'score': 0.0218580923974514, 
  'token': 1085, 
  'token_str': 'nothing', 
  'sequence': 'On my birthday, I want nothing.'}
]

Limitations and bias

The training data used to build the KidLM model is our KidLM corpus. We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.

from transformers import pipeline

fill_mask_kidLM = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM",
        top_k=5
)

prompt = "Why are Africans so <mask>."

predictions_kidLM = fill_mask_kidLM(prompt)

print(predictions_kidLM)

[
{'score': 0.3277539908885956, 
 'token': 5800, 
 'token_str': 'angry', 
 'sequence': 'Why are Africans so angry.'}, 
{'score': 0.13104639947414398, 
 'token': 5074, 
 'token_str': 'sad', 
 'sequence': 'Why are Africans so sad.'}, 
{'score': 0.11670435220003128, 
 'token': 8265, 
 'token_str': 'scared', 
 'sequence': 'Why are Africans so scared.'}, 
{'score': 0.06159689277410507, 
 'token': 430, 
 'token_str': 'different', 
 'sequence': 'Why are Africans so different.'}, 
{'score': 0.041923027485609055, 
 'token': 4904, 
 'token_str': 'upset', 
 'sequence': 'Why are Africans so upset.'}
]

This bias may also affect all fine-tuned versions of this model.

Citation Information

If you use any of the resources or it's relevant to your work, please cite our EMNLP 2024 paper.

@inproceedings{nayeem-rafiei-2024-kidlm,
    title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
    author = "Nayeem, Mir Tafseer  and
      Rafiei, Davood",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.277",
    pages = "4813--4836",
    abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
}

Contributors

Mir Tafseer Nayeem (mnayeem@ualberta.ca)
Davood Rafiei (drafiei@ualberta.ca)