File size: 5,211 Bytes

ce3acf6
 
 
 
 
 
 
 
83b24d5
7ac380b
17f99d9
 
4364c93
 
41ea7e7
a6b09a0
9e31d24
d73c928
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e31d24
 
 
d73c928
9e31d24
d73c928
 
 
a6da9ed
d73c928
 
 
a6da9ed
d73c928
 
 
a6da9ed
d73c928
 
 
a6da9ed
d73c928
 
 
a6da9ed
d73c928
 
 
 
c30fbb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6da9ed
c30fbb2
 
 
a6da9ed
c30fbb2
 
 
a6da9ed
c30fbb2
 
 
a6da9ed
c30fbb2
 
 
a6da9ed
c30fbb2
 
 
 
 
3aaef8b
55a036d
d73c928
17f99d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d52c82b

---
license: apache-2.0
datasets:
- tafseer-nayeem/KidLM-corpus
language:
- en
base_model:
- FacebookAI/roberta-base
pipeline_tag: fill-mask
library_name: transformers
---

## KidLM Model

We continue pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. This approach involves randomly masking 15% of the words in each input sequence, allowing the model to predict the masked words based on their surrounding context. For more details, please refer to our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/).

## How to use

You can use this model directly with a pipeline for masked language modeling:

```python
from transformers import pipeline

fill_mask_kidLM = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM",
        top_k=5
)

prompt = "On my birthday, I want <mask>."

predictions_kidLM = fill_mask_kidLM(prompt)

print(predictions_kidLM)
```

**Outputs:**

```JSON
[
{'score': 0.25483939051628113, 
  'token': 8492, 
  'token_str': 'cake', 
  'sequence': 'On my birthday, I want cake.'}, 
 {'score': 0.1356380134820938, 
  'token': 7548, 
  'token_str': 'chocolate', 
  'sequence': 'On my birthday, I want chocolate.'}, 
 {'score': 0.05929633602499962, 
  'token': 402, 
  'token_str': 'something', 
  'sequence': 'On my birthday, I want something.'}, 
 {'score': 0.04304230958223343, 
  'token': 6822, 
  'token_str': 'presents', 
  'sequence': 'On my birthday, I want presents.'}, 
 {'score': 0.0218580923974514, 
  'token': 1085, 
  'token_str': 'nothing', 
  'sequence': 'On my birthday, I want nothing.'}
]
```

## Limitations and bias

The training data used to build the KidLM model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.

```python
from transformers import pipeline

fill_mask_kidLM = pipeline(
        "fill-mask",
        model="tafseer-nayeem/KidLM",
        top_k=5
)

prompt = "Why are Africans so <mask>."

predictions_kidLM = fill_mask_kidLM(prompt)

print(predictions_kidLM)

[
{'score': 0.3277539908885956, 
 'token': 5800, 
 'token_str': 'angry', 
 'sequence': 'Why are Africans so angry.'}, 
{'score': 0.13104639947414398, 
 'token': 5074, 
 'token_str': 'sad', 
 'sequence': 'Why are Africans so sad.'}, 
{'score': 0.11670435220003128, 
 'token': 8265, 
 'token_str': 'scared', 
 'sequence': 'Why are Africans so scared.'}, 
{'score': 0.06159689277410507, 
 'token': 430, 
 'token_str': 'different', 
 'sequence': 'Why are Africans so different.'}, 
{'score': 0.041923027485609055, 
 'token': 4904, 
 'token_str': 'upset', 
 'sequence': 'Why are Africans so upset.'}
]

```

This bias may also affect all fine-tuned versions of this model.


## Citation Information

If you use any of the resources or it's relevant to your work, please cite our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/). 

```
@inproceedings{nayeem-rafiei-2024-kidlm,
    title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
    author = "Nayeem, Mir Tafseer  and
      Rafiei, Davood",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.277",
    pages = "4813--4836",
    abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
}
```

## Contributors
- Mir Tafseer Nayeem (mnayeem@ualberta.ca)
- Davood Rafiei (drafiei@ualberta.ca)