Edit model card

Model Ariving Soon, Still Training

Model Card: bert-dutch-finetuned

Model Description

Model Name: bert-dutch-finetuned
Model Type: BERT (Bidirectional Encoder Representations from Transformers)
Base Model: bert-base-cased
Language: Dutch (Nederlands)
Task: Masked Language Modeling (MLM), Text Classification, and other NLP tasks.

This model is a fine-tuned version of the bert-base-cased model, adapted specifically for the Dutch language. It is pre-trained on a large Dutch corpus from OSCAR and other Dutch datasets. The model is capable of understanding and generating text in Dutch and can be further fine-tuned for specific downstream NLP tasks like Named Entity Recognition (NER), Sentiment Analysis, etc.

Intended Use

The bert-dutch-finetuned model can be used for various NLP tasks in Dutch, including:

  • Masked Language Modeling (MLM)
  • Text Classification
  • Named Entity Recognition (NER)
  • Question Answering (QA)
  • Text Summarization

This model is ideal for researchers and practitioners working on Dutch NLP applications.

How to Use

To use this model with the Hugging Face transformers library:

from transformers import BertTokenizer, BertForMaskedLM

# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained("your-username/bert-dutch-finetuned")
model = BertForMaskedLM.from_pretrained("your-username/bert-dutch-finetuned")

# Example usage
inputs = tokenizer("Dit is een voorbeeldzin in het Nederlands.", return_tensors="pt")
outputs = model(**inputs)

Training Data

The model was trained on a large Dutch corpus consisting of various publicly available datasets, such as:

  • OSCAR (Open Super-large Crawled ALMAnaCH Corpus): A multilingual corpus obtained by language classification and filtering of the Common Crawl dataset.
  • Dutch Wikipedia Dumps: A collection of Dutch Wikipedia pages.

The training data includes diverse text types, covering a wide range of topics to ensure robust language understanding.

Training Procedure

The model was fine-tuned using the following setup:

  • Base Model: bert-base-cased
  • Training Objective: Masked Language Modeling (MLM)
  • Optimizer: AdamW
  • Learning Rate: 5e-5
  • Batch Size: 8
  • Epochs: 3
  • Hardware Used: A GPU-enabled environment (NVIDIA V100)

Evaluation

The model was evaluated on a validation set split from the same training corpus. The evaluation metrics included:

  • Perplexity for Masked Language Modeling
  • Accuracy for Text Classification tasks (if applicable)

The model performs well on standard Dutch text understanding tasks but might require further fine-tuning for specific downstream applications.

Limitations and Biases

  • The model may exhibit biases present in the training data. This includes potential social biases or stereotypes embedded in large web-scraped datasets like OSCAR.
  • The model's performance is optimized for Dutch and may not generalize well to other languages.
  • It may not perform well on domain-specific tasks without additional fine-tuning.

Ethical Considerations

Users should be aware of the biases that might be present in the model outputs. It is recommended to conduct a bias assessment before deploying the model in sensitive applications, especially those related to decision-making.

Acknowledgments

This model was built using the Hugging Face transformers library and fine-tuned on the OSCAR and Dutch Wikipedia datasets. Special thanks to the creators and maintainers of these resources.

Citation

If you use this model in your research or applications, please consider citing:

@misc{bert-dutch-finetuned,
  author = DJ Ober,
  title = BERT Fine-Tuned for Dutch Language,
  year = 2024,
  howpublished = https://huggingface.co/dober123/bert-dutch-finetuned,
}

license: wtfpl

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Model tree for dober123/bert-dutch-finetuned

Finetuned
this model