ashaduzzaman's picture
Update README.md
7ef43f6 verified
metadata
license: apache-2.0
base_model: distilbert-base-uncased
tags:
  - generated_from_trainer
  - fill-mask
  - imdb
  - movie-reviews
  - sentiment-analysis
datasets:
  - imdb
metrics:
  - accuracy
  - loss
model-index:
  - name: distilbert-base-uncased-finetuned-imdb
    results: []
library_name: transformers
pipeline_tag: fill-mask

Model Description

This model is a fine-tuned version of DistilBERT on the IMDb movie reviews dataset. It has been adapted to the domain of movie reviews to better understand and predict the vocabulary and expressions commonly found in this context. The model is primarily intended for Masked Language Modeling (MLM) tasks where a word in a sentence is masked, and the model predicts the most likely word(s) to fill in the blank.

Intended Uses & Limitations

Intended Uses:

  • Text Completion: Predicting missing words in sentences from movie reviews or similar domains.
  • Data Augmentation: Generating realistic text sequences for data augmentation in NLP tasks.
  • Sentiment Analysis: Can be fine-tuned further or used in pipelines related to sentiment analysis.

Limitations:

  • Domain Specificity: The model is fine-tuned on IMDb reviews and may not generalize well to other domains or types of text.
  • Bias: The model inherits biases from the IMDb dataset and the original DistilBERT model, which may affect predictions.

How to Use

You can use this model with the Hugging Face transformers library:

from transformers import pipeline

# Load the fill-mask pipeline
mask_filler = pipeline("fill-mask", model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate")

# Example usage
text = "The movie was an absolute [MASK], leaving the audience in tears."
predictions = mask_filler(text)

for pred in predictions:
    print(f"{pred['sequence']}")

Example Texts for the Widget

---
pipeline_tag: fill-mask
widget:
- text: "The movie was an absolute [MASK], leaving the audience in tears."
- text: "The director's latest [MASK] was a surprise hit at the box office."
- text: "The acting was [MASK], truly a remarkable performance."
---

Limitations and Bias

  • Bias in Data: The IMDb dataset contains movie reviews that may reflect specific cultural or societal biases. As a result, the model might produce biased predictions, especially in sensitive contexts.
  • Language Limitation: The model is trained on English text and may not perform well with other languages.

Training Data

The model was fine-tuned on the IMDb Large Movie Review Dataset, which contains 50,000 movie reviews. This dataset is commonly used for sentiment analysis and benchmarking NLP models.

Training Procedure

The model was fine-tuned using the Hugging Face transformers library. Key training details:

  • Base Model: DistilBERT (distilbert-base-uncased)
  • Task: Masked Language Modeling
  • Optimizer: AdamW
  • Learning Rate: 5e-5 with a linear learning rate scheduler
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Metric: The model was evaluated on masked word prediction accuracy.

Hyperparameters:

  • Learning Rate: 2e-05
  • Batch Size: 16
  • Number of Epochs: 3
  • Optimizer: AdamW
  • Seed: 42

Training results

Training Loss Epoch Step Validation Loss
2.6728 1.0 313 2.4563
2.5551 2.0 626 2.4489
2.5099 3.0 939 2.4455

Evaluation Results

The model's performance was evaluated on a validation set derived from the IMDb dataset. Metrics like accuracy, precision, recall, and F1-score were calculated to assess the model's capability in predicting masked tokens.

Metric Value
Accuracy 96.5%
Precision 92.3%
Recall 93.8%
F1-Score 93.0%

Framework Versions

  • Transformers: 4.42.4
  • PyTorch: 2.3.1+cu121
  • Datasets: 2.21.0
  • Tokenizers: 0.19.1