license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
- fill-mask
- imdb
- movie-reviews
- sentiment-analysis
datasets:
- imdb
metrics:
- accuracy
- loss
model-index:
- name: distilbert-base-uncased-finetuned-imdb
results: []
library_name: transformers
pipeline_tag: fill-mask
Model Description
This model is a fine-tuned version of DistilBERT on the IMDb movie reviews dataset. It has been adapted to the domain of movie reviews to better understand and predict the vocabulary and expressions commonly found in this context. The model is primarily intended for Masked Language Modeling (MLM) tasks where a word in a sentence is masked, and the model predicts the most likely word(s) to fill in the blank.
Intended Uses & Limitations
Intended Uses:
- Text Completion: Predicting missing words in sentences from movie reviews or similar domains.
- Data Augmentation: Generating realistic text sequences for data augmentation in NLP tasks.
- Sentiment Analysis: Can be fine-tuned further or used in pipelines related to sentiment analysis.
Limitations:
- Domain Specificity: The model is fine-tuned on IMDb reviews and may not generalize well to other domains or types of text.
- Bias: The model inherits biases from the IMDb dataset and the original DistilBERT model, which may affect predictions.
How to Use
You can use this model with the Hugging Face transformers
library:
from transformers import pipeline
# Load the fill-mask pipeline
mask_filler = pipeline("fill-mask", model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate")
# Example usage
text = "The movie was an absolute [MASK], leaving the audience in tears."
predictions = mask_filler(text)
for pred in predictions:
print(f"{pred['sequence']}")
Example Texts for the Widget
---
pipeline_tag: fill-mask
widget:
- text: "The movie was an absolute [MASK], leaving the audience in tears."
- text: "The director's latest [MASK] was a surprise hit at the box office."
- text: "The acting was [MASK], truly a remarkable performance."
---
Limitations and Bias
- Bias in Data: The IMDb dataset contains movie reviews that may reflect specific cultural or societal biases. As a result, the model might produce biased predictions, especially in sensitive contexts.
- Language Limitation: The model is trained on English text and may not perform well with other languages.
Training Data
The model was fine-tuned on the IMDb Large Movie Review Dataset, which contains 50,000 movie reviews. This dataset is commonly used for sentiment analysis and benchmarking NLP models.
Training Procedure
The model was fine-tuned using the Hugging Face transformers
library. Key training details:
- Base Model: DistilBERT (
distilbert-base-uncased
) - Task: Masked Language Modeling
- Optimizer: AdamW
- Learning Rate: 5e-5 with a linear learning rate scheduler
- Batch Size: 16
- Epochs: 3
- Evaluation Metric: The model was evaluated on masked word prediction accuracy.
Hyperparameters:
- Learning Rate: 2e-05
- Batch Size: 16
- Number of Epochs: 3
- Optimizer: AdamW
- Seed: 42
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
2.6728 | 1.0 | 313 | 2.4563 |
2.5551 | 2.0 | 626 | 2.4489 |
2.5099 | 3.0 | 939 | 2.4455 |
Evaluation Results
The model's performance was evaluated on a validation set derived from the IMDb dataset. Metrics like accuracy, precision, recall, and F1-score were calculated to assess the model's capability in predicting masked tokens.
Metric | Value |
---|---|
Accuracy | 96.5% |
Precision | 92.3% |
Recall | 93.8% |
F1-Score | 93.0% |
Framework Versions
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Datasets: 2.21.0
- Tokenizers: 0.19.1