DistilBERT IMDb Sentiment Classifier
Model Description
This is a fine-tuned version of DistilBERT for sentiment analysis on the IMDb movie review dataset. DistilBERT is a smaller, faster, and lighter variant of BERT, designed to perform efficiently while retaining the core strengths of BERT in natural language understanding.
The model is trained to classify movie reviews as either positive or negative sentiments, making it ideal for applications where sentiment analysis is needed, such as analyzing customer feedback, social media posts, or reviews.
Intended Use
This model is intended for text classification tasks, specifically sentiment analysis. It can be used to automatically label a piece of text as either having a positive or negative sentiment.
Use Cases
- Movie review sentiment analysis
- Customer feedback analysis
- Social media sentiment monitoring
- Product review classification
How to Use
Here is how you can use this model with the Hugging Face transformers
library:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "Ashaduzzaman/imdb-distilbert-funetuned",
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)
# Example text
text = "The movie was absolutely fantastic! The acting was superb and the story was gripping."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.softmax(logits, dim=1)
# Get the predicted label
predicted_label = torch.argmax(predictions).item()
labels = ["Negative", "Positive"]
print(f"Predicted sentiment: {labels[predicted_label]}")
Training Data
This model was trained on the IMDb movie review dataset, a large dataset for binary sentiment classification. The dataset contains 50,000 highly polarized movie reviews. This dataset is balanced, with 25,000 positive and 25,000 negative reviews.
Training Procedure
The model was fine-tuned using the IMDb dataset with the following configuration:
- Optimizer: AdamW (Adam with betas=(0.9,0.999) and epsilon=1e-08)
- Learning Rate: 2e-5
- Batch Size: 16
- Epochs: 2
- Max Sequence Length: 512 tokens
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
0.2239 | 1.0 | 1563 | 0.2026 | 0.9227 |
0.1468 | 2.0 | 3126 | 0.2319 | 0.9320 |
- Loss: 0.2319
- Accuracy: 0.9320
Limitations
- The model is specifically trained on the IMDb dataset, so its effectiveness may be reduced when applied to other domains or types of text.
- Sentiment detection is binary (positive or negative). Neutral sentiments or more nuanced emotions are not captured.
- The model may not perform well on text that is highly sarcastic, contains slang, or is very short (e.g., one-word reviews).
Ethical Considerations
- Bias: The model may reflect biases present in the IMDb dataset. Users should be cautious about applying this model to sensitive applications.
- Content: Since the IMDb dataset includes movie reviews, the model might not generalize well to text outside of this context.
Acknowledgments
- The original DistilBERT model was developed by Hugging Face.
- The IMDb dataset is provided by Stanford and can be found here.
Framework versions
- Transformers 4.42.4
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1
- Downloads last month
- 8
Model tree for ashaduzzaman/imdb-distilbert-funetuned
Base model
distilbert/distilbert-base-uncased