--- license: apache-2.0 base_model: distilbert-base-uncased tags: - generated_from_trainer metrics: - accuracy model-index: - name: imdb-distilbert-funetuned results: [] datasets: - ajaykarthick/imdb-movie-reviews language: - en library_name: transformers pipeline_tag: text-classification --- # DistilBERT IMDb Sentiment Classifier ## Model Description This is a fine-tuned version of [DistilBERT](https://huggingface.co/distilbert-base-uncased) for sentiment analysis on the IMDb movie review dataset. DistilBERT is a smaller, faster, and lighter variant of BERT, designed to perform efficiently while retaining the core strengths of BERT in natural language understanding. The model is trained to classify movie reviews as either **positive** or **negative** sentiments, making it ideal for applications where sentiment analysis is needed, such as analyzing customer feedback, social media posts, or reviews. ## Intended Use This model is intended for text classification tasks, specifically sentiment analysis. It can be used to automatically label a piece of text as either having a positive or negative sentiment. ### Use Cases - **Movie review sentiment analysis** - **Customer feedback analysis** - **Social media sentiment monitoring** - **Product review classification** ## How to Use Here is how you can use this model with the Hugging Face `transformers` library: ```python from transformers import DistilBertTokenizer, DistilBertForSequenceClassification import torch # Load the model and tokenizer model_name = "Ashaduzzaman/imdb-distilbert-funetuned", tokenizer = DistilBertTokenizer.from_pretrained(model_name) model = DistilBertForSequenceClassification.from_pretrained(model_name) # Example text text = "The movie was absolutely fantastic! The acting was superb and the story was gripping." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predictions = torch.softmax(logits, dim=1) # Get the predicted label predicted_label = torch.argmax(predictions).item() labels = ["Negative", "Positive"] print(f"Predicted sentiment: {labels[predicted_label]}") ``` ## Training Data This model was trained on the IMDb movie review dataset, a large dataset for binary sentiment classification. The dataset contains 50,000 highly polarized movie reviews. This dataset is balanced, with 25,000 positive and 25,000 negative reviews. ## Training Procedure The model was fine-tuned using the IMDb dataset with the following configuration: - **Optimizer**: AdamW (Adam with betas=(0.9,0.999) and epsilon=1e-08) - **Learning Rate**: 2e-5 - **Batch Size**: 16 - **Epochs**: 2 - **Max Sequence Length**: 512 tokens ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:--------:| | 0.2239 | 1.0 | 1563 | 0.2026 | 0.9227 | | 0.1468 | 2.0 | 3126 | 0.2319 | 0.9320 | - **Loss:** 0.2319 - **Accuracy:** 0.9320 ## Limitations - The model is specifically trained on the IMDb dataset, so its effectiveness may be reduced when applied to other domains or types of text. - Sentiment detection is binary (positive or negative). Neutral sentiments or more nuanced emotions are not captured. - The model may not perform well on text that is highly sarcastic, contains slang, or is very short (e.g., one-word reviews). ## Ethical Considerations - **Bias**: The model may reflect biases present in the IMDb dataset. Users should be cautious about applying this model to sensitive applications. - **Content**: Since the IMDb dataset includes movie reviews, the model might not generalize well to text outside of this context. ## Acknowledgments - The original [DistilBERT](https://huggingface.co/distilbert-base-uncased) model was developed by Hugging Face. - The IMDb dataset is provided by Stanford and can be found [here](https://ai.stanford.edu/~amaas/data/sentiment/). ## Framework versions - Transformers 4.42.4 - Pytorch 2.3.1+cu121 - Datasets 2.21.0 - Tokenizers 0.19.1