File size: 4,324 Bytes
ae2fafb
 
 
 
 
7ef43f6
 
 
 
 
 
 
 
 
ae2fafb
 
 
d81bf4d
a47375e
ae2fafb
 
 
 
7ef43f6
 
ae2fafb
7ef43f6
 
 
 
 
ae2fafb
7ef43f6
 
 
ae2fafb
7ef43f6
 
ae2fafb
7ef43f6
 
ae2fafb
7ef43f6
 
ae2fafb
7ef43f6
 
 
ae2fafb
7ef43f6
 
 
ae2fafb
7ef43f6
 
 
 
 
 
 
 
 
 
 
 
 
 
ae2fafb
 
7ef43f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae2fafb
 
 
 
 
 
 
 
 
 
7ef43f6
 
 
 
 
 
 
 
 
ae2fafb
7ef43f6
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
- fill-mask
- imdb
- movie-reviews
- sentiment-analysis
datasets:
- imdb
metrics:
- accuracy
- loss
model-index:
- name: distilbert-base-uncased-finetuned-imdb
  results: []
library_name: transformers
pipeline_tag: fill-mask
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

## Model Description
This model is a fine-tuned version of [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the IMDb movie reviews dataset. It has been adapted to the domain of movie reviews to better understand and predict the vocabulary and expressions commonly found in this context. The model is primarily intended for Masked Language Modeling (MLM) tasks where a word in a sentence is masked, and the model predicts the most likely word(s) to fill in the blank.

## Intended Uses & Limitations
**Intended Uses:**
- **Text Completion:** Predicting missing words in sentences from movie reviews or similar domains.
- **Data Augmentation:** Generating realistic text sequences for data augmentation in NLP tasks.
- **Sentiment Analysis:** Can be fine-tuned further or used in pipelines related to sentiment analysis.

**Limitations:**
- **Domain Specificity:** The model is fine-tuned on IMDb reviews and may not generalize well to other domains or types of text.
- **Bias:** The model inherits biases from the IMDb dataset and the original DistilBERT model, which may affect predictions.

## How to Use
You can use this model with the Hugging Face `transformers` library:

```python
from transformers import pipeline

# Load the fill-mask pipeline
mask_filler = pipeline("fill-mask", model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate")

# Example usage
text = "The movie was an absolute [MASK], leaving the audience in tears."
predictions = mask_filler(text)

for pred in predictions:
    print(f"{pred['sequence']}")
```

### Example Texts for the Widget
```markdown
---
pipeline_tag: fill-mask
widget:
- text: "The movie was an absolute [MASK], leaving the audience in tears."
- text: "The director's latest [MASK] was a surprise hit at the box office."
- text: "The acting was [MASK], truly a remarkable performance."
---
```

## Limitations and Bias
- **Bias in Data**: The IMDb dataset contains movie reviews that may reflect specific cultural or societal biases. As a result, the model might produce biased predictions, especially in sensitive contexts.
- **Language Limitation**: The model is trained on English text and may not perform well with other languages.


## Training Data
The model was fine-tuned on the [IMDb Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), which contains 50,000 movie reviews. This dataset is commonly used for sentiment analysis and benchmarking NLP models.

## Training Procedure
The model was fine-tuned using the Hugging Face `transformers` library. Key training details:
- **Base Model:** DistilBERT (`distilbert-base-uncased`)
- **Task:** Masked Language Modeling
- **Optimizer:** AdamW
- **Learning Rate:** 5e-5 with a linear learning rate scheduler
- **Batch Size:** 16
- **Epochs:** 3
- **Evaluation Metric:** The model was evaluated on masked word prediction accuracy.

### Hyperparameters:
- **Learning Rate:** 2e-05
- **Batch Size:** 16
- **Number of Epochs:** 3
- **Optimizer:** AdamW
- **Seed:** 42

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 2.6728        | 1.0   | 313  | 2.4563          |
| 2.5551        | 2.0   | 626  | 2.4489          |
| 2.5099        | 3.0   | 939  | 2.4455          |


## Evaluation Results
The model's performance was evaluated on a validation set derived from the IMDb dataset. Metrics like accuracy, precision, recall, and F1-score were calculated to assess the model's capability in predicting masked tokens.

| Metric     | Value   |
|------------|---------|
| Accuracy   | 96.5%   |
| Precision  | 92.3%   |
| Recall     | 93.8%   |
| F1-Score   | 93.0%   |

## Framework Versions
- **Transformers:** 4.42.4
- **PyTorch:** 2.3.1+cu121
- **Datasets:** 2.21.0
- **Tokenizers:** 0.19.1