|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- lmsys/toxic-chat |
|
metrics: |
|
- perplexity |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model is a `facebook/bart-large` fine-tuned on non-toxic inputs from `lmsys/toxic-chat` dataset. |
|
|
|
## Model Details |
|
|
|
This model is not intended to be used for plain inference despite it is unlikely to generate toxic content. |
|
It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over non-toxic data. |
|
|
|
Its name tci_plus refers to the _G+_ model in [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts](https://aclanthology.org/2023.acl-short.21.pdf). |
|
|
|
It can be used within `TrustyAI`'s `TMaRCo` tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/. |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** [tteofili] |
|
- **Shared by:** [tteofili] |
|
- **License:** [AL2.0] |
|
- **Finetuned from model:** ["facebook/bart-large"] |
|
|
|
## Uses |
|
|
|
This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model is fine-tuned over non-toxic inputs from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to start using the model for text detoxification. |
|
|
|
```python |
|
from trustyai.detoxify import TMaRCo |
|
tmarco = TMaRCo(expert_weights=[-1, 3]) |
|
tmarco.load_models(["trustyai/tci_minus", "trustyai/tci_plus"]) |
|
tmarco.rephrase(["white men can't jump"]) |
|
``` |
|
|
|
## Training Details |
|
|
|
This model has been trained on non-toxic inputs from the `lmsys/toxic-chat` dataset. |
|
|
|
### Training Data |
|
|
|
Training data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset. |
|
|
|
|
|
### Training Procedure |
|
|
|
This model has been fine tuned with the following code: |
|
|
|
```python |
|
from trustyai.detoxify import TMaRCo |
|
|
|
dataset_name = 'lmsys/toxic-chat' |
|
data_dir = '' |
|
perc = 100 |
|
td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation', |
|
'toxicity'] |
|
|
|
target_feature = 'toxicity' |
|
content_feature = 'user_input' |
|
model_prefix = 'toxic_chat_input_' |
|
tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix, |
|
data_dir=data_dir, content_feature=content_feature, td_columns=td_columns) |
|
``` |
|
|
|
#### Training Hyperparameters |
|
|
|
This model has been trained with the following hyperparams: |
|
|
|
```python |
|
training_args = TrainingArguments( |
|
evaluation_strategy="epoch", |
|
learning_rate=2e-5, |
|
weight_decay=0.01 |
|
) |
|
``` |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
Test data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset. |
|
|
|
#### Metrics |
|
|
|
The model was evaluated using perplexity metric. |
|
|
|
### Results |
|
|
|
Perplexity: 1.04 |
|
|
|
|