trustyai
/

tci_plus

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

tci_plus / README.md

tteofili's picture

Update README.md

4344d0b verified 10 months ago

|

history blame contribute delete

3.3 kB

	---
	license: apache-2.0
	datasets:
	- lmsys/toxic-chat
	metrics:
	- perplexity
	---

	# Model Card for Model ID

	This model is a `facebook/bart-large` fine-tuned on non-toxic inputs from `lmsys/toxic-chat` dataset.

	## Model Details

	This model is not intended to be used for plain inference despite it is unlikely to generate toxic content.
	It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over non-toxic data.

	Its name tci_plus refers to the _G+_ model in [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts](https://aclanthology.org/2023.acl-short.21.pdf).

	It can be used within `TrustyAI`'s `TMaRCo` tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/.

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: [tteofili]
	- Shared by: [tteofili]
	- License: [AL2.0]
	- Finetuned from model: ["facebook/bart-large"]

	## Uses

	This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.

	## Bias, Risks, and Limitations

	This model is fine-tuned over non-toxic inputs from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content.

	## How to Get Started with the Model

	Use the code below to start using the model for text detoxification.

	```python
	from trustyai.detoxify import TMaRCo
	tmarco = TMaRCo(expert_weights=[-1, 3])
	tmarco.load_models(["trustyai/tci_minus", "trustyai/tci_plus"])
	tmarco.rephrase(["white men can't jump"])
	```

	## Training Details

	This model has been trained on non-toxic inputs from the `lmsys/toxic-chat` dataset.

	### Training Data

	Training data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.


	### Training Procedure

	This model has been fine tuned with the following code:

	```python
	from trustyai.detoxify import TMaRCo

	dataset_name = 'lmsys/toxic-chat'
	data_dir = ''
	perc = 100
	td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation',
	'toxicity']

	target_feature = 'toxicity'
	content_feature = 'user_input'
	model_prefix = 'toxic_chat_input_'
	tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix,
	data_dir=data_dir, content_feature=content_feature, td_columns=td_columns)
	```

	#### Training Hyperparameters

	This model has been trained with the following hyperparams:

	```python
	training_args = TrainingArguments(
	evaluation_strategy="epoch",
	learning_rate=2e-5,
	weight_decay=0.01
	)
	```

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Test data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.

	#### Metrics

	The model was evaluated using perplexity metric.

	### Results

	Perplexity: 1.04