Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for deberta-v3-large-self-disclosure-detection

The model is used to detect self-disclosures (personal information) in a sentence. It is a binary token classification task. For example "I am 22 years old and ..." has labels of "["DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "DISCLOSURE", "O", ...]"

The model is able to detect the following 17 categores: "Age", "Age_Gender", "Appearance", "Education", "Family", "Finance", "Gender", "Health", "Husband_BF", "Location", "Mental_Health", "Occupation", "Pet", "Race_Nationality", "Relationship_Status", "Sexual_Orientation", "Wife_GF".

For more details, please read the paper: Reducing Privacy Risks in Online Self-Disclosures with Language Models .

Accessing this model implies automatic agreement to the following guidelines:

  1. Only use the model for research purposes.
  2. No redistribution without the author's agreement.
  3. Any derivative works created using this model must acknowledge the original author.

Model Description

  • Model type: A binary token-classification finetuned model that can detect self-disclosures
  • Language(s) (NLP): English
  • License: Creative Commons Attribution-NonCommercial
  • Finetuned from model: microsoft/deberta-v3-large

Example Code

import torch
from torch.utils.data import DataLoader, Dataset

import datasets
from datasets import ClassLabel, load_dataset

from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification

model_path = "douy/deberta-v3-large-self-disclosure-detection-binary"

config = AutoConfig.from_pretrained(model_path,)
label2id = config.label2id
id2label = config.id2label

config.num_labels = 2

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)

model = AutoModelForTokenClassification.from_pretrained(model_path, config=config, device_map="cuda:0").eval()

def tokenize_and_align_labels(words):
    tokenized_inputs = tokenizer(
                words,
                padding=False,
                is_split_into_words=True,
            )

    # we use ("O") for all the labels
    word_ids = tokenized_inputs.word_ids(0)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        # Special tokens have a word id that is None. We set the label to -100 so they are automatically
        # ignored in the loss function.
        if word_idx is None:
            label_ids.append(-100)
        # We set the label for the first token of each word.
        elif word_idx != previous_word_idx:
            label_ids.append(label2id["O"])
        # For the other tokens in a word, we set the label to -100
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = label_ids
    return tokenized_inputs

class DisclosureDataset(Dataset):
    def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
        self.inputs = inputs
        self.tokenizer = tokenizer
        self.tokenize_and_align_labels_function = tokenize_and_align_labels_function

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        words = self.inputs[idx]
        tokenized_inputs = self.tokenize_and_align_labels_function(words)
        return tokenized_inputs
    
    
sentences = [
    "I am a 23-year-old who is currently going through the last leg of undergraduate school.",
    "We also partnered with news and data providers to add up-to-date information and new visual designs for categories like weather, stocks, sports, news, and maps.",
    "My husband and I live in US.",
    "I was messing with advanced voice the other day and I was like, 'Oh, I can do this.'",
]

inputs = [sentence.split() for sentence in sentences]

data_collator = DataCollatorForTokenClassification(tokenizer)

dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)

dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)

total_predictions = []
for step, batch in enumerate(dataloader):
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.inference_mode():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(-1)
    labels = batch["labels"]

    predictions = predictions.cpu().tolist()
    labels = labels.cpu().tolist()

    true_predictions = []
    for i, label in enumerate(labels):
        true_pred = []
        for j, m in enumerate(label):
            if m != -100:
                true_pred.append(id2label[predictions[i][j]])
        true_predictions.append(true_pred)
    total_predictions.extend(true_predictions)
    

for word, pred in zip(inputs, total_predictions):
    for w, p in zip(word, pred):
        print(w, p)

Citation

@article{dou2023reducing,
  title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
  author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
  journal={arXiv preprint arXiv:2311.09538},
  year={2023}
}
Downloads last month
0
Inference API
Unable to determine this model's library. Check the docs .

Model tree for douy/deberta-v3-large-self-disclosure-detection-binary

Finetuned
(111)
this model