# DAEDRA: Determining Adverse Event Disposition for Regulatory Affairs

DAEDRA is a language model intended to predict the disposition (outcome) of an adverse event based on the text of the event report. Intended to be used to classify reports in passive reporting systems, it is trained on the [VAERS](https://vaers.hhs.gov/) dataset, which contains reports of adverse events following vaccination in the United States.

In [1]:
%pip install accelerate -U

Note: you may need to restart the kernel to use updated packages.


In [17]:
%pip install transformers datasets shap watermark wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting wandb
  Using cached wandb-0.16.2-py3-none-any.whl (2.2 MB)
Collecting sentry-sdk>=1.0.0
  Using cached sentry_sdk-1.39.2-py2.py3-none-any.whl (254 kB)
Collecting docker-pycreds>=0.4.0
  Using cached docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Using cached setproctitle-1.3.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31 kB)
Collecting appdirs>=1.4.3
  Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Installing collected packages: appdirs, setproctitle, sentry-sdk, docker-pycreds, wandb
Successfully installed appdirs-1.4.4 docker-pycreds-0.4.0 sentry-sdk-1.39.2 setproctitle-1.3.3 wandb-0.16.2
Note: you may need to restart the kernel to use updated packages.


In [99]:
import pandas as pd
import numpy as np
import torch
import os
from typing import List
from sklearn.metrics import f1_score, accuracy_score, classification_report
from transformers import AutoTokenizer, Trainer, AutoModelForSequenceClassification, TrainingArguments, pipeline
from datasets import load_dataset, Dataset, DatasetDict
from pyarrow import Table
import shap

%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [87]:
device: str = 'cuda' if torch.cuda.is_available() else 'cpu'

SEED: int = 42

BATCH_SIZE: int = 8
EPOCHS: int = 1
model_ckpt: str = "distilbert-base-uncased"

CLASS_NAMES: List[str] = ["DIED",
                          "ER_VISIT",
                          "HOSPITAL",
                          "OFC_VISIT",
                          #"X_STAY",      # pruned
                          #"DISABLE",     # pruned
                          #"D_PRESENTED"  # pruned
                          ]




# WandB configuration
os.environ["WANDB_PROJECT"] = "DAEDRA model training"  # name your W&B project
os.environ["WANDB_LOG_MODEL"] = "checkpoint"  # log all model checkpoints

In [5]:
%watermark --iversion

re     : 2.2.1
numpy  : 1.23.5
logging: 0.5.1.2
pandas : 2.0.2
torch  : 1.12.0
shap   : 0.44.1



In [6]:
!nvidia-smi

Sun Jan 28 02:27:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              37W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off | 00000002:00:0

## Loading the data set

In [105]:
dataset = load_dataset("chrisvoncsefalvay/vaers-outcomes")

We prune things down to the first four keys: `DIED`, `ER_VISIT`, `HOSPITAL`, `OFC_VISIT`.

In [106]:
ds = DatasetDict()

for i in ["test", "train", "val"]:
    tab = Table.from_arrays([dataset[i]["id"], dataset[i]["text"], [i[:4] for i in dataset[i]["labels"]]], names=["id", "text", "labels"])
    ds[i] = Dataset(tab)

dataset = ds

### Tokenisation and encoding

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [9]:
def tokenize_and_encode(examples):
  return tokenizer(examples["text"], truncation=True)

In [10]:
cols = dataset["train"].column_names
cols.remove("labels")
ds_enc = dataset.map(tokenize_and_encode, batched=True, remove_columns=cols)

Map: 100%|██████████| 15786/15786 [00:01<00:00, 10990.82 examples/s]


### Training

In [11]:
class MultiLabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

In [12]:
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=len(CLASS_NAMES)).to("cuda")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
def accuracy_threshold(y_pred, y_true, threshold=.5, sigmoid=True):
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)

    if sigmoid:
        y_pred = y_pred.sigmoid()

    return ((y_pred > threshold) == y_true.bool()).float().mean().item()

In [14]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return {'accuracy_thresh': accuracy_threshold(predictions, labels)}

In [15]:
args = TrainingArguments(
    output_dir="vaers",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=.01,
    report_to=["wandb"]
)

In [18]:
multi_label_trainer = MultiLabelTrainer(
    model, 
    args, 
    train_dataset=ds_enc["train"], 
    eval_dataset=ds_enc["test"], 
    compute_metrics=compute_metrics, 
    tokenizer=tokenizer
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [19]:
multi_label_trainer.evaluate()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mchrisvoncsefalvay[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokeni

{'eval_loss': 0.7153111100196838,
 'eval_accuracy_thresh': 0.2938227355480194,
 'eval_runtime': 82.3613,
 'eval_samples_per_second': 191.668,
 'eval_steps_per_second': 11.984}

In [21]:
multi_label_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy Thresh
1,0.0867,0.093388,0.962897


Checkpoint destination directory vaers/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[34m[1mwandb[0m: Adding directory to artifact (./vaers/checkpoint-500)... Done. 15.9s
Checkpoint destination directory vaers/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[34m[1mwandb[0m: Adding directory to artifact (./vaers/checkpoint-1000)... Done. 12.5s
Checkpoint destination directory vaers/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[34m[1mwandb[0m: Adding directory to artifact (./vaers/checkpoint-1500)... Done. 21.9s
Checkpoint destination directory vaers/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[34m[1mwandb[0m: Adding directory to artifact (./vaers/checkpoint-2000)... Done. 13.8s
Checkpoint destination directory vaers/checkpoint-2500 already exists and is n

TrainOutput(global_step=4605, training_loss=0.09062977189220382, metrics={'train_runtime': 1223.2444, 'train_samples_per_second': 60.223, 'train_steps_per_second': 3.765, 'total_flos': 9346797199425174.0, 'train_loss': 0.09062977189220382, 'epoch': 1.0})

### Evaluation

We instantiate a classifier `pipeline` and push it to CUDA.

In [24]:
classifier = pipeline("text-classification", 
                      model, 
                      tokenizer=tokenizer, 
                      device="cuda:0")

We use the same tokenizer used for training to tokenize/encode the validation set.

In [26]:
test_encodings = tokenizer.batch_encode_plus(dataset["val"]["text"], 
                                             max_length=None, 
                                             padding='max_length', 
                                             return_token_type_ids=True, 
                                             truncation=True)

The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).


Once we've made the data loadable by putting it into a `DataLoader`, we 

In [29]:
test_data = torch.utils.data.TensorDataset(torch.tensor(test_encodings['input_ids']), 
                                           torch.tensor(test_encodings['attention_mask']), 
                                           torch.tensor(ds_enc["val"]["labels"]), 
                                           torch.tensor(test_encodings['token_type_ids']))
test_dataloader = torch.utils.data.DataLoader(test_data, 
                                              sampler=torch.utils.data.SequentialSampler(test_data), 
                                              batch_size=BATCH_SIZE)

In [30]:
model.eval()

logit_preds, true_labels, pred_labels, tokenized_texts = [], [], [], []

for i, batch in enumerate(test_dataloader):
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels, b_token_types = batch
  
  with torch.no_grad():
    outs = model(b_input_ids, attention_mask=b_input_mask)
    b_logit_pred = outs[0]
    pred_label = torch.sigmoid(b_logit_pred)

    b_logit_pred = b_logit_pred.detach().cpu().numpy()
    pred_label = pred_label.to('cpu').numpy()
    b_labels = b_labels.to('cpu').numpy()

  tokenized_texts.append(b_input_ids)
  logit_preds.append(b_logit_pred)
  true_labels.append(b_labels)
  pred_labels.append(pred_label)

# Flatten outputs
tokenized_texts = [item for sublist in tokenized_texts for item in sublist]
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]

# Converting flattened binary values to boolean values
true_bools = [tl == 1 for tl in true_labels]
pred_bools = [pl > 0.50 for pl in pred_labels] 

We create a classification report:

In [31]:
print('Test F1 Accuracy: ', f1_score(true_bools, pred_bools, average='micro'))
print('Test Flat Accuracy: ', accuracy_score(true_bools, pred_bools), '\n')
clf_report = classification_report(true_bools, pred_bools, target_names=CLASS_NAMES)
print(clf_report)

Test F1 Accuracy:  0.8148841961852862
Test Flat Accuracy:  0.8456129236617042 

              precision    recall  f1-score   support

        DIED       0.98      0.83      0.90       312
    ER_VISIT       0.75      0.57      0.65      1143
    HOSPITAL       0.94      0.90      0.92      2361
   OFC_VISIT       0.77      0.66      0.71      2835
      X_STAY       0.00      0.00      0.00         9
     DISABLE       0.62      0.28      0.39       313
 D_PRESENTED       0.89      0.85      0.87      5392

   micro avg       0.86      0.77      0.81     12365
   macro avg       0.71      0.59      0.63     12365
weighted avg       0.85      0.77      0.81     12365
 samples avg       0.29      0.28      0.28     12365



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.
Recall and F-score are ill-defined and being set to 0.0 in samples with no true labels. Use `zero_division` parameter to control this behavior.


Finally, we render a 'head to head' comparison table that maps each text prediction to actual and predicted labels.

In [32]:
# Creating a map of class names from class numbers
idx2label = dict(zip(range(len(CLASS_NAMES)), CLASS_NAMES))

In [33]:
true_label_idxs, pred_label_idxs = [], []

for vals in true_bools:
  true_label_idxs.append(np.where(vals)[0].flatten().tolist())
for vals in pred_bools:
  pred_label_idxs.append(np.where(vals)[0].flatten().tolist())

In [34]:
true_label_texts, pred_label_texts = [], []

for vals in true_label_idxs:
  if vals:
    true_label_texts.append([idx2label[val] for val in vals])
  else:
    true_label_texts.append(vals)

for vals in pred_label_idxs:
  if vals:
    pred_label_texts.append([idx2label[val] for val in vals])
  else:
    pred_label_texts.append(vals)

In [35]:
symptom_texts = [tokenizer.decode(text,
                                  skip_special_tokens=True,
                                  clean_up_tokenization_spaces=False) for text in tokenized_texts]

In [36]:
comparisons_df = pd.DataFrame({'symptom_text': symptom_texts, 
                               'true_labels': true_label_texts, 
                               'pred_labels':pred_label_texts})
comparisons_df.to_csv('comparisons.csv')
comparisons_df

Unnamed: 0,symptom_text,true_labels,pred_labels
0,"pt was due for hepb, hib, ipv. i gave pentacel...",[],[]
1,"cold ; covid - 19 twice, he tested positive ; ...",[],[]
2,patient described pain in both shoulders and r...,[],[]
3,error : improper storage ( ex. temp. / locatio...,[],[]
4,vaccine was stored in as unapproved storage unit,[],[]
...,...,...,...
15780,allergic reaction ; this is a spontaneous repo...,[],[]
15781,immediate side effects were in line with expec...,[],[]
15782,anaphylaxis immediately after administration o...,[],[]
15783,no additional ae ' s were reported ; the hcp r...,[],[]


### Shapley analysis

In [160]:
explainer = shap.Explainer(classifier, output_names=CLASS_NAMES)

#### Sampling correct predictions

First, let's look at some correct predictions of deaths:

In [153]:
correct_death_predictions = comparisons_df[comparisons_df['true_labels'].astype(str) == "['DIED']"]

In [161]:
texts = [i[:512] for i in correct_death_predictions.sample(n=6).symptom_text]
idxs = [i for i in range(len(texts))]

d_s = Dataset(Table.from_arrays([idxs, texts], names=["idx", "texts"]))

In [162]:
shap_values = explainer(d_s["texts"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
PartitionExplainer explainer: 7it [00:14,  3.70s/it]                       


In [163]:
shap.plots.text(shap_values)