# DAEDRA: Determining Adverse Event Disposition for Regulatory Affairs

DAEDRA is a language model intended to predict the disposition (outcome) of an adverse event based on the text of the event report. Intended to be used to classify reports in passive reporting systems, it is trained on the [VAERS](https://vaers.hhs.gov/) dataset, which contains reports of adverse events following vaccination in the United States.

In [1]:
%pip install accelerate -U

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install transformers datasets shap watermark wandb evaluate codecarbon

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import torch
import os
from typing import List, Union
from transformers import AutoTokenizer, Trainer, AutoModelForSequenceClassification, TrainingArguments, DataCollatorWithPadding, pipeline
from datasets import load_dataset, Dataset, DatasetDict
import shap
import wandb
import evaluate
from codecarbon import EmissionsTracker
import logging

wandb.finish()

logging.getLogger('codecarbon').propagate = False

os.environ["TOKENIZERS_PARALLELISM"] = "false"
tracker = EmissionsTracker()

%load_ext watermark

  from .autonotebook import tqdm as notebook_tqdm
2024-01-29 04:43:58.191236: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 04:43:59.182154: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-01-29 04:43:59.182291: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
[codecarbon INFO @ 04:44:02] [setup] RAM Tracking...
[codecarbon INFO @ 04:44:02] [setup] GPU Tracking...
[codecarbo

In [4]:
device: str = 'cuda' if torch.cuda.is_available() else 'cpu'

SEED: int = 42

BATCH_SIZE: int = 32
EPOCHS: int = 5
model_ckpt: str = "distilbert-base-uncased"

# WandB configuration
os.environ["WANDB_PROJECT"] = "DAEDRA multiclass model training" 
os.environ["WANDB_LOG_MODEL"] = "checkpoint"  # log all model checkpoints
os.environ["WANDB_NOTEBOOK_NAME"] = "DAEDRA.ipynb"

In [5]:
%watermark --iversion

shap    : 0.44.1
numpy   : 1.23.5
pandas  : 2.0.2
logging : 0.5.1.2
torch   : 1.12.0
evaluate: 0.4.1
wandb   : 0.16.2
re      : 2.2.1



In [6]:
!nvidia-smi

Mon Jan 29 04:44:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   26C    P0              25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off | 00000002:00:0

## Loading the data set

In [7]:
dataset = load_dataset("chrisvoncsefalvay/vaers-outcomes")

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 1270444
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 272238
    })
    val: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 272238
    })
})

In [9]:
SUBSAMPLING = 1.0

if SUBSAMPLING < 1:
    _ = DatasetDict()
    for each in dataset.keys():
        _[each] = dataset[each].shuffle(seed=SEED).select(range(int(len(dataset[each]) * SUBSAMPLING)))

    dataset = _

## Tokenisation and encoding

In [10]:
def encode_ds(ds: Union[Dataset, DatasetDict], tokenizer_model: str = model_ckpt) -> Union[Dataset, DatasetDict]:
    return ds_enc

## Evaluation metrics

In [11]:
accuracy = evaluate.load("accuracy")
precision, recall = evaluate.load("precision"), evaluate.load("recall")
f1 = evaluate.load("f1")

In [12]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy.compute(predictions=predictions, references=labels)["accuracy"],
        'precision_macroaverage': precision.compute(predictions=predictions, references=labels, average='macro')["precision"],
        'precision_microaverage': precision.compute(predictions=predictions, references=labels, average='micro')["precision"],
        'recall_macroaverage': recall.compute(predictions=predictions, references=labels, average='macro')["recall"],
        'recall_microaverage': recall.compute(predictions=predictions, references=labels, average='micro')["recall"],
        'f1_microaverage': f1.compute(predictions=predictions, references=labels, average='micro')["f1"]
    }

## Training

We specify a label map – this has to be done manually, even if `Datasets` has a function for it, as `AutoModelForSequenceClassification` requires an object with a length :(

In [13]:
label_map = {i: label for i, label in enumerate(dataset["test"].features["label"].names)}

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

cols = dataset["train"].column_names
cols.remove("label")
ds_enc = dataset.map(lambda x: tokenizer(x["text"], truncation=True), batched=True, remove_columns=cols)


Map: 100%|██████████| 272238/272238 [01:45<00:00, 2592.04 examples/s]


In [15]:

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, 
    num_labels=len(dataset["test"].features["label"].names), 
    id2label=label_map, 
    label2id={v:k for k,v in label_map.items()})

args = TrainingArguments(
    output_dir="vaers",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=.01,
    logging_steps=1,
    load_best_model_at_end=True,
    run_name=f"daedra-training",
    report_to=["wandb"])

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=ds_enc["train"],
        eval_dataset=ds_enc["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
if SUBSAMPLING != 1.0:
    wandb_tag: List[str] = [f"subsample-{SUBSAMPLING}"]
else:
    wandb_tag: List[str] = [f"full_sample"]

wandb_tag.append(f"batch_size-{BATCH_SIZE}")
wandb_tag.append(f"base:{model_ckpt}")
    
wandb.init(name="daedra_training_run", tags=wandb_tag, magic=True)

[34m[1mwandb[0m: Currently logged in as: [33mchrisvoncsefalvay[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
tracker.start()
trainer.train()
tracker.stop()


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


Epoch,Training Loss,Validation Loss


[codecarbon INFO @ 04:46:20] Energy consumed for RAM : 0.000690 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:46:20] Energy consumed for all GPUs : 0.001499 kWh. Total GPU Power : 359.1829830586385 W
[codecarbon INFO @ 04:46:20] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 04:46:20] 0.002366 kWh of electricity used since the beginning.
[codecarbon INFO @ 04:46:35] Energy consumed for RAM : 0.001378 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:46:35] Energy consumed for all GPUs : 0.004078 kWh. Total GPU Power : 619.6193403526773 W
[codecarbon INFO @ 04:46:35] Energy consumed for all CPUs : 0.000355 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 04:46:35] 0.005811 kWh of electricity used since the beginning.
[codecarbon INFO @ 04:46:50] Energy consumed for RAM : 0.002066 kWh. RAM Power : 165.33123922348022 W
[codecarbon INFO @ 04:46:50] Energy consumed for all GPUs : 0.006632 kWh. Total GPU Power : 613.6554096062

In [None]:
wandb.finish()

In [None]:
variant = "full_sample" if SUBSAMPLING == 1.0 else f"subsample-{SUBSAMPLING}"
tokenizer._tokenizer.save("tokenizer.json")
tokenizer.push_to_hub("chrisvoncsefalvay/daedra")
sample = "full sample" if SUBSAMPLING == 1.0 else f"{SUBSAMPLING * 100}% of the full sample"

model.push_to_hub("chrisvoncsefalvay/daedra", 
                  variant=variant,
                  commit_message=f"DAEDRA model trained on {sample} of the VAERS dataset (training set size: {dataset['train'].num_rows:,})")

In [None]:
variant = "full_sample" if SUBSAMPLING == 1.0 else f"subsample-{SUBSAMPLING}"
tokenizer._tokenizer.save("tokenizer.json")
tokenizer.push_to_hub("chrisvoncsefalvay/daedra")
sample = "full sample" if SUBSAMPLING == 1.0 else f"{SUBSAMPLING * 100}% of the full sample"

model.push_to_hub("chrisvoncsefalvay/daedra", 
                  variant=variant,
                  commit_message=f"DAEDRA model trained on {sample} of the VAERS dataset (training set size: {dataset['train'].num_rows:,})")