Model card for RadEdit

Model description

RadEdit is a deep learning approach for stress testing biomedical vision models to discover failure cases. It uses a generative text-to-image model to “edit” chest X-rays by using a text description to add or remove abnormalities from a masked region of the image. These edited images can subsequently be used to test whether existing models (e.g. those for disease classification or anatomy segmentation), perform as expected under these different conditions.

To enable this, a text-to-image latent diffusion model is trained from scratch to generate chest X-rays from either the impression section of a radiology report (a short clinically actionable outline of the main findings) or a list of radiographic observations.

RadEdit is described in detail in RadEdit: stress-testing biomedical vision models via diffusion image editing (F. Pérez-García, S. Bond-Taylor, et al., 2024).

We release the weights for the RadEdit model as well as the editing pipeline for stress-testing models.

Developed by: Microsoft Health Futures
Model type: Latent Diffusion Model
License: Model weights in the unet subfolder are licensed under MSRLA. Editing pipeline in pipeline.py is licensed under MIT.
Components: Text encoder and tokenizer: BioViL-T. Autoencoder: SDXL-VAE.

Model Uses
Data
Biases, Risks and Limitations
Model Capabilities
Getting Started
- Sampling Chest X-Rays
- Editing
Training Details
Citation

Uses

Intended Use

The model checkpoints are intended to be used solely for (I) future research on chest X-ray generation and model stress-testing and (II) reproducibility of the experimental results reported in the reference paper. The code and model checkpoints should not be used to provide medical or clinical opinions, and is not designed to replace the role of qualified medical professionals in appropriately identifying, assessing, diagnosing or managing medical conditions. Users remain responsible for any outputs generated by the model.

Primary Intended Use

The primary intended use is to support AI researchers reproducing and building on top of this work. RadEdit and its associated models should be helpful for exploring various biomedical stress-testing tasks via image editing or generation.

Out-of-Scope Use

Any deployed use case of the model, commercial or otherwise, is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases.

Data

RadEdit was trained on the following public deidentified chest X-ray datasets. Only the frontal view chest X-rays are used, totalling 487,680 training images. For MIMIC-CXR the impression section of the radiology report (a short clinically actionable outline of the main findings) is used as the input text to the model. For The NIH-CXR and CheXpert, a list of all abnormalities present in an image as indicated by the labels, e.g., “Cardiomegaly. Pneumothorax.” is used as the input text.

MIMIC-CXR

The MIMIC-CXR dataset contains 377,110 image-report pairs from 227,827 radiology studies. A patient may have multiple studies, whereas each study may contain multiple chest x-ray (CXR) images taken at different views. We follow the standard partition and use the first nine subsets (P10-P18) for training and validation, while reserving the last (P19) for testing.

NIH-CXR

The NIH-CXR dataset contains 112,120 X-ray images with 8 automatically generated disease labels from 30,805 unique patients. Since there is no official validation split, we create a random train/validation split, ensuring that no patient appears in both sets.

CheXpert

The CheXpert dataset contains 224,316 chest X-ray images from 65,240 patients together with automatically generated labels indicating the presence of 14 observations in radiology reports. We use the official train/validation split.

Biases, Risks and Limitations

The model was developed using English corpora, and thus may be considered English-only. The model is evaluated on a narrow set of biomedical benchmark tasks, described in the RadEdit paper. As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, the model is likely to carry many of the limitations of the models from which it is derived, Stable Diffusion v1.5, BioViL-T, and SDXL-VAE. In particular, the SDXL-VAE (which is used to compress images prior to training the diffusion model) can exhibit artefacts in its reconstructions which can make generated images identifiable from real images. See Figure 12 in this paper for examples of such artefacts. While evaluation has included clinical input, this is not exhaustive; model performance will vary in different settings and is intended for research use only.

Further, the model inherits the biases from the training datasets. These datasets come from hospitals in the United States; therefore, it might be biased towards population in the training data. Underlying biases of the training datasets may not be well characterized. A substantial proportion of the training data comes from inpatient medical record; samples from the model are thus reflective of this population. Due to the automated procedure used to obtain pathology labels, erroneous labels may have been used to train the model, which may affect its performance.

The RadEdit editing pipeline is not applicable to all stress testing scenarios. For example, testing segmentation models’ behaviour to cardiomegaly (enlarged heart) is not possible as this would require segmentation masks to be changed. Other limitations of the editing procedure are discussed in the RadEdit paper.

Other limitations:

The model does not achieve perform photorealism.
Model outputs may include errors.
The model can fail to produce aligned outputs for more complex prompts.
The model can fail to produce outputs matching the text input; particularly if the text differs substantially from the training data.
When using the model for image editing, unwanted visual changes may be made.

Getting Started

This repository provides the weights for the U-Net model. The VAE, text encoder, tokenizer, and scheduler have to be loaded separately and combined into the generation pipeline:

from transformers import AutoModel, AutoTokenizer
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline, UNet2DConditionModel

# Load the UNet model
unet_loaded = UNet2DConditionModel.from_pretrained("microsoft/radedit", subfolder="unet")

# Load all other components of the stable diffusion pipeline
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae")
text_encoder = AutoModel.from_pretrained(
    "microsoft/BiomedVLP-BioViL-T",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/BiomedVLP-BioViL-T",
    model_max_length=128,
    trust_remote_code=True,
)
scheduler = DDIMScheduler(
    beta_schedule="linear",
    clip_sample=False,
    prediction_type="epsilon",
    timestep_spacing="trailing",
    steps_offset=1,
)

generation_pipeline = StableDiffusionPipeline(
    vae=vae,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    unet=unet_loaded,
    scheduler=scheduler,
    safety_checker=None,
    requires_safety_checker=False,
    feature_extractor=None,
)
generation_pipeline.to("cuda")

Sampling Chest X-Rays

The generation pipeline can be used to sample images via the following

import torch

prompts = [
    "Small right-sided pleural effusion",
    "No acute cardiopulmonary process",
    "Small left-sided pleural effusion",
    "Large right-sided pleural effusion",
    "Bilateral pleural effusions",
    "Large left-sided pleural effusion",
]

torch.manual_seed(0)
images = generation_pipeline(
    prompts,
    num_inference_steps=100,
    guidance_scale=7.5,
).images

Editing

To load the RadEdit editing pipeline, we convert the generation pipeline into the custom pipeline in pipeline.py

from diffusers import DiffusionPipeline
radedit_pipeline = DiffusionPipeline.from_pipe(
    pipeline,
    custom_pipeline="microsoft/radedit",
)

Following this, RadEdit can be used to edit an input_image using two masks: the edit_mask which defined the region we wish the editing prompt to be applied to, and the fixed_mask which defined the region where any edits are prevented from taking place.

prompt = 'No acute cardiopulmonary process'
arrays = radedit_pipeline_loaded(
    prompt,
    weights=[7.5],
    image=input_img,
    edit_mask=input_mask,
    keep_mask=fixed_mask,
    num_inference_steps=200,
    invert_prompt='',
    skip_ratio=0.3,
)

Training details

We train the U-Net for 300 epochs, monitoring validation loss to avoid overfitting. During training we regularly evaluate a number of different metrics which assess the quality, diversity and alignment between prompt and generation, including FID, precision/recall/density/coverage, and CLIP score to ensure that samples are high quality and diverse.

Environmental impact

Hardware type: NVIDIA V100 GPUs
Hours used: 318 hours/GPU × 1 nodes × 8 GPUs/node = 2544 GPU-hours
Cloud provider: Azure
Compute region: West US 2
Carbon emitted: 229 kg CO₂ eq.

Compute infrastructure

RadEdit was trained on Azure Machine Learning.

Software

We used SimpleITK and Pydicom for processing of DICOM files.

Citation

BibTeX:

@inproceedings{perezgarcia2024radedit,
      title={{RadEdit}: stress-testing biomedical vision models via diffusion image editing},
      author={P{\'e}rez-Garc{\'i}a, Fernando and Bond-Taylor, Sam and Sanchez, Pedro P and van Breugel, Boris and Castro, Daniel C and Sharma, Harshita and Salvatelli, Valentina and Wetscherek, Maria TA and Richardson, Hannah and Lungren, Matthew P and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan and Ilse, Maximilian},
      year={2024},
      booktitle={European Conference of Computer Vision}
}

APA:

Pérez-García, F., Bond-Taylor, S., Sanchez, P. P., van Breugel, B., Castro, D. C., Sharma, H., ... & Ilse, M. (2024). RadEdit: stress-testing biomedical vision models via diffusion image editing. European Conference on Computer Vision.

Model card contact

Sam Bond-Taylor (sbondtaylor@microsoft.com).

microsoft
/

radedit

You need to agree to share your contact information to access this model