Text-to-Image
Diffusers
Safetensors
English
StableDiffusionPipeline
common-canvas
stable-diffusion
Inference Endpoints
CommonCanvas-S-C / README.md
pasta41's picture
minor summar updates
11ebd62 verified
|
raw
history blame
4.99 kB
metadata
license: cc-by-sa-4.0
tags:
  - common-canvas
  - stable-diffusion
  - text-to-image
datasets:
  - common-canvas/commoncatalog-cc-by-sa
  - common-canvas/commoncatalog-cc-by
language:
  - en

CommonCanvas-S-NC

Version Number: 0.1

Summary

CommonCanvas is a family of latent diffusion models capable of generating images from a given text prompt. Different models in the family are different sizes, and trained on different subsets of the CommonCatalog Dataset (See Data Card), a large dataset of Creative Commons licensed images with synthetic captions produced using a pre-trained BLIP-2 captioning model. CommonCanvas-S-NC is the small (S) model based off the Stable Diffusion 2 architecture, and trained on the non-commercial (NC) subset of CommonCatalog.

The goal of this purpose is to produce a high-quality text-to-image model, but to do so using an easily accessible dataset of known provenance. The exact training recipe of the model can be found in the paper hosted at this link. https://arxiv.org/abs/2310.16825

Training Overview

Input: CommonCatalog Text Captions
Output: CommonCatalog Images
Architecture: Stable Diffusion 2

Performance Limitations

CommonCanvas under-performs in several categories, including faces, general photography, and paintings (see paper, Figure 8). These datasets all originated from the Conceptual Captions dataset, which relies on web-scraped data. These web-sourced captions, while abundant, may not always align with human-generated language nuances. Transitioning to synthetic captions introduces certain performance challenges, however, the drop in performance is not as dramatic as one might assume.

Training Dataset Limitations

The model is trained on 10 year old YFCC data and may not have modern concepts or recent events in its training corpus. Performance on this model will be worse on certain proper nouns or specific celebrities, but this is a feature not a bug. The model may not generate known artwork, individual celebrities, or specific locations due to the autogenerated nature of the caption data.

Note: The non-commercial variants of this model are explicitly not intended to be use

  • It is trained on data derived from the Flickr100M dataset. The information is dated and known to have a bias towards internet connected Western countries. Some areas such as the global south lack representation.

Associated Risks

  • Text in images produced by the model will likely be difficult to read.
  • The model struggles with more complex tasks that require compositional understanding
  • It may not accurately generate faces or representations of specific people.
  • CommonCatalog (the training dataset) contains synthetic captions that are primarily English-language text; our models may not perform as effectively when prompted in other languages.
  • The autoencoder aspect of the model introduces some information loss.
  • It may be possible to guide the model to generate harmful content, i.e. nudity or other NSFW material.

Intended Uses

  • Using the model for generative AI research
  • Safe deployment of models which have the potential to generate harmful content.
  • Probing and understanding the limitations and biases of generative models.
  • Generation of artworks and use in design and other artistic processes.
  • Applications in educational or creative tools.
  • Research on generative models.

Usage

We recommend using the MosaicML Diffusion Repo to finetune / train the model: https://github.com/mosaicml/diffusion. Example finetuning code coming soon.

Spaces demo

Try the model demo on Hugging Face Spaces

Inference with 🧨 diffusers

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
    "common-canvas/CommonCanvas-S-C", 
    custom_pipeline="hyoungwoncho/sd_perturbed_attention_guidance", #read more at https://huggingface.co/hyoungwoncho/sd_perturbed_attention_guidance
    torch_dtype=torch.float16
).to(device)

prompt = "a cat sitting in a car seat"
image = pipe(prompt, num_inference_steps=25).images[0]    

Inference with ComfyUI / AUTOMATIC1111

Download safetensors ⬇️

Evaluation/Validation

We validated the model against Stability AI’s SD2 model and compared human user study

Acknowledgements

We thank @multimodalart, @Wauplin, and @lhoestq at Hugging Face for helping us host the dataset, and model weights.

Citation

@article{gokaslan2023commoncanvas,
  title={CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images},
  author={Gokaslan, Aaron and Cooper, A Feder and Collins, Jasmine and Seguin, Landan and Jacobson, Austin and Patel, Mihir and Frankle, Jonathan and Stephenson, Cory and Kuleshov, Volodymyr},
  journal={arXiv preprint arXiv:2310.16825},
  year={2023}
}