Spaces:

fffiloni
/

text-guided-image-colorization

Running on Zero

App Files Files Community

fffiloni commited on 8 days ago

Commit

691af46

•

1 Parent(s): 1772326

Migrated from GitHub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

000000000285.jpg +0 -0
000000000724.jpg +0 -0
000000007991.jpg +0 -0
000000018837.jpg +0 -0
000000122962.jpg +0 -0
000000295478.jpg +0 -0
ORIGINAL_README.md +128 -0
eval_controlnet.py +148 -0
eval_controlnet.sh +19 -0
eval_controlnet_sdxl_light.py +284 -0
eval_controlnet_sdxl_light.sh +44 -0
eval_controlnet_sdxl_light_single.py +390 -0
eval_controlnet_sdxl_light_single.sh +20 -0
example/UUColor_results/Hollywood-Sign.jpeg +0 -0
example/legacy_images/Big-Ben-vintage.jpg +0 -0
example/legacy_images/Central-Park.jpg +0 -0
example/legacy_images/Hollywood-Sign.jpg +0 -0
example/legacy_images/Little-Mermaid.jpg +0 -0
example/legacy_images/Migrant-Mother.jpg +0 -0
example/legacy_images/Mount-Everest.jpg +0 -0
example/legacy_images/Tower-of-Pisa.jpg +0 -0
example/legacy_images/Wasatch-Mountains-Summit-County-Utah.jpg +0 -0
gradio_ui.py +356 -0
images/000000022935_gray.jpg +0 -0
images/000000022935_green_shirt_on_right_girl.jpeg +0 -0
images/000000022935_purple_shirt_on_right_girl.jpeg +0 -0
images/000000022935_red_shirt_on_right_girl.jpeg +0 -0
images/000000025560_color.jpg +0 -0
images/000000025560_gray.jpg +0 -0
images/000000025560_gt.jpg +0 -0
images/000000041633_black_car.jpeg +0 -0
images/000000041633_bright_red_car.jpeg +0 -0
images/000000041633_dark_blue_car.jpeg +0 -0
images/000000041633_gray.jpg +0 -0
images/000000065736_color.jpg +0 -0
images/000000065736_gray.jpg +0 -0
images/000000065736_gt.jpg +0 -0
images/000000091779_color.jpg +0 -0
images/000000091779_gray.jpg +0 -0
images/000000091779_gt.jpg +0 -0
images/000000092177_color.jpg +0 -0
images/000000092177_gray.jpg +0 -0
images/000000092177_gt.jpg +0 -0
images/000000166426_color.jpg +0 -0
images/000000166426_gray.jpg +0 -0
images/000000166426_gt.jpg +0 -0
images/000000286708_gray.jpg +0 -0
images/000000286708_orange_hat.jpeg +0 -0
images/000000286708_pink_hat.jpeg +0 -0
images/000000286708_yellow_hat.jpeg +0 -0

000000000285.jpg ADDED Viewed

000000000724.jpg ADDED Viewed

000000007991.jpg ADDED Viewed

000000018837.jpg ADDED Viewed

000000122962.jpg ADDED Viewed

000000295478.jpg ADDED Viewed

ORIGINAL_README.md ADDED Viewed

	@@ -0,0 +1,128 @@

+# Text-Guided-Image-Colorization
+This project utilizes the power of **Stable Diffusion (SDXL/SDXL-Light)** and the **BLIP (Bootstrapping Language-Image Pre-training)** captioning model to provide an interactive image colorization experience. Users can influence the generated colors of objects within images, making the colorization process more personalized and creative.
+## Table of Contents
+ - [Features](#features)
+ - [Installation](#installation)
+ - [Quick Start](#quick-start)
+ - [Dataset Usage](#dataset-usage)
+ - [Training](#training)
+ - [Evaluation](#evaluation)
+ - [Results](#results)
+ - [License](#license)
+## Features
+- **Interactive Colorization**: Users can specify desired colors for different objects in the image.
+- **ControlNet Approach**: Enhanced colorization capabilities through retraining with ControlNet, allowing SDXL to better adapt to the image colorization task.
+- **High-Quality Outputs**: Leverage the latest advancements in diffusion models to generate vibrant and realistic colorizations.
+- **User-Friendly Interface**: Easy-to-use interface for seamless interaction with the model.
+## Installation
+To set up the project locally, follow these steps:
+1. **Clone the Repository**:
+   ```bash
+   git clone https://github.com/nick8592/text-guided-image-colorization.git
+   cd text-guided-image-colorization
+   ```
+2. **Install Dependencies**:
+   Make sure you have Python 3.7 or higher installed. Then, install the required packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+   Install `torch` and `torchvision` matching your CUDA version:
+   ```bash
+   pip install torch torchvision --index-url https://download.pytorch.org/whl/cuXXX
+   ```
+   Replace `XXX` with your CUDA version (e.g., `118` for CUDA 11.8). For more info, see [PyTorch Get Started](https://pytorch.org/get-started/locally/).
+3. **Download Pre-trained Models**:
+   | Models | Hugging Face (Recommand) | Other |
+   |:---:|:---:|:---:|
+   |SDXL-Lightning Caption|[link](https://huggingface.co/nickpai/sdxl_light_caption_output)|[link](https://gofile.me/7uE8s/FlEhfpWPw) (2kNJfV)|
+   |SDXL-Lightning Custom Caption (Recommand)|[link](https://huggingface.co/nickpai/sdxl_light_custom_caption_output)|[link](https://gofile.me/7uE8s/AKmRq5sLR) (KW7Fpi)|
+   ```bash
+   text-guided-image-colorization/sdxl_light_caption_output
+   └── checkpoint-30000
+       ├── controlnet
+       │   ├── diffusion_pytorch_model.safetensors
+       │   └── config.json
+       ├── optimizer.bin
+       ├── random_states_0.pkl
+       ├── scaler.pt
+       └── scheduler.bin
+   ```
+## Quick Start
+1. Run the `gradio_ui.py` script:
+```bash
+python gradio_ui.py
+```
+2. Open the provided URL in your web browser to access the Gradio-based user interface.
+3. Upload an image and use the interface to control the colors of specific objects in the image. But still the model can generate images without a specific prompt.
+4. The model will generate a colorized version of the image based on your input (or automatic). See the [demo video](https://x.com/weichenpai/status/1829513077588631987).
+![Gradio UI](images/gradio_ui.png)
+## Dataset Usage
+You can find more details about the dataset usage in the [Dataset-for-Image-Colorization](https://github.com/nick8592/Dataset-for-Image-Colorization).
+## Training
+For training, you can use one of the following scripts:
+- `train_controlnet.sh`: Trains a model using [Stable Diffusion v2](https://huggingface.co/stabilityai/stable-diffusion-2-1)
+- `train_controlnet_sdxl.sh`: Trains a model using [SDXL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
+- `train_controlnet_sdxl_light.sh`: Trains a model using [SDXL-Lightning](https://huggingface.co/ByteDance/SDXL-Lightning)
+Although the training code for SDXL is provided, due to a lack of GPU resources, I wasn't able to train the model by myself. Therefore, there might be some errors when you try to train the model.
+## Evaluation
+For evaluation, you can use one of the following scripts:
+- `eval_controlnet.sh`: Evaluates the model using [Stable Diffusion v2](https://huggingface.co/stabilityai/stable-diffusion-2-1) for a folder of images.
+- `eval_controlnet_sdxl_light.sh`: Evaluates the model using [SDXL-Lightning](https://huggingface.co/ByteDance/SDXL-Lightning) for a folder of images.
+- `eval_controlnet_sdxl_light_single.sh`: Evaluates the model using [SDXL-Lightning](https://huggingface.co/ByteDance/SDXL-Lightning) for a single image.
+## Results
+### Prompt-Guided
+| Caption | Condition 1 | Condition 2 | Condition 3 |
+|:---:|:---:|:---:|:---:|
+| ![000000022935_gray.jpg](images/000000022935_gray.jpg) | ![000000022935_green_shirt_on_right_girl.jpeg](images/000000022935_green_shirt_on_right_girl.jpeg) | ![000000022935_purple_shirt_on_right_girl.jpeg](images/000000022935_purple_shirt_on_right_girl.jpeg) |![000000022935_red_shirt_on_right_girl.jpeg](images/000000022935_red_shirt_on_right_girl.jpeg) |
+| a photography of a woman in a soccer uniform kicking a soccer ball | + "green shirt"| + "purple shirt" | + "red shirt" |
+| ![000000041633_gray.jpg](images/000000041633_gray.jpg) | ![000000041633_bright_red_car.jpeg](images/000000041633_bright_red_car.jpeg) | ![000000041633_dark_blue_car.jpeg](images/000000041633_dark_blue_car.jpeg) |![000000041633_black_car.jpeg](images/000000041633_black_car.jpeg) |
+| a photography of a photo of a truck | + "bright red car"| + "dark blue car" | + "black car" |
+| ![000000286708_gray.jpg](images/000000286708_gray.jpg) | ![000000286708_orange_hat.jpeg](images/000000286708_orange_hat.jpeg) | ![000000286708_pink_hat.jpeg](images/000000286708_pink_hat.jpeg) |![000000286708_yellow_hat.jpeg](images/000000286708_yellow_hat.jpeg) |
+| a photography of a cat wearing a hat on his head | + "orange hat"| + "pink hat" | + "yellow hat" |
+### Prompt-Free
+Ground truth images are provided solely for reference purpose in the image colorization task.
+| Grayscale Image | Colorized Result | Ground Truth |
+|:---:|:---:|:---:|
+| ![000000025560_gray.jpg](images/000000025560_gray.jpg) | ![000000025560_color.jpg](images/000000025560_color.jpg) | ![000000025560_gt.jpg](images/000000025560_gt.jpg) |
+| ![000000065736_gray.jpg](images/000000065736_gray.jpg) | ![000000065736_color.jpg](images/000000065736_color.jpg) | ![000000065736_gt.jpg](images/000000065736_gt.jpg) |
+| ![000000091779_gray.jpg](images/000000091779_gray.jpg) | ![000000091779_color.jpg](images/000000091779_color.jpg) | ![000000091779_gt.jpg](images/000000091779_gt.jpg) |
+| ![000000092177_gray.jpg](images/000000092177_gray.jpg) | ![000000092177_color.jpg](images/000000092177_color.jpg) | ![000000092177_gt.jpg](images/000000092177_gt.jpg) |
+| ![000000166426_gray.jpg](images/000000166426_gray.jpg) | ![000000166426_color.jpg](images/000000166426_color.jpg) | ![000000025560_gt.jpg](images/000000166426_gt.jpg) |
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.

eval_controlnet.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import os
+import time
+import torch
+import shutil
+import argparse
+import numpy as np
+from tqdm import tqdm
+from PIL import Image
+from datasets import load_dataset
+from diffusers.utils import load_image
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+# Define the function to parse arguments
+def parse_args(input_args=None):
+    parser = argparse.ArgumentParser(description="Simple example of a ControlNet evaluation script.")
+    parser.add_argument("--model_dir", type=str, default="sd_v2_caption_free_output/checkpoint-22500",
+                        help="Directory of the model checkpoint")
+    parser.add_argument("--model_id", type=str, default="stabilityai/stable-diffusion-2-base",
+                        help="ID of the model (Tested with runwayml/stable-diffusion-v1-5 and stabilityai/stable-diffusion-2-base)")
+    parser.add_argument("--dataset", type=str, default="nickpai/coco2017-colorization",
+                        help="Dataset used")
+    parser.add_argument("--revision", type=str, default="caption-free",
+                        choices=["main", "caption-free"],
+                        help="Revision option (main/caption-free)")
+    if input_args is not None:
+        args = parser.parse_args(input_args)
+    else:
+        args = parser.parse_args()
+    return args
+def apply_color(image, color_map):
+    # Convert input images to LAB color space
+    image_lab = image.convert('LAB')
+    color_map_lab = color_map.convert('LAB')
+    # Split LAB channels
+    l, a, b = image_lab.split()
+    _, a_map, b_map = color_map_lab.split()
+    # Merge LAB channels with color map
+    merged_lab = Image.merge('LAB', (l, a_map, b_map))
+    # Convert merged LAB image back to RGB color space
+    result_rgb = merged_lab.convert('RGB')
+    return result_rgb
+def main(args):
+    generator = torch.manual_seed(0)
+    # MODEL_DIR = "sd_v2_caption_free_output/checkpoint-22500"
+    # # MODEL_ID="runwayml/stable-diffusion-v1-5"
+    # MODEL_ID="stabilityai/stable-diffusion-2-base"
+    # DATASET = "nickpai/coco2017-colorization"
+    # REVISION = "caption-free" # option: main/caption-free
+    # Path to the eval_results folder
+    eval_results_folder = os.path.join(args.model_dir, "results")
+    # Remove eval_results folder if it exists
+    if os.path.exists(eval_results_folder):
+        shutil.rmtree(eval_results_folder)
+    # Create directory for eval_results
+    os.makedirs(eval_results_folder)
+    # Create subfolders for compare and colorized images
+    compare_folder = os.path.join(eval_results_folder, "compare")
+    colorized_folder = os.path.join(eval_results_folder, "colorized")
+    os.makedirs(compare_folder)
+    os.makedirs(colorized_folder)
+    # Load the validation split of the colorization dataset
+    val_dataset = load_dataset(args.dataset, split="validation", revision=args.revision)
+    controlnet = ControlNetModel.from_pretrained(f"{args.model_dir}/controlnet", torch_dtype=torch.float16)
+    pipe = StableDiffusionControlNetPipeline.from_pretrained(
+        args.model_id, controlnet=controlnet, torch_dtype=torch.float16
+    ).to("cuda")
+    pipe.safety_checker = None
+    # Counter for processed images
+    processed_images = 0
+    # Record start time
+    start_time = time.time()
+    # Iterate through the validation dataset
+    for example in tqdm(val_dataset, desc="Processing Images"):
+        image_path = example["file_name"]
+        prompt = []
+        for caption in example["captions"]:
+            if isinstance(caption, str):
+                prompt.append(caption)
+            elif isinstance(caption, (list, np.ndarray)):
+                # take a random caption if there are multiple
+                prompt.append(caption[0])
+            else:
+                raise ValueError(
+                    f"Caption column `captions` should contain either strings or lists of strings."
+                )
+        # Generate image
+        ground_truth_image = load_image(image_path).resize((512, 512))
+        control_image = load_image(image_path).convert("L").convert("RGB").resize((512, 512))
+        image = pipe(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0]
+        # Apply color mapping
+        image = apply_color(ground_truth_image, image)
+        # Concatenate images into a row
+        row_image = np.hstack((np.array(control_image), np.array(image), np.array(ground_truth_image)))
+        row_image = Image.fromarray(row_image)
+        # Save row image in the compare folder
+        compare_output_path = os.path.join(compare_folder, f"{image_path.split('/')[-1]}")
+        row_image.save(compare_output_path)
+        # Save colorized image in the colorized folder
+        colorized_output_path = os.path.join(colorized_folder, f"{image_path.split('/')[-1]}")
+        image.save(colorized_output_path)
+        # Increment processed images counter
+        processed_images += 1
+    # Record end time
+    end_time = time.time()
+    # Calculate total time taken
+    total_time = end_time - start_time
+    # Calculate FPS
+    fps = processed_images / total_time
+    print("All images processed.")
+    print(f"Total time taken: {total_time:.2f} seconds")
+    print(f"FPS: {fps:.2f}")
+# Entry point of the script
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)

eval_controlnet.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+# Define default values for parameters
+# # sdv2 with BCE loss
+# MODEL_DIR="sd_v2_caption_bce_output/checkpoint-22500"
+# MODEL_ID="stabilityai/stable-diffusion-2-base"
+# DATASET="nickpai/coco2017-colorization"
+# REVISION="main"
+# sdv2 with kl loss
+MODEL_DIR="sd_v2_caption_kl_output/checkpoint-22500"
+MODEL_ID="stabilityai/stable-diffusion-2-base"
+DATASET="nickpai/coco2017-colorization"
+REVISION="main"
+accelerate launch eval_controlnet.py \
+    --model_dir=$MODEL_DIR \
+    --model_id=$MODEL_ID \
+    --dataset=$DATASET \
+    --revision=$REVISION

eval_controlnet_sdxl_light.py ADDED Viewed

	@@ -0,0 +1,284 @@

+import os
+import time
+import torch
+import shutil
+import argparse
+import numpy as np
+from tqdm import tqdm
+from PIL import Image
+from datasets import load_dataset
+from accelerate import Accelerator
+from diffusers.utils import load_image
+from diffusers import (
+    AutoencoderKL,
+    StableDiffusionXLControlNetPipeline,
+    ControlNetModel,
+    UNet2DConditionModel,
+)
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+# Define the function to parse arguments
+def parse_args(input_args=None):
+    parser = argparse.ArgumentParser(description="Simple example of a ControlNet evaluation script.")
+    parser.add_argument(
+        "--pretrained_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to pretrained model or model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--pretrained_vae_model_name_or_path",
+        type=str,
+        default=None,
+        help="Path to an improved VAE to stabilize training. For more details check out: https://github.com/huggingface/diffusers/pull/4038.",
+    )
+    parser.add_argument(
+        "--controlnet_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to pretrained controlnet model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to output results.",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="nickpai/coco2017-colorization",
+        help="Dataset used"
+    )
+    parser.add_argument(
+        "--dataset_revision",
+        type=str,
+        default="caption-free",
+        choices=["main", "caption-free", "custom-caption"],
+        help="Revision option (main/caption-free/custom-caption)"
+    )
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default=None,
+        choices=["no", "fp16", "bf16"],
+        help=(
+            "Whether to use mixed precision. Choose between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >="
+            " 1.10.and an Nvidia Ampere GPU.  Default to the value of accelerate config of the current system or the"
+            " flag passed with the `accelerate.launch` command. Use this argument to override the accelerate config."
+        ),
+    )
+    parser.add_argument(
+        "--variant",
+        type=str,
+        default=None,
+        help="Variant of the model files of the pretrained model identifier from huggingface.co/models, 'e.g.' fp16",
+    )
+    parser.add_argument(
+        "--revision",
+        type=str,
+        default=None,
+        required=False,
+        help="Revision of pretrained model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--num_inference_steps",
+        type=int,
+        default=8,
+        help="1-step, 2-step, 4-step, or 8-step distilled models"
+    )
+    parser.add_argument(
+        "--repo",
+        type=str,
+        default="ByteDance/SDXL-Lightning",
+        required=True,
+        help="Repository from huggingface.co",
+    )
+    parser.add_argument(
+        "--ckpt",
+        type=str,
+        default="sdxl_lightning_4step_unet.safetensors",
+        required=True,
+        help="Available checkpoints from the repository",
+    )
+    parser.add_argument(
+        "--negative_prompt",
+        action="store_true",
+        help="The prompt or prompts not to guide the image generation",
+    )
+    if input_args is not None:
+        args = parser.parse_args(input_args)
+    else:
+        args = parser.parse_args()
+    return args
+def apply_color(image, color_map):
+    # Convert input images to LAB color space
+    image_lab = image.convert('LAB')
+    color_map_lab = color_map.convert('LAB')
+    # Split LAB channels
+    l, a, b = image_lab.split()
+    _, a_map, b_map = color_map_lab.split()
+    # Merge LAB channels with color map
+    merged_lab = Image.merge('LAB', (l, a_map, b_map))
+    # Convert merged LAB image back to RGB color space
+    result_rgb = merged_lab.convert('RGB')
+    return result_rgb
+def main(args):
+    generator = torch.manual_seed(0)
+    # Path to the eval_results folder
+    eval_results_folder = os.path.join(args.output_dir, "results")
+    # Remove eval_results folder if it exists
+    if os.path.exists(eval_results_folder):
+        shutil.rmtree(eval_results_folder)
+    # Create directory for eval_results
+    os.makedirs(eval_results_folder)
+    # Create subfolders for compare and colorized images
+    compare_folder = os.path.join(eval_results_folder, "compare")
+    colorized_folder = os.path.join(eval_results_folder, "colorized")
+    os.makedirs(compare_folder)
+    os.makedirs(colorized_folder)
+    # Load the validation split of the colorization dataset
+    val_dataset = load_dataset(args.dataset, split="validation", revision=args.dataset_revision)
+    accelerator = Accelerator(
+        mixed_precision=args.mixed_precision,
+    )
+    weight_dtype = torch.float32
+    if accelerator.mixed_precision == "fp16":
+        weight_dtype = torch.float16
+    elif accelerator.mixed_precision == "bf16":
+        weight_dtype = torch.bfloat16
+    vae_path = (
+        args.pretrained_model_name_or_path
+        if args.pretrained_vae_model_name_or_path is None
+        else args.pretrained_vae_model_name_or_path
+    )
+    vae = AutoencoderKL.from_pretrained(
+        vae_path,
+        subfolder="vae" if args.pretrained_vae_model_name_or_path is None else None,
+        revision=args.revision,
+        variant=args.variant,
+    )
+    unet = UNet2DConditionModel.from_config(
+        args.pretrained_model_name_or_path,
+        subfolder="unet",
+        revision=args.revision,
+        variant=args.variant,
+    )
+    unet.load_state_dict(load_file(hf_hub_download(args.repo, args.ckpt)))
+    # Move vae, unet and text_encoder to device and cast to weight_dtype
+    # The VAE is in float32 to avoid NaN losses.
+    if args.pretrained_vae_model_name_or_path is not None:
+        vae.to(accelerator.device, dtype=weight_dtype)
+    else:
+        vae.to(accelerator.device, dtype=torch.float32)
+    unet.to(accelerator.device, dtype=weight_dtype)
+    controlnet = ControlNetModel.from_pretrained(args.controlnet_model_name_or_path, torch_dtype=weight_dtype)
+    pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+        args.pretrained_model_name_or_path,
+        vae=vae,
+        unet=unet,
+        controlnet=controlnet,
+    )
+    pipe.to(accelerator.device, dtype=weight_dtype)
+    # Prepare everything with our `accelerator`.
+    pipe, val_dataset = accelerator.prepare(pipe, val_dataset)
+    pipe.safety_checker = None
+    # Counter for processed images
+    processed_images = 0
+    # Record start time
+    start_time = time.time()
+    # Iterate through the validation dataset
+    for example in tqdm(val_dataset, desc="Processing Images"):
+        image_path = example["file_name"]
+        prompt = []
+        for caption in example["captions"]:
+            if isinstance(caption, str):
+                prompt.append(caption)
+            elif isinstance(caption, (list, np.ndarray)):
+                # take a random caption if there are multiple
+                prompt.append(caption[0])
+            else:
+                raise ValueError(
+                    f"Caption column `captions` should contain either strings or lists of strings."
+                )
+        negative_prompt = None
+        if args.negative_prompt:
+            negative_prompt = [
+                "low quality, bad quality, low contrast, black and white, bw, monochrome, grainy, blurry, historical, restored, desaturate"
+            ]
+        # Generate image
+        ground_truth_image = load_image(image_path).resize((512, 512))
+        control_image = load_image(image_path).convert("L").convert("RGB").resize((512, 512))
+        image = pipe(prompt=prompt,
+                     negative_prompt=negative_prompt,
+                     num_inference_steps=args.num_inference_steps,
+                     generator=generator,
+                     image=control_image).images[0]
+        # Apply color mapping
+        image = apply_color(ground_truth_image, image)
+        # Concatenate images into a row
+        row_image = np.hstack((np.array(control_image), np.array(image), np.array(ground_truth_image)))
+        row_image = Image.fromarray(row_image)
+        # Save row image in the compare folder
+        compare_output_path = os.path.join(compare_folder, f"{image_path.split('/')[-1]}")
+        row_image.save(compare_output_path)
+        # Save colorized image in the colorized folder
+        colorized_output_path = os.path.join(colorized_folder, f"{image_path.split('/')[-1]}")
+        image.save(colorized_output_path)
+        # Increment processed images counter
+        processed_images += 1
+    # Record end time
+    end_time = time.time()
+    # Calculate total time taken
+    total_time = end_time - start_time
+    # Calculate FPS
+    fps = processed_images / total_time
+    print("All images processed.")
+    print(f"Total time taken: {total_time:.2f} seconds")
+    print(f"FPS: {fps:.2f}")
+# Entry point of the script
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)

eval_controlnet_sdxl_light.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+# Define default values for parameters
+# # sdxl light without negative prompt
+# export BASE_MODEL="stabilityai/stable-diffusion-xl-base-1.0"
+# export REPO="ByteDance/SDXL-Lightning"
+# export INFERENCE_STEP=8
+# export CKPT="sdxl_lightning_8step_unet.safetensors" # caution!!! ckpt's "N"step must match with inference_step
+# export CONTROLNET_MODEL="sdxl_light_custom_caption_output/checkpoint-12500/controlnet"
+# export DATASET="nickpai/coco2017-colorization"
+# export DATSET_REVISION="custom-caption"
+# export OUTPUT_DIR="sdxl_light_custom_caption_output/checkpoint-12500"
+# accelerate launch eval_controlnet_sdxl_light.py \
+#     --pretrained_model_name_or_path=$BASE_MODEL \
+#     --repo=$REPO \
+#     --ckpt=$CKPT \
+#     --num_inference_steps=$INFERENCE_STEP \
+#     --controlnet_model_name_or_path=$CONTROLNET_MODEL \
+#     --dataset=$DATASET \
+#     --dataset_revision=$DATSET_REVISION \
+#     --mixed_precision="fp16" \
+#     --output_dir=$OUTPUT_DIR
+# sdxl light with negative prompt
+export BASE_MODEL="stabilityai/stable-diffusion-xl-base-1.0"
+export REPO="ByteDance/SDXL-Lightning"
+export INFERENCE_STEP=8
+export CKPT="sdxl_lightning_8step_unet.safetensors" # caution!!! ckpt's "N"step must match with inference_step
+export CONTROLNET_MODEL="sdxl_light_caption_output/checkpoint-22500/controlnet"
+export DATASET="nickpai/coco2017-colorization"
+export DATSET_REVISION="custom-caption"
+export OUTPUT_DIR="sdxl_light_caption_output/checkpoint-22500"
+accelerate launch eval_controlnet_sdxl_light.py \
+    --pretrained_model_name_or_path=$BASE_MODEL \
+    --repo=$REPO \
+    --ckpt=$CKPT \
+    --num_inference_steps=$INFERENCE_STEP \
+    --controlnet_model_name_or_path=$CONTROLNET_MODEL \
+    --dataset=$DATASET \
+    --dataset_revision=$DATSET_REVISION \
+    --mixed_precision="fp16" \
+    --output_dir=$OUTPUT_DIR \
+    --negative_prompt

eval_controlnet_sdxl_light_single.py ADDED Viewed

	@@ -0,0 +1,390 @@

+import os
+import PIL
+import time
+import torch
+import argparse
+from typing import Optional, Union
+from accelerate import Accelerator
+from diffusers import (
+    AutoencoderKL,
+    StableDiffusionXLControlNetPipeline,
+    ControlNetModel,
+    UNet2DConditionModel,
+)
+from transformers import (
+    BlipProcessor, BlipForConditionalGeneration,
+    VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
+)
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+# Define the function to parse arguments
+def parse_args(input_args=None):
+    parser = argparse.ArgumentParser(description="Simple example of a ControlNet evaluation script.")
+    parser.add_argument(
+        "--image_path",
+        type=str,
+        default="example/legacy_images/Hollywood-Sign.jpg",
+        required=True,
+        help="Path to the image",
+    )
+    parser.add_argument(
+        "--pretrained_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to pretrained model or model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--pretrained_vae_model_name_or_path",
+        type=str,
+        default=None,
+        help="Path to an improved VAE to stabilize training. For more details check out: https://github.com/huggingface/diffusers/pull/4038.",
+    )
+    parser.add_argument(
+        "--controlnet_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Path to pretrained controlnet model.",
+    )
+    parser.add_argument(
+        "--caption_model_name",
+        type=str,
+        default="blip-image-captioning-large",
+        choices=["blip-image-captioning-large", "blip-image-captioning-base"],
+        help="Path to pretrained controlnet model.",
+    )
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default=None,
+        choices=["no", "fp16", "bf16"],
+        help=(
+            "Whether to use mixed precision. Choose between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >="
+            " 1.10.and an Nvidia Ampere GPU.  Default to the value of accelerate config of the current system or the"
+            " flag passed with the `accelerate.launch` command. Use this argument to override the accelerate config."
+        ),
+    )
+    parser.add_argument(
+        "--variant",
+        type=str,
+        default=None,
+        help="Variant of the model files of the pretrained model identifier from huggingface.co/models, 'e.g.' fp16",
+    )
+    parser.add_argument(
+        "--revision",
+        type=str,
+        default=None,
+        required=False,
+        help="Revision of pretrained model identifier from huggingface.co/models.",
+    )
+    parser.add_argument(
+        "--num_inference_steps",
+        type=int,
+        default=8,
+        help="1-step, 2-step, 4-step, or 8-step distilled models"
+    )
+    parser.add_argument(
+        "--repo",
+        type=str,
+        default="ByteDance/SDXL-Lightning",
+        required=True,
+        help="Repository from huggingface.co",
+    )
+    parser.add_argument(
+        "--ckpt",
+        type=str,
+        default="sdxl_lightning_4step_unet.safetensors",
+        required=True,
+        help="Available checkpoints from the repository",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=123,
+        help="Random seeds"
+    )
+    parser.add_argument(
+        "--positive_prompt",
+        type=str,
+        help="Text for positive prompt",
+    )
+    parser.add_argument(
+        "--negative_prompt",
+        type=str,
+        default="low quality, bad quality, low contrast, black and white, bw, monochrome, grainy, blurry, historical, restored, desaturate",
+        help="Text for negative prompt",
+    )
+    if input_args is not None:
+        args = parser.parse_args(input_args)
+    else:
+        args = parser.parse_args()
+    return args
+def apply_color(image, color_map):
+    # Convert input images to LAB color space
+    image_lab = image.convert('LAB')
+    color_map_lab = color_map.convert('LAB')
+    # Split LAB channels
+    l, a, b = image_lab.split()
+    _, a_map, b_map = color_map_lab.split()
+    # Merge LAB channels with color map
+    merged_lab = PIL.Image.merge('LAB', (l, a_map, b_map))
+    # Convert merged LAB image back to RGB color space
+    result_rgb = merged_lab.convert('RGB')
+    return result_rgb
+def remove_unlikely_words(prompt: str) -> str:
+    """
+    Removes unlikely words from a prompt.
+    Args:
+        prompt: The text prompt to be cleaned.
+    Returns:
+        The cleaned prompt with unlikely words removed.
+    """
+    unlikely_words = []
+    a1_list = [f'{i}s' for i in range(1900, 2000)]
+    a2_list = [f'{i}' for i in range(1900, 2000)]
+    a3_list = [f'year {i}' for i in range(1900, 2000)]
+    a4_list = [f'circa {i}' for i in range(1900, 2000)]
+    b1_list = [f"{year[0]} {year[1]} {year[2]} {year[3]} s" for year in a1_list]
+    b2_list = [f"{year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    b3_list = [f"year {year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    b4_list = [f"circa {year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    words_list = [
+        "black and white,", "black and white", "black & white,", "black & white", "circa",
+        "balck and white,", "monochrome,", "black-and-white,", "black-and-white photography,",
+        "black - and - white photography,", "monochrome bw,", "black white,", "black an white,",
+        "grainy footage,", "grainy footage", "grainy photo,", "grainy photo", "b&w photo",
+        "back and white", "back and white,", "monochrome contrast", "monochrome", "grainy",
+        "grainy photograph,", "grainy photograph", "low contrast,", "low contrast", "b & w",
+        "grainy black-and-white photo,", "bw", "bw,",  "grainy black-and-white photo",
+        "b & w,", "b&w,", "b&w!,", "b&w", "black - and - white,", "bw photo,", "grainy  photo,",
+        "black-and-white photo,", "black-and-white photo", "black - and - white photography",
+        "b&w photo,", "monochromatic photo,", "grainy monochrome photo,", "monochromatic",
+        "blurry photo,", "blurry,", "blurry photography,", "monochromatic photo",
+        "black - and - white photograph,", "black - and - white photograph", "black on white,",
+        "black on white", "black-and-white", "historical image,", "historical picture,",
+        "historical photo,", "historical photograph,", "archival photo,", "taken in the early",
+        "taken in the late", "taken in the", "historic photograph,", "restored,", "restored",
+        "historical photo", "historical setting,",
+        "historic photo,", "historic", "desaturated!!,", "desaturated!,", "desaturated,", "desaturated",
+        "taken in", "shot on leica", "shot on leica sl2", "sl2",
+        "taken with a leica camera", "taken with a leica camera", "leica sl2", "leica", "setting",
+        "overcast day", "overcast weather", "slight overcast", "overcast",
+        "picture taken in", "photo taken in",
+        ", photo", ",  photo", ",   photo", ",    photo", ", photograph",
+        ",,", ",,,", ",,,,", " ,", "  ,", "   ,", "    ,",
+    ]
+    unlikely_words.extend(a1_list)
+    unlikely_words.extend(a2_list)
+    unlikely_words.extend(a3_list)
+    unlikely_words.extend(a4_list)
+    unlikely_words.extend(b1_list)
+    unlikely_words.extend(b2_list)
+    unlikely_words.extend(b3_list)
+    unlikely_words.extend(b4_list)
+    unlikely_words.extend(words_list)
+    for word in unlikely_words:
+        prompt = prompt.replace(word, "")
+    return prompt
+def blip_image_captioning(image: PIL.Image.Image,
+                          model_backbone: str,
+                          weight_dtype: type,
+                          device: str,
+                          conditional: bool) -> str:
+    # https://huggingface.co/Salesforce/blip-image-captioning-large
+    # https://huggingface.co/Salesforce/blip-image-captioning-base
+    if weight_dtype == torch.bfloat16: # in case model might not accept bfloat16 data type
+        weight_dtype = torch.float16
+    processor = BlipProcessor.from_pretrained(f"Salesforce/{model_backbone}")
+    model = BlipForConditionalGeneration.from_pretrained(
+         f"Salesforce/{model_backbone}", torch_dtype=weight_dtype).to(device)
+    valid_backbones = ["blip-image-captioning-large", "blip-image-captioning-base"]
+    if model_backbone not in valid_backbones:
+         raise ValueError(f"Invalid model backbone '{model_backbone}'. \
+                          Valid options are: {', '.join(valid_backbones)}")
+    if conditional:
+        text = "a photography of"
+        inputs = processor(image, text, return_tensors="pt").to(device, weight_dtype)
+    else:
+        inputs = processor(image, return_tensors="pt").to(device)
+    out = model.generate(**inputs)
+    caption = processor.decode(out[0], skip_special_tokens=True)
+    return caption
+import matplotlib.pyplot as plt
+def display_images(input_image, output_image, ground_truth):
+    """
+    Displays a grid of input, output, ground truth images with a caption at the bottom.
+    Args:
+        input_image: A grayscale image as a NumPy array.
+        output_image: A grayscale image (result) as a NumPy array.
+        ground_truth: A grayscale image (ground truth) as a NumPy array.
+    """
+    fig, axes = plt.subplots(1, 3, figsize=(20, 8))
+    axes[0].imshow(input_image, cmap='gray')
+    axes[0].set_title('Input')
+    axes[0].axis('off')
+    axes[1].imshow(output_image)
+    axes[1].set_title('Output')
+    axes[1].axis('off')
+    axes[2].imshow(ground_truth)
+    axes[2].set_title('Ground Truth')
+    axes[2].axis('off')
+    plt.tight_layout()
+    plt.show()
+# Define a function to process the image with the loaded model
+def process_image(image_path: str,
+                  controlnet_model_name_or_path: str,
+                  caption_model_name: str,
+                  positive_prompt: Optional[str],
+                  negative_prompt: Optional[str],
+                  seed: int,
+                  num_inference_steps: int,
+                  mixed_precision: str,
+                  pretrained_model_name_or_path: str,
+                  pretrained_vae_model_name_or_path: Optional[str],
+                  revision: Optional[str],
+                  variant: Optional[str],
+                  repo: str,
+                  ckpt: str,) -> PIL.Image.Image:
+    # Seed
+    generator = torch.manual_seed(seed)
+    # Accelerator Setting
+    accelerator = Accelerator(
+        mixed_precision=mixed_precision,
+    )
+    weight_dtype = torch.float32
+    if accelerator.mixed_precision == "fp16":
+        weight_dtype = torch.float16
+    elif accelerator.mixed_precision == "bf16":
+        weight_dtype = torch.bfloat16
+    vae_path = (
+        pretrained_model_name_or_path
+        if pretrained_vae_model_name_or_path is None
+        else pretrained_vae_model_name_or_path
+    )
+    vae = AutoencoderKL.from_pretrained(
+        vae_path,
+        subfolder="vae" if pretrained_vae_model_name_or_path is None else None,
+        revision=revision,
+        variant=variant,
+    )
+    unet = UNet2DConditionModel.from_config(
+        pretrained_model_name_or_path,
+        subfolder="unet",
+        revision=revision,
+        variant=variant,
+    )
+    unet.load_state_dict(load_file(hf_hub_download(repo, ckpt)))
+    # Move vae, unet and text_encoder to device and cast to weight_dtype
+    # The VAE is in float32 to avoid NaN losses.
+    if pretrained_vae_model_name_or_path is not None:
+        vae.to(accelerator.device, dtype=weight_dtype)
+    else:
+        vae.to(accelerator.device, dtype=torch.float32)
+    unet.to(accelerator.device, dtype=weight_dtype)
+    controlnet = ControlNetModel.from_pretrained(controlnet_model_name_or_path, torch_dtype=weight_dtype)
+    pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+        pretrained_model_name_or_path,
+        vae=vae,
+        unet=unet,
+        controlnet=controlnet,
+    )
+    pipe.to(accelerator.device, dtype=weight_dtype)
+    image = PIL.Image.open(image_path)
+    # Prepare everything with our `accelerator`.
+    pipe, image = accelerator.prepare(pipe, image)
+    pipe.safety_checker = None
+    # Convert image into grayscale
+    original_size = image.size
+    control_image = image.convert("L").convert("RGB").resize((512, 512))
+    # Image captioning
+    if caption_model_name == "blip-image-captioning-large" or "blip-image-captioning-base":
+        caption = blip_image_captioning(control_image, caption_model_name,
+                                        weight_dtype, accelerator.device, conditional=True)
+    # elif caption_model_name == "ViT-L-14/openai" or "ViT-H-14/laion2b_s32b_b79k":
+    #     caption = clip_image_captioning(control_image, caption_model_name, accelerator.device)
+    # elif caption_model_name == "vit-gpt2-image-captioning":
+    #     caption = vit_gpt2_image_captioning(control_image, accelerator.device)
+    caption = remove_unlikely_words(caption)
+    print("================================================================")
+    print(f"Positive prompt: \n>>> {positive_prompt}")
+    print(f"Negative prompt: \n>>> {negative_prompt}")
+    print(f"Caption results: \n>>> {caption}")
+    print("================================================================")
+    # Combine positive prompt and captioning result
+    prompt = [positive_prompt + ", " + caption]
+    # Image colorization
+    image = pipe(prompt=prompt,
+                 negative_prompt=negative_prompt,
+                 num_inference_steps=num_inference_steps,
+                 generator=generator,
+                 image=control_image).images[0]
+    # Apply color mapping
+    result_image = apply_color(control_image, image)
+    result_image = result_image.resize(original_size)
+    return result_image, caption
+def main(args):
+    output_image, output_caption = process_image(image_path=args.image_path,
+                                                 controlnet_model_name_or_path=args.controlnet_model_name_or_path,
+                                                 caption_model_name=args.caption_model_name,
+                                                 positive_prompt=args.positive_prompt,
+                                                 negative_prompt=args.negative_prompt,
+                                                 seed=args.seed,
+                                                 num_inference_steps=args.num_inference_steps,
+                                                 mixed_precision=args.mixed_precision,
+                                                 pretrained_model_name_or_path=args.pretrained_model_name_or_path,
+                                                 pretrained_vae_model_name_or_path=args.pretrained_vae_model_name_or_path,
+                                                 revision=args.revision,
+                                                 variant=args.variant,
+                                                 repo=args.repo,
+                                                 ckpt=args.ckpt,)
+    input_image = PIL.Image.open(args.image_path)
+    display_images(input_image.convert("L"), output_image, input_image)
+    return output_image, output_caption
+# Entry point of the script
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)

eval_controlnet_sdxl_light_single.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+# sdxl light for single image
+export BASE_MODEL="stabilityai/stable-diffusion-xl-base-1.0"
+export REPO="ByteDance/SDXL-Lightning"
+export INFERENCE_STEP=8
+export CKPT="sdxl_lightning_8step_unet.safetensors" # caution!!! ckpt's "N"step must match with inference_step
+export CONTROLNET_MODEL="sdxl_light_caption_output/checkpoint-30000/controlnet"
+export CAPTION_MODEL="blip-image-captioning-large"
+export IMAGE_PATH="example/legacy_images/Hollywood-Sign.jpg"
+# export POSITIVE_PROMPT="blue shirt"
+accelerate launch eval_controlnet_sdxl_light_single.py \
+    --pretrained_model_name_or_path=$BASE_MODEL \
+    --repo=$REPO \
+    --ckpt=$CKPT \
+    --num_inference_steps=$INFERENCE_STEP \
+    --controlnet_model_name_or_path=$CONTROLNET_MODEL \
+    --caption_model_name=$CAPTION_MODEL \
+    --mixed_precision="fp16" \
+    --image_path=$IMAGE_PATH \
+    --positive_prompt="red car"

example/UUColor_results/Hollywood-Sign.jpeg ADDED Viewed

example/legacy_images/Big-Ben-vintage.jpg ADDED Viewed

example/legacy_images/Central-Park.jpg ADDED Viewed

example/legacy_images/Hollywood-Sign.jpg ADDED Viewed

example/legacy_images/Little-Mermaid.jpg ADDED Viewed

example/legacy_images/Migrant-Mother.jpg ADDED Viewed

example/legacy_images/Mount-Everest.jpg ADDED Viewed

example/legacy_images/Tower-of-Pisa.jpg ADDED Viewed

example/legacy_images/Wasatch-Mountains-Summit-County-Utah.jpg ADDED Viewed

gradio_ui.py ADDED Viewed

	@@ -0,0 +1,356 @@

+import PIL
+import torch
+import subprocess
+import gradio as gr
+from typing import Optional
+from accelerate import Accelerator
+from diffusers import (
+    AutoencoderKL,
+    StableDiffusionXLControlNetPipeline,
+    ControlNetModel,
+    UNet2DConditionModel,
+)
+from transformers import (
+    BlipProcessor, BlipForConditionalGeneration,
+    VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
+)
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+from clip_interrogator import Interrogator, Config, list_clip_models
+def apply_color(image: PIL.Image.Image, color_map: PIL.Image.Image) -> PIL.Image.Image:
+    # Convert input images to LAB color space
+    image_lab = image.convert('LAB')
+    color_map_lab = color_map.convert('LAB')
+    # Split LAB channels
+    l, a , b = image_lab.split()
+    _, a_map, b_map = color_map_lab.split()
+    # Merge LAB channels with color map
+    merged_lab = PIL.Image.merge('LAB', (l, a_map, b_map))
+    # Convert merged LAB image back to RGB color space
+    result_rgb = merged_lab.convert('RGB')
+    return result_rgb
+def remove_unlikely_words(prompt: str) -> str:
+    """
+    Removes unlikely words from a prompt.
+    Args:
+        prompt: The text prompt to be cleaned.
+    Returns:
+        The cleaned prompt with unlikely words removed.
+    """
+    unlikely_words = []
+    a1_list = [f'{i}s' for i in range(1900, 2000)]
+    a2_list = [f'{i}' for i in range(1900, 2000)]
+    a3_list = [f'year {i}' for i in range(1900, 2000)]
+    a4_list = [f'circa {i}' for i in range(1900, 2000)]
+    b1_list = [f"{year[0]} {year[1]} {year[2]} {year[3]} s" for year in a1_list]
+    b2_list = [f"{year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    b3_list = [f"year {year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    b4_list = [f"circa {year[0]} {year[1]} {year[2]} {year[3]}" for year in a1_list]
+    words_list = [
+        "black and white,", "black and white", "black & white,", "black & white", "circa",
+        "balck and white,", "monochrome,", "black-and-white,", "black-and-white photography,",
+        "black - and - white photography,", "monochrome bw,", "black white,", "black an white,",
+        "grainy footage,", "grainy footage", "grainy photo,", "grainy photo", "b&w photo",
+        "back and white", "back and white,", "monochrome contrast", "monochrome", "grainy",
+        "grainy photograph,", "grainy photograph", "low contrast,", "low contrast", "b & w",
+        "grainy black-and-white photo,", "bw", "bw,",  "grainy black-and-white photo",
+        "b & w,", "b&w,", "b&w!,", "b&w", "black - and - white,", "bw photo,", "grainy  photo,",
+        "black-and-white photo,", "black-and-white photo", "black - and - white photography",
+        "b&w photo,", "monochromatic photo,", "grainy monochrome photo,", "monochromatic",
+        "blurry photo,", "blurry,", "blurry photography,", "monochromatic photo",
+        "black - and - white photograph,", "black - and - white photograph", "black on white,",
+        "black on white", "black-and-white", "historical image,", "historical picture,",
+        "historical photo,", "historical photograph,", "archival photo,", "taken in the early",
+        "taken in the late", "taken in the", "historic photograph,", "restored,", "restored",
+        "historical photo", "historical setting,",
+        "historic photo,", "historic", "desaturated!!,", "desaturated!,", "desaturated,", "desaturated",
+        "taken in", "shot on leica", "shot on leica sl2", "sl2",
+        "taken with a leica camera", "taken with a leica camera", "leica sl2", "leica", "setting",
+        "overcast day", "overcast weather", "slight overcast", "overcast",
+        "picture taken in", "photo taken in",
+        ", photo", ",  photo", ",   photo", ",    photo", ", photograph",
+        ",,", ",,,", ",,,,", " ,", "  ,", "   ,", "    ,",
+    ]
+    unlikely_words.extend(a1_list)
+    unlikely_words.extend(a2_list)
+    unlikely_words.extend(a3_list)
+    unlikely_words.extend(a4_list)
+    unlikely_words.extend(b1_list)
+    unlikely_words.extend(b2_list)
+    unlikely_words.extend(b3_list)
+    unlikely_words.extend(b4_list)
+    unlikely_words.extend(words_list)
+    for word in unlikely_words:
+        prompt = prompt.replace(word, "")
+    return prompt
+def blip_image_captioning(image: PIL.Image.Image,
+                          model_backbone: str,
+                          weight_dtype: type,
+                          device: str,
+                          conditional: bool) -> str:
+    # https://huggingface.co/Salesforce/blip-image-captioning-large
+    # https://huggingface.co/Salesforce/blip-image-captioning-base
+    if weight_dtype == torch.bfloat16: # in case model might not accept bfloat16 data type
+        weight_dtype = torch.float16
+    processor = BlipProcessor.from_pretrained(f"Salesforce/{model_backbone}")
+    model = BlipForConditionalGeneration.from_pretrained(
+         f"Salesforce/{model_backbone}", torch_dtype=weight_dtype).to(device)
+    valid_backbones = ["blip-image-captioning-large", "blip-image-captioning-base"]
+    if model_backbone not in valid_backbones:
+         raise ValueError(f"Invalid model backbone '{model_backbone}'. \
+                          Valid options are: {', '.join(valid_backbones)}")
+    if conditional:
+        text = "a photography of"
+        inputs = processor(image, text, return_tensors="pt").to(device, weight_dtype)
+    else:
+        inputs = processor(image, return_tensors="pt").to(device)
+    out = model.generate(**inputs)
+    caption = processor.decode(out[0], skip_special_tokens=True)
+    return caption
+# def vit_gpt2_image_captioning(image: PIL.Image.Image, device: str) -> str:
+#     # https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
+#     model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device)
+#     feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
+#     tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
+#     max_length = 16
+#     num_beams = 4
+#     gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
+#     pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
+#     pixel_values = pixel_values.to(device)
+#     output_ids = model.generate(pixel_values, **gen_kwargs)
+#     preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
+#     caption = [pred.strip() for pred in preds]
+#     return caption[0]
+# def clip_image_captioning(image: PIL.Image.Image,
+#                           clip_model_name: str,
+#                           device: str) -> str:
+#     # validate clip model name
+#     models = list_clip_models()
+#     if clip_model_name not in models:
+#         raise ValueError(f"Could not find CLIP model {clip_model_name}! \
+#                          Available models: {models}")
+#     config = Config(device=device, clip_model_name=clip_model_name)
+#     config.apply_low_vram_defaults()
+#     ci = Interrogator(config)
+#     caption = ci.interrogate(image)
+#     return caption
+# Define a function to process the image with the loaded model
+def process_image(image_path: str,
+                  controlnet_model_name_or_path: str,
+                  caption_model_name: str,
+                  positive_prompt: Optional[str],
+                  negative_prompt: Optional[str],
+                  seed: int,
+                  num_inference_steps: int,
+                  mixed_precision: str,
+                  pretrained_model_name_or_path: str,
+                  pretrained_vae_model_name_or_path: Optional[str],
+                  revision: Optional[str],
+                  variant: Optional[str],
+                  repo: str,
+                  ckpt: str,) -> PIL.Image.Image:
+    # Seed
+    generator = torch.manual_seed(seed)
+    # Accelerator Setting
+    accelerator = Accelerator(
+        mixed_precision=mixed_precision,
+    )
+    weight_dtype = torch.float32
+    if accelerator.mixed_precision == "fp16":
+        weight_dtype = torch.float16
+    elif accelerator.mixed_precision == "bf16":
+        weight_dtype = torch.bfloat16
+    vae_path = (
+        pretrained_model_name_or_path
+        if pretrained_vae_model_name_or_path is None
+        else pretrained_vae_model_name_or_path
+    )
+    vae = AutoencoderKL.from_pretrained(
+        vae_path,
+        subfolder="vae" if pretrained_vae_model_name_or_path is None else None,
+        revision=revision,
+        variant=variant,
+    )
+    unet = UNet2DConditionModel.from_config(
+        pretrained_model_name_or_path,
+        subfolder="unet",
+        revision=revision,
+        variant=variant,
+    )
+    unet.load_state_dict(load_file(hf_hub_download(repo, ckpt)))
+    # Move vae, unet and text_encoder to device and cast to weight_dtype
+    # The VAE is in float32 to avoid NaN losses.
+    if pretrained_vae_model_name_or_path is not None:
+        vae.to(accelerator.device, dtype=weight_dtype)
+    else:
+        vae.to(accelerator.device, dtype=torch.float32)
+    unet.to(accelerator.device, dtype=weight_dtype)
+    controlnet = ControlNetModel.from_pretrained(controlnet_model_name_or_path, torch_dtype=weight_dtype)
+    pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+        pretrained_model_name_or_path,
+        vae=vae,
+        unet=unet,
+        controlnet=controlnet,
+    )
+    pipe.to(accelerator.device, dtype=weight_dtype)
+    image = PIL.Image.open(image_path)
+    # Prepare everything with our `accelerator`.
+    pipe, image = accelerator.prepare(pipe, image)
+    pipe.safety_checker = None
+    # Convert image into grayscale
+    original_size = image.size
+    control_image = image.convert("L").convert("RGB").resize((512, 512))
+    # Image captioning
+    if caption_model_name == "blip-image-captioning-large" or "blip-image-captioning-base":
+        caption = blip_image_captioning(control_image, caption_model_name,
+                                        weight_dtype, accelerator.device, conditional=True)
+    # elif caption_model_name == "ViT-L-14/openai" or "ViT-H-14/laion2b_s32b_b79k":
+    #     caption = clip_image_captioning(control_image, caption_model_name, accelerator.device)
+    # elif caption_model_name == "vit-gpt2-image-captioning":
+    #     caption = vit_gpt2_image_captioning(control_image, accelerator.device)
+    caption = remove_unlikely_words(caption)
+    # Combine positive prompt and captioning result
+    prompt = [positive_prompt + ", " + caption]
+    # Image colorization
+    image = pipe(prompt=prompt,
+                 negative_prompt=negative_prompt,
+                 num_inference_steps=num_inference_steps,
+                 generator=generator,
+                 image=control_image).images[0]
+    # Apply color mapping
+    result_image = apply_color(control_image, image)
+    result_image = result_image.resize(original_size)
+    return result_image, caption
+# Define the image gallery based on folder path
+def get_image_paths(folder_path):
+  import os
+  image_paths = []
+  for filename in os.listdir(folder_path):
+    if filename.endswith(".jpg") or filename.endswith(".png"):
+      image_paths.append([os.path.join(folder_path, filename)])
+  return image_paths
+# Create the Gradio interface
+def create_interface():
+    controlnet_model_dict = {
+       "sdxl-light-caption-30000": "sdxl_light_caption_output/checkpoint-30000/controlnet",
+       "sdxl-light-custom-caption-30000": "sdxl_light_custom_caption_output/checkpoint-30000/controlnet",
+    }
+    images = get_image_paths("example/legacy_images")  # Replace with your folder path
+    interface = gr.Interface(
+        fn=process_image,
+        inputs=[
+            gr.Image(label="Upload image",
+                     value="example/legacy_images/Hollywood-Sign.jpg",
+                     type='filepath'),
+            gr.Dropdown(choices=[controlnet_model_dict[key] for key in controlnet_model_dict],
+                        value=controlnet_model_dict["sdxl-light-caption-30000"],
+                        label="Select ControlNet Model"),
+            gr.Dropdown(choices=["blip-image-captioning-large",
+                                 "blip-image-captioning-base",],
+                        value="blip-image-captioning-large",
+                        label="Select Image Captioning Model"),
+            gr.Textbox(label="Positive Prompt", placeholder="Text for positive prompt"),
+            gr.Textbox(value="low quality, bad quality, low contrast, black and white, bw, monochrome, grainy, blurry, historical, restored, desaturate",
+                       label="Negative Prompt", placeholder="Text for negative prompt"),
+        ],
+        outputs=[
+            gr.Image(label="Colorized image",
+                     value="example/UUColor_results/Hollywood-Sign.jpeg",
+                     format="jpeg"),
+            gr.Textbox(label="Captioning Result", show_copy_button=True)
+        ],
+        examples=images,
+        additional_inputs=[
+            # gr.Radio(choices=["Original", "Square"], value="Original",
+            #          label="Output resolution"),
+            # gr.Slider(minimum=128, maximum=512, value=256, step=128,
+            #           label="Height & Width",
+            #           info='Only effect if select "Square" output resolution'),
+            gr.Slider(0, 1000, 123, label="Seed"),
+            gr.Radio(choices=[1, 2, 4, 8],
+                     value=8,
+                     label="Inference Steps",
+                     info="1-step, 2-step, 4-step, or 8-step distilled models"),
+            gr.Radio(choices=["no", "fp16", "bf16"],
+                     value="fp16",
+                     label="Mixed Precision",
+                     info="Whether to use mixed precision. Choose between fp16 and bf16 (bfloat16)."),
+            gr.Dropdown(choices=["stabilityai/stable-diffusion-xl-base-1.0"],
+                        value="stabilityai/stable-diffusion-xl-base-1.0",
+                        label="Base Model",
+                        info="Path to pretrained model or model identifier from huggingface.co/models."),
+            gr.Dropdown(choices=["None"],
+                        value=None,
+                        label="VAE Model",
+                        info="Path to an improved VAE to stabilize training. For more details check out: https://github.com/huggingface/diffusers/pull/4038."),
+            gr.Dropdown(choices=["None"],
+                        value=None,
+                        label="Varient",
+                        info="Variant of the model files of the pretrained model identifier from huggingface.co/models, 'e.g.' fp16"),
+            gr.Dropdown(choices=["None"],
+                        value=None,
+                        label="Revision",
+                        info="Revision of pretrained model identifier from huggingface.co/models."),
+            gr.Dropdown(choices=["ByteDance/SDXL-Lightning"],
+                        value="ByteDance/SDXL-Lightning",
+                        label="Repository",
+                        info="Repository from huggingface.co"),
+            gr.Dropdown(choices=["sdxl_lightning_1step_unet.safetensors",
+                                 "sdxl_lightning_2step_unet.safetensors",
+                                 "sdxl_lightning_4step_unet.safetensors",
+                                 "sdxl_lightning_8step_unet.safetensors"],
+                        value="sdxl_lightning_8step_unet.safetensors",
+                        label="Checkpoint",
+                        info="Available checkpoints from the repository. Caution! Checkpoint's 'N'step must match with inference steps"),
+        ],
+        title="Text-Guided Image Colorization",
+        description="Upload an image and select a model to colorize it."
+    )
+    return interface
+def main():
+    # Launch the Gradio interface
+    interface = create_interface()
+    interface.launch()
+if __name__ == "__main__":
+   main()