Spaces:

sanjaykamath
/

Safeworld_Captioning_Spaces

Runtime error

App Files Files Community

rsanjaykamath commited on Feb 15, 2022

Commit

7fc7f3d

•

1 Parent(s): eb43f71

push

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
.idea/.gitignore +3 -0
LICENSE.txt +12 -0
README 2.md +46 -0
README.md +40 -6
__pycache__/run_code.cpython-38.pyc +0 -0
app.py +232 -0
app_run.ipynb +400 -0
configs/caption_coco.yaml +33 -0
configs/med_config.json +21 -0
configs/nlvr.yaml +21 -0
configs/nocaps.yaml +15 -0
configs/pretrain.yaml +27 -0
configs/retrieval_coco.yaml +34 -0
configs/retrieval_flickr.yaml +34 -0
configs/vqa.yaml +25 -0
data/__init__.py +101 -0
data/coco_karpathy_dataset.py +126 -0
data/flickr30k_dataset.py +93 -0
data/nlvr_dataset.py +78 -0
data/nocaps_dataset.py +32 -0
data/pretrain_dataset.py +59 -0
data/utils.py +112 -0
data/vqa_dataset.py +88 -0
elephant.jpg +0 -0
eval_nocaps.py +118 -0
examples/ex1.jpg +0 -0
examples/ex2.jpg +0 -0
examples/ex3.jpg +0 -0
extras/.DS_Store +0 -0
extras/sample-images/0.JPG +0 -0
extras/sample-images/1.JPG +0 -0
extras/sample-images/10.jpg +0 -0
extras/sample-images/2.jpg +0 -0
extras/sample-images/3.jpg +0 -0
extras/sample-images/4.jpg +0 -0
extras/sample-images/5.jpg +0 -0
extras/sample-images/6.JPG +0 -0
extras/sample-images/7.JPG +0 -0
extras/sample-images/8.jpg +0 -0
extras/sample-images/9.jpg +0 -0
foo.png +0 -0
gradio_cached_examples/log.csv +2 -0
local_run.ipynb +347 -0
model-data/.DS_Store +0 -0
model-data/weights/pictor-ppe-v302-a1-yolo-v3-weights.h5 +3 -0
model-data/weights/pictor-ppe-v302-a2-yolo-v3-weights.h5 +3 -0
model-data/weights/pictor-ppe-v302-a3-yolo-v3-weights.h5 +3 -0
model-data/weights/readme.md +1 -0
modelsn/__init__.py +0 -0

.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

.idea/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@


1	+
2	+ # Default ignored files
3	+ /workspace.xml

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+Copyright (c) 2022, Salesforce.com, Inc.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+* Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README 2.md ADDED Viewed

	@@ -0,0 +1,46 @@

+---
+title: PPE_Detection
+emoji: 💩
+colorFrom: pink
+colorTo: indigo
+sdk: gradio
+app_file: app.py
+pinned: false
+license: other
+---
+# Configuration
+`title`: _string_
+Display title for the Space
+`emoji`: _string_
+Space emoji (emoji-only character allowed)
+`colorFrom`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`colorTo`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`sdk`: _string_
+Can be either `gradio`, `streamlit`, or `static`
+`sdk_version` : _string_
+Only applicable for `streamlit` SDK.
+See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
+`app_file`: _string_
+Path to your main application file (which contains either `gradio` or `streamlit` Python code, or `static` html code).
+Path is relative to the root of the repository.
+`models`: _List[string]_
+HF model IDs (like "gpt2" or "deepset/roberta-base-squad2") used in the Space.
+Will be parsed automatically from your code if not specified here.
+`datasets`: _List[string]_
+HF dataset IDs (like "common_voice" or "oscar-corpus/OSCAR-2109") used in the Space.
+Will be parsed automatically from your code if not specified here.
+`pinned`: _boolean_
+Whether the Space stays on top of your list.

README.md CHANGED Viewed

@@ -1,12 +1,46 @@
 ---
-title: Safeworld_Captioning_Spaces
-emoji: 📚
-colorFrom: indigo
-colorTo: purple
 sdk: gradio
 app_file: app.py
 pinned: false
-license: other
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: BLIP
+emoji: 🦀
+colorFrom: red
+colorTo: blue
 sdk: gradio
 app_file: app.py
 pinned: false
+license: bsd-3-clause
 ---
+# Configuration
+`title`: _string_
+Display title for the Space
+`emoji`: _string_
+Space emoji (emoji-only character allowed)
+`colorFrom`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`colorTo`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`sdk`: _string_
+Can be either `gradio`, `streamlit`, or `static`
+`sdk_version` : _string_
+Only applicable for `streamlit` SDK.
+See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
+`app_file`: _string_
+Path to your main application file (which contains either `gradio` or `streamlit` Python code, or `static` html code).
+Path is relative to the root of the repository.
+`models`: _List[string]_
+HF model IDs (like "gpt2" or "deepset/roberta-base-squad2") used in the Space.
+Will be parsed automatically from your code if not specified here.
+`datasets`: _List[string]_
+HF dataset IDs (like "common_voice" or "oscar-corpus/OSCAR-2109") used in the Space.
+Will be parsed automatically from your code if not specified here.
+`pinned`: _boolean_
+Whether the Space stays on top of your list.

__pycache__/run_code.cpython-38.pyc ADDED Viewed

Binary file (3.18 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,232 @@

+import os
+os.system(
+    "wget https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg/1920px-Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg -O starry.jpg")
+from PIL import Image
+import requests
+import torch
+from torchvision import transforms
+from torchvision.transforms.functional import InterpolationMode
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+# MDETR Code
+import torchvision.transforms as T
+import matplotlib.pyplot as plt
+from collections import defaultdict
+import torch.nn.functional as F
+import numpy as np
+from skimage.measure import find_contours
+from matplotlib import patches, lines
+from matplotlib.patches import Polygon
+import gradio as gr
+torch.hub.download_url_to_file('https://cdn.pixabay.com/photo/2014/03/04/15/10/elephants-279505_1280.jpg',
+                               'elephant.jpg')
+model2, postprocessor = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5', pretrained=True,
+                                       return_postprocessor=True)
+model2 = model2.cpu()
+model2.eval()
+torch.set_grad_enabled(False);
+# standard PyTorch mean-std input image normalization
+transform = T.Compose([
+    T.Resize(800),
+    T.ToTensor(),
+    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+])
+# for output bounding box post-processing
+def box_cxcywh_to_xyxy(x):
+    x_c, y_c, w, h = x.unbind(1)
+    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
+         (x_c + 0.5 * w), (y_c + 0.5 * h)]
+    return torch.stack(b, dim=1)
+def rescale_bboxes(out_bbox, size):
+    img_w, img_h = size
+    b = box_cxcywh_to_xyxy(out_bbox)
+    b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)
+    return b
+# colors for visualization
+COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],
+          [0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]
+def apply_mask(image, mask, color, alpha=0.5):
+    """Apply the given mask to the image.
+    """
+    for c in range(3):
+        image[:, :, c] = np.where(mask == 1,
+                                  image[:, :, c] *
+                                  (1 - alpha) + alpha * color[c] * 255,
+                                  image[:, :, c])
+    return image
+def plot_results(pil_img, scores, boxes, labels, masks=None):
+    plt.figure(figsize=(16, 10))
+    np_image = np.array(pil_img)
+    ax = plt.gca()
+    colors = COLORS * 100
+    if masks is None:
+        masks = [None for _ in range(len(scores))]
+    assert len(scores) == len(boxes) == len(labels) == len(masks)
+    for s, (xmin, ymin, xmax, ymax), l, mask, c in zip(scores, boxes.tolist(), labels, masks, colors):
+        ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,
+                                   fill=False, color=c, linewidth=3))
+        text = f'{l}: {s:0.2f}'
+        ax.text(xmin, ymin, text, fontsize=15, bbox=dict(facecolor='white', alpha=0.8))
+        if mask is None:
+            continue
+        np_image = apply_mask(np_image, mask, c)
+        padded_mask = np.zeros((mask.shape[0] + 2, mask.shape[1] + 2), dtype=np.uint8)
+        padded_mask[1:-1, 1:-1] = mask
+        contours = find_contours(padded_mask, 0.5)
+        for verts in contours:
+            # Subtract the padding and flip (y, x) to (x, y)
+            verts = np.fliplr(verts) - 1
+            p = Polygon(verts, facecolor="none", edgecolor=c)
+            ax.add_patch(p)
+    plt.imshow(np_image)
+    plt.axis('off')
+    plt.savefig('foo.png', bbox_inches='tight')
+    return 'foo.png'
+def add_res(results, ax, color='green'):
+    # for tt in results.values():
+    if True:
+        bboxes = results['boxes']
+        labels = results['labels']
+        scores = results['scores']
+        # keep = scores >= 0.0
+        # bboxes = bboxes[keep].tolist()
+        # labels = labels[keep].tolist()
+        # scores = scores[keep].tolist()
+    # print(torchvision.ops.box_iou(tt['boxes'].cpu().detach(), torch.as_tensor([[xmin, ymin, xmax, ymax]])))
+    colors = ['purple', 'yellow', 'red', 'green', 'orange', 'pink']
+    for i, (b, ll, ss) in enumerate(zip(bboxes, labels, scores)):
+        ax.add_patch(plt.Rectangle((b[0], b[1]), b[2] - b[0], b[3] - b[1], fill=False, color=colors[i], linewidth=3))
+        cls_name = ll if isinstance(ll, str) else CLASSES[ll]
+        text = f'{cls_name}: {ss:.2f}'
+        print(text)
+        ax.text(b[0], b[1], text, fontsize=15, bbox=dict(facecolor='white', alpha=0.8))
+def plot_inference(im, caption, approaches):
+    choices = {"Worker Helmet Separately": 1, "Worker Helmet Vest": 2, "Workers only": 3}
+    # mean-std normalize the input image (batch-size: 1)
+    img = transform(im).unsqueeze(0).cpu()
+    # propagate through the model
+    memory_cache = model2(img, [caption], encode_and_save=True)
+    outputs = model2(img, [caption], encode_and_save=False, memory_cache=memory_cache)
+    # keep only predictions with 0.7+ confidence
+    probas = 1 - outputs['pred_logits'].softmax(-1)[0, :, -1].cpu()
+    keep = (probas > 0.7).cpu()
+    # convert boxes from [0; 1] to image scales
+    bboxes_scaled = rescale_bboxes(outputs['pred_boxes'].cpu()[0, keep], im.size)
+    # Extract the text spans predicted by each box
+    positive_tokens = (outputs["pred_logits"].cpu()[0, keep].softmax(-1) > 0.1).nonzero().tolist()
+    predicted_spans = defaultdict(str)
+    for tok in positive_tokens:
+        item, pos = tok
+        if pos < 255:
+            span = memory_cache["tokenized"].token_to_chars(0, pos)
+            predicted_spans[item] += " " + caption[span.start:span.end]
+    labels = [predicted_spans[k] for k in sorted(list(predicted_spans.keys()))]
+    caption = 'Caption: ' + caption
+    return (sepia_call(caption, im, plot_results(im, probas[keep], bboxes_scaled, labels), choices[approaches]))
+# BLIP Code
+from modelsn.blip import blip_decoder
+image_size = 384
+transform = transforms.Compose([
+    transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
+    transforms.ToTensor(),
+    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
+])
+model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth'
+model = blip_decoder(pretrained=model_url, image_size=384, vit='base')
+model.eval()
+model = model.to(device)
+from modelsn.blip_vqa import blip_vqa
+image_size_vq = 480
+transform_vq = transforms.Compose([
+    transforms.Resize((image_size_vq, image_size_vq), interpolation=InterpolationMode.BICUBIC),
+    transforms.ToTensor(),
+    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
+])
+model_url_vq = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth'
+model_vq = blip_vqa(pretrained=model_url_vq, image_size=480, vit='base')
+model_vq.eval()
+model_vq = model_vq.to(device)
+def inference(raw_image, approaches, question):
+    image = transform(raw_image).unsqueeze(0).to(device)
+    with torch.no_grad():
+        caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
+    return (plot_inference(raw_image, caption[0], approaches))
+    # return 'caption: '+caption[0]
+# PPE Detection code
+import numpy as np
+import run_code
+import gradio as gr
+def sepia_call(caption, Input_Image, MDETR_im, Approach):
+    pil_image = Input_Image
+    open_cv_image = np.asarray(pil_image)
+    sepia_img = run_code.run(open_cv_image, Approach)
+    images = sepia_img['img']
+    texts = sepia_img['text']
+    return (caption, MDETR_im, images, texts)
+inputs = [gr.inputs.Image(type='pil'),
+          gr.inputs.Radio(choices=["Worker Helmet Separately", "Worker Helmet Vest", "Workers only"], type="value",
+                          default="Worker Helmet Vest", label="Model"), "textbox"]
+outputs = [gr.outputs.Textbox(label="Output"), "image", "image", gr.outputs.Textbox(label="Output")]
+title = "BLIP + MDETR + PPE Detection"
+description = "Gradio demo for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Salesforce Research. To use it, simply upload your image, or click one of the examples to load them. Read more at the links below."
+article = "<p style='text-align: center'><a href='https://arxiv.org/abs/2201.12086' target='_blank'>BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation</a> | <a href='https://github.com/salesforce/BLIP' target='_blank'>Github Repo</a></p>"
+gr.Interface(inference, inputs, outputs, title=title, description=description, article=article,
+             examples=[['starry.jpg', "Image Captioning", "None"]]).launch(share=True, enable_queue=True,
+                                                                           cache_examples=False)

app_run.ipynb ADDED Viewed

	@@ -0,0 +1,400 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "15468c81",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "--2022-02-15 18:26:17--  https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg/1920px-Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg\n",
+      "Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208\n",
+      "Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 1388211 (1.3M) [image/jpeg]\n",
+      "Saving to: ‘starry.jpg’\n",
+      "\n",
+      "     0K .......... .......... .......... .......... ..........  3%  776K 2s\n",
+      "    50K .......... .......... .......... .......... ..........  7%  877K 2s\n",
+      "   100K .......... .......... .......... .......... .......... 11% 2.93M 1s\n",
+      "   150K .......... .......... .......... .......... .......... 14% 2.28M 1s\n",
+      "   200K .......... .......... .......... .......... .......... 18% 4.04M 1s\n",
+      "   250K .......... .......... .......... .......... .......... 22% 5.46M 1s\n",
+      "   300K .......... .......... .......... .......... .......... 25% 6.40M 1s\n",
+      "   350K .......... .......... .......... .......... .......... 29% 2.41M 0s\n",
+      "   400K .......... .......... .......... .......... .......... 33% 3.18M 0s\n",
+      "   450K .......... .......... .......... .......... .......... 36% 3.03M 0s\n",
+      "   500K .......... .......... .......... .......... .......... 40% 8.30M 0s\n",
+      "   550K .......... .......... .......... .......... .......... 44% 3.31M 0s\n",
+      "   600K .......... .......... .......... .......... .......... 47% 3.10M 0s\n",
+      "   650K .......... .......... .......... .......... .......... 51% 12.3M 0s\n",
+      "   700K .......... .......... .......... .......... .......... 55% 4.20M 0s\n",
+      "   750K .......... .......... .......... .......... .......... 59% 1.93M 0s\n",
+      "   800K .......... .......... .......... .......... .......... 62% 6.28M 0s\n",
+      "   850K .......... .......... .......... .......... .......... 66% 3.09M 0s\n",
+      "   900K .......... .......... .......... .......... .......... 70% 22.7M 0s\n",
+      "   950K .......... .......... .......... .......... .......... 73% 4.43M 0s\n",
+      "  1000K .......... .......... .......... .......... .......... 77% 4.16M 0s\n",
+      "  1050K .......... .......... .......... .......... .......... 81% 2.29M 0s\n",
+      "  1100K .......... .......... .......... .......... .......... 84% 1.81M 0s\n",
+      "  1150K .......... .......... .......... .......... .......... 88% 6.20M 0s\n",
+      "  1200K .......... .......... .......... .......... .......... 92% 2.03M 0s\n",
+      "  1250K .......... .......... .......... .......... .......... 95% 23.5M 0s\n",
+      "  1300K .......... .......... .......... .......... .......... 99% 5.04M 0s\n",
+      "  1350K .....                                                 100% 9.95M=0.5s\n",
+      "\n",
+      "2022-02-15 18:26:17 (2.89 MB/s) - ‘starry.jpg’ saved [1388211/1388211]\n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "02b7655f0b2b404b952b7c152a3a1661",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0.00/262k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using cache found in /Users/sanjaykamath/.cache/torch/hub/ashkamath_mdetr_main\n",
+      "Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.weight']\n",
+      "- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth\n",
+      "load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth\n",
+      "Running on local URL:  http://127.0.0.1:7862/\n",
+      "Running on public URL: https://13389.gradio.app\n",
+      "\n",
+      "This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"900\"\n",
+       "            height=\"500\"\n",
+       "            src=\"https://13389.gradio.app\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "            \n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7fce90855f40>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(<fastapi.applications.FastAPI at 0x7fcfa3376fd0>,\n",
+       " 'http://127.0.0.1:7862/',\n",
+       " 'https://13389.gradio.app')"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2022-02-15 18:27:19.011924: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
+      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "os.system(\"wget https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg/1920px-Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg -O starry.jpg\")\n",
+    "\n",
+    "from PIL import Image\n",
+    "import requests\n",
+    "import torch\n",
+    "from torchvision import transforms\n",
+    "from torchvision.transforms.functional import InterpolationMode\n",
+    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "\n",
+    "\n",
+    "\n",
+    "    \n",
+    "#MDETR Code    \n",
+    "import torchvision.transforms as T\n",
+    "import matplotlib.pyplot as plt\n",
+    "from collections import defaultdict\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "from skimage.measure import find_contours\n",
+    "\n",
+    "from matplotlib import patches,  lines\n",
+    "from matplotlib.patches import Polygon\n",
+    "import gradio as gr\n",
+    "\n",
+    "torch.hub.download_url_to_file('https://cdn.pixabay.com/photo/2014/03/04/15/10/elephants-279505_1280.jpg', 'elephant.jpg')\n",
+    "\n",
+    "\n",
+    "model2, postprocessor = torch.hub.load('ashkamath/mdetr:main', 'mdetr_efficientnetB5', pretrained=True, return_postprocessor=True)\n",
+    "model2 = model2.cpu()\n",
+    "model2.eval()\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "torch.set_grad_enabled(False);\n",
+    "# standard PyTorch mean-std input image normalization\n",
+    "transform = T.Compose([\n",
+    "    T.Resize(800),\n",
+    "    T.ToTensor(),\n",
+    "    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])\n",
+    "])\n",
+    "\n",
+    "# for output bounding box post-processing\n",
+    "def box_cxcywh_to_xyxy(x):\n",
+    "    x_c, y_c, w, h = x.unbind(1)\n",
+    "    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),\n",
+    "         (x_c + 0.5 * w), (y_c + 0.5 * h)]\n",
+    "    return torch.stack(b, dim=1)\n",
+    "\n",
+    "def rescale_bboxes(out_bbox, size):\n",
+    "    img_w, img_h = size\n",
+    "    b = box_cxcywh_to_xyxy(out_bbox)\n",
+    "    b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)\n",
+    "    return b\n",
+    "# colors for visualization\n",
+    "COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],\n",
+    "          [0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]\n",
+    "\n",
+    "def apply_mask(image, mask, color, alpha=0.5):\n",
+    "    \"\"\"Apply the given mask to the image.\n",
+    "    \"\"\"\n",
+    "    for c in range(3):\n",
+    "        image[:, :, c] = np.where(mask == 1,\n",
+    "                                  image[:, :, c] *\n",
+    "                                  (1 - alpha) + alpha * color[c] * 255,\n",
+    "                                  image[:, :, c])\n",
+    "    return image\n",
+    "\n",
+    "def plot_results(pil_img, scores, boxes, labels, masks=None):\n",
+    "    plt.figure(figsize=(16,10))\n",
+    "    np_image = np.array(pil_img)\n",
+    "    ax = plt.gca()\n",
+    "    colors = COLORS * 100\n",
+    "    if masks is None:\n",
+    "      masks = [None for _ in range(len(scores))]\n",
+    "    assert len(scores) == len(boxes) == len(labels) == len(masks)\n",
+    "    for s, (xmin, ymin, xmax, ymax), l, mask, c in zip(scores, boxes.tolist(), labels, masks, colors):\n",
+    "        ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,\n",
+    "                                   fill=False, color=c, linewidth=3))\n",
+    "        text = f'{l}: {s:0.2f}'\n",
+    "        ax.text(xmin, ymin, text, fontsize=15, bbox=dict(facecolor='white', alpha=0.8))\n",
+    "\n",
+    "        if mask is None:\n",
+    "          continue\n",
+    "        np_image = apply_mask(np_image, mask, c)\n",
+    "\n",
+    "        padded_mask = np.zeros((mask.shape[0] + 2, mask.shape[1] + 2), dtype=np.uint8)\n",
+    "        padded_mask[1:-1, 1:-1] = mask\n",
+    "        contours = find_contours(padded_mask, 0.5)\n",
+    "        for verts in contours:\n",
+    "          # Subtract the padding and flip (y, x) to (x, y)\n",
+    "          verts = np.fliplr(verts) - 1\n",
+    "          p = Polygon(verts, facecolor=\"none\", edgecolor=c)\n",
+    "          ax.add_patch(p)\n",
+    "\n",
+    "\n",
+    "    plt.imshow(np_image)\n",
+    "    plt.axis('off')\n",
+    "    plt.savefig('foo.png',bbox_inches='tight')\n",
+    "    return 'foo.png'\n",
+    "\n",
+    "\n",
+    "def add_res(results, ax, color='green'):\n",
+    "    #for tt in results.values():\n",
+    "    if True:\n",
+    "        bboxes = results['boxes']\n",
+    "        labels = results['labels']\n",
+    "        scores = results['scores']\n",
+    "        #keep = scores >= 0.0\n",
+    "        #bboxes = bboxes[keep].tolist()\n",
+    "        #labels = labels[keep].tolist()\n",
+    "        #scores = scores[keep].tolist()\n",
+    "    #print(torchvision.ops.box_iou(tt['boxes'].cpu().detach(), torch.as_tensor([[xmin, ymin, xmax, ymax]])))\n",
+    "    \n",
+    "    colors = ['purple', 'yellow', 'red', 'green', 'orange', 'pink']\n",
+    "    \n",
+    "    for i, (b, ll, ss) in enumerate(zip(bboxes, labels, scores)):\n",
+    "        ax.add_patch(plt.Rectangle((b[0], b[1]), b[2] - b[0], b[3] - b[1], fill=False, color=colors[i], linewidth=3))\n",
+    "        cls_name = ll if isinstance(ll,str) else CLASSES[ll]\n",
+    "        text = f'{cls_name}: {ss:.2f}'\n",
+    "        print(text)\n",
+    "        ax.text(b[0], b[1], text, fontsize=15, bbox=dict(facecolor='white', alpha=0.8))\n",
+    "\n",
+    "\n",
+    "def plot_inference(im, caption, approaches):\n",
+    "    \n",
+    "    choices = {\"Worker Helmet Separately\" : 1,\"Worker Helmet Vest\":2, \"Workers only\":3}\n",
+    "    \n",
+    "    \n",
+    "# mean-std normalize the input image (batch-size: 1)\n",
+    "    img = transform(im).unsqueeze(0).cpu()\n",
+    "\n",
+    "  # propagate through the model\n",
+    "    memory_cache = model2(img, [caption], encode_and_save=True)\n",
+    "    outputs = model2(img, [caption], encode_and_save=False, memory_cache=memory_cache)\n",
+    "\n",
+    "  # keep only predictions with 0.7+ confidence\n",
+    "    probas = 1 - outputs['pred_logits'].softmax(-1)[0, :, -1].cpu()\n",
+    "    keep = (probas > 0.7).cpu()\n",
+    "\n",
+    "  # convert boxes from [0; 1] to image scales\n",
+    "    bboxes_scaled = rescale_bboxes(outputs['pred_boxes'].cpu()[0, keep], im.size)\n",
+    "\n",
+    "  # Extract the text spans predicted by each box\n",
+    "    positive_tokens = (outputs[\"pred_logits\"].cpu()[0, keep].softmax(-1) > 0.1).nonzero().tolist()\n",
+    "    predicted_spans = defaultdict(str)\n",
+    "    for tok in positive_tokens:\n",
+    "        item, pos = tok\n",
+    "        if pos < 255:\n",
+    "            span = memory_cache[\"tokenized\"].token_to_chars(0, pos)\n",
+    "            predicted_spans [item] += \" \" + caption[span.start:span.end]\n",
+    "\n",
+    "    labels = [predicted_spans [k] for k in sorted(list(predicted_spans .keys()))]\n",
+    "    caption = 'Caption: '+ caption\n",
+    "    return (sepia_call(caption, im, plot_results(im, probas[keep], bboxes_scaled, labels), choices[approaches]))\n",
+    "  \n",
+    "\n",
+    "\n",
+    "    \n",
+    "#BLIP Code\n",
+    "\n",
+    "\n",
+    "from modelsn.blip import blip_decoder\n",
+    "\n",
+    "image_size = 384\n",
+    "transform = transforms.Compose([\n",
+    "    transforms.Resize((image_size,image_size),interpolation=InterpolationMode.BICUBIC),\n",
+    "    transforms.ToTensor(),\n",
+    "    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))\n",
+    "    ]) \n",
+    "\n",
+    "model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth'\n",
+    "    \n",
+    "model = blip_decoder(pretrained=model_url, image_size=384, vit='base')\n",
+    "model.eval()\n",
+    "model = model.to(device)\n",
+    "\n",
+    "\n",
+    "from modelsn.blip_vqa import blip_vqa\n",
+    "\n",
+    "image_size_vq = 480\n",
+    "transform_vq = transforms.Compose([\n",
+    "    transforms.Resize((image_size_vq,image_size_vq),interpolation=InterpolationMode.BICUBIC),\n",
+    "    transforms.ToTensor(),\n",
+    "    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))\n",
+    "    ]) \n",
+    "\n",
+    "model_url_vq = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth'\n",
+    "    \n",
+    "model_vq = blip_vqa(pretrained=model_url_vq, image_size=480, vit='base')\n",
+    "model_vq.eval()\n",
+    "model_vq = model_vq.to(device)\n",
+    "\n",
+    "\n",
+    "\n",
+    "def inference(raw_image, approaches, question):\n",
+    "    \n",
+    "\n",
+    "    image = transform(raw_image).unsqueeze(0).to(device)   \n",
+    "    with torch.no_grad():\n",
+    "        caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)\n",
+    "\n",
+    "    return (plot_inference(raw_image, caption[0], approaches))\n",
+    "    #return 'caption: '+caption[0]\n",
+    "\n",
+    "   \n",
+    "\n",
+    "    \n",
+    "#PPE Detection code\n",
+    "import numpy as np\n",
+    "import run_code\n",
+    "import gradio as gr\n",
+    "  \n",
+    "\n",
+    "def sepia_call(caption, Input_Image, MDETR_im, Approach):\n",
+    "    pil_image = Input_Image\n",
+    "    open_cv_image = np.asarray(pil_image)\n",
+    "    sepia_img = run_code.run(open_cv_image, Approach)\n",
+    "    images = sepia_img['img']\n",
+    "    texts= sepia_img['text']\n",
+    "\n",
+    "    return (caption, MDETR_im, images, texts)\n",
+    "\n",
+    "\n",
+    "inputs = [gr.inputs.Image(type='pil'),gr.inputs.Radio(choices=[\"Worker Helmet Separately\",\"Worker Helmet Vest\", \"Workers only\"], type=\"value\", default=\"Worker Helmet Vest\", label=\"Model\"),\"textbox\"]\n",
+    "outputs = [gr.outputs.Textbox(label=\"Output\"), \"image\", \"image\", gr.outputs.Textbox(label=\"Output\")]\n",
+    "\n",
+    "\n",
+    "title = \"BLIP + MDETR + PPE Detection\"\n",
+    "\n",
+    "description = \"Gradio demo for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Salesforce Research. To use it, simply upload your image, or click one of the examples to load them. Read more at the links below.\"\n",
+    "\n",
+    "article = \"<p style='text-align: center'><a href='https://arxiv.org/abs/2201.12086' target='_blank'>BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation</a> | <a href='https://github.com/salesforce/BLIP' target='_blank'>Github Repo</a></p>\"\n",
+    "\n",
+    "\n",
+    "gr.Interface(inference, inputs, outputs, title=title, description=description, article=article, examples=[['starry.jpg',\"Image Captioning\",\"None\"]]).launch(share=True,enable_queue=True,cache_examples=False)"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "b2729aa9",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

configs/caption_coco.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+image_root: '/export/share/datasets/vision/coco/images/'
+ann_root: 'annotation'
+coco_gt_root: 'annotation/coco_gt'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth'
+# size of vit model; base or large
+vit: 'base'
+vit_grad_ckpt: False
+vit_ckpt_layer: 0
+batch_size: 32
+init_lr: 1e-5
+# vit: 'large'
+# vit_grad_ckpt: True
+# vit_ckpt_layer: 5
+# batch_size: 16
+# init_lr: 2e-6
+image_size: 384
+# generation configs
+max_length: 20
+min_length: 5
+num_beams: 3
+prompt: 'a picture of '
+# optimizer
+weight_decay: 0.05
+min_lr: 0
+max_epoch: 5

configs/med_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 30524,
+  "encoder_width": 768,
+  "add_cross_attention": true
+}

configs/nlvr.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+image_root: '/export/share/datasets/vision/NLVR2/'
+ann_root: 'annotation'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth'
+#size of vit model; base or large
+vit: 'base'
+batch_size_train: 16
+batch_size_test: 64
+vit_grad_ckpt: False
+vit_ckpt_layer: 0
+max_epoch: 15
+image_size: 384
+# optimizer
+weight_decay: 0.05
+init_lr: 3e-5
+min_lr: 0

configs/nocaps.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+image_root: '/export/share/datasets/vision/nocaps/'
+ann_root: 'annotation'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth'
+vit: 'base'
+batch_size: 32
+image_size: 384
+max_length: 20
+min_length: 5
+num_beams: 3
+prompt: 'a picture of '

configs/pretrain.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+train_file: ['/export/share/junnan-li/VL_pretrain/annotation/coco_karpathy_train.json',
+             '/export/share/junnan-li/VL_pretrain/annotation/vg_caption.json',
+             ]
+laion_path: ''
+# size of vit model; base or large
+vit: 'base'
+vit_grad_ckpt: False
+vit_ckpt_layer: 0
+image_size: 224
+batch_size: 75
+queue_size: 57600
+alpha: 0.4
+# optimizer
+weight_decay: 0.05
+init_lr: 3e-4
+min_lr: 1e-6
+warmup_lr: 1e-6
+lr_decay_rate: 0.9
+max_epoch: 20
+warmup_steps: 3000

configs/retrieval_coco.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+image_root: '/export/share/datasets/vision/coco/images/'
+ann_root: 'annotation'
+dataset: 'coco'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth'
+# size of vit model; base or large
+vit: 'base'
+batch_size_train: 32
+batch_size_test: 64
+vit_grad_ckpt: True
+vit_ckpt_layer: 4
+init_lr: 1e-5
+# vit: 'large'
+# batch_size_train: 16
+# batch_size_test: 32
+# vit_grad_ckpt: True
+# vit_ckpt_layer: 12
+# init_lr: 5e-6
+image_size: 384
+queue_size: 57600
+alpha: 0.4
+k_test: 256
+negative_all_rank: True
+# optimizer
+weight_decay: 0.05
+min_lr: 0
+max_epoch: 6

configs/retrieval_flickr.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+image_root: '/export/share/datasets/vision/flickr30k/'
+ann_root: 'annotation'
+dataset: 'flickr'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_flickr.pth'
+# size of vit model; base or large
+vit: 'base'
+batch_size_train: 32
+batch_size_test: 64
+vit_grad_ckpt: True
+vit_ckpt_layer: 4
+init_lr: 1e-5
+# vit: 'large'
+# batch_size_train: 16
+# batch_size_test: 32
+# vit_grad_ckpt: True
+# vit_ckpt_layer: 10
+# init_lr: 5e-6
+image_size: 384
+queue_size: 57600
+alpha: 0.4
+k_test: 128
+negative_all_rank: False
+# optimizer
+weight_decay: 0.05
+min_lr: 0
+max_epoch: 6

configs/vqa.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+vqa_root: '/export/share/datasets/vision/VQA/Images/mscoco/' #followed by train2014/
+vg_root: '/export/share/datasets/vision/visual-genome/'  #followed by image/
+train_files: ['vqa_train','vqa_val','vg_qa']
+ann_root: 'annotation'
+# set pretrained as a file path or an url
+pretrained: 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth'
+# size of vit model; base or large
+vit: 'base'
+batch_size_train: 16
+batch_size_test: 32
+vit_grad_ckpt: False
+vit_ckpt_layer: 0
+init_lr: 2e-5
+image_size: 480
+k_test: 128
+inference: 'rank'
+# optimizer
+weight_decay: 0.05
+min_lr: 0
+max_epoch: 10

data/__init__.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import torch
+from torch.utils.data import DataLoader
+from torchvision import transforms
+from torchvision.transforms.functional import InterpolationMode
+from data.coco_karpathy_dataset import coco_karpathy_train, coco_karpathy_caption_eval, coco_karpathy_retrieval_eval
+from data.nocaps_dataset import nocaps_eval
+from data.flickr30k_dataset import flickr30k_train, flickr30k_retrieval_eval
+from data.vqa_dataset import vqa_dataset
+from data.nlvr_dataset import nlvr_dataset
+from data.pretrain_dataset import pretrain_dataset
+from transform.randaugment import RandomAugment
+def create_dataset(dataset, config, min_scale=0.5):
+    normalize = transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
+    transform_train = transforms.Compose([
+            transforms.RandomResizedCrop(config['image_size'],scale=(min_scale, 1.0),interpolation=InterpolationMode.BICUBIC),
+            transforms.RandomHorizontalFlip(),
+            RandomAugment(2,5,isPIL=True,augs=['Identity','AutoContrast','Brightness','Sharpness','Equalize',
+                                              'ShearX', 'ShearY', 'TranslateX', 'TranslateY', 'Rotate']),
+            transforms.ToTensor(),
+            normalize,
+        ])
+    transform_test = transforms.Compose([
+        transforms.Resize((config['image_size'],config['image_size']),interpolation=InterpolationMode.BICUBIC),
+        transforms.ToTensor(),
+        normalize,
+        ])
+    if dataset=='pretrain':
+        dataset = pretrain_dataset(config['train_file'], config['laion_path'], transform_train)
+        return dataset
+    elif dataset=='caption_coco':
+        train_dataset = coco_karpathy_train(transform_train, config['image_root'], config['ann_root'], prompt=config['prompt'])
+        val_dataset = coco_karpathy_caption_eval(transform_test, config['image_root'], config['ann_root'], 'val')
+        test_dataset = coco_karpathy_caption_eval(transform_test, config['image_root'], config['ann_root'], 'test')
+        return train_dataset, val_dataset, test_dataset
+    elif dataset=='nocaps':
+        val_dataset = nocaps_eval(transform_test, config['image_root'], config['ann_root'], 'val')
+        test_dataset = nocaps_eval(transform_test, config['image_root'], config['ann_root'], 'test')
+        return val_dataset, test_dataset
+    elif dataset=='retrieval_coco':
+        train_dataset = coco_karpathy_train(transform_train, config['image_root'], config['ann_root'])
+        val_dataset = coco_karpathy_retrieval_eval(transform_test, config['image_root'], config['ann_root'], 'val')
+        test_dataset = coco_karpathy_retrieval_eval(transform_test, config['image_root'], config['ann_root'], 'test')
+        return train_dataset, val_dataset, test_dataset
+    elif dataset=='retrieval_flickr':
+        train_dataset = flickr30k_train(transform_train, config['image_root'], config['ann_root'])
+        val_dataset = flickr30k_retrieval_eval(transform_test, config['image_root'], config['ann_root'], 'val')
+        test_dataset = flickr30k_retrieval_eval(transform_test, config['image_root'], config['ann_root'], 'test')
+        return train_dataset, val_dataset, test_dataset
+    elif dataset=='vqa':
+        train_dataset = vqa_dataset(transform_train, config['ann_root'], config['vqa_root'], config['vg_root'],
+                                    train_files = config['train_files'], split='train')
+        test_dataset = vqa_dataset(transform_test, config['ann_root'], config['vqa_root'], config['vg_root'], split='test')
+        return train_dataset, test_dataset
+    elif dataset=='nlvr':
+        train_dataset = nlvr_dataset(transform_train, config['image_root'], config['ann_root'],'train')
+        val_dataset = nlvr_dataset(transform_test, config['image_root'], config['ann_root'],'val')
+        test_dataset = nlvr_dataset(transform_test, config['image_root'], config['ann_root'],'test')
+        return train_dataset, val_dataset, test_dataset
+def create_sampler(datasets, shuffles, num_tasks, global_rank):
+    samplers = []
+    for dataset,shuffle in zip(datasets,shuffles):
+        sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=num_tasks, rank=global_rank, shuffle=shuffle)
+        samplers.append(sampler)
+    return samplers
+def create_loader(datasets, samplers, batch_size, num_workers, is_trains, collate_fns):
+    loaders = []
+    for dataset,sampler,bs,n_worker,is_train,collate_fn in zip(datasets,samplers,batch_size,num_workers,is_trains,collate_fns):
+        if is_train:
+            shuffle = (sampler is None)
+            drop_last = True
+        else:
+            shuffle = False
+            drop_last = False
+        loader = DataLoader(
+            dataset,
+            batch_size=bs,
+            num_workers=n_worker,
+            pin_memory=True,
+            sampler=sampler,
+            shuffle=shuffle,
+            collate_fn=collate_fn,
+            drop_last=drop_last,
+        )
+        loaders.append(loader)
+    return loaders

data/coco_karpathy_dataset.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import os
+import json
+from torch.utils.data import Dataset
+from torchvision.datasets.utils import download_url
+from PIL import Image
+from data.utils import pre_caption
+class coco_karpathy_train(Dataset):
+    def __init__(self, transform, image_root, ann_root, max_words=30, prompt=''):
+        '''
+        image_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        '''
+        url = 'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json'
+        filename = 'coco_karpathy_train.json'
+        download_url(url,ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filename),'r'))
+        self.transform = transform
+        self.image_root = image_root
+        self.max_words = max_words
+        self.prompt = prompt
+        self.img_ids = {}
+        n = 0
+        for ann in self.annotation:
+            img_id = ann['image_id']
+            if img_id not in self.img_ids.keys():
+                self.img_ids[img_id] = n
+                n += 1
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image_path = os.path.join(self.image_root,ann['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        caption = self.prompt+pre_caption(ann['caption'], self.max_words)
+        return image, caption, self.img_ids[ann['image_id']]
+class coco_karpathy_caption_eval(Dataset):
+    def __init__(self, transform, image_root, ann_root, split):
+        '''
+        image_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        split (string): val or test
+        '''
+        urls = {'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json',
+                'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json'}
+        filenames = {'val':'coco_karpathy_val.json','test':'coco_karpathy_test.json'}
+        download_url(urls[split],ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filenames[split]),'r'))
+        self.transform = transform
+        self.image_root = image_root
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image_path = os.path.join(self.image_root,ann['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        img_id = ann['image'].split('/')[-1].strip('.jpg').split('_')[-1]
+        return image, int(img_id)
+class coco_karpathy_retrieval_eval(Dataset):
+    def __init__(self, transform, image_root, ann_root, split, max_words=30):
+        '''
+        image_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        split (string): val or test
+        '''
+        urls = {'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json',
+                'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json'}
+        filenames = {'val':'coco_karpathy_val.json','test':'coco_karpathy_test.json'}
+        download_url(urls[split],ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filenames[split]),'r'))
+        self.transform = transform
+        self.image_root = image_root
+        self.text = []
+        self.image = []
+        self.txt2img = {}
+        self.img2txt = {}
+        txt_id = 0
+        for img_id, ann in enumerate(self.annotation):
+            self.image.append(ann['image'])
+            self.img2txt[img_id] = []
+            for i, caption in enumerate(ann['caption']):
+                self.text.append(pre_caption(caption,max_words))
+                self.img2txt[img_id].append(txt_id)
+                self.txt2img[txt_id] = img_id
+                txt_id += 1
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        image_path = os.path.join(self.image_root, self.annotation[index]['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        return image, index

data/flickr30k_dataset.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import os
+import json
+from torch.utils.data import Dataset
+from torchvision.datasets.utils import download_url
+from PIL import Image
+from data.utils import pre_caption
+class flickr30k_train(Dataset):
+    def __init__(self, transform, image_root, ann_root, max_words=30, prompt=''):
+        '''
+        image_root (string): Root directory of images (e.g. flickr30k/)
+        ann_root (string): directory to store the annotation file
+        '''
+        url = 'https://storage.googleapis.com/sfr-vision-language-research/datasets/flickr30k_train.json'
+        filename = 'flickr30k_train.json'
+        download_url(url,ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filename),'r'))
+        self.transform = transform
+        self.image_root = image_root
+        self.max_words = max_words
+        self.prompt = prompt
+        self.img_ids = {}
+        n = 0
+        for ann in self.annotation:
+            img_id = ann['image_id']
+            if img_id not in self.img_ids.keys():
+                self.img_ids[img_id] = n
+                n += 1
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image_path = os.path.join(self.image_root,ann['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        caption = self.prompt+pre_caption(ann['caption'], self.max_words)
+        return image, caption, self.img_ids[ann['image_id']]
+class flickr30k_retrieval_eval(Dataset):
+    def __init__(self, transform, image_root, ann_root, split, max_words=30):
+        '''
+        image_root (string): Root directory of images (e.g. flickr30k/)
+        ann_root (string): directory to store the annotation file
+        split (string): val or test
+        '''
+        urls = {'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/flickr30k_val.json',
+                'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/flickr30k_test.json'}
+        filenames = {'val':'flickr30k_val.json','test':'flickr30k_test.json'}
+        download_url(urls[split],ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filenames[split]),'r'))
+        self.transform = transform
+        self.image_root = image_root
+        self.text = []
+        self.image = []
+        self.txt2img = {}
+        self.img2txt = {}
+        txt_id = 0
+        for img_id, ann in enumerate(self.annotation):
+            self.image.append(ann['image'])
+            self.img2txt[img_id] = []
+            for i, caption in enumerate(ann['caption']):
+                self.text.append(pre_caption(caption,max_words))
+                self.img2txt[img_id].append(txt_id)
+                self.txt2img[txt_id] = img_id
+                txt_id += 1
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        image_path = os.path.join(self.image_root, self.annotation[index]['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        return image, index

data/nlvr_dataset.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import os
+import json
+import random
+from torch.utils.data import Dataset
+from torchvision.datasets.utils import download_url
+from PIL import Image
+from data.utils import pre_caption
+class nlvr_dataset(Dataset):
+    def __init__(self, transform, image_root, ann_root, split):
+        '''
+        image_root (string): Root directory of images
+        ann_root (string): directory to store the annotation file
+        split (string): train, val or test
+        '''
+        urls = {'train':'https://storage.googleapis.com/sfr-vision-language-research/datasets/nlvr_train.json',
+                'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/nlvr_dev.json',
+                'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/nlvr_test.json'}
+        filenames = {'train':'nlvr_train.json','val':'nlvr_dev.json','test':'nlvr_test.json'}
+        download_url(urls[split],ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filenames[split]),'r'))
+        self.transform = transform
+        self.image_root = image_root
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image0_path = os.path.join(self.image_root,ann['images'][0])
+        image0 = Image.open(image0_path).convert('RGB')
+        image0 = self.transform(image0)
+        image1_path = os.path.join(self.image_root,ann['images'][1])
+        image1 = Image.open(image1_path).convert('RGB')
+        image1 = self.transform(image1)
+        sentence = pre_caption(ann['sentence'], 40)
+        if ann['label']=='True':
+            label = 1
+        else:
+            label = 0
+        words = sentence.split(' ')
+        if 'left' not in words and 'right' not in words:
+            if random.random()<0.5:
+                return image0, image1, sentence, label
+            else:
+                return image1, image0, sentence, label
+        else:
+            if random.random()<0.5:
+                return image0, image1, sentence, label
+            else:
+                new_words = []
+                for word in words:
+                    if word=='left':
+                        new_words.append('right')
+                    elif word=='right':
+                        new_words.append('left')
+                    else:
+                        new_words.append(word)
+                sentence = ' '.join(new_words)
+                return image1, image0, sentence, label

data/nocaps_dataset.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import os
+import json
+from torch.utils.data import Dataset
+from torchvision.datasets.utils import download_url
+from PIL import Image
+class nocaps_eval(Dataset):
+    def __init__(self, transform, image_root, ann_root, split):
+        urls = {'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/nocaps_val.json',
+                'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/nocaps_test.json'}
+        filenames = {'val':'nocaps_val.json','test':'nocaps_test.json'}
+        download_url(urls[split],ann_root)
+        self.annotation = json.load(open(os.path.join(ann_root,filenames[split]),'r'))
+        self.transform = transform
+        self.image_root = image_root
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image_path = os.path.join(self.image_root,ann['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        return image, int(ann['img_id'])

data/pretrain_dataset.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import json
+import os
+import random
+from torch.utils.data import Dataset
+from PIL import Image
+from PIL import ImageFile
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+Image.MAX_IMAGE_PIXELS = None
+from data.utils import pre_caption
+import os,glob
+class pretrain_dataset(Dataset):
+    def __init__(self, ann_file, laion_path, transform):
+        self.ann_pretrain = []
+        for f in ann_file:
+            print('loading '+f)
+            ann = json.load(open(f,'r'))
+            self.ann_pretrain += ann
+        self.laion_path = laion_path
+        if self.laion_path:
+            self.laion_files = glob.glob(os.path.join(laion_path,'*.json'))
+            print('loading '+self.laion_files[0])
+            with open(self.laion_files[0],'r') as f:
+                self.ann_laion = json.load(f)
+            self.annotation = self.ann_pretrain + self.ann_laion
+        else:
+            self.annotation = self.ann_pretrain
+        self.transform = transform
+    def reload_laion(self, epoch):
+        n = epoch%len(self.laion_files)
+        print('loading '+self.laion_files[n])
+        with open(self.laion_files[n],'r') as f:
+            self.ann_laion = json.load(f)
+        self.annotation = self.ann_pretrain + self.ann_laion
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        image = Image.open(ann['image']).convert('RGB')
+        image = self.transform(image)
+        caption = pre_caption(ann['caption'],30)
+        return image, caption

data/utils.py ADDED Viewed

	@@ -0,0 +1,112 @@

+import re
+import json
+import os
+import torch
+import torch.distributed as dist
+import utils
+def pre_caption(caption,max_words=50):
+    caption = re.sub(
+        r"([.!\"()*#:;~])",
+        ' ',
+        caption.lower(),
+    )
+    caption = re.sub(
+        r"\s{2,}",
+        ' ',
+        caption,
+    )
+    caption = caption.rstrip('\n')
+    caption = caption.strip(' ')
+    #truncate caption
+    caption_words = caption.split(' ')
+    if len(caption_words)>max_words:
+        caption = ' '.join(caption_words[:max_words])
+    return caption
+def pre_question(question,max_ques_words=50):
+    question = re.sub(
+        r"([.!\"()*#:;~])",
+        '',
+        question.lower(),
+    )
+    question = question.rstrip(' ')
+    #truncate question
+    question_words = question.split(' ')
+    if len(question_words)>max_ques_words:
+        question = ' '.join(question_words[:max_ques_words])
+    return question
+def save_result(result, result_dir, filename, remove_duplicate=''):
+    result_file = os.path.join(result_dir, '%s_rank%d.json'%(filename,utils.get_rank()))
+    final_result_file = os.path.join(result_dir, '%s.json'%filename)
+    json.dump(result,open(result_file,'w'))
+    dist.barrier()
+    if utils.is_main_process():
+        # combine results from all processes
+        result = []
+        for rank in range(utils.get_world_size()):
+            result_file = os.path.join(result_dir, '%s_rank%d.json'%(filename,rank))
+            res = json.load(open(result_file,'r'))
+            result += res
+        if remove_duplicate:
+            result_new = []
+            id_list = []
+            for res in result:
+                if res[remove_duplicate] not in id_list:
+                    id_list.append(res[remove_duplicate])
+                    result_new.append(res)
+            result = result_new
+        json.dump(result,open(final_result_file,'w'))
+        print('result file saved to %s'%final_result_file)
+    return final_result_file
+from pycocotools.coco import COCO
+from pycocoevalcap.eval import COCOEvalCap
+from torchvision.datasets.utils import download_url
+def coco_caption_eval(coco_gt_root, results_file, split):
+    urls = {'val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val_gt.json',
+            'test':'https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test_gt.json'}
+    filenames = {'val':'coco_karpathy_val_gt.json','test':'coco_karpathy_test_gt.json'}
+    download_url(urls[split],coco_gt_root)
+    annotation_file = os.path.join(coco_gt_root,filenames[split])
+    # create coco object and coco_result object
+    coco = COCO(annotation_file)
+    coco_result = coco.loadRes(results_file)
+    # create coco_eval object by taking coco and coco_result
+    coco_eval = COCOEvalCap(coco, coco_result)
+    # evaluate on a subset of images by setting
+    # coco_eval.params['image_id'] = coco_result.getImgIds()
+    # please remove this line when evaluating the full validation set
+    # coco_eval.params['image_id'] = coco_result.getImgIds()
+    # evaluate results
+    # SPICE will take a few minutes the first time, but speeds up due to caching
+    coco_eval.evaluate()
+    # print output evaluation scores
+    for metric, score in coco_eval.eval.items():
+        print(f'{metric}: {score:.3f}')
+    return coco_eval

data/vqa_dataset.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import os
+import json
+import random
+from PIL import Image
+import torch
+from torch.utils.data import Dataset
+from data.utils import pre_question
+from torchvision.datasets.utils import download_url
+class vqa_dataset(Dataset):
+    def __init__(self, transform, ann_root, vqa_root, vg_root, train_files=[], split="train"):
+        self.split = split
+        self.transform = transform
+        self.vqa_root = vqa_root
+        self.vg_root = vg_root
+        if split=='train':
+            urls = {'vqa_train':'https://storage.googleapis.com/sfr-vision-language-research/datasets/vqa_train.json',
+                    'vqa_val':'https://storage.googleapis.com/sfr-vision-language-research/datasets/vqa_val.json',
+                    'vg_qa':'https://storage.googleapis.com/sfr-vision-language-research/datasets/vg_qa.json'}
+            self.annotation = []
+            for f in train_files:
+                download_url(urls[f],ann_root)
+                self.annotation += json.load(open(os.path.join(ann_root,'%s.json'%f),'r'))
+        else:
+            download_url('https://storage.googleapis.com/sfr-vision-language-research/datasets/vqa_test.json',ann_root)
+            self.annotation = json.load(open(os.path.join(ann_root,'vqa_test.json'),'r'))
+            download_url('https://storage.googleapis.com/sfr-vision-language-research/datasets/answer_list.json',ann_root)
+            self.answer_list = json.load(open(os.path.join(ann_root,'answer_list.json'),'r'))
+    def __len__(self):
+        return len(self.annotation)
+    def __getitem__(self, index):
+        ann = self.annotation[index]
+        if ann['dataset']=='vqa':
+            image_path = os.path.join(self.vqa_root,ann['image'])
+        elif ann['dataset']=='vg':
+            image_path = os.path.join(self.vg_root,ann['image'])
+        image = Image.open(image_path).convert('RGB')
+        image = self.transform(image)
+        if self.split == 'test':
+            question = pre_question(ann['question'])
+            question_id = ann['question_id']
+            return image, question, question_id
+        elif self.split=='train':
+            question = pre_question(ann['question'])
+            if ann['dataset']=='vqa':
+                answer_weight = {}
+                for answer in ann['answer']:
+                    if answer in answer_weight.keys():
+                        answer_weight[answer] += 1/len(ann['answer'])
+                    else:
+                        answer_weight[answer] = 1/len(ann['answer'])
+                answers = list(answer_weight.keys())
+                weights = list(answer_weight.values())
+            elif ann['dataset']=='vg':
+                answers = [ann['answer']]
+                weights = [0.2]
+            return image, question, answers, weights
+def vqa_collate_fn(batch):
+    image_list, question_list, answer_list, weight_list, n = [], [], [], [], []
+    for image, question, answer, weights in batch:
+        image_list.append(image)
+        question_list.append(question)
+        weight_list += weights
+        answer_list += answer
+        n.append(len(answer))
+    return torch.stack(image_list,dim=0), question_list, answer_list, torch.Tensor(weight_list), n

elephant.jpg ADDED Viewed

eval_nocaps.py ADDED Viewed

	@@ -0,0 +1,118 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+'''
+import argparse
+import os
+import ruamel_yaml as yaml
+import numpy as np
+import random
+import time
+import datetime
+import json
+from pathlib import Path
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.backends.cudnn as cudnn
+import torch.distributed as dist
+from torch.utils.data import DataLoader
+from models.blip import blip_decoder
+import utils
+from data import create_dataset, create_sampler, create_loader
+from data.utils import save_result
+@torch.no_grad()
+def evaluate(model, data_loader, device, config):
+    # evaluate
+    model.eval()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = 'Evaluation:'
+    print_freq = 10
+    result = []
+    for image, image_id in metric_logger.log_every(data_loader, print_freq, header):
+        image = image.to(device)
+        captions = model.generate(image, sample=False, num_beams=config['num_beams'], max_length=config['max_length'],
+                                  min_length=config['min_length'], repetition_penalty=1.1)
+        for caption, img_id in zip(captions, image_id):
+            result.append({"image_id": img_id.item(), "caption": caption})
+    return result
+def main(args, config):
+    utils.init_distributed_mode(args)
+    device = torch.device(args.device)
+    # fix the seed for reproducibility
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    cudnn.benchmark = True
+    #### Dataset ####
+    print("Creating captioning dataset")
+    val_dataset, test_dataset = create_dataset('nocaps', config)
+    if args.distributed:
+        num_tasks = utils.get_world_size()
+        global_rank = utils.get_rank()
+        samplers = create_sampler([val_dataset,test_dataset], [False,False], num_tasks, global_rank)
+    else:
+        samplers = [None,None]
+    val_loader, test_loader = create_loader([val_dataset, test_dataset],samplers,
+                                            batch_size=[config['batch_size']]*2,num_workers=[4,4],
+                                            is_trains=[False, False], collate_fns=[None,None])
+    #### Model ####
+    print("Creating model")
+    model = blip_decoder(pretrained=config['pretrained'], image_size=config['image_size'], vit=config['vit'],
+                           prompt=config['prompt'])
+    model = model.to(device)
+    model_without_ddp = model
+    if args.distributed:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
+        model_without_ddp = model.module
+    val_result = evaluate(model_without_ddp, val_loader, device, config)
+    val_result_file = save_result(val_result, args.result_dir, 'val', remove_duplicate='image_id')
+    test_result = evaluate(model_without_ddp, test_loader, device, config)
+    test_result_file = save_result(test_result, args.result_dir, 'test', remove_duplicate='image_id')
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--config', default='./configs/nocaps.yaml')
+    parser.add_argument('--output_dir', default='output/NoCaps')
+    parser.add_argument('--device', default='cuda')
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--world_size', default=1, type=int, help='number of distributed processes')
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    parser.add_argument('--distributed', default=True, type=bool)
+    args = parser.parse_args()
+    config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
+    args.result_dir = os.path.join(args.output_dir, 'result')
+    Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+    Path(args.result_dir).mkdir(parents=True, exist_ok=True)
+    yaml.dump(config, open(os.path.join(args.output_dir, 'config.yaml'), 'w'))
+    main(args, config)

examples/ex1.jpg ADDED Viewed

examples/ex2.jpg ADDED Viewed

examples/ex3.jpg ADDED Viewed

extras/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

extras/sample-images/0.JPG ADDED Viewed

extras/sample-images/1.JPG ADDED Viewed

extras/sample-images/10.jpg ADDED Viewed

extras/sample-images/2.jpg ADDED Viewed

extras/sample-images/3.jpg ADDED Viewed

extras/sample-images/4.jpg ADDED Viewed

extras/sample-images/5.jpg ADDED Viewed

extras/sample-images/6.JPG ADDED Viewed

extras/sample-images/7.JPG ADDED Viewed

extras/sample-images/8.jpg ADDED Viewed

extras/sample-images/9.jpg ADDED Viewed

foo.png ADDED Viewed

gradio_cached_examples/log.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ Output
2	+ caption: a painting of a starry night over a city

local_run.ipynb ADDED Viewed

	@@ -0,0 +1,347 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running on local URL:  http://127.0.0.1:7860/\n",
+      "\n",
+      "To create a public link, set `share=True` in `launch()`.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"900\"\n",
+       "            height=\"500\"\n",
+       "            src=\"http://127.0.0.1:7860/\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "            \n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7fbca787f520>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(<fastapi.applications.FastAPI at 0x7fbcc67ceeb0>,\n",
+       " 'http://127.0.0.1:7860/',\n",
+       " None)"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2022-02-09 14:10:22.417549: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
+      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Number of Helmets: 4\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 5\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 5, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Number of Helmets: 4\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7fbc729998b0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 5\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 5, 'WHV': 0, 'WV': 0}\n",
+      "WARNING:tensorflow:6 out of the last 6 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7fbc979e9ee0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "Workers wearing helmet and vest: 3\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 0\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 3, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "Number of Helmets: 3\n",
+      "Number of Vests: 1\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Number of Helmets: 4\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 6\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 4\n",
+      "Workers not wearing helmet and vest: 2\n",
+      "\n",
+      "\n",
+      "dict vals:\n",
+      "{'W': 6, 'WH': 4, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 6\n",
+      "dict vals:\n",
+      "{'W': 6, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Number of Helmets: 4\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 6\n",
+      "dict vals:\n",
+      "{'W': 6, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 6\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 4\n",
+      "Workers not wearing helmet and vest: 2\n",
+      "\n",
+      "\n",
+      "dict vals:\n",
+      "{'W': 6, 'WH': 4, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 1\n",
+      "Number of Helmets: 1\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 1, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 1\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 1\n",
+      "dict vals:\n",
+      "{'W': 1, 'WH': 1, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 1\n",
+      "dict vals:\n",
+      "{'W': 1, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 1\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 1\n",
+      "dict vals:\n",
+      "{'W': 1, 'WH': 1, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 5\n",
+      "Number of Helmets: 4\n",
+      "Number of Vests: 0\n",
+      "dict vals:\n",
+      "{'W': 5, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 6\n",
+      "Workers wearing helmet and vest: 0\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 4\n",
+      "Workers not wearing helmet and vest: 2\n",
+      "\n",
+      "\n",
+      "dict vals:\n",
+      "{'W': 6, 'WH': 4, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "Workers wearing helmet and vest: 3\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 0\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 3, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "Number of Helmets: 3\n",
+      "Number of Vests: 1\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 0, 'WV': 0}\n",
+      "\n",
+      "\n",
+      "\n",
+      "Total workers: 3\n",
+      "Workers wearing helmet and vest: 3\n",
+      "Workers wearing only vest: 0\n",
+      "Workers wearing only helmet: 0\n",
+      "dict vals:\n",
+      "{'W': 3, 'WH': 0, 'WHV': 3, 'WV': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "import run_code\n",
+    "import cv2\n",
+    "import gradio as gr\n",
+    "\n",
+    "\n",
+    "def sepia(Input_Image, Approach):\n",
+    "    pil_image = Input_Image\n",
+    "    open_cv_image = np.asarray(pil_image)\n",
+    "    # Convert RGB to BGR\n",
+    "    #open_cv_image = open_cv_image[:, :, ::-1].copy()\n",
+    "    #Approach = 3\n",
+    "    sepia_img = run_code.run(open_cv_image, Approach)\n",
+    "    images = sepia_img['img']\n",
+    "    texts= sepia_img['text']\n",
+    "    #print (labels)\n",
+    "    return images, texts\n",
+    "\n",
+    "image = [gr.inputs.Image(type=\"pil\"), gr.inputs.Radio([1, 2, 3])]\n",
+    "#output = [\"image\", gr.outputs.Label(num_top_classes=4)]\n",
+    "output = [\"image\", gr.outputs.Textbox(type=\"auto\")]\n",
+    "#output = gr.outputs.Label(num_top_classes=4)\n",
+    "\n",
+    "title=\"Real-time Detection of Personal-Protective-Equipment (PPE)\"\n",
+    "description=\"This demo is the implementation of Real-time Detection of Personal-Protective-Equipment (PPE) paper https://github.com/ciber-lab/pictor-ppe\" \\\n",
+    "            \"  - by Sanjay Kamath \"\n",
+    "examples = [[\"examples/ex1.jpg\", 1], [\"examples/ex2.jpg\", 2], [\"examples/ex3.jpg\", 3]]\n",
+    "\n",
+    "#iface = gr.Interface(sepia , [ gr.inputs.Image(shape=(200, 200)), gr.inputs.Radio([1, 2, 3])], \"image\", title=title,\n",
+    "#                   examples = [[\"examples/ex1.jpg\"], [\"examples/ex2.jpg\"], [\"examples/ex3.jpg\"]],\n",
+    "#                     description=description)\n",
+    "\n",
+    "iface = gr.Interface(fn=sepia, inputs=image, outputs=output, title=title, description=description, examples=examples)\n",
+    "\n",
+    "iface.launch()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

model-data/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

model-data/weights/pictor-ppe-v302-a1-yolo-v3-weights.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ec800aa5acdd9719ff5e63b34d1374e5c8a31e17f38f3a8250bf1aeeac1a972
+size 246910096

model-data/weights/pictor-ppe-v302-a2-yolo-v3-weights.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:317831ba378b8ec02e24e57859876eb0348284c8a75155143c9df85ee478c47b
+size 246931600

model-data/weights/pictor-ppe-v302-a3-yolo-v3-weights.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d06d4956d0f6b3ac71f02e103e9efdc4b222ce83aeae232f65ee6c04ee1dd2d7
+size 246867088

model-data/weights/readme.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ Download the trained weights of YOLO models ([Google Drive folder](https://drive.google.com/drive/folders/13tCdROHnS0c5VibW1VO8pOEj0rXEvvGj?usp=sharing)) and put in this folder.

modelsn/__init__.py ADDED Viewed

File without changes