Spaces:

flax-community
/

dalle-mini

Running

File size: 48,234 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d0b72877",
   "metadata": {},
   "source": [
    "# vqgan-jax-encoding-yfcc100m"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba7b31e6",
   "metadata": {},
   "source": [
    "Same as `vqgan-jax-encoding-with-captions`, but for YFCC100M.\n",
    "\n",
    "This dataset was prepared by @borisdayma in Json lines format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "3b59489e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "\n",
    "import requests\n",
    "from PIL import Image\n",
    "import numpy as np\n",
    "from tqdm import tqdm\n",
    "\n",
    "import torch\n",
    "import torchvision.transforms as T\n",
    "import torchvision.transforms.functional as TF\n",
    "from torchvision.transforms import InterpolationMode\n",
    "from torch.utils.data import Dataset, DataLoader\n",
    "from torchvision.datasets.folder import default_loader\n",
    "import os\n",
    "\n",
    "import jax\n",
    "from jax import pmap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "511c3b9e",
   "metadata": {},
   "source": [
    "## VQGAN-JAX model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb408f6c",
   "metadata": {},
   "source": [
    "`dalle_mini` is a local package that contains the VQGAN-JAX model and other utilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "2ca50dc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dalle_mini.vqgan_jax.modeling_flax_vqgan import VQModel"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b60da9a",
   "metadata": {},
   "source": [
    "We'll use a VQGAN trained by using Taming Transformers and converted to a JAX model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "id": "29ce8b15",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Working with z of shape (1, 256, 16, 16) = 65536 dimensions.\n"
     ]
    }
   ],
   "source": [
    "model = VQModel.from_pretrained(\"flax-community/vqgan_f16_16384\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7c4c1e6",
   "metadata": {},
   "source": [
    "## Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "33861477",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "id": "81b19eca",
   "metadata": {},
   "outputs": [],
   "source": [
    "yfcc100m = Path('/home/khali/TPU-Test/YFCC100M_OpenAI_subset')\n",
    "# Images are 'sharded' from the following directory\n",
    "yfcc100m_images = yfcc100m/'data'/'data'/'images'\n",
    "yfcc100m_metadata = yfcc100m/'metadata_YFCC100M.jsonl'\n",
    "yfcc100m_output = yfcc100m/'metadata_encoded.tsv'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c58bb4a",
   "metadata": {},
   "source": [
    "### Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a14ae3d",
   "metadata": {},
   "source": [
    "We need to select entries with images that exist. Otherwise we can't build batches because `Dataloader` does not support `None` in batches. We use Huggingface Datasets, I understand they support threaded reading of jsonl files, and I was running out of memory when using pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "7811648c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "from datasets import Dataset, load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4811a230",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "tcmalloc: large alloc 1254047744 bytes == 0xb2b08000 @  0x7f9e78632680 0x7f9e78653824 0x585b92 0x504d56 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56 0x56acb6 0x5f5956 0x56aadf 0x5f5956 0x56acb6 0x568d9a 0x5f5b33 0x50b7f8 0x5f2702 0x56c332\n",
      "tcmalloc: large alloc 1254047744 bytes == 0xfd74e000 @  0x7f9e78632680 0x7f9e78653824 0x590214 0x586f90 0x56e1f3 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56 0x56acb6 0x5f5956 0x56aadf 0x5f5956 0x56acb6 0x568d9a 0x5f5b33 0x50b7f8 0x5f2702 0x56c332\n",
      "tcmalloc: large alloc 5016190976 bytes == 0x148b42000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5019099136 bytes == 0x273f12000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5019811840 bytes == 0x39f9a8000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5024571392 bytes == 0x4cb4ec000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5021097984 bytes == 0x4cb4ec000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5022818304 bytes == 0x4cb4ec000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5020794880 bytes == 0x4cb4ec000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5019451392 bytes == 0x39f9a8000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5020565504 bytes == 0x4cb4ec000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5012561920 bytes == 0x273f12000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5021835264 bytes == 0x5f6cba000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n",
      "tcmalloc: large alloc 5017436160 bytes == 0x273f12000 @  0x7f9e78632680 0x7f9e78653824 0x5b9144 0x7f9b2929127e 0x7f9b29291a19 0x7f9b29291886 0x7f9b29291cef 0x7f9b2928f204 0x5f2cc9 0x5f30ff 0x5705f6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x56acb6 0x5f5956 0x5a8cb3 0x56ae94 0x568d9a 0x68cdc7 0x5ff5d4 0x5c3cb0 0x56aadf 0x501148 0x56c422 0x501148 0x56c422 0x501148 0x504d56\n"
     ]
    }
   ],
   "source": [
    "# The metadata is too bog to load into memory at once, so chopping it into chunks\n",
    "chunk_size=1000000\n",
    "batch_no=1\n",
    "for chunk in pd.read_json(yfcc100m_metadata, orient=\"records\", lines=True,chunksize=chunk_size):\n",
    "    chunk.to_csv('./chunks/chunk'+str(batch_no)+'.tsv', sep=\"\\t\", index=False)\n",
    "    batch_no+=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "46b2f083",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>photoid</th>\n",
       "      <th>uid</th>\n",
       "      <th>unickname</th>\n",
       "      <th>datetaken</th>\n",
       "      <th>dateuploaded</th>\n",
       "      <th>capturedevice</th>\n",
       "      <th>title</th>\n",
       "      <th>description</th>\n",
       "      <th>usertags</th>\n",
       "      <th>machinetags</th>\n",
       "      <th>...</th>\n",
       "      <th>licenseurl</th>\n",
       "      <th>serverid</th>\n",
       "      <th>farmid</th>\n",
       "      <th>secret</th>\n",
       "      <th>secretoriginal</th>\n",
       "      <th>ext</th>\n",
       "      <th>marker</th>\n",
       "      <th>key</th>\n",
       "      <th>title_clean</th>\n",
       "      <th>description_clean</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>137943</td>\n",
       "      <td>48600072071@N01</td>\n",
       "      <td>doctor+paradox</td>\n",
       "      <td>2004-08-01 18:13:06.0</td>\n",
       "      <td>1091409186</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A+Picture+Share%21</td>\n",
       "      <td>Antenna</td>\n",
       "      <td>cameraphone,cayugaheights,green,hydrant,ithaca...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1650c7cdc6</td>\n",
       "      <td>1650c7cdc6</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>d29e7c6a3028418c64eb15e3cf577c2</td>\n",
       "      <td>A Picture Share!</td>\n",
       "      <td>Antenna</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1246361</td>\n",
       "      <td>44124324682@N01</td>\n",
       "      <td>mharrsch</td>\n",
       "      <td>2004-11-03 23:04:02.0</td>\n",
       "      <td>1099523042</td>\n",
       "      <td>NaN</td>\n",
       "      <td>An+ornate+Roman+urn</td>\n",
       "      <td>Photographed+at+the+%3Ca+href%3D%22http%3A%2F%...</td>\n",
       "      <td>ancient,baltimore,burial,death,empire,funeral,...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>cf37054610</td>\n",
       "      <td>cf37054610</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>d29f01b149167d683f9ddde464bb3db</td>\n",
       "      <td>An ornate Roman urn</td>\n",
       "      <td>Photographed at the Walters Art Museum, Baltim...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1251599</td>\n",
       "      <td>51035803024@N01</td>\n",
       "      <td>bmitd67</td>\n",
       "      <td>2004-10-30 17:09:32.0</td>\n",
       "      <td>1099538888</td>\n",
       "      <td>Canon+PowerShot+S30</td>\n",
       "      <td>Jai+%26+Tara+on+the+Cumberland</td>\n",
       "      <td>Another+trip+for+the+happy+couple.</td>\n",
       "      <td>blue+heron,cumberland+river,jai,tara,tennessee</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4a4234e32c</td>\n",
       "      <td>4a4234e32c</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>d296e9e34bdae41edb6c679ff824ab2a</td>\n",
       "      <td>Jai &amp; Tara on the Cumberland</td>\n",
       "      <td>Another trip for the happy couple.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2348587</td>\n",
       "      <td>73621375@N00</td>\n",
       "      <td>Thom+Watson</td>\n",
       "      <td>2004-12-18 21:08:09.0</td>\n",
       "      <td>1103497228</td>\n",
       "      <td>SONY+DSC-W1</td>\n",
       "      <td>Castle+gate+-+%22lite-brited%22</td>\n",
       "      <td>Taken+at+the+Miracle+of+Lights+display+in+Cent...</td>\n",
       "      <td>bullrunpark,castle,centreville,christmas,decor...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>7162c974c3</td>\n",
       "      <td>7162c974c3</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>d29ce96395848478b1e8396e44899</td>\n",
       "      <td>Castle gate - \"lite-brited\"</td>\n",
       "      <td>Taken at the Miracle of Lights display in Cent...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3516047</td>\n",
       "      <td>48600072071@N01</td>\n",
       "      <td>doctor+paradox</td>\n",
       "      <td>2005-01-18 16:44:18.0</td>\n",
       "      <td>1106084658</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A+Picture+Share%21</td>\n",
       "      <td>Tabular</td>\n",
       "      <td>cameraphone,moblog,unfound</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>663e0d8b3d</td>\n",
       "      <td>663e0d8b3d</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>d29abf32c4e12ff881f975b70e0cec0</td>\n",
       "      <td>A Picture Share!</td>\n",
       "      <td>Tabular</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999995</th>\n",
       "      <td>4648651054</td>\n",
       "      <td>24511045@N04</td>\n",
       "      <td>mtfrazier</td>\n",
       "      <td>2010-05-02 15:47:45.0</td>\n",
       "      <td>1275083371</td>\n",
       "      <td>Canon+EOS+50D</td>\n",
       "      <td>U.S.+Navy+Blue+Angels%3A+2010</td>\n",
       "      <td>2+May+2010%0ASunday%0ASt.+Joseph%2C+Missouri</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-nd/2.0/</td>\n",
       "      <td>4072</td>\n",
       "      <td>5</td>\n",
       "      <td>2d12d73fb0</td>\n",
       "      <td>dd5856ea42</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>60fa2911cb81eb25b356e9fee978aef</td>\n",
       "      <td>U.S. Navy Blue Angels: 2010</td>\n",
       "      <td>2 May 2010 Sunday St. Joseph, Missouri</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999996</th>\n",
       "      <td>4652130996</td>\n",
       "      <td>21963865@N04</td>\n",
       "      <td>GRAB1.0</td>\n",
       "      <td>2010-05-29 19:23:10.0</td>\n",
       "      <td>1275200833</td>\n",
       "      <td>SONY+DSLR-A230</td>\n",
       "      <td>Attempts+on+Her+Life</td>\n",
       "      <td>BAPA+1+production+of+Martin+Crimp%27s+Attempts...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-nd/2.0/</td>\n",
       "      <td>4003</td>\n",
       "      <td>5</td>\n",
       "      <td>8889121579</td>\n",
       "      <td>2f46599456</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>60f5ef5ce4c2d24566226abebd67d4</td>\n",
       "      <td>Attempts on Her Life</td>\n",
       "      <td>BAPA 1 production of Martin Crimp's Attempts o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999997</th>\n",
       "      <td>4652568339</td>\n",
       "      <td>64025277@N00</td>\n",
       "      <td>1Sock</td>\n",
       "      <td>2010-05-13 15:38:37.0</td>\n",
       "      <td>1275234267</td>\n",
       "      <td>Canon+EOS+DIGITAL+REBEL+XT</td>\n",
       "      <td>Carlsbad+Caverns+3</td>\n",
       "      <td>%E2%99%A5%E2%99%A5%E2%99%A5%E2%99%A5%E2%99%A5%...</td>\n",
       "      <td>carlsbad,carlsbad+caverns,cave,faa,new+mexico,...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-nd/2.0/</td>\n",
       "      <td>4010</td>\n",
       "      <td>5</td>\n",
       "      <td>0a1808a69e</td>\n",
       "      <td>cf6d348e3d</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>60f029482d1d1028fda5281daf498f</td>\n",
       "      <td>Carlsbad Caverns 3</td>\n",
       "      <td>♥♥♥♥♥♥♥ Interested in purchasing this photogra...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999998</th>\n",
       "      <td>4653110895</td>\n",
       "      <td>20483509@N00</td>\n",
       "      <td>subberculture</td>\n",
       "      <td>2010-05-30 15:37:05.0</td>\n",
       "      <td>1275245596</td>\n",
       "      <td>Canon+DIGITAL+IXUS+40</td>\n",
       "      <td>Want</td>\n",
       "      <td>Isn%27t+that+gorgeous%3F</td>\n",
       "      <td>2010,edinburgh+museum,may,phonebox,wood</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-sa/2.0/</td>\n",
       "      <td>4066</td>\n",
       "      <td>5</td>\n",
       "      <td>77c3b3a254</td>\n",
       "      <td>c4697e1511</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>60f72775f433cf8de3efaeb431866153</td>\n",
       "      <td>Want</td>\n",
       "      <td>Isn't that gorgeous?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999999</th>\n",
       "      <td>4655503987</td>\n",
       "      <td>8457193@N07</td>\n",
       "      <td>zackojones</td>\n",
       "      <td>2010-05-30 15:34:58.0</td>\n",
       "      <td>1275310230</td>\n",
       "      <td>Canon+EOS+7D</td>\n",
       "      <td>Summertime</td>\n",
       "      <td>You+gotta+love+it%21</td>\n",
       "      <td>georgia,savannah,united+states,us</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>http://creativecommons.org/licenses/by-nc-sa/2.0/</td>\n",
       "      <td>4043</td>\n",
       "      <td>5</td>\n",
       "      <td>caff543bfe</td>\n",
       "      <td>f60952ac4d</td>\n",
       "      <td>jpg</td>\n",
       "      <td>0</td>\n",
       "      <td>60f687e11b913bce461e9525d8047e0</td>\n",
       "      <td>Summertime</td>\n",
       "      <td>You gotta love it!</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000000 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           photoid              uid       unickname              datetaken  \\\n",
       "0           137943  48600072071@N01  doctor+paradox  2004-08-01 18:13:06.0   \n",
       "1          1246361  44124324682@N01        mharrsch  2004-11-03 23:04:02.0   \n",
       "2          1251599  51035803024@N01         bmitd67  2004-10-30 17:09:32.0   \n",
       "3          2348587     73621375@N00     Thom+Watson  2004-12-18 21:08:09.0   \n",
       "4          3516047  48600072071@N01  doctor+paradox  2005-01-18 16:44:18.0   \n",
       "...            ...              ...             ...                    ...   \n",
       "999995  4648651054     24511045@N04       mtfrazier  2010-05-02 15:47:45.0   \n",
       "999996  4652130996     21963865@N04         GRAB1.0  2010-05-29 19:23:10.0   \n",
       "999997  4652568339     64025277@N00           1Sock  2010-05-13 15:38:37.0   \n",
       "999998  4653110895     20483509@N00   subberculture  2010-05-30 15:37:05.0   \n",
       "999999  4655503987      8457193@N07      zackojones  2010-05-30 15:34:58.0   \n",
       "\n",
       "        dateuploaded               capturedevice  \\\n",
       "0         1091409186                         NaN   \n",
       "1         1099523042                         NaN   \n",
       "2         1099538888         Canon+PowerShot+S30   \n",
       "3         1103497228                 SONY+DSC-W1   \n",
       "4         1106084658                         NaN   \n",
       "...              ...                         ...   \n",
       "999995    1275083371               Canon+EOS+50D   \n",
       "999996    1275200833              SONY+DSLR-A230   \n",
       "999997    1275234267  Canon+EOS+DIGITAL+REBEL+XT   \n",
       "999998    1275245596       Canon+DIGITAL+IXUS+40   \n",
       "999999    1275310230                Canon+EOS+7D   \n",
       "\n",
       "                                  title  \\\n",
       "0                    A+Picture+Share%21   \n",
       "1                   An+ornate+Roman+urn   \n",
       "2        Jai+%26+Tara+on+the+Cumberland   \n",
       "3       Castle+gate+-+%22lite-brited%22   \n",
       "4                    A+Picture+Share%21   \n",
       "...                                 ...   \n",
       "999995    U.S.+Navy+Blue+Angels%3A+2010   \n",
       "999996             Attempts+on+Her+Life   \n",
       "999997               Carlsbad+Caverns+3   \n",
       "999998                             Want   \n",
       "999999                       Summertime   \n",
       "\n",
       "                                              description  \\\n",
       "0                                                 Antenna   \n",
       "1       Photographed+at+the+%3Ca+href%3D%22http%3A%2F%...   \n",
       "2                      Another+trip+for+the+happy+couple.   \n",
       "3       Taken+at+the+Miracle+of+Lights+display+in+Cent...   \n",
       "4                                                 Tabular   \n",
       "...                                                   ...   \n",
       "999995       2+May+2010%0ASunday%0ASt.+Joseph%2C+Missouri   \n",
       "999996  BAPA+1+production+of+Martin+Crimp%27s+Attempts...   \n",
       "999997  %E2%99%A5%E2%99%A5%E2%99%A5%E2%99%A5%E2%99%A5%...   \n",
       "999998                           Isn%27t+that+gorgeous%3F   \n",
       "999999                               You+gotta+love+it%21   \n",
       "\n",
       "                                                 usertags machinetags  ...  \\\n",
       "0       cameraphone,cayugaheights,green,hydrant,ithaca...         NaN  ...   \n",
       "1       ancient,baltimore,burial,death,empire,funeral,...         NaN  ...   \n",
       "2          blue+heron,cumberland+river,jai,tara,tennessee         NaN  ...   \n",
       "3       bullrunpark,castle,centreville,christmas,decor...         NaN  ...   \n",
       "4                              cameraphone,moblog,unfound         NaN  ...   \n",
       "...                                                   ...         ...  ...   \n",
       "999995                                                NaN         NaN  ...   \n",
       "999996                                                NaN         NaN  ...   \n",
       "999997  carlsbad,carlsbad+caverns,cave,faa,new+mexico,...         NaN  ...   \n",
       "999998            2010,edinburgh+museum,may,phonebox,wood         NaN  ...   \n",
       "999999                  georgia,savannah,united+states,us         NaN  ...   \n",
       "\n",
       "                                               licenseurl  serverid  farmid  \\\n",
       "0       http://creativecommons.org/licenses/by-nc-sa/2.0/         1       1   \n",
       "1       http://creativecommons.org/licenses/by-nc-sa/2.0/         1       1   \n",
       "2       http://creativecommons.org/licenses/by-nc-sa/2.0/         1       1   \n",
       "3       http://creativecommons.org/licenses/by-nc-sa/2.0/         2       1   \n",
       "4       http://creativecommons.org/licenses/by-nc-sa/2.0/         3       1   \n",
       "...                                                   ...       ...     ...   \n",
       "999995  http://creativecommons.org/licenses/by-nc-nd/2.0/      4072       5   \n",
       "999996  http://creativecommons.org/licenses/by-nc-nd/2.0/      4003       5   \n",
       "999997  http://creativecommons.org/licenses/by-nc-nd/2.0/      4010       5   \n",
       "999998     http://creativecommons.org/licenses/by-sa/2.0/      4066       5   \n",
       "999999  http://creativecommons.org/licenses/by-nc-sa/2.0/      4043       5   \n",
       "\n",
       "            secret secretoriginal  ext marker  \\\n",
       "0       1650c7cdc6     1650c7cdc6  jpg      0   \n",
       "1       cf37054610     cf37054610  jpg      0   \n",
       "2       4a4234e32c     4a4234e32c  jpg      0   \n",
       "3       7162c974c3     7162c974c3  jpg      0   \n",
       "4       663e0d8b3d     663e0d8b3d  jpg      0   \n",
       "...            ...            ...  ...    ...   \n",
       "999995  2d12d73fb0     dd5856ea42  jpg      0   \n",
       "999996  8889121579     2f46599456  jpg      0   \n",
       "999997  0a1808a69e     cf6d348e3d  jpg      0   \n",
       "999998  77c3b3a254     c4697e1511  jpg      0   \n",
       "999999  caff543bfe     f60952ac4d  jpg      0   \n",
       "\n",
       "                                     key                   title_clean  \\\n",
       "0        d29e7c6a3028418c64eb15e3cf577c2              A Picture Share!   \n",
       "1        d29f01b149167d683f9ddde464bb3db           An ornate Roman urn   \n",
       "2       d296e9e34bdae41edb6c679ff824ab2a  Jai & Tara on the Cumberland   \n",
       "3          d29ce96395848478b1e8396e44899   Castle gate - \"lite-brited\"   \n",
       "4        d29abf32c4e12ff881f975b70e0cec0              A Picture Share!   \n",
       "...                                  ...                           ...   \n",
       "999995   60fa2911cb81eb25b356e9fee978aef   U.S. Navy Blue Angels: 2010   \n",
       "999996    60f5ef5ce4c2d24566226abebd67d4          Attempts on Her Life   \n",
       "999997    60f029482d1d1028fda5281daf498f            Carlsbad Caverns 3   \n",
       "999998  60f72775f433cf8de3efaeb431866153                          Want   \n",
       "999999   60f687e11b913bce461e9525d8047e0                    Summertime   \n",
       "\n",
       "                                        description_clean  \n",
       "0                                                 Antenna  \n",
       "1       Photographed at the Walters Art Museum, Baltim...  \n",
       "2                      Another trip for the happy couple.  \n",
       "3       Taken at the Miracle of Lights display in Cent...  \n",
       "4                                                 Tabular  \n",
       "...                                                   ...  \n",
       "999995             2 May 2010 Sunday St. Joseph, Missouri  \n",
       "999996  BAPA 1 production of Martin Crimp's Attempts o...  \n",
       "999997  ♥♥♥♥♥♥♥ Interested in purchasing this photogra...  \n",
       "999998                               Isn't that gorgeous?  \n",
       "999999                                 You gotta love it!  \n",
       "\n",
       "[1000000 rows x 26 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# looking up at a chunk\n",
    "pd.read_csv(\"./chunks/chunk1.tsv\", sep=\"\\t\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "c51c5597",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>title_clean</th>\n",
       "      <th>description_clean</th>\n",
       "      <th>ext</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>d29e7c6a3028418c64eb15e3cf577c2</td>\n",
       "      <td>A Picture Share!</td>\n",
       "      <td>Antenna</td>\n",
       "      <td>jpg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d29f01b149167d683f9ddde464bb3db</td>\n",
       "      <td>An ornate Roman urn</td>\n",
       "      <td>Photographed at the Walters Art Museum, Baltim...</td>\n",
       "      <td>jpg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>d296e9e34bdae41edb6c679ff824ab2a</td>\n",
       "      <td>Jai &amp; Tara on the Cumberland</td>\n",
       "      <td>Another trip for the happy couple.</td>\n",
       "      <td>jpg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d29ce96395848478b1e8396e44899</td>\n",
       "      <td>Castle gate - \"lite-brited\"</td>\n",
       "      <td>Taken at the Miracle of Lights display in Cent...</td>\n",
       "      <td>jpg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>d29abf32c4e12ff881f975b70e0cec0</td>\n",
       "      <td>A Picture Share!</td>\n",
       "      <td>Tabular</td>\n",
       "      <td>jpg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                key                   title_clean  \\\n",
       "0   d29e7c6a3028418c64eb15e3cf577c2              A Picture Share!   \n",
       "1   d29f01b149167d683f9ddde464bb3db           An ornate Roman urn   \n",
       "2  d296e9e34bdae41edb6c679ff824ab2a  Jai & Tara on the Cumberland   \n",
       "3     d29ce96395848478b1e8396e44899   Castle gate - \"lite-brited\"   \n",
       "4   d29abf32c4e12ff881f975b70e0cec0              A Picture Share!   \n",
       "\n",
       "                                   description_clean  ext  \n",
       "0                                            Antenna  jpg  \n",
       "1  Photographed at the Walters Art Museum, Baltim...  jpg  \n",
       "2                 Another trip for the happy couple.  jpg  \n",
       "3  Taken at the Miracle of Lights display in Cent...  jpg  \n",
       "4                                            Tabular  jpg  "
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Looking at a chunk with only the relevant columns that we need\n",
    "df = pd.read_csv(\"./chunks/chunk1.tsv\", sep=\"\\t\")[[\"key\", \"title_clean\", \"description_clean\", \"ext\"]]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc1668f8",
   "metadata": {},
   "source": [
    "### Grabbing each chunks from the folder, cleaning it up, only taking the entries which image exist and appending it to the global df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abbcccf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# the function that helps us to decide whether an image with certain id exists in storage, we only take the ones that we have the images for\n",
    "def image_exists(item):\n",
    "    name, _, _, ext, _ = item\n",
    "    root=str(yfcc100m_images)\n",
    "    image_path = (Path(root)/name[0:3]/name[3:6]/name).with_suffix(\".\"+ext)\n",
    "    if image_path.exists():\n",
    "        return True\n",
    "    else:\n",
    "        return None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "44fa86ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell does it all, grabs each chunk, cleans it up based on image existing condition, etc.\n",
    "global_df = pd.DataFrame()\n",
    "chunks_dir = \"./chunks\"\n",
    "for filename in os.listdir(chunks_dir):\n",
    "        df = pd.read_csv(f\"./chunks/{str(filename)}\", sep=\"\\t\")[[\"key\", \"title_clean\", \"description_clean\", \"ext\"]]\n",
    "        df['caption'] = df[\"title_clean\"]+\". \"+df['description_clean']\n",
    "        df['is_exist'] = df.apply(image_exists, axis=1)\n",
    "        df = df.dropna()[[\"key\", \"caption\"]]\n",
    "        df.columns = ['image_file', 'caption']\n",
    "        global_df = global_df.append(df, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "45024fdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# saving the tsv to disk\n",
    "global_df.to_csv('./chunks/YFCC_subset_clean.tsv', sep=\"\\t\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "dca4eb73",
   "metadata": {},
   "outputs": [],
   "source": [
    "# loading the tsv from disk (for explicitness, also my electricity was gone, glad it happened after I saved to the disk :( )\n",
    "\n",
    "dataset = pd.read_csv(f\"./chunks/YFCC_subset_clean.tsv\", sep=\"\\t\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "id": "a511264a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Luke Melas-Kyriazi's dataset.py's modified version for YFCC\n",
    "\"\"\"\n",
    "import warnings\n",
    "from typing import Optional, Callable\n",
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import torch\n",
    "import pandas as pd\n",
    "from torch.utils.data import Dataset\n",
    "from torchvision.datasets.folder import default_loader\n",
    "from PIL import ImageFile\n",
    "from PIL.Image import DecompressionBombWarning\n",
    "ImageFile.LOAD_TRUNCATED_IMAGES = True\n",
    "warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
    "warnings.filterwarnings(\"ignore\", category=DecompressionBombWarning)\n",
    "\n",
    "\n",
    "class CaptionDataset(Dataset):\n",
    "    \"\"\"\n",
    "    A PyTorch Dataset class for (image, texts) tasks. Note that this dataset \n",
    "    returns the raw text rather than tokens. This is done on purpose, because\n",
    "    it's easy to tokenize a batch of text after loading it from this dataset.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, *, images_root: str, captions_path: str, text_transform: Optional[Callable] = None, \n",
    "                 image_transform: Optional[Callable] = None, image_transform_type: str = 'torchvision',\n",
    "                 include_captions: bool = True):\n",
    "        \"\"\"\n",
    "        :param images_root: folder where images are stored\n",
    "        :param captions_path: path to csv that maps image filenames to captions\n",
    "        :param image_transform: image transform pipeline\n",
    "        :param text_transform: image transform pipeline\n",
    "        :param image_transform_type: image transform type, either `torchvision` or `albumentations`\n",
    "        :param include_captions: Returns a dictionary with `image`, `text` if `true`; otherwise returns just the images.\n",
    "        \"\"\"\n",
    "\n",
    "        # Base path for images\n",
    "        self.images_root = Path(images_root)\n",
    "\n",
    "        # Load captions as DataFrame\n",
    "        self.captions = pd.read_csv(f\"./chunks/YFCC_subset_clean.tsv\", sep=\"\\t\")\n",
    "        self.captions['image_file'] = self.captions['image_file'].astype(str)\n",
    "\n",
    "        # PyTorch transformation pipeline for the image (normalizing, etc.)\n",
    "        self.text_transform = text_transform\n",
    "        self.image_transform = image_transform\n",
    "        self.image_transform_type = image_transform_type.lower()\n",
    "        assert self.image_transform_type in ['torchvision', 'albumentations']\n",
    "\n",
    "        # Total number of datapoints\n",
    "        self.size = len(self.captions)\n",
    "\n",
    "        # Return image+captions or just images\n",
    "        self.include_captions = include_captions\n",
    "    \n",
    "    def image_exists(item):\n",
    "        name, caption = item\n",
    "        root=str(self.images_root)\n",
    "        image_path = (Path(root)/name[0:3]/name[3:6]/name).with_suffix(\".jpg\")\n",
    "\n",
    "        return image_path.exists()\n",
    "\n",
    "    def verify_that_all_images_exist(self):\n",
    "        for image_file in self.captions['image_file']:\n",
    "            if not image_exists:\n",
    "                print(f'file does not exist: {p}')\n",
    "\n",
    "    def _get_raw_image(self, i):\n",
    "        name = self.captions.iloc[i]['image_file']\n",
    "        image_path = (Path(self.images_root)/name[0:3]/name[3:6]/name).with_suffix(\".jpg\")\n",
    "        image = default_loader(image_path)\n",
    "        return image\n",
    "\n",
    "    def _get_raw_text(self, i):\n",
    "        return self.captions.iloc[i]['caption']\n",
    "\n",
    "    def __getitem__(self, i):\n",
    "        image = self._get_raw_image(i)\n",
    "        caption = self._get_raw_text(i)\n",
    "        if self.image_transform is not None:\n",
    "            if self.image_transform_type == 'torchvision':\n",
    "                image = self.image_transform(image)\n",
    "            elif self.image_transform_type == 'albumentations':\n",
    "                image = self.image_transform(image=np.array(image))['image']\n",
    "            else:\n",
    "                raise NotImplementedError(f\"{self.image_transform_type=}\")\n",
    "        return {'image': image, 'text': caption} if self.include_captions else image\n",
    "\n",
    "    def __len__(self):\n",
    "        return self.size\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    import albumentations as A\n",
    "    from albumentations.pytorch import ToTensorV2\n",
    "    from transformers import AutoTokenizer\n",
    "    \n",
    "\n",
    "    images_root = \"/home/khali/TPU-Test/YFCC100M_OpenAI_subset/data/data/images\"\n",
    "    captions_path = './YFCC_subset_clean.tsv'\n",
    "    image_size = 256\n",
    "    \n",
    "    # Create transforms\n",
    "    def image_transform(image):\n",
    "        s = min(image.size)\n",
    "        r = image_size / s\n",
    "        s = (round(r * image.size[1]), round(r * image.size[0]))\n",
    "        image = TF.resize(image, s, interpolation=InterpolationMode.LANCZOS)\n",
    "        image = TF.center_crop(image, output_size = 2 * [image_size])\n",
    "        image = torch.unsqueeze(T.ToTensor()(image), 0)\n",
    "        image = image.permute(0, 2, 3, 1).numpy()\n",
    "        return image\n",
    "    \n",
    "    # Create dataset\n",
    "    dataset = CaptionDataset(\n",
    "        images_root=images_root,\n",
    "        captions_path=captions_path,\n",
    "        image_transform=image_transform,\n",
    "        image_transform_type='torchvision',\n",
    "        include_captions=False\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "id": "cc922704",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2483316"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "id": "6e47ba46",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataloader = DataLoader(dataset, batch_size=32, num_workers=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c8a130eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# looking at a batch\n",
    "next(iter(dataloader))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c192fd44",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import matplotlib.pyplot as plt\n",
    "# for tensor_image, _ in dataloader:\n",
    "#     print(tensor_image)\n",
    "#     plt.imshow(tensor_image.permute(1, 2, 0))\n",
    "#     break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62ad01c3",
   "metadata": {},
   "source": [
    "## Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "id": "88f36d0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode(model, batch):\n",
    "#     print(\"jitting encode function\")\n",
    "    _, indices = model.encode(batch)\n",
    "    return indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "id": "1f35f0cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def superbatch_generator(dataloader, num_tpus):\n",
    "    iter_loader = iter(dataloader)\n",
    "    for batch in iter_loader:\n",
    "        superbatch = [batch.squeeze(1)]\n",
    "        try:\n",
    "            for b in range(num_tpus-1):\n",
    "                batch = next(iter_loader)\n",
    "                if batch is None:\n",
    "                    break\n",
    "                # Skip incomplete last batch\n",
    "                if batch.shape[0] == dataloader.batch_size:\n",
    "                    superbatch.append(batch.squeeze(1))\n",
    "        except StopIteration:\n",
    "            pass\n",
    "        superbatch = torch.stack(superbatch, axis=0)\n",
    "        yield superbatch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "id": "2210705b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "def encode_captioned_dataset(dataset, output_tsv, batch_size=32, num_workers=16):\n",
    "    if os.path.isfile(output_tsv):\n",
    "        print(f\"Destination file {output_tsv} already exists, please move away.\")\n",
    "        return\n",
    "    \n",
    "    num_tpus = 8    \n",
    "    dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)\n",
    "    superbatches = superbatch_generator(dataloader, num_tpus=num_tpus)\n",
    "    \n",
    "    p_encoder = pmap(lambda batch: encode(model, batch))\n",
    "\n",
    "    # We save each superbatch to avoid reallocation of buffers as we process them.\n",
    "    # We keep the file open to prevent excessive file seeks.\n",
    "    with open(output_tsv, \"w\") as file:\n",
    "        iterations = len(dataset) // (batch_size * num_tpus)\n",
    "        for n in tqdm(range(iterations)):\n",
    "            superbatch = next(superbatches)\n",
    "            encoded = p_encoder(superbatch.numpy())\n",
    "            encoded = encoded.reshape(-1, encoded.shape[-1])\n",
    "\n",
    "            # Extract fields from the dataset internal `captions` property, and save to disk\n",
    "            start_index = n * batch_size * num_tpus\n",
    "            end_index = (n+1) * batch_size * num_tpus\n",
    "            paths = dataset.captions[\"image_file\"][start_index:end_index].values\n",
    "            captions = dataset.captions[\"caption\"][start_index:end_index].values\n",
    "            encoded_as_string = list(map(lambda item: np.array2string(item, separator=',', max_line_width=50000, formatter={'int':lambda x: str(x)}), encoded))\n",
    "            batch_df = pd.DataFrame.from_dict({\"image_file\": paths, \"caption\": captions, \"encoding\": encoded_as_string})\n",
    "            batch_df.to_csv(file, sep='\\t', header=(n==0), index=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "id": "7704863d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4850/4850 [2:27:51<00:00,  1.83s/it]\n"
     ]
    }
   ],
   "source": [
    "encode_captioned_dataset(dataset, yfcc100m_output, batch_size=64, num_workers=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8953dd84",
   "metadata": {},
   "source": [
    "----"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.9.0 64-bit ('Python39')"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  },
  "interpreter": {
   "hash": "db471c52d602b4f5f40ecaf278e88ccfef85c29d0a1a07185b0d51fc7acf4e26"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}