{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La primera parte del proyecto consiste en la obtención, mediantes técnicas de Web Scraping, de:\n", "\n", "* Un set de imágenes de setas para entrenar un modelo Deep Learning de clasificación multiclase.\n", "* Un dataset con los datos asociados a distintas especies de setas, para entrenar un modelo Machine Learning de clasificación multiclase. \n", "\n", "Toda la información será obtenida de las siguientes páginas web:\n", "\n", "\n", "* https://www.mushroom.world/\n", "* https://www.wildfooduk.com/mushroom-guide/\n", "* https://www.fungipedia.org/hongos/\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importación de librerías " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.241675Z", "start_time": "2022-02-05T11:38:50.637941Z" } }, "outputs": [], "source": [ "# Importamos librerías\n", "import requests # Librería HTTP\n", "from bs4 import BeautifulSoup # Extraer datos de archivos HTML y XML\n", "import re # Regular Expressions\n", "import os # Paths y directorios\n", "import pathlib # Paths y directorios\n", "import pandas as pd # Tratamiento de DataFrames\n", "import numpy as np # Funciones matemáticas, algebraicas y otras\n", "from PIL import Image # Edición de imágenes\n", "from resizeimage import resizeimage # Reescalado de imágenes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definición de funciones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para este ejercicio, definiremos una serie de funciones que nos facilitarán la ejecución y la estructura del mismo:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HTML parser\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definimos en primer lugar una función general para ***parsear* la información de la URL facilitada**: " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.257201Z", "start_time": "2022-02-05T11:38:51.242692Z" } }, "outputs": [], "source": [ "def getdata(url, parser='html.parser'):\n", " #Definimos los headers para la request HTTP de manera que el servidor no nos bloquee la respuesta:\n", " headers = {\n", " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',\n", " 'Accept-Language': 'es-ES,es;q=0.9', \n", " 'Cache-Control': 'max-age=0',\n", " 'Referer': 'https://google.com',\n", " 'DNT': '1',\n", " }\n", " dir = url\n", " r = requests.get(dir, headers = headers)\n", " soup = BeautifulSoup(r.text, parser)\n", " return soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Directory switcher" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creamos también una función para crear y cambiar directorios en la carpeta raíz para nuestras imágenes:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.273252Z", "start_time": "2022-02-05T11:38:51.258172Z" } }, "outputs": [], "source": [ "directory = os.getcwd()\n", "image_directory = os.path.join(directory + '\\Images')\n", "try:\n", " os.mkdir(image_directory)\n", "except FileExistsError:\n", " pass" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.289268Z", "start_time": "2022-02-05T11:38:51.274220Z" } }, "outputs": [], "source": [ "def getdirectory(folder):\n", " os.chdir(image_directory)\n", " try:\n", " os.mkdir(os.getcwd() + \"/\" + str(folder))\n", " except FileExistsError:\n", " pass\n", " os.chdir(os.getcwd() + \"/\" + str(folder))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Image downloader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La siguientes funciones sirven para **descargar todas las imágenes de las distintas webs** anteriormente mencionadas:" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2022-01-21T19:56:14.547984Z", "start_time": "2022-01-21T19:56:14.529008Z" } }, "source": [ "1. La **primera función de descarga** es para https://www.mushroom.world/.\n", "\n", " En ella, usamos las *RE* para obtener los atributos \"href\" de cada imagen y obtener sus URLs. En este caso, todas son de la forma: *https://www.mushroom.world/data/...*. Después, iteraremos sobre este listado y escribiremos cada imagen en el directorio." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.304866Z", "start_time": "2022-02-05T11:38:51.290267Z" } }, "outputs": [], "source": [ "def imagedown_1(soup):\n", " images = soup.find_all(href=re.compile(\"data\"))\n", " for image in images:\n", " href = image['href']\n", " link = 'https://www.mushroom.world' + image['href'][3:]\n", " name = href[15:-4] + ' mw' + '.jpg'\n", " name = name.replace('/','-')\n", " if not os.path.exists('./' + name): # Solo descargamos aquellas imágenes que no tengamos\n", " with open(name, 'wb') as f:\n", " im = requests.get(link)\n", " f.write(im.content)\n", " # Reescalamos la imagen a 512x512 píxeles\n", " with open(name, 'r+b') as f:\n", " try:\n", " with Image.open(f) as image:\n", " cover = resizeimage.resize_cover(image, [512, 512])\n", " cover.save(name, image.format)\n", " except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Creamos ahora una **función para aplicar en la segunda web** utilizada : https://www.wildfooduk.com/mushroom-guide/\n", "\n", " En este caso, hacemos uso de las utilidades de BS4 y html para obtener las imágenes de cada página. Eso si, previamente necesitaremos un listado con todos los links individuales." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.319987Z", "start_time": "2022-02-05T11:38:51.305866Z" } }, "outputs": [], "source": [ "def imagedown_2(soup):\n", " image_set = soup.find('ul', {'class': 'mush-thumbs'})\n", " images = image_set.find_all('a', {'id': re.compile(\"image-\")})\n", " contador = 0\n", " for image in images:\n", " contador += 1\n", " link = image['href']\n", " name = soup.find('table').find_all('td')[5].string.strip() + \" \" + str(contador) + ' wf' + '.jpg'\n", " name = name.replace('/','-')\n", " if not os.path.exists('./' + name): # Solo descargamos aquellas imágenes que no tengamos\n", " with open(name, 'wb') as f:\n", " im = requests.get(link)\n", " f.write(im.content)\n", " # Reescalamos la imagen a 512x512 píxeles\n", " with open(name, 'r+b') as f:\n", " try:\n", " with Image.open(f) as image:\n", " cover = resizeimage.resize_cover(image, [512, 512])\n", " cover.save(name, image.format)\n", " except:\n", " pass\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Creamos ahora una función para aplicar en la **tercera y última web utilizada**: https://www.fungipedia.org/hongos.html\n", "\n", " En este caso utilizaremos CSS Selectors para obtener las imágenes, pues las imágenes se encuentran dentro de un plugin \"Simple Image Gallery Pro\". Al igual que en caso anterior, previamente necesitaremos un listado con todos los links individuales:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.335814Z", "start_time": "2022-02-05T11:38:51.320673Z" } }, "outputs": [], "source": [ "def imagedown_3(soup):\n", " # Utilizamos nuevamente headers, de lo contrario el servidor nos devuelve un codigo 403 Forbidden error.\n", " headers = {\n", " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',\n", " 'Accept-Language': 'es-ES,es;q=0.9', \n", " 'Cache-Control': 'max-age=0',\n", " 'Referer': 'https://google.com',\n", " 'DNT': '1',\n", " }\n", " images = soup.select('.sigProLinkWrapper a[href]:not([href=\"\"])')\n", " domain = 'https://www.fungipedia.org'\n", " contador = 0\n", " for image in images:\n", " contador += 1\n", " link = image.attrs.get('href')\n", " name = soup.find('h1', {'class': 'itemTitle'}).string.strip() + \" \" + str(contador) + ' fp' + '.jpg'\n", " name = name.replace('/','-')\n", " if not os.path.exists('./' + name): # Solo descargamos aquellas imágenes que no tengamos\n", " with open(name, 'wb') as f: \n", " im = requests.get(domain + link, allow_redirects = True, headers = headers)\n", " f.write(im.content)\n", " # Reescalamos la imagen a 512x512 píxeles\n", " with open(name, 'r+b') as f:\n", " try:\n", " with Image.open(f) as image:\n", " cover = resizeimage.resize_cover(image, [512, 512])\n", " cover.save(name, image.format)\n", " except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pager" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En último lugar, definimos las funciones para **pasar de página** en las distintas estructuras de las webs:" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2022-01-21T20:00:04.349396Z", "start_time": "2022-01-21T20:00:04.339424Z" } }, "source": [ "1. Para https://www.mushroom.world:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.351406Z", "start_time": "2022-02-05T11:38:51.336529Z" } }, "outputs": [], "source": [ "def getnextpage_1(soup):\n", " page = soup.find('div', {'id': 'pager'}) # A partir de la etiqueta de división div del paginador (id=pager), podemos pasar de página iterando en un bucle:\n", " if page.find(string=re.compile(\"Next Page\")):\n", " url = 'https://www.mushroom.world' + str(page.find('a', string=re.compile(\"Next Page\"))['href'])\n", " return url\n", " else:\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Para https://www.wildfooduk.com/mushroom-guide/: en este caso encontramos todos los links en una única URL, luego no hará falta." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Para https://www.fungipedia.org/':" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.366585Z", "start_time": "2022-02-05T11:38:51.352439Z" } }, "outputs": [], "source": [ "def getnextpage_2(soup):\n", " pager = soup.find('div', {'class': 'pagination'})\n", " if pager.find('a', {'class': 'next'}):\n", " url = 'https://www.fungipedia.org/' + str(pager.find('a', {'class': 'next'})['href'])\n", " return url\n", " else:\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtención de imágenes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mushroom World " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Las distintas URLs del siguiente diccionario hacen referencia a las clases en las que dividiremos las imágenes de las setas:\n", "\n", "* **Edible** *(Comestibles)*\n", "* **Inedible** *(No Comestibles)*\n", "* **Poisonous** *(Venenosas)*\n", "\n", "Cada una de ellas se guardará en una carpeta en nuestro directorio." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:38:51.381814Z", "start_time": "2022-02-05T11:38:51.368582Z" } }, "outputs": [], "source": [ "edibility = {\"Edible\" : 'https://www.mushroom.world/mushrooms/edible', \n", " \"Inedible\": 'https://www.mushroom.world/mushrooms/inedible', \n", " \"Poisonous\": 'https://www.mushroom.world/mushrooms/poisonous'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Con las funciones anteriormente definidas, simplemente planteamos el siguiente bucle para descargar todas las imágenes:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:48:33.300282Z", "start_time": "2022-02-05T11:38:51.382815Z" } }, "outputs": [], "source": [ "for i in edibility:\n", " url = edibility[i]\n", " getdirectory(i)\n", " while type(url) == str:\n", " soup = getdata(url,'html.parser')\n", " imagedown_1(soup)\n", " url = getnextpage_1(soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wild UK Mushrooms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fijamos las URLs nuevamente:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T11:48:33.316268Z", "start_time": "2022-02-05T11:48:33.301281Z" } }, "outputs": [], "source": [ "edibility = {\"Edible\" : 'https://www.wildfooduk.com/mushroom-guide/?mushroom_type=edible', \n", " \"Inedible\": 'https://www.wildfooduk.com/mushroom-guide/?mushroom_type=inedible', \n", " \"Poisonous\": 'https://www.wildfooduk.com/mushroom-guide/?mushroom_type=poisonous'} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Planteamos el bucle para recorrer la estructura de la web y descargar las imágenes:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T12:13:26.101999Z", "start_time": "2022-02-05T11:48:33.317238Z" } }, "outputs": [], "source": [ "for i in edibility:\n", " url = edibility[i]\n", " getdirectory(i)\n", " soup = getdata(url)\n", " mushroom_table = soup.find_all('td', {'class': 'mushroom-image'})\n", " mushroom_links = []\n", " for mushroom in mushroom_table:\n", " mushroom_links.append(mushroom.find('a')['href'])\n", " for link in mushroom_links:\n", " soup = getdata(link,'html.parser')\n", " imagedown_2(soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fungipedia " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fijamos las URLs nuevamente, en este caso al aplicando los filtros correspondientes en la web, estos se reflejan directamente en las URLs." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T12:13:26.117978Z", "start_time": "2022-02-05T12:13:26.102991Z" } }, "outputs": [], "source": [ "edibility = {\"Edible\" : 'https://www.fungipedia.org/hongos/itemlist/filter.html?array12%5B%5D=buen-comestible&array12%5B%5D=buen-comestible-precaucion&array12%5B%5D=comestible&array12%5B%5D=comestible-precaucion&array12%5B%5D=excelente-comestible&array12%5B%5D=excelente-comestible-precaucion&moduleId=95&Itemid=337', \n", " \"Inedible\": 'https://www.fungipedia.org/hongos/itemlist/filter.html?array12%5B%5D=no-comestible&array12%5B%5D=sin-valor&moduleId=95&Itemid=337', \n", " \"Poisonous\": 'https://www.fungipedia.org/hongos/itemlist/filter.html?array12%5B%5D=mortal&array12%5B%5D=toxica&moduleId=95&Itemid=337'} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Planteamos el bucle para recorrer la estructura de la web y descargar las imágenes:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2022-02-05T12:36:46.899216Z", "start_time": "2022-02-05T12:13:26.118948Z" } }, "outputs": [], "source": [ "for i in edibility:\n", " url_main = edibility[i]\n", " getdirectory(i)\n", " while True:\n", " soup_main = getdata(url_main,'html.parser')\n", " mushroom_elements = soup_main.find_all('a', {'class': 'gris'})\n", " mushroom_links = []\n", " for element in mushroom_elements:\n", " mushroom_links.append('https://www.fungipedia.org' + element['href'])\n", " for link in mushroom_links:\n", " soup_link = getdata(link,'html.parser')\n", " imagedown_3(soup_link)\n", " url_main = getnextpage_2(soup_main)\n", " if not url_main:\n", " break" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "eRTY-COfOwwD", "ip1P14xN-uSX", "VaNHXO2N_Hv-", "mhPpAK2bMyQZ", "rEI-mXrkU4ku" ], "name": "PrevioTFM3.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "877px", "left": "70px", "top": "111.125px", "width": "165px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }