SaiedAlshahrani
commited on
Commit
•
360dd3d
1
Parent(s):
8a8b943
Upload 7 files
Browse files- LICENSE +21 -0
- README.md +48 -13
- packages.txt +2 -0
- report.py +313 -0
- requirements.txt +8 -0
- update-daemon.sh +20 -0
- update-metadata.py +196 -0
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2023 Saied Alshahrani
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,13 +1,48 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Wikipedia Corpora Meta Report
|
2 |
+
We, in this repository, share with the community our Python scripts for the “**Wikipedia Corpora Meta Report**”, an online metadata report (dashboard), designed to shed light on how bots or humans generate or edit Wikipedia editions to provide the NLP community with detailed information (metadata) about each Wikipedia edition’s articles, enabling them to make informed decisions regarding using these Wikipedia articles for training their NLP tasks and systems.
|
3 |
+
|
4 |
+
This dashboard interactively displays the metadata of each Wikipedia edition using sunburst visualization and provides users with the options to view the metadata in a tabular format and to download the displayed metadata as a CSV file. The dashboard is open-sourced on GitHub with an MIT license and publicly hosted on Streamlit Community Cloud at [https://wikipedia-corpora-report.app](https://wikipedia-corpora-report.streamlit.app/).
|
5 |
+
|
6 |
+
This dashboard was presented as a *transparency* tool in our **accepted** paper, [**Performance Implications of Using Unrepresentative Corpora in Arabic Natural Language Processing**](https://aclanthology.org/2023.arabicnlp-1.19.pdf), at [*The First Arabic Natural Language Processing Conference (ArabicNLP 2023)*](https://sites.google.com/view/wanlp2023), co-located with [EMNLP 2023](https://2023.emnlp.org/) in Singapore (hybrid conference), December 7, 2023.
|
7 |
+
|
8 |
+
|
9 |
+
### Local Run of Dashboard
|
10 |
+
The dashboard is publicly hosted online on Streamlit Community Cloud, yet if you desire to run the dashboard locally on your machine, follow these steps.
|
11 |
+
|
12 |
+
1- Clone the dashboard's GitHub repository to your machine. Use this command in your terminal:
|
13 |
+
|
14 |
+
```bash
|
15 |
+
git clone https://github.com/SaiedAlshahrani/Wikipedia-Corpora-Report.git
|
16 |
+
cd Wikipedia-Corpora-Report
|
17 |
+
```
|
18 |
+
|
19 |
+
2- Download the required Python packages. Use this command in your terminal:
|
20 |
+
|
21 |
+
```bash
|
22 |
+
pip install -r requirements.txt
|
23 |
+
```
|
24 |
+
|
25 |
+
3- Run Streamlit local server. Use this command in your terminal:
|
26 |
+
|
27 |
+
```bash
|
28 |
+
streamlit run report.py
|
29 |
+
```
|
30 |
+
|
31 |
+
|
32 |
+
### BibTeX Citation:
|
33 |
+
|
34 |
+
```bash
|
35 |
+
@inproceedings{alshahrani-etal-2023-performance,
|
36 |
+
title = "{Performance Implications of Using Unrepresentative Corpora in {A}rabic Natural Language Processing}",
|
37 |
+
author = "Alshahrani, Saied and Alshahrani, Norah and Dey, Soumyabrata and Matthews, Jeanna",
|
38 |
+
booktitle = "Proceedings of the The First Arabic Natural Language Processing Conference (ArabicNLP 2023)",
|
39 |
+
month = December,
|
40 |
+
year = "2023",
|
41 |
+
address = "Singapore (Hybrid)",
|
42 |
+
publisher = "Association for Computational Linguistics",
|
43 |
+
url = "https://aclanthology.org/2023.arabicnlp-1.19",
|
44 |
+
doi = "10.18653/v1/2023.arabicnlp-1.19",
|
45 |
+
pages = "218--231",
|
46 |
+
abstract = "Wikipedia articles are a widely used source of training data for Natural Language Processing (NLP) research, particularly as corpora for low-resource languages like Arabic. However, it is essential to understand the extent to which these corpora reflect the representative contributions of native speakers, especially when many entries in a given language are directly translated from other languages or automatically generated through automated mechanisms. In this paper, we study the performance implications of using inorganic corpora that are not representative of native speakers and are generated through automated techniques such as bot generation or automated template-based translation. The case of the Arabic Wikipedia editions gives a unique case study of this since the Moroccan Arabic Wikipedia edition (ARY) is small but representative, the Egyptian Arabic Wikipedia edition (ARZ) is large but unrepresentative, and the Modern Standard Arabic Wikipedia edition (AR) is both large and more representative. We intrinsically evaluate the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy evaluations and fill-mask evaluations using our two newly created datasets: Arab States Analogy Dataset (ASAD) and Masked Arab States Dataset (MASD). We demonstrate that for good NLP performance, we need both large and organic corpora; neither alone is sufficient. We show that producing large corpora through automated means can be a counter-productive, producing models that both perform worse and lack cultural richness and meaningful representation of the Arabic language and its native speakers.",
|
47 |
+
}
|
48 |
+
```
|
packages.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
wget
|
2 |
+
firefox-esr
|
report.py
ADDED
@@ -0,0 +1,313 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import ssl
|
2 |
+
import warnings
|
3 |
+
import datasets
|
4 |
+
import subprocess
|
5 |
+
import pandas as pd
|
6 |
+
import urllib.request
|
7 |
+
from time import sleep
|
8 |
+
import streamlit as st
|
9 |
+
from datetime import date
|
10 |
+
import plotly.express as px
|
11 |
+
from urllib.error import HTTPError
|
12 |
+
|
13 |
+
|
14 |
+
warnings.simplefilter("ignore", UserWarning)
|
15 |
+
warnings.simplefilter("ignore", FutureWarning)
|
16 |
+
pd.options.display.float_format = '{:.2f}'.format
|
17 |
+
ssl._create_default_https_context = ssl._create_unverified_context
|
18 |
+
|
19 |
+
st.set_page_config(page_title="Wikipedia Corpora Report", page_icon="https://webspace.clarkson.edu/~alshahsf/images/wikipedia1.png")
|
20 |
+
|
21 |
+
st.markdown("""
|
22 |
+
<h1 style='text-align: center';>Wikipedia Corpora Meta Report</h1>
|
23 |
+
<h5 style='text-align: center';>A Metadata Report of How Wikipedia Editions Are Generated and Edited</h5>
|
24 |
+
""", unsafe_allow_html=True)
|
25 |
+
|
26 |
+
|
27 |
+
def fetch_wikis_codes():
|
28 |
+
try:
|
29 |
+
urls = [r'https://en.wikipedia.org/wiki/Statistics_of_Wikipedias',
|
30 |
+
r'https://meta.wikimedia.org/wiki/List_of_Wikipedias']
|
31 |
+
|
32 |
+
for url in urls:
|
33 |
+
try: tables = pd.read_html(url)
|
34 |
+
except urllib.error.HTTPError: continue
|
35 |
+
|
36 |
+
for i in range(len(tables)):
|
37 |
+
dataframe = tables[i]
|
38 |
+
columns = list(dataframe.columns.values)
|
39 |
+
|
40 |
+
if(set(['Language', 'Wiki']).issubset(set(columns))):
|
41 |
+
wikis_codes = tables[i]
|
42 |
+
break
|
43 |
+
|
44 |
+
wikis_codes = wikis_codes[['Wiki', 'Language']]
|
45 |
+
wikis_codes = wikis_codes[wikis_codes["Language"].str.contains("(closed)") == False]
|
46 |
+
wikis_codes = wikis_codes.set_index('Wiki').to_dict()['Language']
|
47 |
+
return wikis_codes
|
48 |
+
|
49 |
+
except:
|
50 |
+
wikis_codes = {'en': 'English', 'ceb': 'Cebuano', 'de': 'German', 'sv': 'Swedish', 'fr': 'French', 'nl': 'Dutch', 'ru': 'Russian',
|
51 |
+
'es': 'Spanish', 'it': 'Italian', 'arz': 'Egyptian Arabic', 'pl': 'Polish', 'ja': 'Japanese', 'zh': 'Chinese', 'vi':
|
52 |
+
'Vietnamese', 'uk': 'Ukrainian', 'war': 'Waray', 'ar': 'Arabic', 'pt': 'Portuguese', 'fa': 'Persian', 'ca': 'Catalan',
|
53 |
+
'sr': 'Serbian', 'id': 'Indonesian', 'ko': 'Korean', 'no': 'Norwegian (Bokmål)', 'ce': 'Chechen', 'fi': 'Finnish', 'cs':
|
54 |
+
'Czech', 'tr': 'Turkish', 'hu': 'Hungarian', 'tt': 'Tatar', 'sh': 'Serbo-Croatian', 'ro': 'Romanian', 'zh-min-nan':
|
55 |
+
'Southern Min', 'eu': 'Basque', 'ms': 'Malay', 'eo': 'Esperanto', 'he': 'Hebrew', 'hy': 'Armenian', 'da': 'Danish', 'bg':
|
56 |
+
'Bulgarian', 'cy': 'Welsh', 'sk': 'Slovak', 'azb': 'South Azerbaijani', 'uz': 'Uzbek', 'et': 'Estonian', 'simple':
|
57 |
+
'Simple English', 'be': 'Belarusian', 'kk': 'Kazakh', 'min': 'Minangkabau', 'el': 'Greek', 'hr': 'Croatian', 'lt': 'Lithuanian',
|
58 |
+
'gl': 'Galician', 'az': 'Azerbaijani', 'ur': 'Urdu', 'sl': 'Slovene', 'lld': 'Ladin', 'ka': 'Georgian', 'nn': 'Norwegian (Nynorsk)',
|
59 |
+
'hi': 'Hindi', 'th': 'Thai', 'ta': 'Tamil', 'bn': 'Bengali', 'la': 'Latin', 'mk': 'Macedonian', 'zh-yue': 'Cantonese', 'ast':
|
60 |
+
'Asturian', 'lv': 'Latvian', 'af': 'Afrikaans', 'tg': 'Tajik', 'my': 'Burmese', 'mg': 'Malagasy', 'mr': 'Marathi', 'sq': 'Albanian',
|
61 |
+
'bs': 'Bosnian', 'oc': 'Occitan', 'te': 'Telugu', 'ml': 'Malayalam', 'nds': 'Low German', 'be-tarask': 'Belarusian (Taraškievica)',
|
62 |
+
'br': 'Breton', 'ky': 'Kyrgyz', 'sw': 'Swahili', 'jv': 'Javanese', 'lmo': 'Lombard', 'new': 'Newar', 'pnb': 'Western Punjabi', 'vec':
|
63 |
+
'Venetian', 'ht': 'Haitian Creole', 'pms': 'Piedmontese', 'ba': 'Bashkir', 'lb': 'Luxembourgish', 'su': 'Sundanese', 'ku': 'Kurdish (Kurmanji)',
|
64 |
+
'ga': 'Irish', 'szl': 'Silesian', 'is': 'Icelandic', 'fy': 'West Frisian', 'cv': 'Chuvash', 'ckb': 'Kurdish (Sorani)', 'pa': 'Punjabi', 'tl':
|
65 |
+
'Tagalog', 'an': 'Aragonese', 'wuu': 'Wu Chinese', 'diq': 'Zaza', 'io': 'Ido', 'sco': 'Scots', 'vo': 'Volapük', 'yo': 'Yoruba', 'ne': 'Nepali',
|
66 |
+
'ia': 'Interlingua', 'kn': 'Kannada', 'gu': 'Gujarati', 'als': 'Alemannic German', 'ha': 'Hausa', 'avk': 'Kotava', 'bar': 'Bavarian', 'crh':
|
67 |
+
'Crimean Tatar', 'scn': 'Sicilian', 'bpy': 'Bishnupriya Manipuri', 'qu': 'Quechua (Southern Quechua)', 'nv': 'Navajo', 'mn': 'Mongolian', 'xmf':
|
68 |
+
'Mingrelian', 'ban': 'Balinese', 'si': 'Sinhala', 'tum': 'Tumbuka', 'ps': 'Pashto', 'frr': 'North Frisian', 'os': 'Ossetian', 'mzn': 'Mazanderani',
|
69 |
+
'bat-smg': 'Samogitian', 'or': 'Odia', 'ig': 'Igbo', 'sah': 'Yakut', 'cdo': 'Eastern Min', 'gd': 'Scottish Gaelic', 'bug': 'Buginese', 'yi': 'Yiddish',
|
70 |
+
'sd': 'Sindhi', 'ilo': 'Ilocano', 'am': 'Amharic', 'nap': 'Neapolitan', 'li': 'Limburgish', 'bcl': 'Central Bikol', 'fo': 'Faroese', 'gor': 'Gorontalo',
|
71 |
+
'hsb': 'Upper Sorbian', 'map-bms': 'Banyumasan', 'mai': 'Maithili', 'shn': 'Shan', 'eml': 'Emilian-Romagnol', 'ace': 'Acehnese', 'zh-classical':
|
72 |
+
'Classical Chinese', 'sa': 'Sanskrit', 'as': 'Assamese', 'wa': 'Walloon', 'ie': 'Interlingue', 'hyw': 'Western Armenian', 'lij': 'Ligurian', 'mhr':
|
73 |
+
'Meadow Mari', 'zu': 'Zulu', 'sn': 'Shona', 'hif': 'Fiji Hindi', 'mrj': 'Hill Mari', 'bjn': 'Banjarese', 'mni': 'Meitei', 'km': 'Khmer', 'hak':
|
74 |
+
'Hakka Chinese', 'roa-tara': 'Tarantino', 'pam': 'Kapampangan', 'sat': 'Santali', 'rue': 'Rusyn', 'nso': 'Northern Sotho', 'bh': 'Bihari (Bhojpuri)',
|
75 |
+
'so': 'Somali', 'mi': 'Māori', 'se': 'Northern Sámi', 'myv': 'Erzya', 'vls': 'West Flemish', 'nds-nl': 'Dutch Low Saxon', 'dag': 'Dagbani', 'sc':
|
76 |
+
'Sardinian', 'ary': 'Moroccan Arabic', 'co': 'Corsican', 'kw': 'Cornish', 'bo': 'Lhasa Tibetan', 'vep': 'Veps', 'glk': 'Gilaki', 'tk': 'Turkmen', 'kab':
|
77 |
+
'Kabyle', 'gan': 'Gan Chinese', 'rw': 'Kinyarwanda', 'fiu-vro': 'Võro', 'ab': 'Abkhaz', 'gv': 'Manx', 'ug': 'Uyghur', 'nah': 'Nahuatl', 'zea': 'Zeelandic',
|
78 |
+
'skr': 'Saraiki', 'frp': 'Franco-Provençal', 'udm': 'Udmurt', 'pcd': 'Picard', 'mt': 'Maltese', 'kv': 'Komi', 'csb': 'Kashubian', 'gn': 'Guarani', 'smn':
|
79 |
+
'Inari Sámi', 'ay': 'Aymara', 'nrm': 'Norman', 'ks': 'Kashmiri', 'lez': 'Lezgian', 'lfn': 'Lingua Franca Nova', 'olo': 'Livvi-Karelian', 'mwl': 'Mirandese',
|
80 |
+
'stq': 'Saterland Frisian', 'lo': 'Lao', 'ang': 'Old English', 'mdf': 'Moksha', 'fur': 'Friulian', 'rm': 'Romansh', 'lad': 'Judaeo-Spanish', 'kaa': 'Karakalpak',
|
81 |
+
'gom': 'Konkani (Goan Konkani)', 'ext': 'Extremaduran', 'koi': 'Permyak', 'tyv': 'Tuvan', 'pap': 'Papiamento', 'av': 'Avar', 'dsb': 'Lower Sorbian', 'ln':
|
82 |
+
'Lingala', 'dty': 'Doteli', 'tw': 'Twi', 'cbk-zam': 'Chavacano (Zamboanga)', 'dv': 'Maldivian', 'ksh': 'Ripuarian', 'za': 'Zhuang (Standard Zhuang)', 'gag':
|
83 |
+
'Gagauz', 'bxr': 'Buryat (Russia Buriat)', 'pfl': 'Palatine German', 'lg': 'Luganda', 'szy': 'Sakizaya', 'pag': 'Pangasinan', 'blk': "Pa'O", 'pi': 'Pali',
|
84 |
+
'tay': 'Atayal', 'haw': 'Hawaiian', 'awa': 'Awadhi', 'inh': 'Ingush', 'krc': 'Karachay-Balkar', 'xal': 'Kalmyk Oirat', 'pdc': 'Pennsylvania Dutch', 'to':
|
85 |
+
'Tongan', 'atj': 'Atikamekw', 'tcy': 'Tulu', 'arc': 'Aramaic (Syriac)', 'mnw': 'Mon', 'jam': 'Jamaican Patois', 'shi': 'Shilha', 'kbp': 'Kabiye', 'wo':
|
86 |
+
'Wolof', 'anp': 'Angika', 'kbd': 'Kabardian', 'nia': 'Nias', 'nov': 'Novial', 'om': 'Oromo', 'ki': 'Kikuyu', 'nqo': "N'Ko", 'bi': 'Bislama', 'xh': 'Xhosa',
|
87 |
+
'tpi': 'Tok Pisin', 'tet': 'Tetum', 'ff': 'Fula', 'roa-rup': 'Aromanian', 'jbo': 'Lojban', 'fj': 'Fijian', 'kg': 'Kongo (Kituba)', 'lbe': 'Lak', 'ty': 'Tahitian',
|
88 |
+
'guw': 'Gun', 'cu': 'Old Church Slavonic', 'trv': 'Seediq', 'ami': 'Amis', 'srn': 'Sranan Tongo', 'sm': 'Samoan', 'mad': 'Madurese', 'alt': 'Southern Altai',
|
89 |
+
'ltg': 'Latgalian', 'gcr': 'French Guianese Creole', 'chr': 'Cherokee', 'tn': 'Tswana', 'ny': 'Chewa', 'st': 'Sotho', 'pih': 'Norfuk', 'rmy': 'Romani (Vlax Romani)',
|
90 |
+
'got': 'Gothic', 'ee': 'Ewe', 'pcm': 'Nigerian Pidgin', 'bm': 'Bambara', 'ss': 'Swazi', 'ts': 'Tsonga', 've': 'Venda', 'kcg': 'Tyap', 'chy': 'Cheyenne', 'rn':
|
91 |
+
'Kirundi', 'ch': 'Chamorro', 'gur': 'Frafra', 'ik': 'Iñupiaq', 'ady': 'Adyghe', 'pnt': 'Pontic Greek', 'guc': 'Wayuu', 'iu': 'Inuktitut', 'pwn': 'Paiwan', 'sg':
|
92 |
+
'Sango', 'din': 'Dinka', 'ti': 'Tigrinya', 'kl': 'Greenlandic', 'dz': 'Dzongkha', 'cr': 'Cree', 'ak': 'Akan'}
|
93 |
+
return wikis_codes
|
94 |
+
|
95 |
+
|
96 |
+
def run_daemon(args):
|
97 |
+
result = subprocess.run(args, capture_output=True, text=True)
|
98 |
+
try: result.check_returncode()
|
99 |
+
except subprocess.CalledProcessError as exception: raise exception
|
100 |
+
|
101 |
+
|
102 |
+
labels = []
|
103 |
+
wiki_codes = fetch_wikis_codes()
|
104 |
+
for key, value in wiki_codes.items():
|
105 |
+
labels.append(f"{value} ({key})")
|
106 |
+
|
107 |
+
# st.markdown("<br>",unsafe_allow_html=True)
|
108 |
+
|
109 |
+
selected_language = st.selectbox("Select or Search for a Wikipedia language:", labels, placeholder="Select or Search for a Wikipedia language")
|
110 |
+
|
111 |
+
|
112 |
+
@st.cache_data
|
113 |
+
def fetch_metadata_dataset():
|
114 |
+
# HF_TOKEN = st.secrets["HF_TOKEN"]
|
115 |
+
dataset = datasets.load_dataset("SaiedAlshahrani/Wikipedia-Corpora-Report", split="train")#, use_auth_token=HF_TOKEN)
|
116 |
+
dataset = dataset.to_pandas()
|
117 |
+
return dataset
|
118 |
+
|
119 |
+
dataset = fetch_metadata_dataset()
|
120 |
+
|
121 |
+
metadata = dataset[dataset['Wiki'] == selected_language]
|
122 |
+
|
123 |
+
retrieval_date = metadata['Retrieval-Date'].iloc[0]
|
124 |
+
|
125 |
+
now_date = date.today()
|
126 |
+
data_date = date(int(retrieval_date.split('-')[0]), int(retrieval_date.split('-')[1]), int(retrieval_date.split('-')[2]))
|
127 |
+
delta = now_date - data_date
|
128 |
+
|
129 |
+
# if delta.days > 45: run_daemon(["bash", "update-daemon.sh"])
|
130 |
+
|
131 |
+
pages_content_bots = metadata['Values'].iloc[0]
|
132 |
+
pages_content_humans = metadata['Values'].iloc[1]
|
133 |
+
pages_non_content_bots = metadata['Values'].iloc[2]
|
134 |
+
pages_non_content_humans = metadata['Values'].iloc[3]
|
135 |
+
|
136 |
+
edits_content_bots = metadata['Values'].iloc[4]
|
137 |
+
edits_content_humans = metadata['Values'].iloc[5]
|
138 |
+
edits_non_content_bots = metadata['Values'].iloc[6]
|
139 |
+
edits_non_content_humans = metadata['Values'].iloc[7]
|
140 |
+
|
141 |
+
pages_content_pages = pages_content_bots+pages_content_humans
|
142 |
+
pages_non_content_pages = pages_non_content_bots+pages_non_content_humans
|
143 |
+
total_pages = pages_content_pages+pages_non_content_pages
|
144 |
+
|
145 |
+
edits_content_pages = edits_content_bots+edits_content_humans
|
146 |
+
edits_non_content_pages = edits_non_content_bots+edits_non_content_humans
|
147 |
+
total_edits = edits_content_pages + edits_non_content_pages
|
148 |
+
|
149 |
+
wiki_metadata = pd.DataFrame(metadata).reset_index(drop=True)
|
150 |
+
|
151 |
+
col1 , cc, col2 = st.columns([1.5, 1.75, 1], gap="small")
|
152 |
+
|
153 |
+
with col1:
|
154 |
+
display_data_table = st.checkbox(f'Display metadata in a table.', value=False)
|
155 |
+
|
156 |
+
with cc:
|
157 |
+
st.markdown(f"<p style='color:lightgray;font-family:'IBM Plex Sans',sans-serif;font-size:18px;'> ⓘ Latest Metadata Update: {retrieval_date}</p>", unsafe_allow_html=True)
|
158 |
+
|
159 |
+
with col2:
|
160 |
+
download_button = st.download_button(label="Download Metadata", data=wiki_metadata.to_csv().encode('utf-8'),
|
161 |
+
file_name=f'{selected_language.split("(")[0].strip(" ")}-Metadata-{retrieval_date}.csv', mime='text/csv',)
|
162 |
+
|
163 |
+
fig = px.sunburst(data_frame=wiki_metadata,
|
164 |
+
path=['Wiki','Metric', 'Sub-Metric', 'Editors'],
|
165 |
+
values='Values',
|
166 |
+
branchvalues="total",
|
167 |
+
color_discrete_sequence=['darkgray', 'black'],
|
168 |
+
template='xgridoff')
|
169 |
+
|
170 |
+
fig.update_traces(textinfo='label+percent parent')
|
171 |
+
fig.update_traces(hovertemplate="Label=%{label}<br>Value=%{value}<br>Parent=%{parent}</br>")
|
172 |
+
fig.update_layout(margin=dict(t=0, l=0, r=0, b=0))
|
173 |
+
# fig.update_layout(uniformtext=dict(minsize=12, mode='hide'))
|
174 |
+
fig.add_layout_image(dict(x=.430, y=.615, sizex=0.23, sizey=0.23, opacity=0.22, layer="below",
|
175 |
+
source="https://upload.wikimedia.org/wikipedia/commons/6/63/Wikipedia-logo.png"))
|
176 |
+
|
177 |
+
# st.markdown("<br>",unsafe_allow_html=True)
|
178 |
+
|
179 |
+
st.plotly_chart(fig, theme=None, use_container_width=True, config={'displayModeBar': False})
|
180 |
+
|
181 |
+
# st.markdown("##")
|
182 |
+
# st.markdown("<br>",unsafe_allow_html=True)
|
183 |
+
|
184 |
+
|
185 |
+
if display_data_table:
|
186 |
+
table_st_style = """
|
187 |
+
<style>
|
188 |
+
table {
|
189 |
+
border-collapse: collapse;
|
190 |
+
border: 1px solid black;
|
191 |
+
border-spacing: 0;
|
192 |
+
margin-left: 0;
|
193 |
+
margin-right: 0;
|
194 |
+
width: 100%;}
|
195 |
+
|
196 |
+
page {
|
197 |
+
border-collapse: collapse;}
|
198 |
+
|
199 |
+
td, th, tr {
|
200 |
+
border: 1px solid black;
|
201 |
+
padding: 0;}
|
202 |
+
|
203 |
+
.contentTableHeader {
|
204 |
+
background-color: lightgray;
|
205 |
+
color: black;}
|
206 |
+
</style>
|
207 |
+
"""
|
208 |
+
st.markdown(table_st_style, unsafe_allow_html=True)
|
209 |
+
|
210 |
+
st.markdown(f"""
|
211 |
+
<table border="1" width="100%" cellpadding="0" cellspacing="0">
|
212 |
+
<thead class="contentTableHeader">
|
213 |
+
<tr>
|
214 |
+
<td style="text-align:center"><b>Wikipedia</b></td>
|
215 |
+
<td style="text-align:center"><b>Totals</b></td>
|
216 |
+
<td style="text-align:center"><b>Pages</b></td>
|
217 |
+
<td style="text-align:center"><b>Editors</b></td>
|
218 |
+
</tr>
|
219 |
+
</thead>
|
220 |
+
<tbody style="margin: 0;padding: 0">
|
221 |
+
<tr>
|
222 |
+
<td style="text-align:center"; rowspan=8>{selected_language}</td>
|
223 |
+
<td style="text-align:center"; rowspan=4>Pages ({total_pages:,})</td>
|
224 |
+
<td style="text-align:center"; rowspan=2>Articles ({pages_content_pages:,})</td>
|
225 |
+
<td style="text-align:center">Bots ({pages_content_bots:,})</td>
|
226 |
+
</tr>
|
227 |
+
<tr>
|
228 |
+
<td style="text-align:center">Humans ({pages_content_humans:,})</td>
|
229 |
+
</tr>
|
230 |
+
<tr>
|
231 |
+
<td style="text-align:center"; rowspan=2>Non-Articles ({pages_non_content_pages:,})</td>
|
232 |
+
<td style="text-align:center">Bots ({pages_non_content_bots:,})</td>
|
233 |
+
</tr>
|
234 |
+
<tr>
|
235 |
+
<td style="text-align:center">Humans ({pages_non_content_humans:,})</td>
|
236 |
+
</tr>
|
237 |
+
<tr>
|
238 |
+
<td style="text-align:center"; rowspan=4>Edits ({total_edits:,})</td>
|
239 |
+
<td style="text-align:center"; rowspan=2>Articles ({edits_content_pages:,})</td>
|
240 |
+
<td style="text-align:center">Bots ({edits_content_bots:,})</td>
|
241 |
+
</tr>
|
242 |
+
<tr>
|
243 |
+
<td style="text-align:center"; >Humans ({edits_content_humans:,})</td>
|
244 |
+
</tr>
|
245 |
+
<tr>
|
246 |
+
<td style="text-align:center"; rowspan=2>Non-Articles ({edits_non_content_pages:,})</td>
|
247 |
+
<td style="text-align:center">Bots ({edits_non_content_bots:,})</td>
|
248 |
+
</tr>
|
249 |
+
<tr>
|
250 |
+
<td style="text-align:center">Humans ({edits_non_content_humans:,})</td>
|
251 |
+
</tr>
|
252 |
+
</tbody>
|
253 |
+
</table>
|
254 |
+
""", unsafe_allow_html=True)
|
255 |
+
|
256 |
+
fonts_style = """
|
257 |
+
<style>
|
258 |
+
@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@200&display=swap');
|
259 |
+
</style>
|
260 |
+
"""
|
261 |
+
st.markdown(fonts_style, unsafe_allow_html=True)
|
262 |
+
|
263 |
+
hide_st_style = """
|
264 |
+
<style>
|
265 |
+
MainMenu {visibility: hidden;}
|
266 |
+
header {visibility: hidden;}
|
267 |
+
footer {visibility: hidden;}
|
268 |
+
button[title="View fullscreen"]{visibility: hidden;}
|
269 |
+
</style>
|
270 |
+
"""
|
271 |
+
st.markdown(hide_st_style, unsafe_allow_html=True)
|
272 |
+
|
273 |
+
footer="""
|
274 |
+
<style>
|
275 |
+
.footer {
|
276 |
+
position: fixed;
|
277 |
+
left: 0;
|
278 |
+
bottom: 0;
|
279 |
+
width: 100%;
|
280 |
+
background-color: white;
|
281 |
+
color: #737373;
|
282 |
+
text-align: center;}
|
283 |
+
|
284 |
+
.p1 {
|
285 |
+
font-family: 'IBM Plex Sans', sans-serif;
|
286 |
+
font-size: 12px}
|
287 |
+
|
288 |
+
</style>
|
289 |
+
|
290 |
+
<div class="footer"> <p class="p1">Copyright © 2023 by Saied Alshahrani<br>Hosted with Streamlit Community Cloud</p> </div>
|
291 |
+
|
292 |
+
"""
|
293 |
+
st.markdown(footer, unsafe_allow_html=True)
|
294 |
+
|
295 |
+
st.markdown("""
|
296 |
+
<style>
|
297 |
+
.block-container {
|
298 |
+
padding-top: 0rem;
|
299 |
+
padding-bottom: 0rem;
|
300 |
+
padding-left: 0rem;
|
301 |
+
padding-right: 0rem;
|
302 |
+
}
|
303 |
+
</style>
|
304 |
+
""", unsafe_allow_html=True)
|
305 |
+
|
306 |
+
st.markdown("""
|
307 |
+
<style>
|
308 |
+
.br {
|
309 |
+
display: block;
|
310 |
+
margin: 0px 0;
|
311 |
+
}
|
312 |
+
</style>
|
313 |
+
""", unsafe_allow_html=True)
|
requirements.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
lxml==4.9.1
|
2 |
+
pandas==1.4.3
|
3 |
+
plotly==5.15.0
|
4 |
+
datasets==2.14.6
|
5 |
+
streamlit==1.30.0
|
6 |
+
selenium==3.141.0
|
7 |
+
geckodriver-autoinstaller==0.1.0
|
8 |
+
|
update-daemon.sh
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
python update-metadata.py
|
2 |
+
|
3 |
+
# git lfs install
|
4 |
+
git clone https://huggingface.co/datasets/SaiedAlshahrani/Wikipedia-Corpora-Report
|
5 |
+
cd Wikipedia-Corpora-Report/
|
6 |
+
|
7 |
+
head -n1 ../English--Wikipedia--Metadata.csv > Wikipedia-Corpora-Report.csv
|
8 |
+
sed -i '' 1d ../*--Wikipedia--Metadata.csv
|
9 |
+
cat ../*--Wikipedia--Metadata.csv >> Wikipedia-Corpora-Report.csv
|
10 |
+
# cp -r ../all-metadata .
|
11 |
+
|
12 |
+
git add .
|
13 |
+
git status
|
14 |
+
git commit -m "Update Wikipedia-Corpora-Report.csv"
|
15 |
+
git push
|
16 |
+
|
17 |
+
rm ../*--Wikipedia--Metadata.csv
|
18 |
+
cp Wikipedia-Corpora-Report.csv ..
|
19 |
+
cd ..
|
20 |
+
|
update-metadata.py
ADDED
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import selenium
|
2 |
+
import os, warnings
|
3 |
+
import urllib.request
|
4 |
+
from time import sleep
|
5 |
+
import pandas as pd, ssl
|
6 |
+
from selenium import webdriver
|
7 |
+
from urllib.error import HTTPError
|
8 |
+
|
9 |
+
warnings.simplefilter("ignore", UserWarning)
|
10 |
+
warnings.simplefilter("ignore", FutureWarning)
|
11 |
+
pd.options.display.float_format = '{:.2f}'.format
|
12 |
+
ssl._create_default_https_context = ssl._create_unverified_context
|
13 |
+
|
14 |
+
def fetch_wikis_codes():
|
15 |
+
try:
|
16 |
+
urls = [r'https://en.wikipedia.org/wiki/Statistics_of_Wikipedias',
|
17 |
+
r'https://meta.wikimedia.org/wiki/List_of_Wikipedias']
|
18 |
+
|
19 |
+
for url in urls:
|
20 |
+
try: tables = pd.read_html(url)
|
21 |
+
except urllib.error.HTTPError: continue
|
22 |
+
|
23 |
+
for i in range(len(tables)):
|
24 |
+
dataframe = tables[i]
|
25 |
+
columns = list(dataframe.columns.values)
|
26 |
+
|
27 |
+
if(set(['Language', 'Wiki']).issubset(set(columns))):
|
28 |
+
wikis_codes = tables[i]
|
29 |
+
break
|
30 |
+
|
31 |
+
wikis_codes = wikis_codes[['Wiki', 'Language']]
|
32 |
+
wikis_codes = wikis_codes[wikis_codes["Language"].str.contains("(closed)") == False]
|
33 |
+
wikis_codes = wikis_codes.set_index('Wiki').to_dict()['Language']
|
34 |
+
return wikis_codes
|
35 |
+
|
36 |
+
except:
|
37 |
+
wikis_codes = {'en': 'English', 'ceb': 'Cebuano', 'de': 'German', 'sv': 'Swedish', 'fr': 'French', 'nl': 'Dutch', 'ru': 'Russian',
|
38 |
+
'es': 'Spanish', 'it': 'Italian', 'arz': 'Egyptian Arabic', 'pl': 'Polish', 'ja': 'Japanese', 'zh': 'Chinese', 'vi':
|
39 |
+
'Vietnamese', 'uk': 'Ukrainian', 'war': 'Waray', 'ar': 'Arabic', 'pt': 'Portuguese', 'fa': 'Persian', 'ca': 'Catalan',
|
40 |
+
'sr': 'Serbian', 'id': 'Indonesian', 'ko': 'Korean', 'no': 'Norwegian (Bokmål)', 'ce': 'Chechen', 'fi': 'Finnish', 'cs':
|
41 |
+
'Czech', 'tr': 'Turkish', 'hu': 'Hungarian', 'tt': 'Tatar', 'sh': 'Serbo-Croatian', 'ro': 'Romanian', 'zh-min-nan':
|
42 |
+
'Southern Min', 'eu': 'Basque', 'ms': 'Malay', 'eo': 'Esperanto', 'he': 'Hebrew', 'hy': 'Armenian', 'da': 'Danish', 'bg':
|
43 |
+
'Bulgarian', 'cy': 'Welsh', 'sk': 'Slovak', 'azb': 'South Azerbaijani', 'uz': 'Uzbek', 'et': 'Estonian', 'simple':
|
44 |
+
'Simple English', 'be': 'Belarusian', 'kk': 'Kazakh', 'min': 'Minangkabau', 'el': 'Greek', 'hr': 'Croatian', 'lt': 'Lithuanian',
|
45 |
+
'gl': 'Galician', 'az': 'Azerbaijani', 'ur': 'Urdu', 'sl': 'Slovene', 'lld': 'Ladin', 'ka': 'Georgian', 'nn': 'Norwegian (Nynorsk)',
|
46 |
+
'hi': 'Hindi', 'th': 'Thai', 'ta': 'Tamil', 'bn': 'Bengali', 'la': 'Latin', 'mk': 'Macedonian', 'zh-yue': 'Cantonese', 'ast':
|
47 |
+
'Asturian', 'lv': 'Latvian', 'af': 'Afrikaans', 'tg': 'Tajik', 'my': 'Burmese', 'mg': 'Malagasy', 'mr': 'Marathi', 'sq': 'Albanian',
|
48 |
+
'bs': 'Bosnian', 'oc': 'Occitan', 'te': 'Telugu', 'ml': 'Malayalam', 'nds': 'Low German', 'be-tarask': 'Belarusian (Taraškievica)',
|
49 |
+
'br': 'Breton', 'ky': 'Kyrgyz', 'sw': 'Swahili', 'jv': 'Javanese', 'lmo': 'Lombard', 'new': 'Newar', 'pnb': 'Western Punjabi', 'vec':
|
50 |
+
'Venetian', 'ht': 'Haitian Creole', 'pms': 'Piedmontese', 'ba': 'Bashkir', 'lb': 'Luxembourgish', 'su': 'Sundanese', 'ku': 'Kurdish (Kurmanji)',
|
51 |
+
'ga': 'Irish', 'szl': 'Silesian', 'is': 'Icelandic', 'fy': 'West Frisian', 'cv': 'Chuvash', 'ckb': 'Kurdish (Sorani)', 'pa': 'Punjabi', 'tl':
|
52 |
+
'Tagalog', 'an': 'Aragonese', 'wuu': 'Wu Chinese', 'diq': 'Zaza', 'io': 'Ido', 'sco': 'Scots', 'vo': 'Volapük', 'yo': 'Yoruba', 'ne': 'Nepali',
|
53 |
+
'ia': 'Interlingua', 'kn': 'Kannada', 'gu': 'Gujarati', 'als': 'Alemannic German', 'ha': 'Hausa', 'avk': 'Kotava', 'bar': 'Bavarian', 'crh':
|
54 |
+
'Crimean Tatar', 'scn': 'Sicilian', 'bpy': 'Bishnupriya Manipuri', 'qu': 'Quechua (Southern Quechua)', 'nv': 'Navajo', 'mn': 'Mongolian', 'xmf':
|
55 |
+
'Mingrelian', 'ban': 'Balinese', 'si': 'Sinhala', 'tum': 'Tumbuka', 'ps': 'Pashto', 'frr': 'North Frisian', 'os': 'Ossetian', 'mzn': 'Mazanderani',
|
56 |
+
'bat-smg': 'Samogitian', 'or': 'Odia', 'ig': 'Igbo', 'sah': 'Yakut', 'cdo': 'Eastern Min', 'gd': 'Scottish Gaelic', 'bug': 'Buginese', 'yi': 'Yiddish',
|
57 |
+
'sd': 'Sindhi', 'ilo': 'Ilocano', 'am': 'Amharic', 'nap': 'Neapolitan', 'li': 'Limburgish', 'bcl': 'Central Bikol', 'fo': 'Faroese', 'gor': 'Gorontalo',
|
58 |
+
'hsb': 'Upper Sorbian', 'map-bms': 'Banyumasan', 'mai': 'Maithili', 'shn': 'Shan', 'eml': 'Emilian-Romagnol', 'ace': 'Acehnese', 'zh-classical':
|
59 |
+
'Classical Chinese', 'sa': 'Sanskrit', 'as': 'Assamese', 'wa': 'Walloon', 'ie': 'Interlingue', 'hyw': 'Western Armenian', 'lij': 'Ligurian', 'mhr':
|
60 |
+
'Meadow Mari', 'zu': 'Zulu', 'sn': 'Shona', 'hif': 'Fiji Hindi', 'mrj': 'Hill Mari', 'bjn': 'Banjarese', 'mni': 'Meitei', 'km': 'Khmer', 'hak':
|
61 |
+
'Hakka Chinese', 'roa-tara': 'Tarantino', 'pam': 'Kapampangan', 'sat': 'Santali', 'rue': 'Rusyn', 'nso': 'Northern Sotho', 'bh': 'Bihari (Bhojpuri)',
|
62 |
+
'so': 'Somali', 'mi': 'Māori', 'se': 'Northern Sámi', 'myv': 'Erzya', 'vls': 'West Flemish', 'nds-nl': 'Dutch Low Saxon', 'dag': 'Dagbani', 'sc':
|
63 |
+
'Sardinian', 'ary': 'Moroccan Arabic', 'co': 'Corsican', 'kw': 'Cornish', 'bo': 'Lhasa Tibetan', 'vep': 'Veps', 'glk': 'Gilaki', 'tk': 'Turkmen', 'kab':
|
64 |
+
'Kabyle', 'gan': 'Gan Chinese', 'rw': 'Kinyarwanda', 'fiu-vro': 'Võro', 'ab': 'Abkhaz', 'gv': 'Manx', 'ug': 'Uyghur', 'nah': 'Nahuatl', 'zea': 'Zeelandic',
|
65 |
+
'skr': 'Saraiki', 'frp': 'Franco-Provençal', 'udm': 'Udmurt', 'pcd': 'Picard', 'mt': 'Maltese', 'kv': 'Komi', 'csb': 'Kashubian', 'gn': 'Guarani', 'smn':
|
66 |
+
'Inari Sámi', 'ay': 'Aymara', 'nrm': 'Norman', 'ks': 'Kashmiri', 'lez': 'Lezgian', 'lfn': 'Lingua Franca Nova', 'olo': 'Livvi-Karelian', 'mwl': 'Mirandese',
|
67 |
+
'stq': 'Saterland Frisian', 'lo': 'Lao', 'ang': 'Old English', 'mdf': 'Moksha', 'fur': 'Friulian', 'rm': 'Romansh', 'lad': 'Judaeo-Spanish', 'kaa': 'Karakalpak',
|
68 |
+
'gom': 'Konkani (Goan Konkani)', 'ext': 'Extremaduran', 'koi': 'Permyak', 'tyv': 'Tuvan', 'pap': 'Papiamento', 'av': 'Avar', 'dsb': 'Lower Sorbian', 'ln':
|
69 |
+
'Lingala', 'dty': 'Doteli', 'tw': 'Twi', 'cbk-zam': 'Chavacano (Zamboanga)', 'dv': 'Maldivian', 'ksh': 'Ripuarian', 'za': 'Zhuang (Standard Zhuang)', 'gag':
|
70 |
+
'Gagauz', 'bxr': 'Buryat (Russia Buriat)', 'pfl': 'Palatine German', 'lg': 'Luganda', 'szy': 'Sakizaya', 'pag': 'Pangasinan', 'blk': "Pa'O", 'pi': 'Pali',
|
71 |
+
'tay': 'Atayal', 'haw': 'Hawaiian', 'awa': 'Awadhi', 'inh': 'Ingush', 'krc': 'Karachay-Balkar', 'xal': 'Kalmyk Oirat', 'pdc': 'Pennsylvania Dutch', 'to':
|
72 |
+
'Tongan', 'atj': 'Atikamekw', 'tcy': 'Tulu', 'arc': 'Aramaic (Syriac)', 'mnw': 'Mon', 'jam': 'Jamaican Patois', 'shi': 'Shilha', 'kbp': 'Kabiye', 'wo':
|
73 |
+
'Wolof', 'anp': 'Angika', 'kbd': 'Kabardian', 'nia': 'Nias', 'nov': 'Novial', 'om': 'Oromo', 'ki': 'Kikuyu', 'nqo': "N'Ko", 'bi': 'Bislama', 'xh': 'Xhosa',
|
74 |
+
'tpi': 'Tok Pisin', 'tet': 'Tetum', 'ff': 'Fula', 'roa-rup': 'Aromanian', 'jbo': 'Lojban', 'fj': 'Fijian', 'kg': 'Kongo (Kituba)', 'lbe': 'Lak', 'ty': 'Tahitian',
|
75 |
+
'guw': 'Gun', 'cu': 'Old Church Slavonic', 'trv': 'Seediq', 'ami': 'Amis', 'srn': 'Sranan Tongo', 'sm': 'Samoan', 'mad': 'Madurese', 'alt': 'Southern Altai',
|
76 |
+
'ltg': 'Latgalian', 'gcr': 'French Guianese Creole', 'chr': 'Cherokee', 'tn': 'Tswana', 'ny': 'Chewa', 'st': 'Sotho', 'pih': 'Norfuk', 'rmy': 'Romani (Vlax Romani)',
|
77 |
+
'got': 'Gothic', 'ee': 'Ewe', 'pcm': 'Nigerian Pidgin', 'bm': 'Bambara', 'ss': 'Swazi', 'ts': 'Tsonga', 've': 'Venda', 'kcg': 'Tyap', 'chy': 'Cheyenne', 'rn':
|
78 |
+
'Kirundi', 'ch': 'Chamorro', 'gur': 'Frafra', 'ik': 'Iñupiaq', 'ady': 'Adyghe', 'pnt': 'Pontic Greek', 'guc': 'Wayuu', 'iu': 'Inuktitut', 'pwn': 'Paiwan', 'sg':
|
79 |
+
'Sango', 'din': 'Dinka', 'ti': 'Tigrinya', 'kl': 'Greenlandic', 'dz': 'Dzongkha', 'cr': 'Cree', 'ak': 'Akan'}
|
80 |
+
return wikis_codes
|
81 |
+
|
82 |
+
|
83 |
+
def fetch_wiki_metadata(wiki, metric, submetric, timeout):
|
84 |
+
options = webdriver.FirefoxOptions()
|
85 |
+
options.add_argument("--headless")
|
86 |
+
profile = webdriver.FirefoxProfile()
|
87 |
+
profile.set_preference("browser.download.folderList", 2)
|
88 |
+
profile.set_preference("browser.download.manager.showWhenStarting", False)
|
89 |
+
profile.set_preference("browser.download.dir", f"{os.getcwd()}")
|
90 |
+
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream")
|
91 |
+
driver = webdriver.Firefox(options=options, firefox_profile=profile, executable_path='geckodriver', service_log_path=os.devnull)
|
92 |
+
|
93 |
+
if metric == 'pages':
|
94 |
+
base_url = f'https://stats.wikimedia.org/#/{wiki}.wikipedia.org/content/pages-to-date/full|table|'
|
95 |
+
|
96 |
+
elif metric == 'edits':
|
97 |
+
base_url = f'https://stats.wikimedia.org/#/{wiki}.wikipedia.org/contributing/edits/full|table|'
|
98 |
+
|
99 |
+
parameters = f'1-month|editor_type~anonymous*group-bot*name-bot*user+(page_type)~{submetric}|monthly'
|
100 |
+
request_url = "".join([base_url, parameters])
|
101 |
+
|
102 |
+
driver.implicitly_wait(3)
|
103 |
+
driver.get(request_url)
|
104 |
+
driver.page_source
|
105 |
+
|
106 |
+
sleep(timeout)
|
107 |
+
|
108 |
+
csvFilename = f"{wiki}--{metric}--{submetric}.csv"
|
109 |
+
csvFilename = csvFilename.replace(' ','-')
|
110 |
+
driver.find_element_by_class_name("ui.icon.button.tooltipped.tooltipped-n").click()
|
111 |
+
sleep(3) ; os.rename("undefined.csv", csvFilename)
|
112 |
+
|
113 |
+
driver.close()
|
114 |
+
driver.quit()
|
115 |
+
|
116 |
+
print(f' [+] Metadata Exported to `{wiki}/{csvFilename}`.')
|
117 |
+
|
118 |
+
return csvFilename
|
119 |
+
|
120 |
+
|
121 |
+
wiki_codes = fetch_wikis_codes()
|
122 |
+
labels = []
|
123 |
+
for key, value in wiki_codes.items():
|
124 |
+
labels.append(f"{value} ({key})")
|
125 |
+
|
126 |
+
wikis = list(wiki_codes.keys())
|
127 |
+
metrics = ['pages', 'edits']
|
128 |
+
submetrics = ['content', 'non-content']
|
129 |
+
|
130 |
+
timeout = 3
|
131 |
+
counter = 1
|
132 |
+
|
133 |
+
for wiki in wikis:
|
134 |
+
|
135 |
+
print(f'{counter}## {wiki_codes[wiki]} Wikipedia Files:')
|
136 |
+
if not os.path.exists(f'{wiki}'): os.makedirs(f'{wiki}')
|
137 |
+
if not os.path.exists('all-metadata'): os.makedirs('all-metadata')
|
138 |
+
|
139 |
+
for metric in metrics:
|
140 |
+
|
141 |
+
for submetric in submetrics:
|
142 |
+
|
143 |
+
try:
|
144 |
+
csvFilename = fetch_wiki_metadata(wiki, metric, submetric, timeout)
|
145 |
+
dataframe = pd.read_csv(csvFilename).iloc[-1]
|
146 |
+
|
147 |
+
except selenium.common.exceptions.ElementClickInterceptedException:
|
148 |
+
dataframe = pd.read_csv(fetch_wiki_metadata(wiki, metric, submetric, timeout*2)).iloc[-1]
|
149 |
+
timeout *= 2
|
150 |
+
|
151 |
+
retrieval_date = pd.to_datetime(dataframe['timeRange.end']).strftime('%Y-%m-%d')
|
152 |
+
|
153 |
+
if metric == 'pages':
|
154 |
+
if submetric == 'content':
|
155 |
+
pages_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
|
156 |
+
pages_content_humans = dataframe['total.user']+dataframe['total.anonymous']
|
157 |
+
elif submetric == 'non-content':
|
158 |
+
pages_non_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
|
159 |
+
pages_non_content_humans = dataframe['total.user']+dataframe['total.anonymous']
|
160 |
+
else: print(f'Error: this submetric: {submetric} is not supported!')
|
161 |
+
|
162 |
+
elif metric == 'edits':
|
163 |
+
if submetric == 'content':
|
164 |
+
edits_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
|
165 |
+
edits_content_humans = dataframe['total.user']+dataframe['total.anonymous']
|
166 |
+
elif submetric == 'non-content':
|
167 |
+
edits_non_content_bots = dataframe['total.group-bot']+dataframe['total.name-bot']
|
168 |
+
edits_non_content_humans = dataframe['total.user']+dataframe['total.anonymous']
|
169 |
+
else: print(f'Error: this submetric: {submetric} is not supported!')
|
170 |
+
|
171 |
+
else: print(f'Error: this metric: {metric} is not supported!')
|
172 |
+
|
173 |
+
os.system(f'mv {wiki}--{metric}--{submetric}.csv {wiki}/{wiki}--{metric}--{submetric}.csv')
|
174 |
+
|
175 |
+
selected_language = f'{wiki_codes[wiki]} ({wiki})'
|
176 |
+
|
177 |
+
metadata = {'Wiki' : [selected_language, selected_language, selected_language, selected_language,
|
178 |
+
selected_language, selected_language, selected_language,selected_language],
|
179 |
+
|
180 |
+
'Metric' : ['Pages', 'Pages', 'Pages', 'Pages', 'Edits', 'Edits', 'Edits', 'Edits'],
|
181 |
+
|
182 |
+
'Sub-Metric' : ['Articles', 'Articles', 'Non-Articles', 'Non-Articles',
|
183 |
+
'Articles', 'Articles', 'Non-Articles', 'Non-Articles'],
|
184 |
+
|
185 |
+
'Editors' : ['Bots', 'Humans', 'Bots', 'Humans', 'Bots', 'Humans', 'Bots', 'Humans'],
|
186 |
+
|
187 |
+
'Values' : [pages_content_bots, pages_content_humans, pages_non_content_bots, pages_non_content_humans,
|
188 |
+
edits_content_bots, edits_content_humans, edits_non_content_bots, edits_non_content_humans]}
|
189 |
+
|
190 |
+
wiki_metadata = pd.DataFrame(metadata)
|
191 |
+
wiki_metadata['Retrieval-Date'] = retrieval_date
|
192 |
+
wiki_metadata.to_csv(f'{wiki_codes[wiki].replace(" ","-")}--Wikipedia--Metadata.csv', index=False)
|
193 |
+
|
194 |
+
os.system(f'mv {wiki} all-metadata/')
|
195 |
+
counter = counter + 1
|
196 |
+
sleep(1)
|