Querying Datasets with the Datasets Explorer Chrome Extension

Community Article Published July 19, 2024

With the new modalities filter on the HuggingFace Datasets, we will take a look at some interesting geospatial datasets on Hugging Face and perform different SQL queries in the browser with the Datasets Explorer.

The Datasets Explorer Chrome Extension

With the Hugging Face Datasets Explorer Chrome Extension, we can get access to a drop in SQL editor on top of the dataset. Under the hood, it uses DuckDB and loads the different configs and splits as views for you to query. The extension comes with some neat features:

Support for Private Datasets 🔒
Infinite Scroll Lazy Loading for Large Query Results ♾️
Command + Enter Shortcut 💻
Full DuckDB + Extensions Support 🤯
Powered by DuckDB WASM entirely in the browser 🦆

Install and Load the DuckDB Spatial Extension

DuckDB has a spatial extension which you can read more about here which gives us the ability to perform very powerful geospatial queries. The beautiful thing about the Datasets Explorer is that we can leverage these extensions automatically.

Let's run the following queries:

INSTALL spatial;

LOAD spatial;

Case Study 1: US Electric Vehicle Charging Locations Dataset

Let's take a look at a dataset of Electric Vehicle Charging locations which can be found here.

Let's Query Data 🦆

Now that we are setup, we can query our dataset. Let's start with finding the different operators of chargers. We can query for the top operators with the SQL below:

Top Operators with Most Locations

select metadata.operator as operator, count(*) as num_locations 
from default_view
group by operator
order by num_locations desc

We can see ChargePoint having the most locations with Tesla not very far behind. There are also many other location specific operators.

Now, let's do some geospatial queries in DuckDB to find the closest n chargers to a given location.

Finding the 100 Closest Chargers to the NY Times Square 🏙️⚡

This is an oddly specific example, but showcases what level of analysis we can do without ever leaving our browser. Even though metadata is in a dictionary structure, we can use the . operator to select different fields.

100 Closest EV Chargers to Times Square

SELECT 
  metadata.name,
  metadata.operator,
  lat,
  lng,
  ST_Distance(ST_POINT(lng, lat), ST_Point(-73.9865695115598, 40.75595717084848)) AS distance,
FROM default_view
ORDER BY distance
LIMIT 100

Results

If we plot those on a map we can see all kinds of EV Charging stations around New York City.

Case Study 2: AirBnb Accommodation Offer Dataset 🏠

The kraina/airbnb is very interesting as it has a couple different unique features. Namely:

private and shared rooms
types of rooms
cost for two people and two nights in Euros
city (one of 10 European cities)]
person capacity
superhost
bedrooms
guest satisfaction
cleanliness rating
dist: distance from city centre in km
metro_dist: distance from nearest metro station in km

Cities with Most Listing Concentration

WITH listings_with_location AS (
    SELECT *, ST_Point(lng, lat) AS location
    FROM all_view
),
city_centers AS (
    SELECT city, ST_Point(AVG(lng), AVG(lat)) AS center
    FROM listings_with_location
    GROUP BY city
)
SELECT 
    al.city,
    COUNT(*) AS listings_within_1km
FROM 
    listings_with_location al
JOIN 
    city_centers cc ON al.city = cc.city
WHERE 
    ST_DWithin(al.location, cc.center, 1000) 
GROUP BY 
    al.city
ORDER BY 
    listings_within_1km DESC
LIMIT 5

City	listings_within_1km
London	9993
Rome	9027
Paris	6688
Lisbon	5763
Athens	5280

Attraction Index vs Proximity to City Center 🍽️🎉

SELECT 
    city,
    CASE 
        WHEN dist <= 1 THEN '0-1km'
        WHEN dist <= 3 THEN '1-3km'
        WHEN dist <= 5 THEN '3-5km'
        ELSE '5km+'
    END as distance_category,
    AVG(attr_index_norm) as avg_attraction_index,
    COUNT(*) as listings_count
FROM 
    all_view
GROUP BY 
    city, distance_category
ORDER BY 
    city, distance_category;

City	Distance Category	Average Value	Count
Amsterdam	0-1km	29.35	319
Amsterdam	1-3km	15.75	1010
Amsterdam	3-5km	6.94	507
Amsterdam	5km+	3.44	244

Relationship between Price and Proximity to Metro 💸🚆

WITH proximity_prices AS (
    SELECT 
        city,
        CASE 
            WHEN metro_dist <= 0.5 THEN 'very_close'
            WHEN metro_dist <= 1 THEN 'close'
            WHEN metro_dist <= 2 THEN 'moderate'
            ELSE 'far'
        END as metro_proximity,
        realSum as price
    FROM 
        all_view
)
SELECT 
    city,
    AVG(CASE WHEN metro_proximity = 'very_close' THEN price END) as avg_price_very_close,
    AVG(CASE WHEN metro_proximity = 'close' THEN price END) as avg_price_close,
    AVG(CASE WHEN metro_proximity = 'moderate' THEN price END) as avg_price_moderate,
    AVG(CASE WHEN metro_proximity = 'far' THEN price END) as avg_price_far,
    COUNT(*) as total_listings
FROM 
    proximity_prices
GROUP BY 
    city
ORDER BY 
    city;

We can see there is definitely a relationship between average price and proximity to metro.

It's Open Source 🤗

Check it out, it's open source and on the Chrome Web Store.

Upvote