Querying Datasets with the Datasets Explorer Chrome Extension
With the new modalities filter on the HuggingFace Datasets, we will take a look at some interesting geospatial datasets on Hugging Face and perform different SQL queries in the browser with the Datasets Explorer.
The Datasets Explorer Chrome Extension
With the Hugging Face Datasets Explorer Chrome Extension, we can get access to a drop in SQL editor on top of the dataset. Under the hood, it uses DuckDB and loads the different configs and splits as views for you to query. The extension comes with some neat features:
- Support for Private Datasets π
- Infinite Scroll Lazy Loading for Large Query Results βΎοΈ
- Command + Enter Shortcut π»
- Full DuckDB + Extensions Support π€―
- Powered by DuckDB WASM entirely in the browser π¦
Install and Load the DuckDB Spatial Extension
DuckDB has a spatial
extension which you can read more about here which gives us the ability to perform very powerful geospatial queries. The beautiful thing about the Datasets Explorer is that we can leverage these extensions automatically.
Let's run the following queries:
INSTALL spatial;
LOAD spatial;
Case Study 1: US Electric Vehicle Charging Locations Dataset
Let's take a look at a dataset of Electric Vehicle Charging locations which can be found here.
Let's Query Data π¦
Now that we are setup, we can query our dataset. Let's start with finding the different operators of chargers. We can query for the top operators with the SQL below:
Top Operators with Most Locations
select metadata.operator as operator, count(*) as num_locations
from default_view
group by operator
order by num_locations desc
We can see ChargePoint having the most locations with Tesla not very far behind. There are also many other location specific operators.
Now, let's do some geospatial queries in DuckDB to find the closest n chargers to a given location.
Finding the 100 Closest Chargers to the NY Times Square ποΈβ‘
This is an oddly specific example, but showcases what level of analysis we can do without ever leaving our browser. Even though metadata is in a dictionary structure, we can use the .
operator to select different fields.
100 Closest EV Chargers to Times Square
SELECT
metadata.name,
metadata.operator,
lat,
lng,
ST_Distance(ST_POINT(lng, lat), ST_Point(-73.9865695115598, 40.75595717084848)) AS distance,
FROM default_view
ORDER BY distance
LIMIT 100
Results
If we plot those on a map we can see all kinds of EV Charging stations around New York City.
Case Study 2: AirBnb Accommodation Offer Dataset π
The kraina/airbnb is very interesting as it has a couple different unique features. Namely:
- private and shared rooms
- types of rooms
- cost for two people and two nights in Euros
- city (one of 10 European cities)]
- person capacity
- superhost
- bedrooms
- guest satisfaction
- cleanliness rating
- dist: distance from city centre in km
- metro_dist: distance from nearest metro station in km
Cities with Most Listing Concentration
WITH listings_with_location AS (
SELECT *, ST_Point(lng, lat) AS location
FROM all_view
),
city_centers AS (
SELECT city, ST_Point(AVG(lng), AVG(lat)) AS center
FROM listings_with_location
GROUP BY city
)
SELECT
al.city,
COUNT(*) AS listings_within_1km
FROM
listings_with_location al
JOIN
city_centers cc ON al.city = cc.city
WHERE
ST_DWithin(al.location, cc.center, 1000)
GROUP BY
al.city
ORDER BY
listings_within_1km DESC
LIMIT 5
City | listings_within_1km |
---|---|
London | 9993 |
Rome | 9027 |
Paris | 6688 |
Lisbon | 5763 |
Athens | 5280 |
Attraction Index vs Proximity to City Center π½οΈπ
SELECT
city,
CASE
WHEN dist <= 1 THEN '0-1km'
WHEN dist <= 3 THEN '1-3km'
WHEN dist <= 5 THEN '3-5km'
ELSE '5km+'
END as distance_category,
AVG(attr_index_norm) as avg_attraction_index,
COUNT(*) as listings_count
FROM
all_view
GROUP BY
city, distance_category
ORDER BY
city, distance_category;
City | Distance Category | Average Value | Count |
---|---|---|---|
Amsterdam | 0-1km | 29.35 | 319 |
Amsterdam | 1-3km | 15.75 | 1010 |
Amsterdam | 3-5km | 6.94 | 507 |
Amsterdam | 5km+ | 3.44 | 244 |
Relationship between Price and Proximity to Metro πΈπ
WITH proximity_prices AS (
SELECT
city,
CASE
WHEN metro_dist <= 0.5 THEN 'very_close'
WHEN metro_dist <= 1 THEN 'close'
WHEN metro_dist <= 2 THEN 'moderate'
ELSE 'far'
END as metro_proximity,
realSum as price
FROM
all_view
)
SELECT
city,
AVG(CASE WHEN metro_proximity = 'very_close' THEN price END) as avg_price_very_close,
AVG(CASE WHEN metro_proximity = 'close' THEN price END) as avg_price_close,
AVG(CASE WHEN metro_proximity = 'moderate' THEN price END) as avg_price_moderate,
AVG(CASE WHEN metro_proximity = 'far' THEN price END) as avg_price_far,
COUNT(*) as total_listings
FROM
proximity_prices
GROUP BY
city
ORDER BY
city;
We can see there is definitely a relationship between average price and proximity to metro.
It's Open Source π€
Check it out, it's open source and on the Chrome Web Store.