Simon Willison's Weblog

Subscribe

Joining CSV files in your browser using Datasette Lite

20th June 2022

I added a new feature to Datasette Lite--my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them--including joins across data from multiple files.

Your CSV file needs to be hosted somewhere with access-control-allow-origin: * CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the "Raw" version.

Loading CSV data from a URL

Here's the URL to a CSV file of college fight songs collected by FiveThirtyEight in their data repo as part of the reporting for this story a few years ago:

https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv

You can pass this to Datasette Lite in two ways:

Once Datasette has loaded, a data database will be available with a single table called fight-songs.

As you navigate around in Datasette the URL bar will update to reflect current state--which means you can deep-link to table views with applied filters and facets:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data/fight-songs?_facet=conference&_facet=student_writer&_facet=official_song

Or even link to the result of a custom SQL query:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data?sql=select+school%2C+conference%2C+song_name%2C+writers%2C+year%2C+student_writer+spotify_id+from+%5Bfight-songs%5D+order+by+rowid+limit+101

Loading multiple files and joining data

You can pass the ?csv= parameter more than once to load data from multiple CSV files into the same virtual data database. Each CSV file will result in a separate table.

For this demo I'll use two CSV files.

The first is us-counties-recent.csv from the NY Times covid-19-data repository, which lists the most recent numbers for Covid cases for every US county.

The second is us_census_county_populations_2019.csv, a CSV file listing the population of each county according to the 2019 US Census which I extracted from this page on the US Census website.

Both of those tables include a column called fips, representing the FIPS county code for each county. These 4-5 digit codes are ideal for joining the two tables.

Here's a SQL query which joins the two tables, filters for the data for the most recent date represented (using where date = (select max(date) from [us-counties-recent])) and calculates cases_per_million using the cases and the population:

select
[us-counties-recent].*,
us_census_county_populations_2019.population,
1.0 * [us-counties-recent].cases / us_census_county_populations_2019.population * 1000000 as cases_per_million
from
[us-counties-recent]
join us_census_county_populations_2019 on us_census_county_populations_2019.fips = [us-counties-recent].fips
where
date = (select max(date) from [us-counties-recent])
order by
cases_per_million desc

And since everything in Datasette Lite can be bookmarked, here's the super long URL (clickable version here) that executes that query against those two CSV files:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv&csv=https://raw.githubusercontent.com/simonw/covid-19-datasette/main/us_census_county_populations_2019.csv#/data?sql=select%0A++%5Bus-counties-recent%5D.*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc

Posted 20th June 2022 at 9:20 pm * Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

More recent articles

This is Joining CSV files in your browser using Datasette Lite by Simon Willison, posted on 20th June 2022.

Part of series Datasette Lite

  1. Datasette Lite: a server-side Python web application running in a browser - May 4, 2022, 3:16 p.m.
  2. Joining CSV files in your browser using Datasette Lite - June 20, 2022, 9:20 p.m.
  3. Plugin support for Datasette Lite - Aug. 17, 2022, 6:20 p.m.
  4. Analyzing ScotRail audio announcements with Datasette - from prototype to production - Aug. 21, 2022, 2:04 a.m.
  5. Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more - Sept. 16, 2022, 2:55 a.m.
csv 32 projects 517 sql 109 datasette 454 webassembly 85 datasette-lite 19 cors 24

Next: First impressions of DALL-E, generating images from text

Previous: Weeknotes: datasette-socrata, and the last 10%...

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe

Joining CSV files in your browser using Datasette Lite

I added a way to use Datasette Lite (Datasette running entirely in your browser using WebAssembly) to load data from multiple CSV files by URL and run SQL joins across themhttps://t.co/nfSdAm0bkD

-- Simon Willison (@simonw) June 20, 2022