OECD Life Expectancy vs Private Health Costs

This example attempts to reproduce the Guardian Datablog article America's broken healthcare system – in one simple chart using the Muck build system. The demo code can be found at https://github.com/gwk/muck-demos/tree/master/oecd-health, and the result is hosted at https://gwk.github.io/muck/demos/oecd-health.

Creating a Muck project is as simple as creating an empty source document - in this case, index.html.wu. Simply run muck -serve from the project directory, and it will produce index.html from index.html.wu, generating any additional dependencies on an as-needed basis.

If you look at the implementation of this project, you will see that all of the code boxes in this page are just references to the scripts that generate the actual results - change the code and Muck will bring the entire project up to date. This allows for an iterative style of development, and makes it easier to document the technical process. Most news articles would not show the code of course, but one can easily imagine producing both an article for publication as well as a companion "How it was made" post in tandem.

The Data

The article cites as its data source http://stats.oecd.org/index.aspx?DataSetCode=HEALTH_STAT. That link points to the interactive explorer for the entire OECD health data repository, but we can guess that the numbers came from the subsections located at Health → Health expenditure and financing → Health expenditure and financing and Health → Health Status → Life Expectancy.

Navigating to each of these sections updates the dynamic spreadsheet at the center of the page. To obtain the datasets:

At this time it does not appear possible to link to these CSV files directly. For the purposes of this demo, the downloaded files have been renamed health-expenditure-and-financing.csv (109MB) and life-expectancy.csv (4MB) respectively, and checked into the git repository, so that readers can run the code without going through the download process.

Exploration

Since the data consists of multiple tables it will be convenient to use relational queries. As a first step we will create an SQLite database from the two tables using a convenient little tool called csv-to-sqlite. This is nice because it automatically infers column names and affinities (data types) for us. (Alternatively we could use plain Python, the sqlite3 shell's CSV features or CSV virtual table extension, or CSVKit's csvsql).

data.sqlite3.sh
csv-to-sqlite health-expenditure-and-financing.csv.bz2 expenditures life-expectancy.csv.bz2 expectancies -output $MUCK_OUT

This command creates a database of two tables, 'expenditures' and 'expectancies'. Now that we have the data in database format, we can easily query it. First we query the schema:

schema.txt.sql
.open 'data.sqlite3' .schema

The output shows us the structure of each table:

schema.txt
CREATE TABLE expenditures (HF TEXT, Financing_scheme TEXT, HC TEXT, Function TEXT, HP TEXT, Provider TEXT, MEASURE TEXT, Measure_1 TEXT, LOCATION TEXT, Country TEXT, TIME INTEGER, Year INTEGER, Unit_Code TEXT, Unit TEXT, PowerCode_Code INTEGER, PowerCode TEXT, Reference_Period_Code INTEGER, Reference_Period INTEGER, Value REAL, Flag_Codes TEXT, Flags TEXT); CREATE TABLE expectancies (VAR TEXT, Variable TEXT, UNIT TEXT, Measure TEXT, COU TEXT, Country TEXT, YEA INTEGER, Year INTEGER, Value REAL, Flag_Codes TEXT, Flags TEXT);

Both tables are structured so that some columns form pairs consisting of a short coded version followed by a longer human-readable version. Browsing through the tables it is clear that there are many more measurements than what is presented in the original article. We want to find the data points that match to the units presented in the original, namely "private health spending per person in 2016 (US$)" and "average life expectancy at birth".

We can see the different units of measure in each table by selecting distinct values for just the columns that specify the nature of the measurements (I identified these by looking at a sample of the data and then querying just the columns that looked relevant).

Health Expenditures

For expenditures, there are several column pairs indicating "nature of measurement".

HF / Financing_scheme

expenditures-schemes.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT DISTINCT HF, Financing_scheme FROM expenditures;

expenditures-schemes.csv
HFFinancing_scheme
HFTOTAll financing schemes
HF1Government/compulsory schemes
HF2HF3Voluntary schemes/household out-of-pocket payments

The only candidate for "private expenditure" seems to be "Voluntary schemes/household out-of-pocket payments". The table in the article also shows "Total expenditure, incl. government," which probably came from "All financing schemes".

HC / Function

expenditures-functions.csv
HCFunction
HCTOTCurrent expenditure on health (all functions)
HC1HC2Curative and rehabilitative care
HC11HC21Inpatient curative and rehabilitative care
HC12HC22Day curative and rehabilitative care
HC13HC23Outpatient curative and rehabilitative care
HC14HC24Home-based curative and rehabilitative care
HC3Long-term care (health)
HC4Ancillary services (non-specified by function)
HC5Medical goods (non-specified by function)
HC6Preventive care
HC7Governance and health system and financing administration

Clearly we want "Current expenditure on health (all functions)".

HP / Provider

expenditures-providers.csv
HPProvider
HPTOTAll providers
HP1Hospitals
HP2Residential long-term care facilities
HP3Providers of ambulatory health care
HP4Providers of ancillary services
HP5Retailers and other providers of medical goods
HP6Providers of preventive care
HP7Providers of health care system administration and financing
HP8Rest of the economy
HP9Rest of the world

Similarly, the article probably used "All providers".

Measure and Unit

Choice of measure and unit is trickier. Let's query the remaining relevant columns together to try to get a sense of what "measure" and "unit" mean.

expenditures-measures-units.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT DISTINCT MEASURE, Measure_1, Unit_Code, Unit, PowerCode, Reference_Period FROM expenditures ORDER BY MEASURE;

(Note: the _1 suffix was added to the Measure column name by csv-to-sqlite to make the names distinct in SQLite, which is case-insensitive).

expenditures-measures-units.csv
MEASUREMeasure_1Unit_CodeUnitPowerCodeReference_Period
MLLNCUCurrent pricesAUDAustralian DollarMillions
MLLNCUCurrent pricesEUREuroMillions
MLLNCUCurrent pricesCADCanadian DollarMillions
MLLNCUCurrent pricesCZKCzech KorunaMillions
MLLNCUCurrent pricesDKKDanish KroneMillions
MLLNCUCurrent pricesISKIceland KronaMillions
MLLNCUCurrent pricesHUFForintMillions
MLLNCUCurrent pricesILSNew Israeli SheqelMillions
MLLNCUCurrent pricesJPYYenMillions
MLLNCUCurrent pricesNOKNorwegian KroneMillions
MLLNCUCurrent pricesGBPPound SterlingMillions
MLLNCUCurrent pricesCHFSwiss FrancMillions
MLLNCUCurrent pricesPLNZlotyMillions
MLLNCUCurrent pricesTRYTurkish LiraMillions
MLLNCUCurrent pricesSEKSwedish KronaMillions
MLLNCUCurrent pricesKRWWonMillions
MLLNCUCurrent pricesNZDNew Zealand DollarMillions
MLLNCUCurrent pricesMXNMexican PesoMillions
MLLNCUCurrent pricesCLPChilean PesoMillions
MLLNCUCurrent pricesUSDUS DollarMillions
MTMOPPCurrent prices, current PPPsUSDUS DollarMillions
PARCURShare of current expenditure on healthPCPercentageUnits
PARHCShare of functionPCPercentageUnits
PARHFShare of financing schemePCPercentageUnits
PARHPShare of providerPCPercentageUnits
PARPIBShare of gross domestic productPCPercentageUnits
PPPPERPer capita, current prices, current PPPsUSDUS DollarUnits
REPPERPer capita, constant prices, OECD base yearEUREuroUnits2010
REPPERPer capita, constant prices, OECD base yearAUDAustralian DollarUnits2010
REPPERPer capita, constant prices, OECD base yearDKKDanish KroneUnits2010
REPPERPer capita, constant prices, OECD base yearCZKCzech KorunaUnits2010
REPPERPer capita, constant prices, OECD base yearCADCanadian DollarUnits2010
REPPERPer capita, constant prices, OECD base yearHUFForintUnits2010
REPPERPer capita, constant prices, OECD base yearISKIceland KronaUnits2010
REPPERPer capita, constant prices, OECD base yearJPYYenUnits2010
REPPERPer capita, constant prices, OECD base yearKRWWonUnits2010
REPPERPer capita, constant prices, OECD base yearNOKNorwegian KroneUnits2010
REPPERPer capita, constant prices, OECD base yearPLNZlotyUnits2010
REPPERPer capita, constant prices, OECD base yearSEKSwedish KronaUnits2010
REPPERPer capita, constant prices, OECD base yearCHFSwiss FrancUnits2010
REPPERPer capita, constant prices, OECD base yearTRYTurkish LiraUnits2010
REPPERPer capita, constant prices, OECD base yearGBPPound SterlingUnits2010
REPPERPer capita, constant prices, OECD base yearILSNew Israeli SheqelUnits2010
REPPERPer capita, constant prices, OECD base yearNZDNew Zealand DollarUnits2010
REPPERPer capita, constant prices, OECD base yearMXNMexican PesoUnits2010
REPPERPer capita, constant prices, OECD base yearCLPChilean PesoUnits2010
REPPERPer capita, constant prices, OECD base yearUSDUS DollarUnits2010
UNPPERPer capita, current pricesAUDAustralian DollarUnits
UNPPERPer capita, current pricesHUFForintUnits
UNPPERPer capita, current pricesCADCanadian DollarUnits
UNPPERPer capita, current pricesEUREuroUnits
UNPPERPer capita, current pricesDKKDanish KroneUnits
UNPPERPer capita, current pricesCZKCzech KorunaUnits
UNPPERPer capita, current pricesISKIceland KronaUnits
UNPPERPer capita, current pricesJPYYenUnits
UNPPERPer capita, current pricesKRWWonUnits
UNPPERPer capita, current pricesNOKNorwegian KroneUnits
UNPPERPer capita, current pricesPLNZlotyUnits
UNPPERPer capita, current pricesSEKSwedish KronaUnits
UNPPERPer capita, current pricesCHFSwiss FrancUnits
UNPPERPer capita, current pricesTRYTurkish LiraUnits
UNPPERPer capita, current pricesGBPPound SterlingUnits
UNPPERPer capita, current pricesILSNew Israeli SheqelUnits
UNPPERPer capita, current pricesMXNMexican PesoUnits
UNPPERPer capita, current pricesNZDNew Zealand DollarUnits
UNPPERPer capita, current pricesCLPChilean PesoUnits
UNPPERPer capita, current pricesUSDUS DollarUnits
VALRELConstant prices, OECD base yearAUDAustralian DollarMillions2010
VALRELConstant prices, OECD base yearEUREuroMillions2010
VALRELConstant prices, OECD base yearCADCanadian DollarMillions2010
VALRELConstant prices, OECD base yearCZKCzech KorunaMillions2010
VALRELConstant prices, OECD base yearDKKDanish KroneMillions2010
VALRELConstant prices, OECD base yearJPYYenMillions2010
VALRELConstant prices, OECD base yearHUFForintMillions2010
VALRELConstant prices, OECD base yearNOKNorwegian KroneMillions2010
VALRELConstant prices, OECD base yearILSNew Israeli SheqelMillions2010
VALRELConstant prices, OECD base yearISKIceland KronaMillions2010
VALRELConstant prices, OECD base yearCHFSwiss FrancMillions2010
VALRELConstant prices, OECD base yearTRYTurkish LiraMillions2010
VALRELConstant prices, OECD base yearSEKSwedish KronaMillions2010
VALRELConstant prices, OECD base yearPLNZlotyMillions2010
VALRELConstant prices, OECD base yearGBPPound SterlingMillions2010
VALRELConstant prices, OECD base yearKRWWonMillions2010
VALRELConstant prices, OECD base yearCLPChilean PesoMillions2010
VALRELConstant prices, OECD base yearMXNMexican PesoMillions2010
VALRELConstant prices, OECD base yearNZDNew Zealand DollarMillions2010
VALRELConstant prices, OECD base yearUSDUS DollarMillions2010
VRPPPRPer capita, constant prices, constant PPPs, OECD base yearUSDUS DollarUnits2010
VRPPPTConstant prices, constant PPPs, OECD base yearUSDUS DollarMillions2010

The article clearly uses per capita measurement, and the statement "In 2016, the average American spent $4,571 on their health" suggests that we want actual dollars in 2016. "Constant prices, OECD Base Year" seems to refer to 2010, so "Current prices" seems to be the better guess. PPP stands for https://en.wikipedia.org/wiki/Purchasing_power_parity and is an economic theory that normalizes purchasing power to currencies. This seems like a reasonable choice, but at this point I'm out of my depth; if this were original research I would go back to the OECD site to read the documentation.

Perhaps we can eliminate choices by requiring US Dollars expressed in single dollars (as opposed to millions).

expenditures-measures-usd.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT DISTINCT MEASURE, Measure_1, Reference_Period AS 'Base Year' FROM expenditures WHERE Unit_Code == 'USD' AND PowerCode == 'Units' ORDER BY MEASURE;

expenditures-measures-usd.csv
MEASUREMeasure_1Base Year
PPPPERPer capita, current prices, current PPPs
REPPERPer capita, constant prices, OECD base year2010
UNPPERPer capita, current prices
VRPPPRPer capita, constant prices, constant PPPs, OECD base year2010

This still leaves us with the choice between current or constant prices, and PPP or non-PPP. Since we are just trying to figure out what the original author did, we can cheat and go looking for values matching the top four rows in the original table. Note that I had to expand the ranges several times until I got sufficient results for countries other than the United States.

expenditures-measures-cheat.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT DISTINCT MEASURE, Measure_1, Country, Value, Year, Reference_Period AS 'Base Year' FROM expenditures WHERE HF == 'HF2HF3' AND HC == 'HCTOT' AND HP == 'HPTOT' AND Unit_Code == 'USD' AND PowerCode == 'Units' AND Year = 2016 AND ( (Location == 'USA' AND Value BETWEEN 4100 AND 5100) -- article: 4,570.50 OR (Location == 'CHE' AND Value BETWEEN 2600 AND 3600) -- article: 3,097.20 OR (Location == 'AUS' AND Value BETWEEN 1300 AND 2300) -- article: 1,783.90 OR (Location == 'CAN' AND Value BETWEEN 1000 AND 2000) -- article: 1,531.70 ) ORDER BY Value DESC;

expenditures-measures-cheat.csv
MEASUREMeasure_1CountryValueYearBase Year
UNPPERPer capita, current pricesUnited States5032.1382016
PPPPERPer capita, current prices, current PPPsUnited States5032.1382016
VRPPPRPer capita, constant prices, constant PPPs, OECD base yearUnited States4570.49820162010
REPPERPer capita, constant prices, OECD base yearUnited States4570.49820162010
PPPPERPer capita, current prices, current PPPsSwitzerland2880.5932016
PPPPERPer capita, current prices, current PPPsAustralia1518.22016
VRPPPRPer capita, constant prices, constant PPPs, OECD base yearAustralia1414.87820162010
PPPPERPer capita, current prices, current PPPsCanada1394.8872016
VRPPPRPer capita, constant prices, constant PPPs, OECD base yearCanada1315.1520162010

We see the USA value as stated exactly in the article, using constant prices. This is encouraging, and suggests that we simply have to choose whether to use PPP. The only hits for constant price are Australia and Canada, using PPP. Our methodology here is questionable since we just picked an arbitrary "fudge factor" of +/- $500, but since this is a technical demonstration rather than a serious audit of the original we can move on. It is worth taking a moment though to speculate as to why our numbers are different. Since the USA value remains exact, it perhaps the discrepancy is due to changing exchange rates. This explanation is rather unsatisfactory though, because out of all the measure descriptions, I would expect costs for 2016 expressed in 2010 dollars not to change once calculated, so perhaps we are still missing something.

In summary, we will use HF=='HF2HF3', HC=='HCTOT', HP=='HPTOT', and Measure=='VRPPPR'.

Life Expectancy

For life expectancies, the task is a little simpler.

VAR / Variable

expectancies-variables.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT DISTINCT VAR, Variable FROM expectancies;

expectancies-variables.csv
VARVariable
EVIEFE00Females at birth
EVIEFE40Females at age 40
EVIEFE60Females at age 60
EVIEFE65Females at age 65
EVIEFE80Females at age 80
EVIEHO00Males at birth
EVIEHO40Males at age 40
EVIEHO60Males at age 60
EVIEHO65Males at age 65
EVIEHO80Males at age 80
EVIETOTATotal population at birth

We will use EVIETOTA / Total population at birth.

UNIT / Measure

expectancies-measures.csv
UNITMeasure
EVIDUREVYears
EVIFHOEVDifference female-male (years)
EVIHFEEVDifference male-female (years)

Similarly, EVIDUREV / Years is the obvious choice.

Latest Year

Lastly, there is a problem that I did not discover until very late: life expectancy data is not available for 2016 across all countries. The best we can do is select the last year for which it is available:

expectancies-latest-year.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT country, MAX(year) FROM expectancies WHERE var == 'EVIETOTA' AND unit == 'EVIDUREV' GROUP BY country;

expectancies-latest-year.csv
CountryMAX(year)
Australia2015
Austria2015
Belgium2015
Brazil2015
Canada2013
Chile2016
China (People's Republic of)2015
Colombia2015
Costa Rica2015
Czech Republic2015
Denmark2015
Estonia2015
Finland2015
France2015
Germany2015
Greece2015
Hungary2015
Iceland2015
India2015
Indonesia2015
Ireland2015
Israel2015
Italy2015
Japan2015
Korea2015
Latvia2015
Lithuania2015
Luxembourg2015
Mexico2016
Netherlands2015
New Zealand2015
Norway2015
Poland2015
Portugal2015
Russia2015
Slovak Republic2015
Slovenia2015
South Africa2015
Spain2015
Sweden2015
Switzerland2015
Turkey2015
United Kingdom2015
United States2015

We will have to join 'latest year' to the expectancy values in the final query.

The Table

Now that we know which subset of the data we want, we can formulate a JOIN query to recreate the table. We use two LEFT JOIN clauses to effectively add additional columns to the query. First we join expenditures against itself, distinguishing between the two with the names privateExpend and totalExpend. This allows us to select the separate HF2HF3 (private) and HFTOT (total) values, pair them by country, and show them in the same row. Then we join expectancies to do the same for the life expectancy values, choosing the latest year for each country.

table.csv.sql
.open 'data.sqlite3' .mode csv .headers on SELECT privateExpend.country AS Country, printf("%7.2f", privateExpend.value) AS 'Private Cost', printf("%7.2f", totalExpend.value) AS 'Total Cost', latestExpectancies.value as 'Life Expectancy', latestExpectancies.year as 'LE Year' FROM expenditures AS privateExpend LEFT JOIN expenditures AS totalExpend ON privateExpend.country == totalExpend.country LEFT JOIN (SELECT value, country, MAX(year) as year FROM expectancies WHERE var == 'EVIETOTA' AND unit == 'EVIDUREV' GROUP BY country) AS latestExpectancies ON privateExpend.country == latestExpectancies.country WHERE privateExpend.HF == 'HF2HF3' AND privateExpend.HC == 'HCTOT' AND privateExpend.HP == 'HPTOT' AND privateExpend.measure == 'VRPPPR' AND privateExpend.year == 2016 AND totalExpend.HF == 'HFTOT' AND totalExpend.HC == 'HCTOT' AND totalExpend.HP == 'HPTOT' AND totalExpend.measure == 'VRPPPR' AND totalExpend.year == 2016 ORDER BY privateExpend.value DESC;

table.csv
CountryPrivate CostTotal CostLife ExpectancyLE Year
United States4570.508984.7578.82015
Switzerland2443.766718.1483.02015
Australia1414.884387.6882.52015
Ireland1401.524696.4881.52015
Canada1315.154378.2381.72013
Korea1170.492682.4882.12015
Austria1086.244471.1281.32015
Luxembourg1062.266245.6982.42015
Belgium 970.204269.4581.12015
Finland 932.423618.9181.62015
Netherlands 930.684857.6681.62015
Israel 906.582344.0482.12015
Norway 870.855887.0782.42015
France 865.554087.9082.42015
Spain 864.402939.7083.02015
Portugal 818.022423.3181.22015
Sweden 804.524992.9782.32015
Greece 789.631937.9381.12015
United Kingdom 782.083758.7581.02015
Germany 748.114851.7880.72015
Italy 747.682995.2982.62015
Denmark 739.244656.2880.82015
Slovenia 697.112473.5080.92015
Iceland 693.993875.6882.52015
Chile 689.971755.6679.22016
Japan 645.854063.8683.92015
New Zealand 627.143167.9181.72015
Hungary 591.161862.7175.72015
Latvia 566.501300.6874.62015
Poland 503.351622.1877.62015
Mexico 469.90 972.0175.02016
Estonia 421.991763.2077.72015
Slovak Republic 403.681995.9876.72015
Czech Republic 362.892181.9378.72015
Turkey 207.441006.5078.02015

The Visualization

The chart in the article is not a typical chart, so let's start off by rendering a more conventional XY scatterplot, showing both private and total expenditures. Because we are going to try several visualizations, we will first factor out the data loading logic into a tiny python module (I actually did this after some amount of experimentation, for the sake of deduplicating code and clear presentation).

chart_data.py
from muck import * from typing import NamedTuple header = ('Country', 'Private Cost', 'Total Cost', 'Life Expectancy', 'LE Year') class Row(NamedTuple): country: str private_cost: float total_cost: float expectancy: float le_year: int with load('table.csv', header=header) as reader: rows = [Row(c, float(p), float(t), float(e), int(y)) for c, p, t, e, y in reader]

Let's build a scatterplot with leather, a simple charting library for Python. We add a little hack to make the country names show up as tooltips when the reader mouses over the dots:

scatterplot.svg.py
from xml.etree.ElementTree import Element from muck import * from leather import Chart, Series, Dots from chart_data import rows class TitledDots(Dots): 'Hack to get tooltips (SVG title elements) into dot plots.' def to_svg(self, width, height, x_scale, y_scale, series, palette): group = super().to_svg(width, height, x_scale, y_scale, series, palette) for datum, circle in zip(series.data(), group.findall('circle')): title = Element('title') title.text = datum.row['title'] circle.append(title) return group chart = Chart('Private Health Care Expenditure vs Life Expectancy') chart.add_series( Series([{'x': r.private_cost, 'y': r.expectancy, 'title': r.country} for r in rows], x='x', y='y', name='Private Expenditure'), TitledDots(fill_color='black', radius='4')) chart.add_series( Series([{'x': r.total_cost, 'y': r.expectancy, 'title': r.country} for r in rows], x='x', y='y', name='Total Expenditure'), TitledDots(fill_color='#C00000', radius='4')) print('<div>', chart.to_svg(), '</div>', sep='\n') # div is necessary to get block layout.

scatterplot.svg
Private Health Care Expenditure vs Life ExpectancyPrivate ExpenditureTotal Expenditure250050007500100000747678808284United StatesSwitzerlandAustraliaIrelandCanadaKoreaAustriaLuxembourgBelgiumFinlandNetherlandsIsraelNorwayFranceSpainPortugalSwedenGreeceUnited KingdomGermanyItalyDenmarkSloveniaIcelandChileJapanNew ZealandHungaryLatviaPolandMexicoEstoniaSlovak RepublicCzech RepublicTurkeyUnited StatesSwitzerlandAustraliaIrelandCanadaKoreaAustriaLuxembourgBelgiumFinlandNetherlandsIsraelNorwayFranceSpainPortugalSwedenGreeceUnited KingdomGermanyItalyDenmarkSloveniaIcelandChileJapanNew ZealandHungaryLatviaPolandMexicoEstoniaSlovak RepublicCzech RepublicTurkey

Now let's try to reproduce the original visualization. It is unusual because the lines do not represent data series; rather, each line represents a single (expenditure, expectancy) datapoint, and the slope of the line is what conveys magnitude visually. We could hunt around for a library to do this, but let's just draw it ourselves using raw SVG. To do this, we need to scale the values for each dimension and create a line with text at either endpoint.

vis.svg.py
from muck import * from xml.etree.ElementTree import Element, ElementTree from sys import stdout, stderr from chart_data import rows def add_el(parent, tag, **kwargs): try: text = kwargs.pop('text') except KeyError: text = None e = Element(tag, **dict((k.replace('_', '-'), str(v)) for k, v in kwargs.items())) if text is not None: e.text = text if parent is not None: parent.append(e) return e w = 600 h = 600 svg = add_el(None, 'svg', xmlns='http://www.w3.org/2000/svg', display='block', width=f'100%', height=f'auto', viewBox=f'0 0 {w} {h}') add_el(svg, 'rect', x=0, y=0, width=w, height=h, fill='none', stroke_width='1px', stroke='#808080') inset_l = 100 inset_r = 60 inset_y = 10 vis_w = w - (inset_l + inset_r) vis_h = h - 2 * inset_y max_private_cost = max(r[1] for r in rows) min_expectancy = min(r[3] for r in rows) max_expectancy = max(r[3] for r in rows) rng_expectancy = max_expectancy - min_expectancy add_el(svg, 'text', x=10, y=h*0.5, alignment_baseline='middle', font_size=12, text_anchor='middle', text='PRIVATE HEALTH SPENDING PER PERSON IN 2016 (US$)', transform=f'rotate(-90 10 {h*0.5})') add_el(svg, 'text', x=w-10, y=h*0.5, alignment_baseline='middle', font_size=12, text_anchor='middle', text='AVERAGE LIFE EXPECTANCY AT BIRTH', transform=f'rotate(90 {w-10} {h*0.5})') g = add_el(svg, 'g', transform=f'translate({inset_l}, {inset_y})') labeled_countries = { 'United States', 'Switzerland', 'Australia', 'Canada', 'Ireland', 'United Kingdom', 'Turkey', } for r in sorted(rows, key=lambda r: r.private_cost): y1 = vis_h * (1 - r.private_cost / max_private_cost) y2 = vis_h * (1 - (r.expectancy - min_expectancy) / rng_expectancy) labeled = r.country in labeled_countries color = ('#FF6000' if r.country == 'United States' else ('#4040E0' if labeled else '#404040')) add_el(g, 'line', x1='0', y1=y1, x2=vis_w, y2=y2, stroke=color, stroke_width=2) if labeled: add_el(g, 'text', x=-2, y=y1, alignment_baseline='middle', font_size=8, text_anchor='end', text=f'{r.country} ${round(r.private_cost)}', fill=color) add_el(g, 'text', x=vis_w+2, y=y2, alignment_baseline='middle', font_size=8, text_anchor='start', text=f'{r.expectancy:0.1f} years', fill=color) ElementTree(svg).write(stdout.buffer)

vis.svg
PRIVATE HEALTH SPENDING PER PERSON IN 2016 (US$)AVERAGE LIFE EXPECTANCY AT BIRTHTurkey $20778.0 yearsUnited Kingdom $78281.0 yearsCanada $131581.7 yearsIreland $140281.5 yearsAustralia $141582.5 yearsSwitzerland $244483.0 yearsUnited States $457078.8 years

We could probably work on the styling a little more, but it's pretty close to the original.

Conclusion

This demo shows how a data journalist can use Muck to produce a complete article, including charts and tables. The entire reproduction was constructed with Muck, SQLite, and the csv-to-sqlite utility, plus a text editor, a terminal program (to run Muck), and a browser to inspect the results.

This work also shows that by using Muck data journalists can make their projects completely reproducible. To guarantee that everyone sees the same results, I checked the downloaded data into the repository. The OECD terms and conditions are here. Note that while they grant permission to redistribute the data, they also state that they might not actually own portions of it, and claim that it is the user's responsibility to make such determinations. Doing so is really not so feasible for an individual like myself, and given the complexity of modern intellectual property law, republishing anything is something of a calculated risk.

Alternatively I could have left intrepid readers to download the OECD data themselves, but it could change in the future, undergo a change in format that might break our code, or even become unavailable entirely (the license makes disclaimers for each of these possibilities).

The fact that our numbers only partially match those in the original article suggests that the data has indeed changed in the intervening three months, although we cannot be sure. In any case, the demonstration suggests just how difficult reproducible data journalism really is. If the dataset were proprietary, we would not have the option of publishing a copy at all.

Finally, I would like to note that this project is neither a critique nor endorsement of the original article. While I wish the author had provided more specific information about which data was used, I suspect that the core problem has to do with an update to the dataset since the original publication date on 2017/07/07. The discrepancies around latest year of life expectancy data per country is slightly problematic; we could investigate further by looking at how much the life expectancies changed in prior years. Whether or not the comparison between private health expenditures and life expectancy truly summarizes the nature of the US health care problem is another question entirely.