public

Comet charts in Python: visualizing statistical mix effects and Simpson's paradox with Altair

Beter inzicht in statistische mix effecten met Simpson's paradox met comet charts.

3 jaar geleden

Meest recente artikel Deze website is in archief modus door Daniel Kapitan public

Zan Armstrong’s comet chart has been on my list of hobby projects for a while now. I think it is an elegant solution to visualize statistical mix effects and address Simpson’s paradox, and particularly useful when working with longitudinal data involving different sub-populations. Recently I found a good excuse to spend some time to actually use it as part of a exploratory data analysis on a project.

Since I mostly work in Python and have recently fallen in love with Altair — for the same reasons as Fernando explains here — I wondered how the comet chart could be implemented using the grammar of interactive graphics. It took me a while to figure out how to actually plot the comets. In a previous version, I had drawn glyphs using Bokeh. While Altair allows you to plot any SVG path in a graph, this felt a bit hacky and not quite in line with the philosophy of using a grammar of graphics.

Thankfully Mattijn was quick to suggest using trail-marks, after which it was almost as easy as pie. So here’s an example using a dataset of 20,000 flights for 59 destination airports.

In the example shown here, each comet represents one destination airport. The head of the comet corresponds to the most recent observation of the number of flight arrivals (x-axis, shown as logarithmic scale to accommodate the wide range of observations) against the mean delay of those flights (y-axis). The tail of the comet represents a similar (x,y) datum, but from an earlier point in time. Finally, the colour of the comet is encoded to show the change in the mean delay for each airport. A tooltip with a summary of the data is shown when hovering over the head of the comet.

So-called mix effects can often lead to misinterpretation of aggregate numbers. In the example of flight delays, the fact that only a small change is observed in the mean delay across all airports — visualized with the right-most comet outlined in black — hides the underlying variance between airports. Note that in this example the size of each sub-population (number of flights per airport) remains relatively constant, hence the comets here only go up and down. As explained in the original article, mix effects become harder to interpret when the relative size of the sub-populations change as well as their relative values. In the most extreme case this may lead to Simpson’s paradox.

With this base implementation of comet charts in Altair, you can really go to town and combine it with other interactive graphs. Using the overview-detail pattern, you could plot an accompanying density plot of all the flights for a given airport. That way you can quickly zoom in to the lowest level of detail and get a better understanding of the underlying mix effects.

For now, I will leave you with the Python code to make the plot.


{
import altair as alt
import pandas as pd
import vega_datasets


# Use airline data to assess statistical mix effects of delays
flights = vega_datasets.data.flights_20k()
aggregation = dict(
    number_of_flights=("destination", "count"),
    mean_delay=("delay", "mean"),
    mean_distance=("distance", "mean"),
)

# Compare delays by destination between month 1 and 3
grouped = flights.groupby(by=[flights.destination, flights.date.dt.month])
df = (
    grouped.agg(**aggregation)
    .loc[(slice(None), [1, 3]), :]
    .assign(
        change_mean_delay=lambda df: df.groupby("destination")["mean_delay"].diff(),
    )
    .fillna(method="bfill")
    .reset_index()
    .round(2)
)

# Calculate weigthed average of delays for month 1 and 3
total = (
    flights.groupby(flights.date.dt.month)
    .agg(**aggregation)
    .loc[[1, 3], :]
    .assign(
        change_mean_delay=lambda df: df.mean_delay.diff(),
        destination='TOTAL'
    )
    .fillna(method="bfill")
    .round(2)
    .reset_index()
    .loc[:, df.columns]
)


def comet_chart(df, stroke="white"):
    return (
    alt.Chart(df, width=700, height=500)
    .mark_trail(stroke=stroke)
    .encode(
        x=alt.X("number_of_flights", scale=alt.Scale(type="log")),
        y=alt.Y("mean_delay"),
        detail="destination",
        size=alt.Size("date", scale=alt.Scale(range=[0, 10]), legend=None),
        tooltip=[
            "destination",
            "number_of_flights",
            "mean_delay",
            "change_mean_delay",
            "mean_distance",
        ],
        # trails don't support continuous color, see https://github.com/vega/vega/issues/1187
        # hence use bins
        color=alt.Color(
            "change_mean_delay:Q",
            bin=alt.Bin(step=2),
            scale=alt.Scale(scheme="blueorange"),
            legend=alt.Legend(orient="top"),
        ),
    )
)


comet_chart(df) + comet_chart(total, stroke="black")
}
Daniel Kapitan

Gepubliceerd 3 jaar geleden