Cruise Control, or Why Katie Ledecky Dominates Distance Freestyle

I began paying attention to Katie Ledecky, the American distance swimmer, after Brian Phillips wrote a fantastic profile on her for Grantland, in the fall of 2014. Ledecky was seventeen then; she had been the reigning world record-holder in the women’s 800 meter freestyle for over 2 years. Actually, reigning, which suggests stasis, isn’t the best gerund for Ledecky, who, after whittling nine seconds from her prelims time to seize the gold at the 2012 Olympics*, kept whittling. 8 minutes, 14 seconds. 8 minutes, 13 seconds. 8 minutes, 11 seconds. Since Phillips wrote his piece, Ledecky has continued to get faster and faster; in addition to the 800, she now holds world records in the 400 and the mile — the latter by 17(!!!) seconds.It’s in the mile, as Phillips noted, that Ledecky laps her opponents**. At the elite level, this is, let’s just say, unusual. More unusual: holding world records in both mid-distance and distance events (and in Rio, she’ll be racing the 200 — nearly a sprint!– as well).

ledecky_world_record
Via the New York Times

In 2014, Phillips was one of the few non-niche sportswriters to cover Ledecky. These days, with Rio upon us and Ledecky our best, surest chance at Olympic aqua-Gold, features abound. Most of them hammer home the following two points: 1) Katie Ledecky is “the greatest athlete in America”/”peerless”,”better at swimming than anyone is at anything”, and 2) Katie Ledecky is really nice. While I LOVE seeing swimming get some mainstream lip service, calling someone unbeatable without offering more than naive straws as to why they are unbeatable is boring, and calling an athlete “nice” is just lazy. Basically, some of these pieces come off like pre-Kimye T-Swift features, is what I’m saying.

Nonetheless, Katie Ledecky, while perhaps not “better at swimming than anyone is at anything,” is going into Rio with a wider berth between her and her competitors than maybe any other athlete. There are a few explanations for this, some to do entirely with Ledecky herself, and some to do with the mechanics of distance swimming.

First, timing. At 11, Ledecky was still a multisport athlete; two years later, she and her then coach, Yuri Suguiyama, were mapping out plans for London. This might seem like a crazy leap, but it is fairly common for previously casual swimmers go through a period of staggering improvement when they hit adolescence.***  For some, this period may be short and followed by a plateau before improvement continues, but for others, like Ledecky, the period can last for years with little abatement.

Second, margins. The press, not particularly hip to the vagaries of distance swimming, are astounded by the number of seconds Ledecky can drop in an 800 or 1500 between meets. For someone on a spate of improvement, dropping two seconds in the 800 over a month is like dropping maybe a tenth of a second in the 100. Think of it as less of a sprint, more of, I dunno, a half-marathon. Even at the elite level, there’s some cushion, especially before you figure out how you want to swim it. (The length of the race is also why Ledecky is able to establish multi-second leads: being a little faster over the course of a 50 multiplies to being a lot faster over the course of a 800 or a 1500). And these margins are why I think distance swimming is so fun to watch: unlike the 50 or the 100, which are decided almost at the dive, or even the 200, the distance events offer a lot of uncertainty — not only in who will win, but how?

I was never a true distance swimmer, so I’m not the most suited to discuss distance race strategies. That being said, there are three common ones:

  • Go out fast and try to hold on. (When the place matters more than the time, this can be a decent strategy for messing with your competitors’ head games, at least.)
  • Go out easy and build. (If you’re not as susceptible to head games and you feed off pressure, this strategy can pay off. Bonus: it’s more exciting to watch.)
  • Stay a maintainable speed for pretty much the entire race, save for the first and last fifty.

The last method is the trickiest to manage, because the swimmer needs to know exactly what her maintainable speed is. At the elite level, this method also requires the most endurance and raw strength, because the swimmers are holding nearly race pace for over eight minutes. Ledecky follows this strategy, as do some of the elites who she’ll most likely be swimming against in the prelims and finals at Rio.

Below are the splits for each successive 50m in the 800 for the world’s fastest swimmers in this event, using their best times of the past two years.****

image (57)

Ledecky swims a nearly perfect steppe: fast out, steady on, fast close. For the middle 700, the standard deviation between her splits is 0.12 seconds. The average standard deviation for the other elites listed here is 0.27 seconds.

The New Zealand swimmer Lauren Boyle, the German Sarah Kohler, and Ledecky’s teammate Leah Smith swim their middle 700 somewhat evenly (stdevs of .18, .27, .27, respectively), but they don’t close like Ledecky, whose final 50 is within half a second of her first 50. Only Jessica Ashwood (AUS) has a final 50 that is closer to her starting 50 than Ledecky, but Ashwood’s strategy is more of a come-from-behind than any other in this chart.

What this all boils down to — and the reason Ledecky dominates this event — is that she is able to swim fifteen 50s at a pace that is, on average, a half second faster than what any other female swimmer can do.

Screen Shot 2016-08-01 at 10.18.16 PM

This isn’t a predictions post, as, barring deus ex machina, Ledecky, whose fastest time this year is 12 seconds above that of Jazz Carlin, the runner-up, will win this handily. That said, looking at the data, I’d say that Carlin and Jessica Ashwood might have the best chance at coming within ~7 seconds of Ledecky. Not because their swims are steppe, but because of the opposite: they have relatively weaker middles, and the middle is the most straightforward target for time whittling.

 

 

*Wresting crowns from the Brits is always a bit of fun, especially when it’s done under the gaze of the king and queen.

**Though it’s unlikely that you’ll get to see her lap anyone in Rio, as women, ridiculously, are forbidden from racing the mile.

***My own childhood coach attributed this phenomenon, in part, to not knowing enough to overthink, and I see echoes of this in many a profile on top American swimmers.

****Missing Zhang Yuhan, of China, as I was not able to find her splits.

Cruise Control, or Why Katie Ledecky Dominates Distance Freestyle

Do the Jitterbug (an analysis of cold brew efficacy using D3.js)

Screen Shot 2016-07-24 at 9.09.47 PM

I write this sitting in the kitchen sink… Just kidding! My kitchen sink barely holds one wriggling (fat) pug. In truth, I write this melting, Elmira-style, in the peak of #heatdome2016. Temperatures like these (mid 90’s, egads!), merit rewards merely for enduring them, and what is a five dollar vat of caffeine if not a reward?

There are three coffee shops within 1-2 blocks of my apartment, and another handful scattered round my office. I buy cold brew most mornings, but where I buy it depends on mood, lateness, grogginess, general aesthetic preference, etc. In my head, they all cost around $4, but I was vaguely curious to see how the price stacked up versus size, and, more importantly, jitter quotient. Call me a New Yorker (it’s been eight years — I’m close!), but I prefer to let the ice cubes do all the dilution.

I created the following visualization by modifying Michele Weigle’s scatterplot so that it rendered images instead of circles. Specifically, I replaced

 svg.selectAll(".dot")
      .data(data)
    .enter().append("circle")
      .attr("class", "dot")
      .attr("r", 3.5)
      .attr("cx", xMap)
      .attr("cy", yMap)
      .style("fill", function(d) { return color(cValue(d));}) 
      .on("mouseover", function(d) {
          tooltip.transition()
               .duration(200)
               .style("opacity", .9);
          tooltip.html(d["Cereal Name"] + "<br/> (" + xValue(d) 
	        + ", " + yValue(d) + ")")
               .style("left", (d3.event.pageX + 5) + "px")
               .style("top", (d3.event.pageY - 28) + "px");
      })
      .on("mouseout", function(d) {
          tooltip.transition()
               .duration(500)
               .style("opacity", 0);
      });

with:

 svg.selectAll(".dot")
 .data(data)
 .enter().append("image")
 .attr('xlink:href',function(d){
 if (d.neighborhood =='Fidi'){
 return 'icedcoffee_blue.png'
 }
 else { 
 return ('icedcoffee_purple.png')
 }
 })
 .attr('class', 'dot')
 .attr('x', xMap)
 .attr('y', yMap)
 .attr('height', '50')
 .attr('width', '50')
 .on("mouseover", function(d) {
 tooltip.transition()
 .duration(200)
 .style("opacity", .9);
 tooltip.html("jitter-factor: "+ xValue(d)
 + ", " + "price/oz: $"+ yValue(d))
 .style("left", (d3.event.pageX + 5) + "px")
 .style("top", (d3.event.pageY - 28) + "px");
 })
 .on("mouseout", function(d) {
 tooltip.transition()
 .duration(500)
 .style("opacity", 0);
 });

The resulting viz indicates that our own cafeteria gives by far the best jitter bang for buck, while The Wooly Daily ought to be avoided (though the vibes are decent). Also, Fort Greene caffeine tends to be both stronger and pricier than its FiDi counterpart.

The code for this viz is up on github. Feel free to plug your own coffee places into the jitterbug.csv file and see how they shake up!

Do the Jitterbug (an analysis of cold brew efficacy using D3.js)

So You Want to Be a Data Analyst, Part 4:Visualizing Your Data with D3js

A picture is worth 1,000 words, or so throw pillows tell me. A good data visualization can be worth millions of rows (or thousands, or billions, even, depending on the size of the data you’re illustrating). Conversely, a bad data visualization, put into the wrong hands, can be just as harmful as a photoshopped Sandy doomsday photo. The previous two posts in this series have addressed acquiring and making sense of your data; today, I’ll take you through the process of turning that data into a visual story using the D3js library. First, though, I want to cover the reasons why you’d want or need to visualize your data, and explain when it makes sense to create the visualization programmatically instead of using a ready-made chart builder.

Talk Graphics to Me

A graphic/viz/chart is to data what symbols are to words: a succinct means of conveying a large amount of information. Like symbols, visualizations tend to be more eye-catching than the data they encapsulate. Use them when you’re reporting on data that has more than one facet (eg quantities over time, or quantities by category, or quanities by category over time).

Likely, you’ve created data visualizations before, in our old pal MS Excel (or, if you’re like me and LOATHE Excel, Google Spreadsheets). Common viz include bar charts (horizontal and vertical — good for showing a breakdown of quantities by category), line charts (good for showing change in one or more categories over time), scatter plots (good for showing the relationship between two quantities), and pie charts (good for nothing. JK — they’re good for showing breakdowns of quantities by category, provided you have a small number of categories).

If your data is already in a spreadsheet, using a built-in chart to summarize it in one of the ways discussed above is totally fine. I use Google Spreadsheets’ built-in charts often — they have a decent amount of customization, and the added bonus of being embeddable. So when should you use something else?

Taking Measures into Your Own Hands

  1. The first reason you might want to create your own data visualization relates to what I said about Google Spreadsheets being embeddable: shareability. No one should ever have to open an Excel attachment (or any attachment) to view a chart. You can insert your charts in the body of your email, but it’s nice, I think, to have a backup, especially if that backup is interactive (so really, the inserted image is the sneak preview). Also, if you have a data blog or some sort of internal company site, you can slap your custom-made data visualization in there and–provided you’ve styled it right–not have to worry about it breaking on your CEO’s iPad.
  2. The second reason to DIY is customization. Your dataset might not fit in Excel’s default charts, or maybe it does, but you want to show it off in something nicer. A little couture never hurt anyone. D3 shines here — once you get good at it, you can visualize so many data structures (for some great examples, check out Kantar’s Information Is Beautiful 2015 finalists, or the New York TimesUpshot blog).
  3. The third reason is that your data isn’t in a spreadsheet, and you don’t want to export it into one. This is the dashboard scenario: hook your visualization straight into your database, schedule your query so that the data refreshes, and let ‘er rip.
  4. The fourth reason is replicability. Chances are, you’re going to be reporting on the same or similar data (or data with the same or similar structure) more than once. If you create your visualization layer programmatically, you can easily swap in new data for the old data, et voila. To create the D3 example below, I modified a script I’d already created for the Met Gala. The original took me well over an hour to create. This version took maybe ten minutes.

<Caveat Lector>

People not on the creation side of visualizations tend to trust the result implicitly, and this trust can be deliberately or mistakenly abused. The best way to prevent bad visualizations is to understand your data, and its conclusions, before you translate them. Where possible, you should provide your dataset along with your visualization, so that your audience can a) pursue their own deeper understanding of the data and b) validate your work.

</Caveat>

Time to Bind

D3 is certainly not the only javascript library designed to bind data to DOM, but it’s one of the best (IMHO), buzziest (indisputable), and most-used. It was created by Mike Bostock, back when he was the graphics editor at the New York Times. Remember when the NYT started gathering acclaim for its visual articles like “512 Paths to the White House?” That was Bostock’s doing (though the NYT as a whole deserves a lot of kudos for supporting a very different means of information display long before anyone else).

I’m not going to spend too much time explaining D3 upfront, because others, including Bostock himself, have done so extensively. Here’s what you need to know before we get graphing:

  • D3.js is a javascript library that allows you to attach data to the DOM of a web page.
  • Once attached, that data can be manipulated by manipulating the DOM elements its bound too.

Okay’s let’s get crackin. Today, you’re going to create a bubble graph that shows the wax and wane of music genres’ popularity over time. The dataset you’ll be using contains information about music genre popularity in the United States between 1960 and 2010. The popularity of a given genre is simply a sum of the times it appeared in the Billboard Hot 100 in a given year. This dataset was created by aggregating Matthias Mulch’s song-level data, which he used in “Evolution of Popular Music: USA 1960-2010.” You can download the data, as well as all of the visualization code, on Github.

Step 1: Get your data.

Honestly, this can be the hardest part, depending on how you approach D3, and also what kind of data you’re working with. In an ideal world, you already have access to the data you need, and you load it, untouched, into D3 and start manipulating. If replicability matters to you and/or you’re building a dashboard, you’re going to want to do as much of the manipulation directly in D3 as possible. That being said, when you’re starting out, you’re probably going to find yourself doing the following:

  1. Deciding, from the shape of your result set, which visualization(s) would best explain it.
  2. Finding an example D3 visualization online.
  3. Realizing that the data it relies upon is either hard-coded or shaped differently from yours.
  4. Retrofitting your data to resemble the example data.

This is okay, in the beginning. As you learn D3, you’ll become more comfortable with transforming your data into the shape you need in D3, using D3’s built-in nest and mapping functions. Today, however, you’re going to do a minimal amount of manipulation in D3 because the music data you’ll be using is already in the necessary format.

Let’s look at the head of the data:

“year”,”genre”,”score”
1960,”ALTERNATIVE”,1
1961,”ALTERNATIVE”,0
1962,”ALTERNATIVE”,0

Big surprise — there was not much alternative music in 1960! Shall we check out the tail?
2004,”SOUL”,23
2005,”SOUL”,21
2006,”SOUL”,25

See, Dad — soul didn’t die in the seventies! (That would be Motown.)

The point of all this inspecting is to know the shape of your data: three columns, one a date, one a category, and one an integer. As you start doing more D3, you’ll want to compare your data to the data in the examples. One way or another, yours will need to match the example shape before you start drawing your visualization.

Step 2: Set up your html, css, and D3 scaffold.

By which I mean, a basic html web page with a <body> tag to bind to, css to dictate any over-arching styles, and the start of your javascript, which will contain your visualization’s width, height, and margins. Typically, you’ll break these out into their own files, but for brevity’s sake here, I’m going to put everything into the HTML.

<!DOCTYPE html>
 <meta charset="utf-8">
 <style>

body {
 font: 12px sans-serif;
 }

.axis path,
 .axis line {
 fill: none;
 stroke: grey;
 shape-rendering: crispEdges;
 }

.dot {
 stroke: none;
 fill: steelblue;
 }

.grid .tick {display: none;}

.grid path {
 stroke-width: 0;
 }

div.tooltip {
 position: absolute;
 text-align: center;
 width: 80px;
 height: 42px;
 padding: 2px;
 font: 12px sans-serif;
 background: #ddd;
 border: solid 0px #aaa;
 border-radius: 8px;
 pointer-events: none;
 }

</style>
 <body>
 http://d3js.org/d3.v3.min.js


 //We kick off any d3 javscript with basic sizing
 var margin = {top: 100, right: 20, bottom: 30, left: 100},
 width = window.innerWidth - margin.left - margin.right,
 height = 500- margin.top - margin.bottom;

Step 3: Add your data stipulations.

By stipulations, I mean the specific properties of your dataset. Since this dataset contains time data, you’ll want a parse date function that tells D3 “hey, this column is a date and it’s in this format.” You might want a format date function so that you can control what the date displayed on your axis looks like. We’re going to be drawing a categorical bubble chart on a time-series, so you’ll need a time-scaled X axis and an ordinal (non-linear/aka non-quantitative) Y axis. You’ll also want a color scale to differentiate between the genres. All these javascript stipulations will look like this:

//if we have date data, we need to tell d3 how to read it.
 //in the music data case, all we have is the year
 var parsedate = d3.time.format("%Y").parse;
 var formatDay_Time = d3.time.format("%Y");

//we're going to want to color our scores based on the genre. Here, we're just setting a color variable.
 var color = d3.scale.category10();
 //We need to tell d3 what kind of axes we want.
 //In the music data case, we want our dates on the x axis and our genres on the y.
 //Because the genres are categorical, we will explicitly tell d3 how far apart to space them, using rangeRoundBands
 var x = d3.time.scale().range([0, width]);
 var y = d3.scale.ordinal()
 .rangeRoundBands([0, height], .1);

//tell d3 to orient the y axis on the lefthand side of the graph
 var yAxis = d3.svg.axis()
 .scale(y)
 .outerTickSize(0)
 .innerTickSize(0)
 .orient("left");

//put the x axis on the bottom
 var xAxis = d3.svg.axis()
 .scale(x)
 .orient("bottom");

Step 4: Create a visualization object and attach it to the DOM.

The visualization object is called an SVG, aka “scaleable vector graphic.” This is the thing onto which your data will attach and come to life. Declare and attach it to your <body> tag like this:

var svg = d3.select("body")
 .append("svg")
 .attr("width", width + margin.left + margin.right)
 .attr("height", height + margin.top + margin.bottom)
 .append("g")
 .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

Step 5: Load your data and attach it to your SVG.

You can hard code your data, but that’s super onerous and also not replicable, so I’d advise against it. Instead, use one of D3’s loading methods. In our case, we’ll use d3.csv, because our data is in a csv format. Once we load it, we loop through it and format it correctly.

d3.csv("data/history_of_music.csv", function(error, data) {
 data.forEach(function(d) {
 d.year = parsedate(d.year);
 d.score = +d.score;
 });

 

Step 6: Define/create your axes

 

The domain methods tell D3 which data should be attached to your axes. In our case, the d.year data goes to the x axis,the d.genre goes to the y axis, and the d.score dictates the radii of the bubbles. Defining the x axis is easy:

// Set our x axis domain with the earliest and latest years
 x.domain([new Date(1959, 01, 01), new Date(2009, 12, 01)]);

To define the y axis, we want the unique genre names. We can get them by looping through all of the genres and appending each new one to an array.

//Now we'll create an array of unique genres, which we'll use to create our y axis
 //to do this, we just loop through the genres data and append each name
 //that doesn't already exist in the array
 genres_filtered = []

data.forEach(function (d) {
 if (genres_filtered.indexOf(d.genre) === -1) {
 genres_filtered.push(d.genre);
 }
 });

To define the r radius, we use the maximum score divided by the number of genres, so that no bubble will be too big.

//Now we set the radius to be between zero and the height divided by the count of unique genres
 var r = d3.scale.linear().range([0, height / genres_filtered.length]);
 r.domain([0, d3.max(data, function(d) { return d.score})]);

Last, we'll append our x and y axis to the SVG using the call function. 

//actually append our y axis to the svg 
 svg.append("g")
 .attr("class", "y axis")
 .call(yAxis)
 .selectAll("text")
 .attr("dx", 0)
 .attr("dy",-5)
 .style("text-anchor", "end");

//append our xaxis to the svg
 svg.append("g")
 .attr("class", "x axis")
 .attr("transform", "translate(0," + height + ")")
 .call(xAxis);

Step 7: Draw the bubbles.

Here’s the fun part! To draw the bubbles, we’ll use the year and genre to place them on the axes, the score to size them, and genre again to color them. We’ll also add a little mouseout action to display the genre and sum of songs for that year.

// attach our tooltip
var div = d3.select("body").append("div")
 .attr("class", "tooltip")
 .style("opacity", 1e-6);
//draw our circles based on the scores, and attach them to the svg
svg.selectAll(".dot")
 .data(data)
.enter().append("circle")
 .attr("class", "dot")
 .attr("r", function(d) { 
 return r(d.score); 
 })
 .style("opacity", 0.25)
 .attr("cx", function(d) { return x(d.year); })
 .attr("cy", function(d) { return y(d.genre); })
 .style('fill', function (d) { return color(d.genre); })
 .on("mouseover", function(d) {
 div.transition()
 .duration(200)
 .style("opacity", .7);
 div .html(
 formatDay_Time(d.year) + "<br/>" +
 d.genre + "<br/>" + 
 d.score) 
 .style("left", (d3.event.pageX) + "px")
 .style("top", (d3.event.pageY - 42) + "px");
 }) 
 .on("mouseout", function(d) {
 div.transition()
 .duration(500)
 .style("opacity", 1e-6);
 });

Lastly, we’ll append a title.

svg.append("text")
 .attr("x", (width / 2)) 
 .attr("y", 0 - (margin.top / 2)) 
 .attr("text-anchor", "middle") 
 .style("font-size", "16px") 
 .text("Are We Human or Are We Dancer: the Rise and Fall of Musical Genres, 1960 - 2009, per Volume of Billboard Hot 100 Hits");

});

The full page with the script will look like this:

<!DOCTYPE html>
<meta charset="utf-8">
<style>

body {
 font: 12px sans-serif;
}

.axis path,
.axis line {
 fill: none;
 stroke: grey;
 shape-rendering: crispEdges;
}

.dot {
 stroke: none;
 fill: steelblue;
}

.grid .tick {display: none;}

.grid path {
 stroke-width: 0;
}

div.tooltip {
 position: absolute;
 text-align: center;
 width: 80px;
 height: 42px;
 padding: 2px;
 font: 12px sans-serif;
 background: #ddd;
 border: solid 0px #aaa;
 border-radius: 8px;
 pointer-events: none;
}

</style>
<body>
http://d3js.org/d3.v3.min.js


//We kick off any d3 javscript with basic sizing
var margin = {top: 100, right: 20, bottom: 30, left: 100},
 width = window.innerWidth - margin.left - margin.right,
 height = 500- margin.top - margin.bottom;

//if we have date data, we need to tell d3 how to read it. 
//in the music data case, all we have is the year 
var parsedate = d3.time.format("%Y").parse;
var formatDay_Time = d3.time.format("%Y");

//we're going to want to color our scores based on the genre. Here, we're just setting a color variable. 
var color = d3.scale.category10();


//We need to tell d3 what kind of axes we want.
//In the music data case, we want our dates on the x axis and our genres on the y.
//Because the genres are categorical, we will explicitly tell d3 how far apart to space them, using rangeRoundBands
var x = d3.time.scale().range([0, width]);
var y = d3.scale.ordinal()
 .rangeRoundBands([0, height], .1);

//tell d3 to orient the y axis on the lefthand side of the graph
var yAxis = d3.svg.axis()
 .scale(y)
 .outerTickSize(0)
 .innerTickSize(0)
 .orient("left");

//put the x axis on the bottom
var xAxis = d3.svg.axis()
 .scale(x)
 .orient("bottom");

//here we create our graph object, attach it to the body element of our html
//and append the sizing attributes we specified earlier
var svg = d3.select("body")
 .append("svg")
 .attr("width", width + margin.left + margin.right)
 .attr("height", height + margin.top + margin.bottom)
 .append("g")
 .attr("transform", "translate(" + margin.left + "," + margin.top + ")");


//Now, we read in our music data.
//We'll parse our dates and tell d3 that our scores are numeric 
d3.csv("data/history_of_music.csv", function(error, data) {
 data.forEach(function(d) {
 d.year = parsedate(d.year);
 d.score = +d.score;
 });

//check to make sure the data was read in correctly
 console.log(data);

// Set our x axis domain with the earliest and latest years
 x.domain([new Date(1959, 01, 01), new Date(2009, 12, 01)]);

//Now we'll create an array of unique genres, which we'll use to create our y axis
 //to do this, we just loop through the genres data and append each name
 //that doesn't already exist in the array
 genres_filtered = []

data.forEach(function (d) {
 if (genres_filtered.indexOf(d.genre) === -1) {
 genres_filtered.push(d.genre);
 }
 });


 //Now we set the radius to be between zero and the height divided by the count of unique genres
 var r = d3.scale.linear().range([0, height / genres_filtered.length]);
 r.domain([0, d3.max(data, function(d) { return d.score})]);


 //add the genre names to the y axis
 y.domain(genres_filtered);

//color our bubbles based on the genre names
 color.domain(genres_filtered);

//actually append our y axis to the svg 
 svg.append("g")
 .attr("class", "y axis")
 .call(yAxis)
 .selectAll("text")
 .attr("dx", 0)
 .attr("dy",-5)
 .style("text-anchor", "end");

//append our xaxis to the svg
 svg.append("g")
 .attr("class", "x axis")
 .attr("transform", "translate(0," + height + ")")
 .call(xAxis);

// attach our tooltip
var div = d3.select("body").append("div")
 .attr("class", "tooltip")
 .style("opacity", 1e-6);

//draw our circles based on the scores, and attach them to the svg
svg.selectAll(".dot")
 .data(data)
.enter().append("circle")
 .attr("class", "dot")
 .attr("r", function(d) { 
 return r(d.score); 
 })
 .style("opacity", 0.25)
 .attr("cx", function(d) { return x(d.year); })
 .attr("cy", function(d) { return y(d.genre); })
 .style('fill', function (d) { return color(d.genre); })
 .on("mouseover", function(d) {
 div.transition()
 .duration(200)
 .style("opacity", .7);
 div .html(
 formatDay_Time(d.year) + "
" +
 d.genre + "
" + 
 d.score) 
 .style("left", (d3.event.pageX) + "px")
 .style("top", (d3.event.pageY - 42) + "px");
 }) 
 .on("mouseout", function(d) {
 div.transition()
 .duration(500)
 .style("opacity", 1e-6);
 });

// Add the title
 svg.append("text")
 .attr("x", (width / 2)) 
 .attr("y", 0 - (margin.top / 2)) 
 .attr("text-anchor", "middle") 
 .style("font-size", "16px") 
 .text("Are We Human or Are We Dancer: the Rise and Fall of Musical Genres, 1960 - 2009, per Volume of Billboard Hot 100 Hits");

});

 

CONGRATULATIONS! A dataviz iz here! Though, if you’ve never built a web page with javascript before, you might be like…where? To view this guy, I recommend using the very lightweight SimpleHTTPServer. Just install it using your package manager of choice (I use pip), and make sure you’re in the same folder as your viz file when you launch the server. Once you’ve done that, you should see:

Screen Shot 2016-05-22 at 9.57.51 PM This post isn’t really about analyzing this particular dataset, but there are some trends that stand out to me. Namely, that pop peaked in the mid-80s, country didn’t pick up much mainstream steam until the late nineties, and rap has been waning as hiphop has risen. Also, I should say that these ten genres were the ten most popular genres out of a much larger genre set, so lots of the more interesting ones, like New Wave and Postpunk, are not represented here.

Anyways, I’m going to rap this up because it’s getting kinda Joycean in its length (though hopefully not in its parseability!). The code, including that for transforming Mauch’s data into the needed shape, is on Github. Have at it!

So You Want to Be a Data Analyst, Part 4:Visualizing Your Data with D3js

The Kind You Find in a Second Hand Store

The year I turned eight, I listened to Prince every Wednesday night. Wednesdays were my mother’s night to drive me and two other girls to and from swim practice. The practices were held at Wellesley College, and in my memory they started late — after dinner, certainly, though before dessert. The trip to the pool from my home was, in total, around twenty minutes, or five Prince songs. The album must have been a Best Of, for it contained Little Red Corvette and Raspberry Beret and When the Doves Cry and 1999. I loved them all, but the first two I loved the best, particularly Raspberry Beret, with its mix of Springsteen nostalgia and peacock exoticism.

I loved them, and yet they inspired a nagging dread, for they were always played on the way to practice, with its endless, burning laps, and never on the way back. (The way back was reserved for The Beatles, whom I to this day associate with roast chicken.)

I don’t know if it was because of the chlorinated dread his songs inspired, but I have not listened to Prince much in the twenty years since those carpool nights. And then tonight, I heard licks of Little Red Corvette as I was walking home from the subway. In front of Spike Lee’s studio on South Elliott Street, a block party was underway. A dj played 1999 and Around the World in a Day and Sign o’the Times while Lee himself grinned maniacally and exhorted the crowd with two lit iPhones. Small children swayed on shoulders. An older man, portly in a blue blazer, told his son that this night was about respecting his generation. Two women talked about the time they saw Prince in Savannah. A man reminisced about dancing to 1999 in its pre-YTK heyday. On a parked car, two small girls took selfies.

I have lived in Fort Greene for nearly four years now. I like the trees and the plentitude of cafes and transportation, and I love the park, but it’s never, until tonight, felt like a place I belonged to.

The Kind You Find in a Second Hand Store

So You Want to Be A Data Analyst, Chapter 3: The Relational Database

In his short story Funes, the Memorious, the Argentine writer Jorge Borges writes of a teenager, Funes, whose mind contains the world, and all of its history. To access this information, Funes devises a system of one-to-one enumeration.

“In place of seven thousand thirteen, he would say (for example) Mdximo Perez; in place of seven thousand fourteen, The Train; other numbers were Luis Melidn Lafinur, Olimar, Brimstone, Clubs, The Whale, Gas, The Cauldron, Napoleon, Agustin de Vedia. In lieu of five hundred, he would say nine.”

This system, the narrator points out, is ridiculously inefficient — indeed, it is a degredation of the grouping systems we already have, the hundreds and tens and birds and planets.

The original databases, magnetic tape that stored references to information, similar to punch cards, were more efficient than Funes’ enumeration system, but only slightly. The actual data and its reference were stored hierarchically: sunflower growth statistics, let’s say, rolls into sunflowers, which in term rolls into flowers. Any time a single record needed to be updated, the entire tape would be rewritten. And finding that record necessitated first finding its reference in a vast spool of tape — not quite needle:haystack, but close.

first_database

I won’t dive much further into database history, except to say that in 1969, a Brit named E.F. Codd published a paper that introduced the concept of relational databases, and explained their advantage over the hierarchical and network databases: namely, that they allow you and I to care only about what we are looking for, and not about where we must go to find it.

While these days, databases come in a variety of architectures, some hyper-structured, some basically conjoined jellyfish, the relational database is still the pre-eminent database for most mid- and large companies (though, if they are in the business of content or social network creation, it’s likely they have a few jellyfish as well).

As an analyst, I frequently want to uncover the relationship between two or more sets of data; relational databases are designed to expedite that discovery, by joining tables of data on a common key. The key is to data tables what mutual friends are to you and the guy you swiped right for. I like relational databases, because getting data out of them is very straightforward.

automat-new-york-roberts_8004_990x742
Source: National Geographic

In this post, I’m going to illustrate this straightforwardness (but with dogs, not Tinder. Wompwomp). We’ll go through what a relational data table looks like, how to create one in your RDB of choice, and how to query its contents and join it on a second table.

Let’s get started.

First, I want you to picture a spreadsheet. (Humor me, budding analyst.) Let’s say it’s a spreadsheet on dog adoptions, called “Adoption.” Adoption has three columns: Breed, Average Price, and Lifespan.

Breed Puppy Price Average Lifespan (years)
Labrador Retriever 950 11
St. Bernard 1,000 9
Irish Wolfhound 2,000 8
German Sheperd 700 11
Yorkie 750 14.5
Pug 800 13
French Bulldog 3,100 11
Chihuahua 500 17

 

You also have a second spreadsheet, “Health,” with Breed and Average Annual Cost.

Breed Average Annual Cost
Labrador Retriever 2569
St. Bernard 3321
Irish Wolfhound 3321
German Shepherd 2569
Yorkie 1831
Pug 2268
French Bulldog 2268
Chihuahua 1831

 

Let’s say you want to calculate the total cost per breed, because you’re a pragmatic person and you want to convince your husband that an Irish Wolfhound is a bad investment, compared to a chihuahua. In a spreadsheet scenario, these two tables would live in separate tabs, and if you wanted to calculate the total cost per breed, you might first use a VLOOKUP to join the average annual cost data to the average puppy price and lifespan data.

In a relational database scenario, these tables would live in schemas (an overarching container — akin to a top-level report folder, if you’re the organized sort). You would relate them to one another using a key column, in this case, “Breed,” and then multiply the Lifespan column from the Adoption table by the Average Annual Cost Column from the Health table.

from_relational_model
Source: Neo4j

A relational database can hold many schemas, each with dozens if not hundreds of tables, and any of them can be linked by their common columns.

In this silly 8 row example, there’s no need to have the data anywhere but a spreadsheet. But, when:

  1. You have millions of rows of data and/or
  2. You have nested data and/or
  3. You want to keep adding to the dataset programmatically and/or
  4. You want others to be able to access and possibly even update the dataset,

a relational database is far more efficient than a spreadsheet.

Time to talk about getting data into a relational database. The relational database I’m using is MySQL, because that’s what our database team set up for me, but the processes I’ll be going through will be similar for Postgres and Oracle. I’ll be connecting to MySQL using a free GUI tool called “MySQL Workbench,” which you can download here. The language I’ll be using to write to and read from it is Structured Query Language, aka “SQL.” I’ll write more about SQL in an upcoming post, but if you’re unfamiliar with it, just think of it as a way to specify the type and aggregation of information you want. 

Data goes into a database in one of two ways. If the data already exists in a file format, you create a table for it in the appropriate schema, and then you upload it to that table. If the data is not yet exposed to you, you pull it down from its source location, format it appropriately, and push it to the table you’ve created for it. Typically, this would be done in a script, and you could have that script run on a schedule, so that your data table(s) fill up on the reg.

In this tutorial, I’m making the assumption that you already have access to a database and a schema within that database, and that you’ve opened up a connection to that database, either through a GUI like MySQL Workbench or through the command line (your terminal app). A schema, by the way, is like a bureau, only it holds tables, instead of shelves. (Ack, mixing furniture metaphors here!)

All set? Cool, let’s build a table.

First, connect to your database using the connection strings provided to you. In MySQL Workbench, you connect in the Home panel, by clicking on the (+) button.

MySQL Workbench

After you’ve connected to your database, you need to specify the schema you’ll be using. If you already had a “dogs” schema, you’d run:

USE dogs

. As there is no dogs schema, you’ll need to create one.

CREATE SCHEMA IF NOT EXISTS dogs

In our money-efficient pets scenario, we want the adoption table and the health table. As the dogs schema is empty, we’ll need to create both. To create a table in SQL, you first give it a name, and then you list each of the columns, along with their type (integer, character, date etc), and other options like “NOT NULL” if the column fields can’t be blank, and “PRIMARY KEY” if you want the column to be/contribute to the uniqueness and matchiness of the table. (More on this later!)

CREATE TABLE adoption (Breed VARCHAR(45), Puppy_Price INT, Avg_Lifespan INT)

The above creates your table, adoption, and specifies the names and data types of its columns. VARCHAR holds varied character data (aka strings), INT is for numbers, DATE is for dates, DATETIME is for date+time stamps, BOOL is True/False (False=0, True>0), DECIMAL is for decimals (aka floats). Full definitions of these types and more are here.

Using the same structure used to create your adoption table, create your health table.

CREATE TABLE health (Breed VARCHAR(45), Avg_Annual_Cost INT)

Now that you’ve created your table structures, you can fill them with data. This might be done programmatically, by hitting up an API, or manually, by loading a file. In the pets case, let’s assume the adoption data and health data each live in CSVs, located in your local Pets folder. Load them with the following:

LOAD DATA LOCAL INFILE '~/Pets/adoption.csv' INTO TABLE adoption;
LOAD DATA LOCAL INFILE '~/Pets/health.csv' INTO TABLE health;

Check to see the data has loaded properly with a SELECT * (aka select ALL THE THINGS):

SELECT * FROM adoption
SELECT * FROM health

You should see your full tables returned.

Now that we have our tables, we can start grilling ’em.

The SELECT statement used above starts pretty much any basic SQL query you might have. All it does is tell your not-yet-specified table, “hey, I’m going to be asking for these not-yet-specified things.” Let’s add some specificity!

The big, burning question you have regarding your pets data is which pets are the most and least cost-efficient. To answer it, you’ll need to multiply the average annual cost for each pet from the health table by its lifespan from the adoption table, and add the adoption fees from the adoption table to this total. What this means, in SQL terms, is that the time has come for JOINS and SUMs.

To get your first answer — the average health care cost for each pet — you’ll need to JOIN the average annual cost and the lifespan data by the breed and adoption fee data.

SELECT health.Breed, health.Avg_Annual_Cost, adoption.Puppy_Price, adoption.Lifespan
FROM health 
INNER JOIN adoption
ON health.Breed = adoption.Breed
GROUP BY 1,2, 3 ORDER BY 1,4
Breed Price (puppy) Average Annual Cost Average Lifespan (years)
Labrador Retriever $950 $2,569 11
St. Bernard $1,000 $3,321 9
Irish Wolfhound $2,000 $3,321 8
German Sheperd $700 $2,569 11
Yorkie $750 $1,831 14.5
Pug $800 $2,268 13
French Bulldog $2,000 $2,268 11
Chihuahua $500 $1,831 17

Now you have all your pertinent data in one temporary table, but what you really want is to calculate the total health costs (average annual cost * lifespan) and add that to your puppy price. You can do this in your original JOIN statement.

SELECT health.Breed, ((health.Avg_Annual_Cost * adoption.lifespan)+adoption.Puppy_Price) as “Total Cost”
FROM health 
INNER JOIN adoption
ON health.Breed = adoption.Breed
GROUP BY 1 ORDER BY 2 DESC

And the results, which we’ve ordered by price, from most to least expensive:

 

Breed Total Cost
Chihuahua $31,627.00
St. Bernard $30,889.00
Pug $30,284.00
Labrador Retriever $29,209.00
German Sheperd $28,959.00
Irish Wolfhound $28,568.00
Yorkie $27,299.50
French Bulldog $26,948.00

Looks like the Chihauhua, by dint of its longevity, ends up costing the most. Go drop $3k on a Frenchie!

To recap, in this post, you learned about the reasons for using relational databases instead of spreadsheets, how to set up a schema and tables in a relational database, how to fill the tables with data from a file, and how to preform aggregations and joins on this data.

In the next post, I’ll dive further into SQL, and walk through inserting data into an RDB by other means (chiefly, through an API).

 

So You Want to Be A Data Analyst, Chapter 3: The Relational Database

So You Want to Be a Data Analyst, Chapter 2: The API (Is Just an ATM in The Sky)

Before you can extract any insights from data, you need to have the data in the first place. If you’re lucky, you do have the data sitting pretty in a database or can easily export it from its source origin. If you don’t, however, you can hit up an API to get it. You also may want to use an API even if the export function is available, for efficiency’s sake — if, for example, you’re pulling data from multiple Facebook pages, or you want three months’ worth of data and the export only allows for one month.

So, the answer to the question of why you’d use an API is: it gives you access to data that is otherwise unavailable or inefficiently available.

Wait, What Is an API?

Technically, it’s an Application Program Interface, but I think of it as a soul sister of the ATM. The typical ATM allows you to deposit funds, withdraw funds, and get information about your accounts, providing you have a debit card and know your PIN number. An API allows you to post data (deposit) and fetch data (balance inquiry) to endpoints (bank accounts), provided you have an API key (debit card and pin number).

As a data analyst, my interactions with APIs are mainly on the fetching side. In this post, I’ll be walking you fetching the authors of the most-shared articles on the New York Times, using the NYT’s Most Popular API.

Fetching Data from an API, Step 1: Obtain an API Key

An API key is typically a long string of numbers and letters that is issued to you by the application’s creators. It gives you varying degrees of access to the application’s data, with fetching (or making a GET request, in API parlance) being the lowest/most easily obtained level of access. The process of API key issuance differs from one API to another; to access the New York Times’ Most Popular API, you just need to register with their developer network, name your quote-un-quote application (in this case, just a script that fetches data), and select the correct API checkbox, and presto, the NYT will issue you a key.

Screen Shot 2016-01-18 at 11.34.40 AM

 

Fetching Data from an API, Step 2: Create a Get Request

Once you have your API key, you can use it to request data from the API, much as you might insert your debit card into an ATM and request balance info. API requests are made to the API’s URI, and typically contain the endpoint (specific data type), API key, and parameters like date that allow you to drill into a subset of the endpoint’s data. Requests to the NYT’s Most Popular api take the following format: http://api.nytimes.com/svc/mostpopular/v2/%5Bmostshared/mostemailed/mostviewed]/[all-sections/specific section]/days back: 1/7/30].json?api-key=[your key].

In your case, you want the most shared endpoint, and you want a month of data, so your request will look like this: http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=%5Byour key].

Put that into your url bar and see what gets returned. You should see a bunch of json objects, each containing metadata associated with one of the most shared articles.

Fetching Data From an API, Step 3: Get and Store the Response Data

Sure, you could copy all that json you see below the URL bar into a text document and manually locate the information you want (the authors), but that would be boring, time-consuming, and error-susceptible. Instead, you’ll make the request programmatically and store the response data in an object you’ll deconstruct in part 4.

You can use any programming language you like to interact with APIs, but this example uses python and its request library, mainly because python is really handy for analysis. Prereqs for this step are:

  1. text editor (I like Sublime Text)
  2. python (shipped with default build on Macs)
  3. admin access to your computer
  4. pip (or your python package installer of choice)
  5. requests library (which you’ll use to hit the Most Popular – most shares endpoint)

Provided you have admin access, you can get no. 4 by following these instructions. Once you have no. 4, you can get no.5 by opening Terminal (Spotlight → Terminal) and typing “pip install requests.”

Prereqs all taken care of? Open a new file in sublime text and save it as “most_popular.py”. This will be the python script that fetches and formats the most shared articles.

First, import the needed modules:

  • import requests
  • import json
  • import csv

Next, create the request and response objects:

pop_request = 'http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=[your key]'
pop = requests.get(pop_request)

“pop” holds the entire response, but if you print its variables (print vars(pop)), you’ll see that you’re just interested in the “results” part. Store that in its own variable and convert it to a dictionary:

pop_content = pop._content
pop_content = json.loads(pop_content)

Fetching Data from an API, Step 4: Extract the Necessary Data and Save It to a File

Now that you have your most shared article data stored in a big ol’ object, you’re going to parse it and extract what you need. First, in the python interpreter, you can take a closer look at pop_content to find the names of the fields you’re interested in:

import requests
import json
import csv
pop_request = 'http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=684a51023b7f88483e87f63e36c33e41:18:68385208'
pop = requests.get(pop_request)
pop_content = pop._content
pop_content = json.loads(pop_content)

for key, value in pop_content.items():
  if (key == "results"):
    for article in value:
      for field_name,field_value in article.items():
        print field_name

As the goal of this exercise is to find which authors contribute the most most shared articles, you’re going to want the “byline” field for sure, along with the “published_date” field and the “total_shares” field — though the latter is only useful for inference, as it doesn’t actually give share counts, only where the article ranks in the top 20.

Create an empty list for each of the three useful keys. To fill them, you’ll run through the results dictionary within the pop_content object, and save each value that corresponds to a byline, published_date, and total_shares field to a variable, and append that variable to its corresponding list.

dater = []
shares = []
authors = []
for k, v in pop_content.items():
  if (k == "results"):
    for a in v:
      pubdate = a["published_date"]
      dater.append(pubdate)
      author = a["byline"]
      authors.append(author)
      sharecount = a["total_shares"]
      shares.append(sharecount)

In the python IDE, you can inspect the contents of each list if you like, by typing “print [list name]”.

The final part of this process is exporting your cleaned and extracted data to a csv file for easypeasy analysis (later, I’ll go through how to analyze this data within the same python script as well as how to push it to a database and analyze it with SQL).

First, create the empty csv file and column headers:

csv_out = open('nyt_mostpop_week.csv', 'wb')
mywriter = csv.writer(csv_out)

Then stitch the lists together, write each row to your csv object, and close the csv file.

for row in zip(authors,dater,shares):
 mywriter.writerow(row)
csv_out.close()

Now, open terminal, chug into the folder you saved your nyt_mostpop.py script to, and run it (“python nyt_mostpop.py”).In finder, go to your nyt_mostpop.py folder and presto, you’ll see your new csv file.

As the file only has 20 rows, determining the author with the most appearances can be accomplished with a simple sort (or the “FREQUENCY” command). In my case, pulling data between December 19th, 2015 and January 18th, 2016, I get only one author appearing twice: Mr. Noam Scheiber. Congratulations, Noam!

The complete script for accessing NYT’s most shared articles can be viewed and downloaded here.

In chapter 3 of this series, I’ll address more sophisticated methods of data storage.

 

So You Want to Be a Data Analyst, Chapter 2: The API (Is Just an ATM in The Sky)

So You Want to Be a Data Analyst, Chapter 1: What You Need to Know If You Want to Work with Data and Are Not a Computer Scientist, Statistician, or Mathematician

A few years ago, I began the process of changing my career from marketing to data science. The company I was working for formed a sort of internal innovation lab to work on big data products; when I joined it, I stopped marketing enterprise reporting solutions for .NET developers and started marketing an email application that used machine learning to intelligently prioritize inboxes. What this meant, in practice, was that I got to abandon the arcane and chubby depths of the Microsoft BI stack for the open waters of text mining and AI and personalized recommendations, all of it built on reams and reams of data being chunked up and digested in parallel by hundreds of servers. I didn’t understand any of it any better than I understood the Microsoft stuff, but I wanted to.

funes_memorious

The path from nascent, toodling interest in big data to employment as a data analyst at a large media company was not exactly direct, and the knowledge that I acquired along the way is not exactly comprehensive (if I’m being generous, I have maybe 1/5th the skillset of a data scientist). It worked, though, because while there are quite a lot of companies today that are generating or are capable of generating massive amounts of data, there are comparably few people capable of making sense of that data.

For the rest of this series, I’ll be laboring under the assumption that you want to become one of those comparable few. In each chapter, I’ll explain one of the four aspects of quote-unquote “big data” analysis that I’ve found necessary to being an effective analyst. By the end, you should be able to, at a basic level, collect, organize, analyze, and visualize data. But first, I want you to know why each of these functions matter.

jackson_pollock_number5

Let’s go back to the penultimate sentence of the last paragraph: you, dear budding data analyst, should be able to collect data, organize data, analyze data, and visualize data. I fumbled along for quite some time before I realized that. For a while, I thought that in order to get an actual data job, I would need to be capable of building predictive and recommendation systems. And it’s true that a fair number of people do spend their time building one or both of these things (typically, these are the data scientists). But here’s another (not sexy enough to get written about) truth: for most companies, predictions and recommendations are the cherry on a sundae they haven’t yet bought. That is to say, in order to get to the point where predictions or recommendations are useful, let alone critical, a company first needs to a) have a lot of users, b) generate a lot of user data, c) store that data in a way that it can be accessed and analyzed, d) employ people who can clean and analyze and report on that data. Plenty of companies don’t even have a, and if they have a, they don’t have b, and they certainly don’t have c or d. And that is where you, with your new abilities to (say it with me) collect, organize, analyze, and visualize data, come in.

What You Need to Know Before You Begin and What You’ll Come to Know as You Learn

wallpaper_chaos_theory

Before you start dealing with big data sets, you should be comfortable getting basic summaries from small ones. Eg, given a table of newsletter opens and clicks for the past year, you should be able to get the monthly average, median, and standard deviation opens, clicks, and click rates (the latter two illustrate how regular your data is, or isn’t). In general, you should know what a table of data looks like. An understanding of Excel functions like SUM/AVERAGEIF, FREQUENCY, and LOOKUP functions will be helpful as well, as you’ll be essentially replicating them in SQL and/or R.

True (predictive) data science requires a solid grasp of calculus and statistics, as well as basic computer programming. Data analysis, however, requires only that you be willing to learn a bit of calculus (slope of a curve) and a bit of statistics (summary and significance of your results), and a bit of computer programming (interacting with APIs and cleaning the data those APIs return) — emphasis on the “willing to learn” bit. The following is a sample skillset you’ll amass as you become a data analyst.

  1. Collection
    1. write programmatic requests to specific API endpoints
      1. knowing what all of the above means!
  2. Organization
    1. parse the data object return from the API into a structure that you can push out into a csv
    2. or shove into a table in a database
    3. setting up said table in said database
  3. Analysis
    1. use Excel, R, or SQL to obtain aggregate, trended, and/or summary statistics on a filtered dataset
  4. Visualization
    1. using Python, R, Tableau, or Google Charts to graph or otherwise visually represent the output of your query

 

Sound good? Continue onto Chapter 2: An API Is Just an ATM in The Sky

So You Want to Be a Data Analyst, Chapter 1: What You Need to Know If You Want to Work with Data and Are Not a Computer Scientist, Statistician, or Mathematician