So You Want to Be a Data Analyst, Part 4:Visualizing Your Data with D3js

A picture is worth 1,000 words, or so throw pillows tell me. A good data visualization can be worth millions of rows (or thousands, or billions, even, depending on the size of the data you’re illustrating). Conversely, a bad data visualization, put into the wrong hands, can be just as harmful as a photoshopped Sandy doomsday photo. The previous two posts in this series have addressed acquiring and making sense of your data; today, I’ll take you through the process of turning that data into a visual story using the D3js library. First, though, I want to cover the reasons why you’d want or need to visualize your data, and explain when it makes sense to create the visualization programmatically instead of using a ready-made chart builder.

Talk Graphics to Me

A graphic/viz/chart is to data what symbols are to words: a succinct means of conveying a large amount of information. Like symbols, visualizations tend to be more eye-catching than the data they encapsulate. Use them when you’re reporting on data that has more than one facet (eg quantities over time, or quantities by category, or quanities by category over time).

Likely, you’ve created data visualizations before, in our old pal MS Excel (or, if you’re like me and LOATHE Excel, Google Spreadsheets). Common viz include bar charts (horizontal and vertical — good for showing a breakdown of quantities by category), line charts (good for showing change in one or more categories over time), scatter plots (good for showing the relationship between two quantities), and pie charts (good for nothing. JK — they’re good for showing breakdowns of quantities by category, provided you have a small number of categories).

If your data is already in a spreadsheet, using a built-in chart to summarize it in one of the ways discussed above is totally fine. I use Google Spreadsheets’ built-in charts often — they have a decent amount of customization, and the added bonus of being embeddable. So when should you use something else?

Taking Measures into Your Own Hands

  1. The first reason you might want to create your own data visualization relates to what I said about Google Spreadsheets being embeddable: shareability. No one should ever have to open an Excel attachment (or any attachment) to view a chart. You can insert your charts in the body of your email, but it’s nice, I think, to have a backup, especially if that backup is interactive (so really, the inserted image is the sneak preview). Also, if you have a data blog or some sort of internal company site, you can slap your custom-made data visualization in there and–provided you’ve styled it right–not have to worry about it breaking on your CEO’s iPad.
  2. The second reason to DIY is customization. Your dataset might not fit in Excel’s default charts, or maybe it does, but you want to show it off in something nicer. A little couture never hurt anyone. D3 shines here — once you get good at it, you can visualize so many data structures (for some great examples, check out Kantar’s Information Is Beautiful 2015 finalists, or the New York TimesUpshot blog).
  3. The third reason is that your data isn’t in a spreadsheet, and you don’t want to export it into one. This is the dashboard scenario: hook your visualization straight into your database, schedule your query so that the data refreshes, and let ‘er rip.
  4. The fourth reason is replicability. Chances are, you’re going to be reporting on the same or similar data (or data with the same or similar structure) more than once. If you create your visualization layer programmatically, you can easily swap in new data for the old data, et voila. To create the D3 example below, I modified a script I’d already created for the Met Gala. The original took me well over an hour to create. This version took maybe ten minutes.

<Caveat Lector>

People not on the creation side of visualizations tend to trust the result implicitly, and this trust can be deliberately or mistakenly abused. The best way to prevent bad visualizations is to understand your data, and its conclusions, before you translate them. Where possible, you should provide your dataset along with your visualization, so that your audience can a) pursue their own deeper understanding of the data and b) validate your work.

</Caveat>

Time to Bind

D3 is certainly not the only javascript library designed to bind data to DOM, but it’s one of the best (IMHO), buzziest (indisputable), and most-used. It was created by Mike Bostock, back when he was the graphics editor at the New York Times. Remember when the NYT started gathering acclaim for its visual articles like “512 Paths to the White House?” That was Bostock’s doing (though the NYT as a whole deserves a lot of kudos for supporting a very different means of information display long before anyone else).

I’m not going to spend too much time explaining D3 upfront, because others, including Bostock himself, have done so extensively. Here’s what you need to know before we get graphing:

  • D3.js is a javascript library that allows you to attach data to the DOM of a web page.
  • Once attached, that data can be manipulated by manipulating the DOM elements its bound too.

Okay’s let’s get crackin. Today, you’re going to create a bubble graph that shows the wax and wane of music genres’ popularity over time. The dataset you’ll be using contains information about music genre popularity in the United States between 1960 and 2010. The popularity of a given genre is simply a sum of the times it appeared in the Billboard Hot 100 in a given year. This dataset was created by aggregating Matthias Mulch’s song-level data, which he used in “Evolution of Popular Music: USA 1960-2010.” You can download the data, as well as all of the visualization code, on Github.

Step 1: Get your data.

Honestly, this can be the hardest part, depending on how you approach D3, and also what kind of data you’re working with. In an ideal world, you already have access to the data you need, and you load it, untouched, into D3 and start manipulating. If replicability matters to you and/or you’re building a dashboard, you’re going to want to do as much of the manipulation directly in D3 as possible. That being said, when you’re starting out, you’re probably going to find yourself doing the following:

  1. Deciding, from the shape of your result set, which visualization(s) would best explain it.
  2. Finding an example D3 visualization online.
  3. Realizing that the data it relies upon is either hard-coded or shaped differently from yours.
  4. Retrofitting your data to resemble the example data.

This is okay, in the beginning. As you learn D3, you’ll become more comfortable with transforming your data into the shape you need in D3, using D3’s built-in nest and mapping functions. Today, however, you’re going to do a minimal amount of manipulation in D3 because the music data you’ll be using is already in the necessary format.

Let’s look at the head of the data:

“year”,”genre”,”score”
1960,”ALTERNATIVE”,1
1961,”ALTERNATIVE”,0
1962,”ALTERNATIVE”,0

Big surprise — there was not much alternative music in 1960! Shall we check out the tail?
2004,”SOUL”,23
2005,”SOUL”,21
2006,”SOUL”,25

See, Dad — soul didn’t die in the seventies! (That would be Motown.)

The point of all this inspecting is to know the shape of your data: three columns, one a date, one a category, and one an integer. As you start doing more D3, you’ll want to compare your data to the data in the examples. One way or another, yours will need to match the example shape before you start drawing your visualization.

Step 2: Set up your html, css, and D3 scaffold.

By which I mean, a basic html web page with a <body> tag to bind to, css to dictate any over-arching styles, and the start of your javascript, which will contain your visualization’s width, height, and margins. Typically, you’ll break these out into their own files, but for brevity’s sake here, I’m going to put everything into the HTML.

<!DOCTYPE html>
 <meta charset="utf-8">
 <style>

body {
 font: 12px sans-serif;
 }

.axis path,
 .axis line {
 fill: none;
 stroke: grey;
 shape-rendering: crispEdges;
 }

.dot {
 stroke: none;
 fill: steelblue;
 }

.grid .tick {display: none;}

.grid path {
 stroke-width: 0;
 }

div.tooltip {
 position: absolute;
 text-align: center;
 width: 80px;
 height: 42px;
 padding: 2px;
 font: 12px sans-serif;
 background: #ddd;
 border: solid 0px #aaa;
 border-radius: 8px;
 pointer-events: none;
 }

</style>
 <body>
 http://d3js.org/d3.v3.min.js


 //We kick off any d3 javscript with basic sizing
 var margin = {top: 100, right: 20, bottom: 30, left: 100},
 width = window.innerWidth - margin.left - margin.right,
 height = 500- margin.top - margin.bottom;

Step 3: Add your data stipulations.

By stipulations, I mean the specific properties of your dataset. Since this dataset contains time data, you’ll want a parse date function that tells D3 “hey, this column is a date and it’s in this format.” You might want a format date function so that you can control what the date displayed on your axis looks like. We’re going to be drawing a categorical bubble chart on a time-series, so you’ll need a time-scaled X axis and an ordinal (non-linear/aka non-quantitative) Y axis. You’ll also want a color scale to differentiate between the genres. All these javascript stipulations will look like this:

//if we have date data, we need to tell d3 how to read it.
 //in the music data case, all we have is the year
 var parsedate = d3.time.format("%Y").parse;
 var formatDay_Time = d3.time.format("%Y");

//we're going to want to color our scores based on the genre. Here, we're just setting a color variable.
 var color = d3.scale.category10();
 //We need to tell d3 what kind of axes we want.
 //In the music data case, we want our dates on the x axis and our genres on the y.
 //Because the genres are categorical, we will explicitly tell d3 how far apart to space them, using rangeRoundBands
 var x = d3.time.scale().range([0, width]);
 var y = d3.scale.ordinal()
 .rangeRoundBands([0, height], .1);

//tell d3 to orient the y axis on the lefthand side of the graph
 var yAxis = d3.svg.axis()
 .scale(y)
 .outerTickSize(0)
 .innerTickSize(0)
 .orient("left");

//put the x axis on the bottom
 var xAxis = d3.svg.axis()
 .scale(x)
 .orient("bottom");

Step 4: Create a visualization object and attach it to the DOM.

The visualization object is called an SVG, aka “scaleable vector graphic.” This is the thing onto which your data will attach and come to life. Declare and attach it to your <body> tag like this:

var svg = d3.select("body")
 .append("svg")
 .attr("width", width + margin.left + margin.right)
 .attr("height", height + margin.top + margin.bottom)
 .append("g")
 .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

Step 5: Load your data and attach it to your SVG.

You can hard code your data, but that’s super onerous and also not replicable, so I’d advise against it. Instead, use one of D3’s loading methods. In our case, we’ll use d3.csv, because our data is in a csv format. Once we load it, we loop through it and format it correctly.

d3.csv("data/history_of_music.csv", function(error, data) {
 data.forEach(function(d) {
 d.year = parsedate(d.year);
 d.score = +d.score;
 });

 

Step 6: Define/create your axes

 

The domain methods tell D3 which data should be attached to your axes. In our case, the d.year data goes to the x axis,the d.genre goes to the y axis, and the d.score dictates the radii of the bubbles. Defining the x axis is easy:

// Set our x axis domain with the earliest and latest years
 x.domain([new Date(1959, 01, 01), new Date(2009, 12, 01)]);

To define the y axis, we want the unique genre names. We can get them by looping through all of the genres and appending each new one to an array.

//Now we'll create an array of unique genres, which we'll use to create our y axis
 //to do this, we just loop through the genres data and append each name
 //that doesn't already exist in the array
 genres_filtered = []

data.forEach(function (d) {
 if (genres_filtered.indexOf(d.genre) === -1) {
 genres_filtered.push(d.genre);
 }
 });

To define the r radius, we use the maximum score divided by the number of genres, so that no bubble will be too big.

//Now we set the radius to be between zero and the height divided by the count of unique genres
 var r = d3.scale.linear().range([0, height / genres_filtered.length]);
 r.domain([0, d3.max(data, function(d) { return d.score})]);

Last, we'll append our x and y axis to the SVG using the call function. 

//actually append our y axis to the svg 
 svg.append("g")
 .attr("class", "y axis")
 .call(yAxis)
 .selectAll("text")
 .attr("dx", 0)
 .attr("dy",-5)
 .style("text-anchor", "end");

//append our xaxis to the svg
 svg.append("g")
 .attr("class", "x axis")
 .attr("transform", "translate(0," + height + ")")
 .call(xAxis);

Step 7: Draw the bubbles.

Here’s the fun part! To draw the bubbles, we’ll use the year and genre to place them on the axes, the score to size them, and genre again to color them. We’ll also add a little mouseout action to display the genre and sum of songs for that year.

// attach our tooltip
var div = d3.select("body").append("div")
 .attr("class", "tooltip")
 .style("opacity", 1e-6);
//draw our circles based on the scores, and attach them to the svg
svg.selectAll(".dot")
 .data(data)
.enter().append("circle")
 .attr("class", "dot")
 .attr("r", function(d) { 
 return r(d.score); 
 })
 .style("opacity", 0.25)
 .attr("cx", function(d) { return x(d.year); })
 .attr("cy", function(d) { return y(d.genre); })
 .style('fill', function (d) { return color(d.genre); })
 .on("mouseover", function(d) {
 div.transition()
 .duration(200)
 .style("opacity", .7);
 div .html(
 formatDay_Time(d.year) + "<br/>" +
 d.genre + "<br/>" + 
 d.score) 
 .style("left", (d3.event.pageX) + "px")
 .style("top", (d3.event.pageY - 42) + "px");
 }) 
 .on("mouseout", function(d) {
 div.transition()
 .duration(500)
 .style("opacity", 1e-6);
 });

Lastly, we’ll append a title.

svg.append("text")
 .attr("x", (width / 2)) 
 .attr("y", 0 - (margin.top / 2)) 
 .attr("text-anchor", "middle") 
 .style("font-size", "16px") 
 .text("Are We Human or Are We Dancer: the Rise and Fall of Musical Genres, 1960 - 2009, per Volume of Billboard Hot 100 Hits");

});

The full page with the script will look like this:

<!DOCTYPE html>
<meta charset="utf-8">
<style>

body {
 font: 12px sans-serif;
}

.axis path,
.axis line {
 fill: none;
 stroke: grey;
 shape-rendering: crispEdges;
}

.dot {
 stroke: none;
 fill: steelblue;
}

.grid .tick {display: none;}

.grid path {
 stroke-width: 0;
}

div.tooltip {
 position: absolute;
 text-align: center;
 width: 80px;
 height: 42px;
 padding: 2px;
 font: 12px sans-serif;
 background: #ddd;
 border: solid 0px #aaa;
 border-radius: 8px;
 pointer-events: none;
}

</style>
<body>
http://d3js.org/d3.v3.min.js


//We kick off any d3 javscript with basic sizing
var margin = {top: 100, right: 20, bottom: 30, left: 100},
 width = window.innerWidth - margin.left - margin.right,
 height = 500- margin.top - margin.bottom;

//if we have date data, we need to tell d3 how to read it. 
//in the music data case, all we have is the year 
var parsedate = d3.time.format("%Y").parse;
var formatDay_Time = d3.time.format("%Y");

//we're going to want to color our scores based on the genre. Here, we're just setting a color variable. 
var color = d3.scale.category10();


//We need to tell d3 what kind of axes we want.
//In the music data case, we want our dates on the x axis and our genres on the y.
//Because the genres are categorical, we will explicitly tell d3 how far apart to space them, using rangeRoundBands
var x = d3.time.scale().range([0, width]);
var y = d3.scale.ordinal()
 .rangeRoundBands([0, height], .1);

//tell d3 to orient the y axis on the lefthand side of the graph
var yAxis = d3.svg.axis()
 .scale(y)
 .outerTickSize(0)
 .innerTickSize(0)
 .orient("left");

//put the x axis on the bottom
var xAxis = d3.svg.axis()
 .scale(x)
 .orient("bottom");

//here we create our graph object, attach it to the body element of our html
//and append the sizing attributes we specified earlier
var svg = d3.select("body")
 .append("svg")
 .attr("width", width + margin.left + margin.right)
 .attr("height", height + margin.top + margin.bottom)
 .append("g")
 .attr("transform", "translate(" + margin.left + "," + margin.top + ")");


//Now, we read in our music data.
//We'll parse our dates and tell d3 that our scores are numeric 
d3.csv("data/history_of_music.csv", function(error, data) {
 data.forEach(function(d) {
 d.year = parsedate(d.year);
 d.score = +d.score;
 });

//check to make sure the data was read in correctly
 console.log(data);

// Set our x axis domain with the earliest and latest years
 x.domain([new Date(1959, 01, 01), new Date(2009, 12, 01)]);

//Now we'll create an array of unique genres, which we'll use to create our y axis
 //to do this, we just loop through the genres data and append each name
 //that doesn't already exist in the array
 genres_filtered = []

data.forEach(function (d) {
 if (genres_filtered.indexOf(d.genre) === -1) {
 genres_filtered.push(d.genre);
 }
 });


 //Now we set the radius to be between zero and the height divided by the count of unique genres
 var r = d3.scale.linear().range([0, height / genres_filtered.length]);
 r.domain([0, d3.max(data, function(d) { return d.score})]);


 //add the genre names to the y axis
 y.domain(genres_filtered);

//color our bubbles based on the genre names
 color.domain(genres_filtered);

//actually append our y axis to the svg 
 svg.append("g")
 .attr("class", "y axis")
 .call(yAxis)
 .selectAll("text")
 .attr("dx", 0)
 .attr("dy",-5)
 .style("text-anchor", "end");

//append our xaxis to the svg
 svg.append("g")
 .attr("class", "x axis")
 .attr("transform", "translate(0," + height + ")")
 .call(xAxis);

// attach our tooltip
var div = d3.select("body").append("div")
 .attr("class", "tooltip")
 .style("opacity", 1e-6);

//draw our circles based on the scores, and attach them to the svg
svg.selectAll(".dot")
 .data(data)
.enter().append("circle")
 .attr("class", "dot")
 .attr("r", function(d) { 
 return r(d.score); 
 })
 .style("opacity", 0.25)
 .attr("cx", function(d) { return x(d.year); })
 .attr("cy", function(d) { return y(d.genre); })
 .style('fill', function (d) { return color(d.genre); })
 .on("mouseover", function(d) {
 div.transition()
 .duration(200)
 .style("opacity", .7);
 div .html(
 formatDay_Time(d.year) + "
" +
 d.genre + "
" + 
 d.score) 
 .style("left", (d3.event.pageX) + "px")
 .style("top", (d3.event.pageY - 42) + "px");
 }) 
 .on("mouseout", function(d) {
 div.transition()
 .duration(500)
 .style("opacity", 1e-6);
 });

// Add the title
 svg.append("text")
 .attr("x", (width / 2)) 
 .attr("y", 0 - (margin.top / 2)) 
 .attr("text-anchor", "middle") 
 .style("font-size", "16px") 
 .text("Are We Human or Are We Dancer: the Rise and Fall of Musical Genres, 1960 - 2009, per Volume of Billboard Hot 100 Hits");

});

 

CONGRATULATIONS! A dataviz iz here! Though, if you’ve never built a web page with javascript before, you might be like…where? To view this guy, I recommend using the very lightweight SimpleHTTPServer. Just install it using your package manager of choice (I use pip), and make sure you’re in the same folder as your viz file when you launch the server. Once you’ve done that, you should see:

Screen Shot 2016-05-22 at 9.57.51 PM This post isn’t really about analyzing this particular dataset, but there are some trends that stand out to me. Namely, that pop peaked in the mid-80s, country didn’t pick up much mainstream steam until the late nineties, and rap has been waning as hiphop has risen. Also, I should say that these ten genres were the ten most popular genres out of a much larger genre set, so lots of the more interesting ones, like New Wave and Postpunk, are not represented here.

Anyways, I’m going to rap this up because it’s getting kinda Joycean in its length (though hopefully not in its parseability!). The code, including that for transforming Mauch’s data into the needed shape, is on Github. Have at it!

Advertisements
So You Want to Be a Data Analyst, Part 4:Visualizing Your Data with D3js