So You Want to Be a Data Analyst, Chapter 2: The API (Is Just an ATM in The Sky)

Before you can extract any insights from data, you need to have the data in the first place. If you’re lucky, you do have the data sitting pretty in a database or can easily export it from its source origin. If you don’t, however, you can hit up an API to get it. You also may want to use an API even if the export function is available, for efficiency’s sake — if, for example, you’re pulling data from multiple Facebook pages, or you want three months’ worth of data and the export only allows for one month.

So, the answer to the question of why you’d use an API is: it gives you access to data that is otherwise unavailable or inefficiently available.

Wait, What Is an API?

Technically, it’s an Application Program Interface, but I think of it as a soul sister of the ATM. The typical ATM allows you to deposit funds, withdraw funds, and get information about your accounts, providing you have a debit card and know your PIN number. An API allows you to post data (deposit) and fetch data (balance inquiry) to endpoints (bank accounts), provided you have an API key (debit card and pin number).

As a data analyst, my interactions with APIs are mainly on the fetching side. In this post, I’ll be walking you fetching the authors of the most-shared articles on the New York Times, using the NYT’s Most Popular API.

Fetching Data from an API, Step 1: Obtain an API Key

An API key is typically a long string of numbers and letters that is issued to you by the application’s creators. It gives you varying degrees of access to the application’s data, with fetching (or making a GET request, in API parlance) being the lowest/most easily obtained level of access. The process of API key issuance differs from one API to another; to access the New York Times’ Most Popular API, you just need to register with their developer network, name your quote-un-quote application (in this case, just a script that fetches data), and select the correct API checkbox, and presto, the NYT will issue you a key.

Screen Shot 2016-01-18 at 11.34.40 AM

 

Fetching Data from an API, Step 2: Create a Get Request

Once you have your API key, you can use it to request data from the API, much as you might insert your debit card into an ATM and request balance info. API requests are made to the API’s URI, and typically contain the endpoint (specific data type), API key, and parameters like date that allow you to drill into a subset of the endpoint’s data. Requests to the NYT’s Most Popular api take the following format: http://api.nytimes.com/svc/mostpopular/v2/%5Bmostshared/mostemailed/mostviewed]/[all-sections/specific section]/days back: 1/7/30].json?api-key=[your key].

In your case, you want the most shared endpoint, and you want a month of data, so your request will look like this: http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=%5Byour key].

Put that into your url bar and see what gets returned. You should see a bunch of json objects, each containing metadata associated with one of the most shared articles.

Fetching Data From an API, Step 3: Get and Store the Response Data

Sure, you could copy all that json you see below the URL bar into a text document and manually locate the information you want (the authors), but that would be boring, time-consuming, and error-susceptible. Instead, you’ll make the request programmatically and store the response data in an object you’ll deconstruct in part 4.

You can use any programming language you like to interact with APIs, but this example uses python and its request library, mainly because python is really handy for analysis. Prereqs for this step are:

  1. text editor (I like Sublime Text)
  2. python (shipped with default build on Macs)
  3. admin access to your computer
  4. pip (or your python package installer of choice)
  5. requests library (which you’ll use to hit the Most Popular – most shares endpoint)

Provided you have admin access, you can get no. 4 by following these instructions. Once you have no. 4, you can get no.5 by opening Terminal (Spotlight → Terminal) and typing “pip install requests.”

Prereqs all taken care of? Open a new file in sublime text and save it as “most_popular.py”. This will be the python script that fetches and formats the most shared articles.

First, import the needed modules:

  • import requests
  • import json
  • import csv

Next, create the request and response objects:

pop_request = 'http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=[your key]'
pop = requests.get(pop_request)

“pop” holds the entire response, but if you print its variables (print vars(pop)), you’ll see that you’re just interested in the “results” part. Store that in its own variable and convert it to a dictionary:

pop_content = pop._content
pop_content = json.loads(pop_content)

Fetching Data from an API, Step 4: Extract the Necessary Data and Save It to a File

Now that you have your most shared article data stored in a big ol’ object, you’re going to parse it and extract what you need. First, in the python interpreter, you can take a closer look at pop_content to find the names of the fields you’re interested in:

import requests
import json
import csv
pop_request = 'http://api.nytimes.com/svc/mostpopular/v2/mostshared/all-sections/30.json?api-key=684a51023b7f88483e87f63e36c33e41:18:68385208'
pop = requests.get(pop_request)
pop_content = pop._content
pop_content = json.loads(pop_content)

for key, value in pop_content.items():
  if (key == "results"):
    for article in value:
      for field_name,field_value in article.items():
        print field_name

As the goal of this exercise is to find which authors contribute the most most shared articles, you’re going to want the “byline” field for sure, along with the “published_date” field and the “total_shares” field — though the latter is only useful for inference, as it doesn’t actually give share counts, only where the article ranks in the top 20.

Create an empty list for each of the three useful keys. To fill them, you’ll run through the results dictionary within the pop_content object, and save each value that corresponds to a byline, published_date, and total_shares field to a variable, and append that variable to its corresponding list.

dater = []
shares = []
authors = []
for k, v in pop_content.items():
  if (k == "results"):
    for a in v:
      pubdate = a["published_date"]
      dater.append(pubdate)
      author = a["byline"]
      authors.append(author)
      sharecount = a["total_shares"]
      shares.append(sharecount)

In the python IDE, you can inspect the contents of each list if you like, by typing “print [list name]”.

The final part of this process is exporting your cleaned and extracted data to a csv file for easypeasy analysis (later, I’ll go through how to analyze this data within the same python script as well as how to push it to a database and analyze it with SQL).

First, create the empty csv file and column headers:

csv_out = open('nyt_mostpop_week.csv', 'wb')
mywriter = csv.writer(csv_out)

Then stitch the lists together, write each row to your csv object, and close the csv file.

for row in zip(authors,dater,shares):
 mywriter.writerow(row)
csv_out.close()

Now, open terminal, chug into the folder you saved your nyt_mostpop.py script to, and run it (“python nyt_mostpop.py”).In finder, go to your nyt_mostpop.py folder and presto, you’ll see your new csv file.

As the file only has 20 rows, determining the author with the most appearances can be accomplished with a simple sort (or the “FREQUENCY” command). In my case, pulling data between December 19th, 2015 and January 18th, 2016, I get only one author appearing twice: Mr. Noam Scheiber. Congratulations, Noam!

The complete script for accessing NYT’s most shared articles can be viewed and downloaded here.

In chapter 3 of this series, I’ll address more sophisticated methods of data storage.

 

Advertisements
So You Want to Be a Data Analyst, Chapter 2: The API (Is Just an ATM in The Sky)