In his short story Funes, the Memorious, the Argentine writer Jorge Borges writes of a teenager, Funes, whose mind contains the world, and all of its history. To access this information, Funes devises a system of one-to-one enumeration.
“In place of seven thousand thirteen, he would say (for example) Mdximo Perez; in place of seven thousand fourteen, The Train; other numbers were Luis Melidn Lafinur, Olimar, Brimstone, Clubs, The Whale, Gas, The Cauldron, Napoleon, Agustin de Vedia. In lieu of five hundred, he would say nine.”
This system, the narrator points out, is ridiculously inefficient — indeed, it is a degredation of the grouping systems we already have, the hundreds and tens and birds and planets.
The original databases, magnetic tape that stored references to information, similar to punch cards, were more efficient than Funes’ enumeration system, but only slightly. The actual data and its reference were stored hierarchically: sunflower growth statistics, let’s say, rolls into sunflowers, which in term rolls into flowers. Any time a single record needed to be updated, the entire tape would be rewritten. And finding that record necessitated first finding its reference in a vast spool of tape — not quite needle:haystack, but close.
I won’t dive much further into database history, except to say that in 1969, a Brit named E.F. Codd published a paper that introduced the concept of relational databases, and explained their advantage over the hierarchical and network databases: namely, that they allow you and I to care only about what we are looking for, and not about where we must go to find it.
While these days, databases come in a variety of architectures, some hyper-structured, some basically conjoined jellyfish, the relational database is still the pre-eminent database for most mid- and large companies (though, if they are in the business of content or social network creation, it’s likely they have a few jellyfish as well).
As an analyst, I frequently want to uncover the relationship between two or more sets of data; relational databases are designed to expedite that discovery, by joining tables of data on a common key. The key is to data tables what mutual friends are to you and the guy you swiped right for. I like relational databases, because getting data out of them is very straightforward.
In this post, I’m going to illustrate this straightforwardness (but with dogs, not Tinder. Wompwomp). We’ll go through what a relational data table looks like, how to create one in your RDB of choice, and how to query its contents and join it on a second table.
Let’s get started.
First, I want you to picture a spreadsheet. (Humor me, budding analyst.) Let’s say it’s a spreadsheet on dog adoptions, called “Adoption.” Adoption has three columns: Breed, Average Price, and Lifespan.
|Breed||Puppy Price||Average Lifespan (years)|
You also have a second spreadsheet, “Health,” with Breed and Average Annual Cost.
|Breed||Average Annual Cost|
Let’s say you want to calculate the total cost per breed, because you’re a pragmatic person and you want to convince your husband that an Irish Wolfhound is a bad investment, compared to a chihuahua. In a spreadsheet scenario, these two tables would live in separate tabs, and if you wanted to calculate the total cost per breed, you might first use a VLOOKUP to join the average annual cost data to the average puppy price and lifespan data.
In a relational database scenario, these tables would live in schemas (an overarching container — akin to a top-level report folder, if you’re the organized sort). You would relate them to one another using a key column, in this case, “Breed,” and then multiply the Lifespan column from the Adoption table by the Average Annual Cost Column from the Health table.
A relational database can hold many schemas, each with dozens if not hundreds of tables, and any of them can be linked by their common columns.
In this silly 8 row example, there’s no need to have the data anywhere but a spreadsheet. But, when:
- You have millions of rows of data and/or
- You have nested data and/or
- You want to keep adding to the dataset programmatically and/or
- You want others to be able to access and possibly even update the dataset,
a relational database is far more efficient than a spreadsheet.
Time to talk about getting data into a relational database. The relational database I’m using is MySQL, because that’s what our database team set up for me, but the processes I’ll be going through will be similar for Postgres and Oracle. I’ll be connecting to MySQL using a free GUI tool called “MySQL Workbench,” which you can download here. The language I’ll be using to write to and read from it is Structured Query Language, aka “SQL.” I’ll write more about SQL in an upcoming post, but if you’re unfamiliar with it, just think of it as a way to specify the type and aggregation of information you want.
Data goes into a database in one of two ways. If the data already exists in a file format, you create a table for it in the appropriate schema, and then you upload it to that table. If the data is not yet exposed to you, you pull it down from its source location, format it appropriately, and push it to the table you’ve created for it. Typically, this would be done in a script, and you could have that script run on a schedule, so that your data table(s) fill up on the reg.
In this tutorial, I’m making the assumption that you already have access to a database and a schema within that database, and that you’ve opened up a connection to that database, either through a GUI like MySQL Workbench or through the command line (your terminal app). A schema, by the way, is like a bureau, only it holds tables, instead of shelves. (Ack, mixing furniture metaphors here!)
All set? Cool, let’s build a table.
First, connect to your database using the connection strings provided to you. In MySQL Workbench, you connect in the Home panel, by clicking on the (+) button.
After you’ve connected to your database, you need to specify the schema you’ll be using. If you already had a “dogs” schema, you’d run:
. As there is no dogs schema, you’ll need to create one.
CREATE SCHEMA IF NOT EXISTS dogs
In our money-efficient pets scenario, we want the adoption table and the health table. As the dogs schema is empty, we’ll need to create both. To create a table in SQL, you first give it a name, and then you list each of the columns, along with their type (integer, character, date etc), and other options like “NOT NULL” if the column fields can’t be blank, and “PRIMARY KEY” if you want the column to be/contribute to the uniqueness and matchiness of the table. (More on this later!)
CREATE TABLE adoption (Breed VARCHAR(45), Puppy_Price INT, Avg_Lifespan INT)
The above creates your table, adoption, and specifies the names and data types of its columns. VARCHAR holds varied character data (aka strings), INT is for numbers, DATE is for dates, DATETIME is for date+time stamps, BOOL is True/False (False=0, True>0), DECIMAL is for decimals (aka floats). Full definitions of these types and more are here.
Using the same structure used to create your adoption table, create your health table.
CREATE TABLE health (Breed VARCHAR(45), Avg_Annual_Cost INT)
Now that you’ve created your table structures, you can fill them with data. This might be done programmatically, by hitting up an API, or manually, by loading a file. In the pets case, let’s assume the adoption data and health data each live in CSVs, located in your local Pets folder. Load them with the following:
LOAD DATA LOCAL INFILE '~/Pets/adoption.csv' INTO TABLE adoption;
LOAD DATA LOCAL INFILE '~/Pets/health.csv' INTO TABLE health;
Check to see the data has loaded properly with a SELECT * (aka select ALL THE THINGS):
SELECT * FROM adoption
SELECT * FROM health
You should see your full tables returned.
Now that we have our tables, we can start grilling ’em.
The SELECT statement used above starts pretty much any basic SQL query you might have. All it does is tell your not-yet-specified table, “hey, I’m going to be asking for these not-yet-specified things.” Let’s add some specificity!
The big, burning question you have regarding your pets data is which pets are the most and least cost-efficient. To answer it, you’ll need to multiply the average annual cost for each pet from the health table by its lifespan from the adoption table, and add the adoption fees from the adoption table to this total. What this means, in SQL terms, is that the time has come for JOINS and SUMs.
To get your first answer — the average health care cost for each pet — you’ll need to JOIN the average annual cost and the lifespan data by the breed and adoption fee data.
SELECT health.Breed, health.Avg_Annual_Cost, adoption.Puppy_Price, adoption.Lifespan
INNER JOIN adoption
ON health.Breed = adoption.Breed
GROUP BY 1,2, 3 ORDER BY 1,4
|Breed||Price (puppy)||Average Annual Cost||Average Lifespan (years)|
Now you have all your pertinent data in one temporary table, but what you really want is to calculate the total health costs (average annual cost * lifespan) and add that to your puppy price. You can do this in your original JOIN statement.
SELECT health.Breed, ((health.Avg_Annual_Cost * adoption.lifespan)+adoption.Puppy_Price) as “Total Cost”
INNER JOIN adoption
ON health.Breed = adoption.Breed
GROUP BY 1 ORDER BY 2 DESC
And the results, which we’ve ordered by price, from most to least expensive:
Looks like the Chihauhua, by dint of its longevity, ends up costing the most. Go drop $3k on a Frenchie!
To recap, in this post, you learned about the reasons for using relational databases instead of spreadsheets, how to set up a schema and tables in a relational database, how to fill the tables with data from a file, and how to preform aggregations and joins on this data.
In the next post, I’ll dive further into SQL, and walk through inserting data into an RDB by other means (chiefly, through an API).