A few years ago, I began the process of changing my career from marketing to data science. The company I was working for formed a sort of internal innovation lab to work on big data products; when I joined it, I stopped marketing enterprise reporting solutions for .NET developers and started marketing an email application that used machine learning to intelligently prioritize inboxes. What this meant, in practice, was that I got to abandon the arcane and chubby depths of the Microsoft BI stack for the open waters of text mining and AI and personalized recommendations, all of it built on reams and reams of data being chunked up and digested in parallel by hundreds of servers. I didn’t understand any of it any better than I understood the Microsoft stuff, but I wanted to.
The path from nascent, toodling interest in big data to employment as a data analyst at a large media company was not exactly direct, and the knowledge that I acquired along the way is not exactly comprehensive (if I’m being generous, I have maybe 1/5th the skillset of a data scientist). It worked, though, because while there are quite a lot of companies today that are generating or are capable of generating massive amounts of data, there are comparably few people capable of making sense of that data.
For the rest of this series, I’ll be laboring under the assumption that you want to become one of those comparable few. In each chapter, I’ll explain one of the four aspects of quote-unquote “big data” analysis that I’ve found necessary to being an effective analyst. By the end, you should be able to, at a basic level, collect, organize, analyze, and visualize data. But first, I want you to know why each of these functions matter.
Let’s go back to the penultimate sentence of the last paragraph: you, dear budding data analyst, should be able to collect data, organize data, analyze data, and visualize data. I fumbled along for quite some time before I realized that. For a while, I thought that in order to get an actual data job, I would need to be capable of building predictive and recommendation systems. And it’s true that a fair number of people do spend their time building one or both of these things (typically, these are the data scientists). But here’s another (not sexy enough to get written about) truth: for most companies, predictions and recommendations are the cherry on a sundae they haven’t yet bought. That is to say, in order to get to the point where predictions or recommendations are useful, let alone critical, a company first needs to a) have a lot of users, b) generate a lot of user data, c) store that data in a way that it can be accessed and analyzed, d) employ people who can clean and analyze and report on that data. Plenty of companies don’t even have a, and if they have a, they don’t have b, and they certainly don’t have c or d. And that is where you, with your new abilities to (say it with me) collect, organize, analyze, and visualize data, come in.
What You Need to Know Before You Begin and What You’ll Come to Know as You Learn
Before you start dealing with big data sets, you should be comfortable getting basic summaries from small ones. Eg, given a table of newsletter opens and clicks for the past year, you should be able to get the monthly average, median, and standard deviation opens, clicks, and click rates (the latter two illustrate how regular your data is, or isn’t). In general, you should know what a table of data looks like. An understanding of Excel functions like SUM/AVERAGEIF, FREQUENCY, and LOOKUP functions will be helpful as well, as you’ll be essentially replicating them in SQL and/or R.
True (predictive) data science requires a solid grasp of calculus and statistics, as well as basic computer programming. Data analysis, however, requires only that you be willing to learn a bit of calculus (slope of a curve) and a bit of statistics (summary and significance of your results), and a bit of computer programming (interacting with APIs and cleaning the data those APIs return) — emphasis on the “willing to learn” bit. The following is a sample skillset you’ll amass as you become a data analyst.
- Collection
- write programmatic requests to specific API endpoints
- knowing what all of the above means!
- write programmatic requests to specific API endpoints
- Organization
- parse the data object return from the API into a structure that you can push out into a csv
- or shove into a table in a database
- setting up said table in said database
- Analysis
- use Excel, R, or SQL to obtain aggregate, trended, and/or summary statistics on a filtered dataset
- Visualization
- using Python, R, Tableau, or Google Charts to graph or otherwise visually represent the output of your query
Sound good? Continue onto Chapter 2: An API Is Just an ATM in The Sky