That the internet homogenizes language is a matter of debate–but that it provides vast quantities of state-of-the-nation linguistic data is not. I was curious about linguistic diversity among American Twitter users–specifically, about whether that diversity, should it exist, could be regionally based. MacLuhan thought electronic communications would fashion a global village; I looked through one electronic communication to find if this was so.
Part 1: Getting and visualizing the raw data
There are a number of ways to grab public Twitter data. I chose the easiest way of all: using data someone else had already grabbed. In this case, these someones were a team of Compsci post-docs from Carnegie Mellon, led by Jacob Eisenstein. Over the course of analyzing lexical variation on Twitter, Eisenstein et al amassed a dataset tailor-made for my purpose. The data consists of a week’s worth of tweets from users who lived within the 48 contiguous states, had less than 1,000 followers, and had posted at least 20 times over that week–that is to say ordinary, frequent users of the service.
In its original form, the data is a TXT file (full-text.txt in the GeoTwitter repo) with fields containing the following information types:
1. Anonymized user ID 2. Publish date 3. UTM coordinates 4. Tweet body
There were two aspects of this dataset that I was interested in visualizing. The first was the number of tweets per region. I’ll get to the second in part 3.
The first visualization was really simple. In the original full_text dataset, locations are given as longitude/latitude points. To see frequency, I just used the maps library to map the latitude and longitude onto a blank map of the US.
map("usa", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.05) points(x=full_tweets$Long, y=full_tweets$Lat, col='red')
#plot coordinates onto map
coordinates(full_tweets) <- c(“lat”, “long”)