Occasionally, I need a map. I’m not talking about a beautiful, polished work of art. I just need something that gives me some spacial context and that will be acceptable in a customer presentation. Most frequently I need to see points on a map… sometimes lots of them.
If you Google ‘plot points on a map’ you will have dozens of options — http://multiplottr.com, https://batchgeo.com, and https://www.mapcustomizer.com are the first few I see pop up. These will work, especially if you only need to plot a few points. However, when I experiment with these I find myself getting frustrated. Instead, I use r. FYI, there is ArcGIS if you need to do some heavy duty mapping work…
I have a list of zip codes and a corresponding number of orders shipped to that zip code over the past year. The top of the file looks like this:
id | zip | orders |
---|---|---|
1 | 10001 | 2 |
2 | 10002 | 344 |
3 | 10003 | 4 |
4 | 10004 | 5 |
5 | 10005 | 98 |
6 | 10006 | 5 |
Before I map anything, I need to do two things; fix zip codes that read into r as four digit zips because the first identifier is 0 (for instance Holtsville, NY with its zip code 00501) and add the lat / lon coordinates of each zip code to each row.
library(dplyr) # helps us do sql-type joins (will also use to filter data later) library(zipcode) # gives us lat / lon and city / state names for each zipcode data(zipcode) # loads the lat / lon data # read in your data orig <- read.csv("order.csv") # fix zip codes that start with zero read in as 4 digit zip codes # note that I run this more than once in case the zip is preceded by more than 1 zero orig$zip <- lapply(orig$zip, function(x) ifelse(nchar(x) < 5, paste0("0",x), x)) # convert zipcode to categorical variable orig$zip <- as.character(orig$zip) # combine your zip code data with lat / lon and city / state name data data <- left_join(orig, zipcode, by="zip")
Great. Now the top of our data looks like this…
id | zip | orders | city | state | latitude | longitude |
---|---|---|---|---|---|---|
1 | 10001 | 62 | New York | NY | 40.75074 | -73.99653 |
2 | 10002 | 34 | New York | NY | 40.71704 | -73.98700 |
3 | 10003 | 42 | New York | NY | 40.73251 | -73.98935 |
4 | 10004 | 5 | New York | NJ | 40.69923 | -74.04118 |
5 | 10005 | 18 | New York | NY | 40.70602 | -74.00858 |
6 | 10006 | 5 | New York | NY | 40.70790 | -74.01342 |
Well, that’s strange… New York, NJ? I had to detour and look at Google maps. The coordinates lands on Ellis Island which sits in the Hudson River between New York and New Jersey. In fact, on the map, it says both New York and New Jersey — dual ownership. I still feel good about the data. Now, I always start with a basic, basic map leveraging r’s maps package…
library(maps) # basic r map package (basic is a misleading word bc it's still pretty amazing) # plot a map of US states (no dots yet), with gray lines, and don't fill the map with any color map(database="state", col=gray(0.5), fill=FALSE) # now plot a single dark green point at each latitude / longitude location points(data$longitude, data$latitude, pch=20, col="dark green", cex=0.1)
And that’s not a terrible start. LA, San Francisco, Portland, and Seattle on the west coast have decent order volume… and the Northeast is saturated. Often, that’s good enough for what I do… but I can never resist getting into the ggplot package. Next, I’ll make it look just a little bit prettier and vary the size of the dot by the relative order volume for each zip code…
library(ggplot2) # helps us plot beautiful vizualizations # set up the basic ggplot US state map with grid lines and all states <- map_data("state") p <- ggplot() + coord_fixed(1.3) + xlab("") + ylab("") base_state_map <- p + geom_polygon(data=states, aes(x=long, y=lat, group=group), colour="black", fill="steelblue") # create a theme to strip out background shading and gridlines cleanup <- theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_rect(fill = 'white', colour = 'white'), axis.line = element_line(colour = "white"), legend.position="none", axis.ticks=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank()) states_map <- base_state_map + cleanup
Pretty neat. Let’s add the orders. I’m going to ignore Alaska and Hawaii for now because it pushes the continental US into the bottom right hand corner of the window…
# filter out orders to the state of AK or HI data <- filter(data, state != "AK") data <- filter(data, state != "HI") # plot dots on states_map created above. The alpha parameter makes the dots somewhat transparent map_with_orders <- states_map + geom_point(data=data, aes(x=longitude, y=latitude, size=value), colour="yellow";, pch=20, alpha=I(0.5)) + scale_size_continuous(range = c(0.01, 3.0))
Works for me. As you can imagine, from here it is pretty easy to zoom in and create a map for a particular state or county.
One of my favorite things about working with ggplot is the facet grid feature that allows us to create visual comparisons. For demo purposes, I took my zip code data and replicated over 4 years so that the top of my raw data resembled this…
id | zip | 2012 | 2013 | 2014 | 2015 |
---|---|---|---|---|---|
1 | 10001 | 2 | 20 | 25 | 25 |
2 | 10002 | 344 | 300 | 350 | 400 |
3 | 10003 | 4 | 4 | 14 | 14 |
4 | 10004 | 5 | 15 | 50 | 45 |
5 | 10005 | 98 | 150 | 102 | 90 |
6 | 10006 | 5 | 5 | 5 | 5 |
Before creating four plots each representing a calendar year, we need to use the reshape package to put the data in a ‘use-able’ format.
# assuming zip code data already loaded... # read in your revised data, fix the zip codes, rename the columns annual_data <- read.csv("annual_orders.csv") annual_data$zip <- lapply(annual_data$zip, function(x) ifelse(nchar(x) < 5, paste0("0",x), x)) annual_data$zip <- as.character(annual_data$zip) names(annual_data) <- c("zip", "2012", "2013", "2014", "2015") # melting the data will create one row per zipcode for each year # then I marry it up with the zip code library data library(reshape) temp <- melt(annual_data, id=c("zip")) ann_map_data <- left_join(temp, zipcode, by="zip")
Now the top of our data looks like this. Notice the new column called ‘variable’ which represents the year for the data.
id | zip | variable | orders | city | state | latitude | longitude |
---|---|---|---|---|---|---|---|
1 | 10001 | 2012 | 62 | New York | NY | 40.75074 | -73.99653 |
2 | 10002 | 2012 | 34 | New York | NY | 40.71704 | -73.98700 |
3 | 10003 | 2012 | 42 | New York | NY | 40.73251 | -73.98935 |
4 | 10004 | 2012 | 5 | New York | NJ | 40.69923 | -74.04118 |
5 | 10005 | 2012 | 18 | New York | NY | 40.70602 | -74.00858 |
6 | 10006 | 2012 | 5 | New York | NY | 40.70790 | -74.01342 |
Finally, we remove AK and HI again and plot using ggplot’s facet grid feature to split the data on our ‘variable’ variable.
# filter out orders to the state of AK or HI ann_map_data <- filter(ann_map_data, state != "AK") ann_map_data <- filter(ann_map_data, state != "HI") # plot one map per year. Smaller maps require smaller dots... so reduced top end of scale_size_continuous new_map_with_orders <- states_map + geom_point(data=my_data, aes(x=longitude, y=latitude, size=value), colour = "yellow", pch=20, alpha=I(0.5)) + scale_size_continuous(range = c(0.01, 1.0)) + facet_wrap(~variable)
And now it’s easy for me to see how growth occurred between 2012 and 2015. More orders going to the same places (oversimplified). I manipulated this data on purpose to make the year over year growth very noticeable in the four maps above, but I find all sorts of uses in real life for the facet_grid feature.
If you occasionally need a quick map — to either plot customer locations or visualize comparisons (r is also great with choropleths) — r is a great option. If you want to try some of the stuff above, but can’t figure it out.. I’d love to help. franciscorrigan3 AT g mail.