Formatting Flow Data for Sankey Diagram

At first sight, the Sankey diagram appeared to me as a very cool data visualization but impractical for discovering impactful insights. That’s probably because I didn’t know how to build them quickly to explore my own data. Over the past few weeks, I’ve been able to take the time to build some of these charts with supply chain flow data helping me to explore product & money flows and brainstorm what future alternatives might look like. Here, I wrap up all that time and learning into a few r functions to help me very quickly turn origin/destination data into Sankey flow diagrams.

To demo, I’ll use this Kaggle dataset from the DOT called “Freight Analysis Framework.” The specific file I’m reading from is “FAF4.4_State.csv.” Here’s a look at the top of the dataset (I’m only taking 4 columns which are origin state, destination state, transportation mode, and estimated tons moved in the year 2020)…

data <- read.csv("FAF4.4_State.csv") %>% select(dms_orig, dms_dest, dms_mode, tons_2020)

  dms_orig dms_dest dms_mode tons_2020
1        1        1        1 3689.9802
2        1       12        1    7.5912
3        1       13        1   85.9058
4        1       20        1   32.7515
5        1       22        1   28.5412
6        1       28        1  103.0098

There is documentation on mode type coding here, but I couldn’t locate the codes for states. However, this Kaggle user did identify those codings so I borrowed appreciatively. I’ll convert our numerical state identifiers into state abbreviations, filter the big dataset into something more focused and manageable, and we’ll be ready for Sankey. FYI on Sankey if you haven’t looked it up yet.

states_dict_1 = c( 1,  2,  4,  5,  6,  8,  9, 10, 
                  11, 12, 13, 15, 16, 17, 18, 19, 
                  20, 21, 22, 23, 24, 25, 26, 27, 
                  28, 29, 30, 31, 32, 33, 34, 35, 
                  36, 37, 38, 39, 40, 41, 42, 44, 
                  45, 46, 47, 48, 49, 50, 51, 53, 
                  54, 55, 56)

states_dict_2 = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE",
                  "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA",
                  "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN",
                  "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM",
                  "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
                  "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA",
                  "WV", "WI", "WY")

states_df <-, states_dict_2))
states_df$states_dict_1 <- as.integer(as.character(states_df$states_dict_1))

data_2 <- left_join(data, states_df, by=c("dms_orig"="states_dict_1")) %>%
  rename("origin_state"=states_dict_2) %>%
  left_join(states_df, by=c("dms_dest"="states_dict_1")) %>%

# # modes
# 1=Truck, 2=Rail, 3=Water, 4=Air, 5=Intermodel, 6=Pipeline, 7=Unknown, 8=No Domestic Mode

flows <- %>%
  filter(origin_state == "TX" | dest_state == "TX" & dms_mode == 6) %>%
  group_by(origin_state, dest_state) %>%
  summarise(value = sum(tons_2020)) %>%
  # filter(origin_state != dest_state) %>%
  `colnames<-`(c("origin", "destination", "value"))

rm(data); gc()

We now have a data frame showing Texas pipeline flows only. Note that if you have way too much flow data (like if we tried to plot all mode types between all US states) two consequences probably ensue. First, the algorithm to build the Sankey diagram takes way too long. Even if it runs, you’ll end up with a mess of a chart telling you nothing. With the flows data frame just created we can use the below 2 functions to obtain the properly formatted datasets needed for the r networkD3 package which created Sankey diagrams with D3.

# flows need to be zero indexed...
# assumption for the input dataset into format_sankey_links is that it contains columns names 'origin', 'destination', and 'value'
format_sankey_links <- function(dataset) {
  # create vector of unique nodes
  flow_indices <- unique(c(as.character(dataset$origin), as.character(dataset$destination)))
  # create list containing nodes and identifier number for each node
  new_index <- list(state=NULL, index=NULL)
  integer_holder <- seq(1, 100, 1)
  for (i in 1:length(flow_indices)) {
    new_index$state[i] = flow_indices[i]
    if (i == 1) { new_index$index[i] = 0 }
    else {
      if (flow_indices[i] %in% new_index$state[1:i-1]) {
        new_index$index[i] = new_index$index[match(flow_indices[i], new_index$state)]
      } else {
        new_index$index[i] = integer_holder[1]
        integer_holder <- integer_holder[2:length(integer_holder)]
  # add new identifier number to dataset
  dataset$new_origin_index <- new_index$index[match(dataset$origin, new_index$state)]
  dataset$new_dest_index <- new_index$index[match(dataset$destination, new_index$state)]

# nodes (states in this case) need an identifier...
# as an input format_sankey_nodes takes the output dataframe of format_sankey_links
format_sankey_nodes <- function(links_data) {
  # determine node names / ID match
  node_names <- c(as.character(links_data$origin), as.character(links_data$destination))
  node_indices <- c(as.character(links_data$new_origin_index), as.character(links_data$new_dest_index))
  nodes <- unique(, node_indices)))
  names(nodes) <- c("names", "ID")
  nodes$names <- as.character(nodes$names)
  nodes$ID <- as.numeric(as.character(nodes$ID))

With functions defined, we’ll leverage them and plot our Sankey diagram with networkD3.

test_flows <- format_sankey_links(flows)
test_nodes <- format_sankey_nodes(test_flows)

sankeyNetwork(Links = test_flows, Nodes = test_nodes, Source = "new_origin_index",
              Target = "new_dest_index", Value = "value", NodeID = "names",
              NodeGroup = "names", fontSize = 13)

Note that in order to get this visualization on this WordPress powered site — this gave me the right idea — I saved the visualization output in rstudio to an html file using the command

saveNetwork(sank_test, test_sankey.html, selfcontained = TRUE)

, uploaded that file to my WordPress Media library, and the got it directly on this page by using the iframe plugin.

Now, our Texas pipeline flow diagram isn’t all that useful. It makes sense that a lot of that refined oil is piped down to Louisiana for exporting (remember I’ve removed intrastate flows because the majority of transportation flow is short haul and as such engulfs these diagrams if included). It’s also not shocking that a lot of crude oil from OK is piped to Texas for refining. So, where’s the power? I’ve had the most amount of success with these charts when using derivative type metrics for example… ‘$ per mph.’ This shows me the relative tradeoff between speed through supply chain and cost by mode. Additionally, in a gigantic supply chain, it shows me lot’s of the one-off movements that could be avoided with better planning.

Leave a Reply

Your email address will not be published. Required fields are marked *