A few weeks back I submitted a project to Udacity’s Intro to Data Science Course that analyzed a dataset of NYC subway entries and weather. It was an attempt to explore, visualize, and build a model to predict hourly subway entries. It was almost an end to end data project… except that they supplied the dataset for us. What I am doing here is pulling data from three different sources to almost* replicate this dataset thereby completing the end to end aspect of the exercise.
We need data on hourly MTA subway entries per location, the geographical coordinates of those locations, and the weather at those locations during each hour. Here is where I am getting that data in order to parse, clean, and combine into a pandas dataframe in python.
- http://web.mta.info/developers/turnstile.html: For hourly entries
- https://github.com/chriswhong/nycturnstiles/blob/master/geocoded.csv: For geographic coordinates. Also good to look at this: https://vimeo.com/64740217
- http://api.wunderground.com/api: For weather per time and location
I am breaking this process down into 3 main steps. Scrape the MTA site to identify, download, read & clean the past month of subway entries data, read the subway (geo coord) locations and combine with hourly entries, and make necessary API calls to weather underground for time and location of entries while simultaneously adding the data we want to the entries dataframe. Easy 🙂
Here is my full code for this. Of the multiple challenges involved in this process, two stood out as things to look out for in the future. First, the entries and location data have several idiosyncrasies that make it difficult to work with. For instance, sometimes the cumulative entries count went down… perhaps there is a mismatch with exits? There were 38 instances where the four hour change was too big to be realistic. In fact one value was 1,504,093,997 which means in that four hour window people entered that turnstile at a rate of 104,000 per second. Hmm. Likewise, there were some mismatches with the location data where the lat/lon came back as NA. This is definitely an area that warrants further investigation.
Second, the Weather Underground free developer API allows for only 5,000 calls per day or 10 calls per minute. In the month worth of data we use, this would require 31 days x 374 stations = 11,594 API calls. If I was more patient I would collect all the data I need over a several day period… or of course I could always upgrade my subscriptions to weather underground. Instead, for now, I sampled 20 observations in the dataset and made the calls on that data. However, the code is available for those that want to follow this through to the end and actually do something with the dataset thereafter. Here is what the head of the final sample dataframe looks like…
C/A UNIT SCP STATION LINENAME DIVISION \ 819476 R625 R062 01-00-01 CROWN HTS-UTICA 34 IRT 361994 N333A R141 00-03-01 FOREST HILLS-71 EFMR IND 5995 A010 R080 00-00-05 57 ST-7 AVE NQR BMT 772292 R519 R223 00-03-01 46 ST-BLISS ST 7 IRT 794934 R533 R055 00-00-00 MAIN ST 7 IRT 477940 PTH04 R551 00-00-05 GROVE STREET 1 PTH 577295 R161A R452 01-00-04 72 ST 123 IRT 69818 B021 R228 00-03-00 AVE J BQ BMT 656659 R238 R046 00-06-00 42 ST-GRD CNTRL 4567S IRT 192154 N025 R102 01-06-00 125 ST ACBD IND 24659 A038 R085 00-00-04 8 ST-B'WAY NYU NR BMT 674498 R248 R178 00-00-01 77 ST 6 IRT 284726 N138 R355 01-04-00 GREENWOOD-111 A IND 538656 R119 R320 00-00-01 CANAL ST 1 IRT 532682 R114 R028 02-00-00 FULTON ST 2345ACJZ IRT 121528 G001 R151 00-06-00 STILLWELL AVE DFNQ BMT 149165 J001 R460 01-00-02 MARCY AVE JMZ BMT 94205 C017 R455 00-00-00 25 ST R BMT 11698 A021 R032 01-00-05 42 ST-TIMES SQ ACENQRS1237 BMT 398821 N500 R020 00-00-01 47-50 ST-ROCK BDFM IND DATE TIME DESC ENTRIES HOURLY UUNIT \ 819476 07/07/2015 20:00:00 REGULAR 4449129 201 R625_R062_01-00-01 361994 07/26/2015 13:00:00 REGULAR 11814675 633 N333A_R141_00-03-01 5995 07/20/2015 16:00:00 REGULAR 10327752 381 A010_R080_00-00-05 772292 07/08/2015 20:00:00 REGULAR 7322603 212 R519_R223_00-03-01 794934 08/06/2015 16:00:00 REGULAR 7044705 352 R533_R055_00-00-00 477940 08/01/2015 11:48:30 REGULAR 525440 132 PTH04_R551_00-00-05 577295 07/27/2015 13:00:00 REGULAR 4449536 433 R161A_R452_01-00-04 69818 08/06/2015 12:00:00 REGULAR 61550 11 B021_R228_00-03-00 656659 08/03/2015 08:00:00 REGULAR 673286 102 R238_R046_00-06-00 192154 07/11/2015 12:00:00 REGULAR 2061628 87 N025_R102_01-06-00 24659 07/31/2015 04:00:00 REGULAR 3990474 23 A038_R085_00-00-04 674498 07/27/2015 17:00:00 REGULAR 13037082 1174 R248_R178_00-00-01 284726 07/26/2015 09:00:00 REGULAR 50331650 0 N138_R355_01-04-00 538656 07/20/2015 13:00:00 REGULAR 7995434 157 R119_R320_00-00-01 532682 08/05/2015 03:00:00 REGULAR 432989 36 R114_R028_02-00-00 121528 07/14/2015 09:00:00 REGULAR 297881 4 G001_R151_00-06-00 149165 07/14/2015 01:00:00 REGULAR 7377460 81 J001_R460_01-00-02 94205 07/13/2015 04:00:00 REGULAR 3589402 4 C017_R455_00-00-00 11698 07/15/2015 04:00:00 REGULAR 653643 0 A021_R032_01-00-05 398821 07/12/2015 04:00:00 REGULAR 16219017 86 N500_R020_00-00-01 LAT LON TEMP PRECIP RAIN 819476 40.669279 -73.932967 82.0 -9999.00 0 361994 40.721681 -73.844390 87.1 -9999.00 0 5995 40.764755 -73.980646 91.9 -9999.00 0 772292 40.743079 -73.918419 80.1 -9999.00 0 794934 40.759578 -73.830056 81.0 -9999.00 0 477940 40.719876 -74.042616 77.0 -9999.00 0 577295 40.778575 -73.981912 75.9 -9999.00 0 69818 40.625028 -73.960819 80.1 -9999.00 0 656659 40.751849 -73.976945 78.1 -9999.00 0 192154 40.811056 -73.952386 84.0 -9999.00 0 24659 40.730348 -73.992705 75.2 -9999.00 0 674498 40.773636 -73.959875 84.0 -9999.00 0 284726 40.684364 -73.832181 81.0 -9999.00 0 538656 40.722819 -74.006267 80.6 -9999.00 0 532682 40.709938 -74.007983 73.9 -9999.00 0 121528 40.577423 -73.981225 77.0 -9999.00 0 149165 40.708377 -73.957751 73.0 -9999.00 0 94205 40.660430 -73.997944 75.0 -9999.00 0 11698 40.755905 -73.986504 75.9 -9999.00 0 398821 40.758652 -73.981311 73.0 -9999.00 0
Cool project. I now view the Udacity subway entry project as an end to end data science project. Moving on 🙂
* I say almost because I don’t do things like turn the date into a datetime object or create a day of the week variable or grab all the weather elements from weather ground API. That stuff can be done after we have the data… which is the goal of this side project