End to End NYC Subway Entries Data Project

A few weeks back I submitted a project to Udacity’s Intro to Data Science Course that analyzed a dataset of NYC subway entries and weather. It was an attempt to explore, visualize, and build a model to predict hourly subway entries. It was almost an end to end data project…  except that they supplied the dataset for us. What I am doing here is pulling data from three different sources to almost* replicate this dataset thereby completing the end to end aspect of the exercise.

We need data on hourly MTA subway entries per location, the geographical coordinates of those locations, and the weather at those locations during each hour. Here is where I am getting that data in order to parse, clean, and combine into a pandas dataframe in python.

I am breaking this process down into 3 main steps. Scrape the MTA site to identify, download, read & clean the past month of subway entries data, read the subway (geo coord) locations and combine with hourly entries, and make necessary API calls to weather underground for time and location of entries while simultaneously adding the data we want to the entries dataframe. Easy 🙂

Here is my full code for this. Of the multiple challenges involved in this process, two stood out as things to look out for in the future. First, the entries and location data have several idiosyncrasies that make it difficult to work with. For instance, sometimes the cumulative entries count went down… perhaps there is a mismatch with exits? There were 38 instances where the four hour change was too big to be realistic. In fact one value was 1,504,093,997 which means in that four hour window people entered that turnstile at a rate of 104,000 per second. Hmm. Likewise, there were some mismatches with the location data where the lat/lon came back as NA. This is definitely an area that warrants further investigation.

Second, the Weather Underground free developer API allows for only 5,000 calls per day or 10 calls per minute. In the month worth of data we use, this would require 31 days x 374 stations = 11,594 API calls. If I was more patient I would collect all the data I need over a several day period… or of course I could always upgrade my subscriptions to weather underground. Instead, for now, I sampled 20 observations in the dataset and made the calls on that data. However, the code is available for those that want to follow this through to the end and actually do something with the dataset thereafter. Here is what the head of the final sample dataframe looks like…

          C/A  UNIT       SCP          STATION     LINENAME DIVISION  \
819476   R625  R062  01-00-01  CROWN HTS-UTICA           34      IRT   
361994  N333A  R141  00-03-01  FOREST HILLS-71         EFMR      IND   
5995     A010  R080  00-00-05      57 ST-7 AVE          NQR      BMT   
772292   R519  R223  00-03-01   46 ST-BLISS ST            7      IRT   
794934   R533  R055  00-00-00          MAIN ST            7      IRT   
477940  PTH04  R551  00-00-05     GROVE STREET            1      PTH   
577295  R161A  R452  01-00-04            72 ST          123      IRT   
69818    B021  R228  00-03-00            AVE J           BQ      BMT   
656659   R238  R046  00-06-00  42 ST-GRD CNTRL        4567S      IRT   
192154   N025  R102  01-06-00           125 ST         ACBD      IND   
24659    A038  R085  00-00-04   8 ST-B'WAY NYU           NR      BMT   
674498   R248  R178  00-00-01            77 ST            6      IRT   
284726   N138  R355  01-04-00    GREENWOOD-111            A      IND   
538656   R119  R320  00-00-01         CANAL ST            1      IRT   
532682   R114  R028  02-00-00        FULTON ST     2345ACJZ      IRT   
121528   G001  R151  00-06-00    STILLWELL AVE         DFNQ      BMT   
149165   J001  R460  01-00-02        MARCY AVE          JMZ      BMT   
94205    C017  R455  00-00-00            25 ST            R      BMT   
11698    A021  R032  01-00-05   42 ST-TIMES SQ  ACENQRS1237      BMT   
398821   N500  R020  00-00-01    47-50 ST-ROCK         BDFM      IND   

              DATE      TIME     DESC   ENTRIES  HOURLY                UUNIT  \
819476  07/07/2015  20:00:00  REGULAR   4449129     201   R625_R062_01-00-01   
361994  07/26/2015  13:00:00  REGULAR  11814675     633  N333A_R141_00-03-01   
5995    07/20/2015  16:00:00  REGULAR  10327752     381   A010_R080_00-00-05   
772292  07/08/2015  20:00:00  REGULAR   7322603     212   R519_R223_00-03-01   
794934  08/06/2015  16:00:00  REGULAR   7044705     352   R533_R055_00-00-00   
477940  08/01/2015  11:48:30  REGULAR    525440     132  PTH04_R551_00-00-05   
577295  07/27/2015  13:00:00  REGULAR   4449536     433  R161A_R452_01-00-04   
69818   08/06/2015  12:00:00  REGULAR     61550      11   B021_R228_00-03-00   
656659  08/03/2015  08:00:00  REGULAR    673286     102   R238_R046_00-06-00   
192154  07/11/2015  12:00:00  REGULAR   2061628      87   N025_R102_01-06-00   
24659   07/31/2015  04:00:00  REGULAR   3990474      23   A038_R085_00-00-04   
674498  07/27/2015  17:00:00  REGULAR  13037082    1174   R248_R178_00-00-01   
284726  07/26/2015  09:00:00  REGULAR  50331650       0   N138_R355_01-04-00   
538656  07/20/2015  13:00:00  REGULAR   7995434     157   R119_R320_00-00-01   
532682  08/05/2015  03:00:00  REGULAR    432989      36   R114_R028_02-00-00   
121528  07/14/2015  09:00:00  REGULAR    297881       4   G001_R151_00-06-00   
149165  07/14/2015  01:00:00  REGULAR   7377460      81   J001_R460_01-00-02   
94205   07/13/2015  04:00:00  REGULAR   3589402       4   C017_R455_00-00-00   
11698   07/15/2015  04:00:00  REGULAR    653643       0   A021_R032_01-00-05   
398821  07/12/2015  04:00:00  REGULAR  16219017      86   N500_R020_00-00-01   

              LAT        LON  TEMP    PRECIP RAIN  
819476  40.669279 -73.932967  82.0  -9999.00    0  
361994  40.721681 -73.844390  87.1  -9999.00    0  
5995    40.764755 -73.980646  91.9  -9999.00    0  
772292  40.743079 -73.918419  80.1  -9999.00    0  
794934  40.759578 -73.830056  81.0  -9999.00    0  
477940  40.719876 -74.042616  77.0  -9999.00    0  
577295  40.778575 -73.981912  75.9  -9999.00    0  
69818   40.625028 -73.960819  80.1  -9999.00    0  
656659  40.751849 -73.976945  78.1  -9999.00    0  
192154  40.811056 -73.952386  84.0  -9999.00    0  
24659   40.730348 -73.992705  75.2  -9999.00    0  
674498  40.773636 -73.959875  84.0  -9999.00    0  
284726  40.684364 -73.832181  81.0  -9999.00    0  
538656  40.722819 -74.006267  80.6  -9999.00    0  
532682  40.709938 -74.007983  73.9  -9999.00    0  
121528  40.577423 -73.981225  77.0  -9999.00    0  
149165  40.708377 -73.957751  73.0  -9999.00    0  
94205   40.660430 -73.997944  75.0  -9999.00    0  
11698   40.755905 -73.986504  75.9  -9999.00    0  
398821  40.758652 -73.981311  73.0  -9999.00    0  

Cool project. I now view the Udacity subway entry project as an end to end data science project. Moving on 🙂

* I say almost because I don’t do things like turn the date into a datetime object or create a day of the week variable or grab all the weather elements from weather ground API. That stuff can be done after we have the data… which is the goal of this side project

Leave a Reply

Your email address will not be published. Required fields are marked *