The National Oceanic and Atmospheric Administration (NOAA) regularly publishes data on storm occurrences in the US. They make available annual data dating back to 1950 and it includes time-series, geographic proximity, and financial destruction information as well as storm characteristics (event type, width of tornado, wind gust estimates, etc.). While you could do thousands of things with this dataset, I wanted to 1) determine the most harmful storm in the US and 2) predict storm type using location, time of year, and storm characteristics.
You can find the individual annual data files here. In order to get the data into an analysis-ready form, I scraped that source page and downloaded the individual annual files into an aggregated dataset which contains approximately 1.3mm rows. After some initial clean up (nan values to 0, financial damage values of 4.5K to a usable numeric object like 4,500, consolidation of event types (like sleet = winter weather) and calculating storm duration based on storm begin and storm end time), I was ready to determine the most harmful storm.
Defining ‘harmful’ is a big part of it. Each observation (row) is an instance of a storm event that comprises a particular episode. For instance, a tornado that rips across Wichita and Andover Kansas will be counted as two storm events… but one episode. Falling hail followed by a tornado is counted as two storms, but a single episode. So do we care about storm events or episodes. I focused on storm events and then I focused on property and crop damage along with direct and indirect deaths.
To start, I calculated total damage as (property damage + crop damage) * (1 + (direct deaths + indirect deaths). That equation helps to give much greater weight to storms with many deaths. Deadly storms = more harmful storms. Using this interpretation, tornadoes are the most harmful storm on an aggregated basis.
Most Harmful Weather by Storm Type 1950-2014 (billions of dollars)
However, if we look at the total damage per occurrence things change a bit. Using total damage per storm type divided by the number of occurrences of that type, hurricanes are the most harmful storms.
Most Harmful Weather per Occurrence by Storm Type 1950-2014
Type | Num | Per Storm ($M) | Total Damage ($M) |
---|---|---|---|
Hurricane | 3070 | 123 | 377000 |
Coastal Flood | 470 | 64 | 30369 |
Tornado | 13166 | 43 | 578849 |
Tsunami | 13 | 27 | 290 |
Heat | 152 | 18 | 2803 |
And if we were concerned with deadliness per storm type per occurrence (and left out reported monetary damage), the most harmful looks different still
Deadliest Storm per Occurrence by Type 1950-2014
Type | Num | Deaths | Deaths Per Storm |
---|---|---|---|
Tsunami | 13 | 33 | 2.53 |
Astronomical Low Tide | 2 | 1 | 0.5 |
Heat | 152 | 76 | 0.5 |
Avalanche | 58 | 25 | 0.43 |
Dense Fog | 165 | 65 | 0.39 |
So, rather than the most harmful storm type, we need to say that the most harmful storm types are tornadoes, hurricanes, and tsunamis. I could also agree with you if you think floods are the most harmful storm since floods were broken into flood, coastal flood, and flash flood in this analysis. Actually, if you went on to look at the most harmful episode — assuming harmful is financial damage * deaths — then the most harmful episode was a flood that occurred in California in 2006. No deaths, but a gargantuan amount of financial damage that can be attributed to all the Napa/Sonoma Valley vineyards that were destroyed.
Now let’s make some predictions. Using a random forest classification algorithm, I want to predict storm type based on location (state), time of year (month), duration of storm (in minutes), and property damage (in dollars). I set up the model, made predictions on the test set, and calculated accuracy like so…
# Rando Forest Classifier wants categorical variables, so convert floats to categorical hd['EVENT_TYPE'] = pd.Categorical(hd['EVENT_TYPE']).labels hd['MONTH_NAME'] = pd.Categorical(hd['MONTH_NAME']).labels hd['STATE'] = pd.Categorical(hd['STATE']).labels # Split training and test data test_idx = np.random.uniform(0, 1, len(hd)) <= 0.3 train = hd[test_idx==True] test = hd[test_idx==False] # Train the model train_target = train['EVENT_TYPE'] train_data = train.ix[:,['STATE', 'MONTH_NAME', 'DURATION', 'DAMAGE_PROPERTY']] rfc = RandomForestClassifier(n_estimators=500, oob_score=True) rfc.fit(train_data, train_target) # Declare the test set and make storm type predictions test_target = test['EVENT_TYPE'] test_data = test.ix[:,['STATE', 'MONTH_NAME', 'DURATION', 'DAMAGE_PROPERTY']] test_pred = rfc.predict(test_data) # Print accuracy of predictions print("Accuracy = %f" %(skm.accuracy_score(test_target, test_pred)))
Accuracy = 0.758946. 76% accuracy is not terrible considering that a) this is a ‘for-fun’ project and b) there are 31 unique storm types. When I look at the feature importances, the strongest predictor is storm duration followed by state. If I pick out the first storm event, I have a 75% chance of correctly predicting its type.
single_test_target = test['EVENT_TYPE'].iloc[0] single_test_data = [26, 9, 2040, 5000] # State is Indiana and Month is October single_test_pred = rfc.predict(single_test_data) single_test_target: 32 # Winter Storm single_test_pred: array([32]) # Winter Storm
The actual event type is a winter storm and the model predicts winter storm. Looks good on the first observation in the test set so we’ll conclude there. All the code for this project can be found on my github page at this link.