In this project we will perform a basic analysis of the effect of severe weather events, in both the public health and economic senses, on the community. For our analysis we will be using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We will attempt to answer the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
First load the libraries required for the analysis.
if (!require('reshape2')) {
install.packages('reshape2');
require('reshape2');
}
## Loading required package: reshape2
Then download the data if it isn’t already present.
if (!file.exists('./repdata-data-StormData.csv.bz2')) {
download.file(
url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
destfile = './repdata-data-StormData.csv.bz2',
method = 'curl'
);
}
Next read the data, taking care to read it in as-is. The column names are listed below.
storm_data <- read.csv(bzfile('repdata-data-StormData.csv.bz2'), as.is = TRUE);
names(storm_data);
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Now that the data has been read in, begin the process of cleaning it. First, change the column names to lowercase, then retain only the columns that will be used in the analysis, and finally convert the event type column to a factor.
names(storm_data) <- tolower(names(storm_data));
storm_data <- storm_data[, c('evtype', 'fatalities', 'injuries', 'propdmg', 'propdmgexp', 'cropdmg', 'cropdmgexp')];
storm_data$evtype <- as.factor(storm_data$evtype);
We see that the casualty data is split across two columns - fatalities and injuries. Because of this, for the purpose of facilitating casualty analysis, I will add a column that combines the two columns into a total.
storm_data$casualties <- storm_data$fatalities + storm_data$injuries;
And taking a look at the cost data, we see each cost category is split into two - amount and exponent. Because of this, for the purpose of facilitating cost analysis, the two columns will be combined into a total, after converting the exponent into a numeric value. To achieve this we make use of a custom function (see below).
clean_numeric_data <- function(data) {
data[data %in% c('', '+', '?', '-')] <- '0';
data[data %in% c('h', 'H')] <- '2';
data[data %in% c('k', 'K')] <- '3';
data[data %in% c('m', 'M')] <- '6';
data[data %in% c('b', 'B')] <- '9';
data <- as.numeric(data);
return(data);
}
storm_data$propdmgexp <- clean_numeric_data(storm_data$propdmgexp);
storm_data$cropdmgexp <- clean_numeric_data(storm_data$cropdmgexp);
storm_data$propdmg <- storm_data$propdmg * 10^storm_data$propdmgexp;
storm_data$cropdmg <- storm_data$cropdmg * 10^storm_data$cropdmgexp;
storm_data$totaldmg <- storm_data$propdmg + storm_data$cropdmg;
Now that the data has been put in a useful form we are in a position to proceed.
Now we address two questions with the help of some summaries and plots.
Using melt and dcast functions from the reshape2 library we can calculate aggregate sums for both fatalities and injuries by event type, and in both cases we retain only the top 5 rows.
fatalities_long <- melt(storm_data, id.vars = "evtype", measure.vars = "fatalities", na.rm = TRUE);
fatalities_wide <- dcast(fatalities_long, evtype ~ variable, sum);
fatalities <- fatalities_wide[order(fatalities_wide$fatalities, decreasing = TRUE), ][1:5, ];
fatalities;
## evtype fatalities
## 830 TORNADO 5633
## 123 EXCESSIVE HEAT 1903
## 147 FLASH FLOOD 978
## 269 HEAT 937
## 452 LIGHTNING 816
We can see the top five severe weather event causes of fatalities above.
injuries_long <- melt(storm_data, id.vars = "evtype", measure.vars = "injuries", na.rm = TRUE);
injuries_wide <- dcast(injuries_long, evtype ~ variable, sum);
injuries <- injuries_wide[order(injuries_wide$injuries, decreasing = TRUE), ][1:5, ];
injuries;
## evtype injuries
## 830 TORNADO 91346
## 854 TSTM WIND 6957
## 164 FLOOD 6789
## 123 EXCESSIVE HEAT 6525
## 452 LIGHTNING 5230
And we can see the top five severe weather event causes of injuries above.
casualties_long <- melt(storm_data, id.vars = "evtype", measure.vars = "casualties", na.rm = TRUE);
casualties_wide <- dcast(casualties_long, evtype ~ variable, sum);
casualties <- casualties_wide[order(casualties_wide$casualties, decreasing = TRUE), ][1:5, ];
casualties;
## evtype casualties
## 830 TORNADO 96979
## 123 EXCESSIVE HEAT 8428
## 854 TSTM WIND 7461
## 164 FLOOD 7259
## 452 LIGHTNING 6046
And we can see the top five severe weather event causes of casualties above. Next we plot all of the above.
colors <- c('white', 'lightblue', 'pink', 'lightcyan', 'plum');
par(mfrow = c(2, 2));
fatalities_plot <- barplot(
fatalities$fatalities,
main = 'Fatalities By Event Type',
col = colors
);
text(fatalities_plot, par('usr')[3], labels = fatalities$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
injuries_plot <- barplot(
injuries$injuries,
main = 'Injuries By Event Type',
col = colors
);
text(injuries_plot, par('usr')[3], labels = injuries$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
casualties_plot <- barplot(
casualties$casualties,
main = 'Casualties By Event Type',
col = colors
);
text(casualties_plot, par('usr')[3], labels = casualties$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
casualties_pie <- pie(
casualties$casualties,
labels = casualties$evtype,
main = 'Casualties By Event Type',
cex = .75,
col = colors
);
As we can see from the above plots, there is some consistency in the event type across fatalities and injuries - in fact three of the top five in both categories are consistent.
Once again we use the melt and dcast functions from the reshape2 library to calculate aggregate sums of the cost of damage done to property, crops, and both by event type, and in both cases we retain only the top 5 rows.
propdmg_long <- melt(storm_data, id.vars = "evtype", measure.vars = "propdmg", na.rm = TRUE);
propdmg_wide <- dcast(propdmg_long, evtype ~ variable, sum);
propdmg <- propdmg_wide[order(propdmg_wide$propdmg, decreasing = TRUE), ][1:5, ];
propdmg;
## evtype propdmg
## 164 FLOOD 144657709807
## 406 HURRICANE/TYPHOON 69305840000
## 830 TORNADO 56947380676
## 666 STORM SURGE 43323536000
## 147 FLASH FLOOD 16822673978
We can see the top five severe weather events ordered by cost of property damage above.
cropdmg_long <- melt(storm_data, id.vars = "evtype", measure.vars = "cropdmg", na.rm = TRUE);
cropdmg_wide <- dcast(cropdmg_long, evtype ~ variable, sum);
cropdmg <- cropdmg_wide[order(cropdmg_wide$cropdmg, decreasing = TRUE), ][1:5, ];
cropdmg;
## evtype cropdmg
## 88 DROUGHT 13972566000
## 164 FLOOD 5661968450
## 586 RIVER FLOOD 5029459000
## 424 ICE STORM 5022113500
## 238 HAIL 3025954473
And we can see the top five severe weather events ordered by cost of crop damage above.
totaldmg_long <- melt(storm_data, id.vars = "evtype", measure.vars = "totaldmg", na.rm = TRUE);
totaldmg_wide <- dcast(totaldmg_long, evtype ~ variable, sum);
totaldmg <- totaldmg_wide[order(totaldmg_wide$totaldmg, decreasing = TRUE), ][1:5, ];
totaldmg;
## evtype totaldmg
## 164 FLOOD 150319678257
## 406 HURRICANE/TYPHOON 71913712800
## 830 TORNADO 57362333946
## 666 STORM SURGE 43323541000
## 238 HAIL 18761221986
And we can see the top five severe weather events ordered by cost of total damage above. Next we plot all of the above.
par(mfrow = c(2, 2));
propdmg_plot <- barplot(
propdmg$propdmg/10^9,
main = 'Property Damage By Event Type',
ylab = 'Billions of Dollars',
col = colors
);
text(propdmg_plot, par('usr')[3], labels = propdmg$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
cropdmg_plot <- barplot(
cropdmg$cropdmg/10^9,
main = 'Crop Damage By Event Type',
ylab = 'Billions of Dollars',
col = colors
);
text(cropdmg_plot, par('usr')[3], labels = cropdmg$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
totaldmg_plot <- barplot(
totaldmg$totaldmg/10^9,
main = 'Total Damage By Event Type',
ylab = 'Billions of Dollars',
col = colors
);
text(totaldmg_plot, par('usr')[3], labels = totaldmg$evtype, srt = 45, adj = c(1.1, 1.1), xpd = TRUE, cex = .75);
axis(2);
totaldmg_pie <- pie(
totaldmg$totaldmg,
labels = totaldmg$evtype,
main = 'Total Damage By Event Type',
cex = .75,
col = colors
);
From the above data analysis we can see that:
sessionInfo();
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-9 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.7 plyr_1.8.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.10 stringr_0.6.2 tools_3.1.2 yaml_2.1.11