California Wildfire Prediction using Suitability Model Analysis

Bolstering California wildfire preparedness using geospatial analysis

Collaborators: Peter Larcheveque and Akshay Bhide

Abstract

If you have ever lived in California you have probably experienced the adverse effects of a wildfire at some point. Hazy skies, the smell of ashe and fire in your nostrils, and dangerous levels of AQI (Air Quality Index) interrupt your daily life for weeks at a time. At ground zero of the wildfires, people lose homes (many of which are uninsured), businesses, and unfortunately, their lives. Only with the help of brave firefighters and a LOT of taxpayer dollars are we able to stave off such powerful forces. Using topographical rasters and historical wildfire data, we provide a suitability model for predicting the most susceptible wildfire areas as well as showcase under-serviced areas that have the potential to be ground zero for the next massive California wildfire.

Introduction

The question we address in this project is: What factors contribute to California wildfires, and can we analyze these factors to find high-risk areas for wildfires? The business case to this question implicates the following:

California spends about $2.5B annually on fighting wildfires (according to Michael Wara, Ph.D.)
Recent California wildfires (2014-2020) have total estimated damages of $40B
Federal-level concern because extreme wildfire events can bankrupt insurance and reinsurance firms, and the burden will then fall on the Federal government to aid and subsidize (ie. Hurricane Katrina)

Key Geospatial Terminology

Raster: The data layer that contains some set of features, and a mandatory “geometry” feature (usually lat/long) which tells you where that data point exists

GeoDataFrame: Equivalent to a Pandas DataFrame, but contains a mandatory “geometry” feature

Suitability Model: Methodology that identifies the best locations to site something or preserve an area. Utilizes multiple raster layers that are weighted to give a location a final suitability score. In this case, we analyze the suitability of different locations for wildfire susceptibility.

Extent: The boundaries of the given raster layer

ArcGIS: Software used to source data layers and conduct analyses on these data layers

Analysis

We begin our investigation of historical wildfire locations by importing a wildfire dataset and clipping the extent to California state lines.

We first want to understand if wildfire areas are autocorrelated. Simply put, do wildfires tend to consume the same amount of area? If the answer is no, our study is pointless because this would imply wildfire sizes are random. If the answer is yes, we can claim that there are some underlying factors influencing wildfire size and a predictive analysis is worthwhile.

We use a hypothesis test as follows:

H₀: Wildfire areas (sq. miles) are randomly distributed

Hₐ: Wildfire areas (sq. miles) are not randomly distributed

Testing autocorrelation for geospatial features is tricky and non-conventional because there is the extra dimension of space. We use something called Moran’s I to compute the autocorrelation of wildfire areas. There is a ton of fancy math behind this statistic, which you can read here. TLDR: Moran’s I uses feature values and locations to compute autocorrelation. It is used to test for global autocorrelation for a continuous attribute.

We compute Moran’s I for wildfire area:

Using Moran’s I, we obtain a p-value of 0.001.

Conclusion: Because the p-value is less than the accepted threshold value of 5%, we reject the null hypothesis in favor of the alternative, thus concluding that California wildfire areas are not random and that there is likely spatial autocorrelation between the areas of California wildfires.

Rainfall r: -0.0776

Temperature r: -0.0400

Biomass r: -0.1593

Slope r: -0.1439

Rainfall: 10% - rainfall gives indication of soil moisture and vegetation growth, but is a worse indicator than biomass; low weight

Temperature: 10% - although temperature drives humidity, it had the lowest correlation score; low weight

Biomass: 50% - vegetation is the primary driver of wildfire fuel and had the highest correlation score; highest weight

Slope: 30% - correlation analysis shows there is some predictive power and research shows spread potential; middle weight

Suitability Score = (0.1 * rainfall value) + (0.1 * temperature value) + (0.5 * biomass value) + (0.3 * slope value)

Examination of results:

Darker areas indicate more susceptibility to wildfires
Areas with high susceptibility generally reside inside National Parks
Areas with low susceptibility reside inside Joaquin Valley (farmland)

Optimizing Fire Station Locations

Computing these buffers for all fire stations in California we get:

Underserviced Areas & Recommendations

We recommend adding additional preventative measures in the areas of Klamath National Forest, Modoc National Forest, and Stanislaus National Forest. This is because our suitability model has identified these areas as high risk and there are no fire stations within 15 mile buffers of these locations. If a wildfire were to break out in these areas, fire stations might not be able to react fast enough, and the wildfire may become too large too quickly. We recommend adding fire stations or supply depots to cover the locations marked by the red pins.

Next Steps

Our analysis covers many of the fundamentals behind wildfire inception and prevention. However, there are a few things we can improve in the future.

Include wind speed raster —> one of the most crucial aspects of wildfire intensity is wind speed because it increases flame length and carries embers that can ignite other biomass
Perform location allocation of fire stations and supply depots —> we can optimize the location of recommended fire station locations to maximize land coverage, road coverage, and civilian coverage
Consider metrics for assessing how effects these locations in (2) will be in mitigating wildfires

Conclusion

Image: Historical California Wildfires

Above Image: Left plot shows reference distribution versus the observed statistic (tiny red dot), right plot shows focal value against its spatial lag, for which the Moran I statistic is the slope

Image: P-value

Raster Suitability Model

Now that we have concluded that there are some underlying factors contributing to California wildfire area, let’s conduct a suitability model!

Our methodology is as follows:

Import topographical raster data
Clip rasters to the extent of California
Normalize rasters
Correlate rasters to fire size
Use correlation coefficient + outside research + assumptions to finalize raster weighting
Use raster weights and data values to obtain a suitability score (higher score = higher wildfire risk)

Based on outside research and available data, we have decided to use 4 topographical rasters: Rainfall, temperature, biomass, and slope.

Image: sample fire stations with 15 minute drive time buffers

Rainfall Temperature Biomass Slope

Correlating Rasters to top 100 largest wildfires:

We can see that these R values are pretty low for all our raster layers, so it seems that none of these quite have the predictive power we were hoping for. We see that biomass and slope have higher correlation coefficients than rainfall and temperature, so this will still have some affect in our final weighting determinations. However, due to the low signal, we will have to use these results with a grain of salt.

Final Weights + Justification:

Suitability Model - Results

Image: Final Suitability Model Raster

We want to answer the question: “Are fire stations located in areas with high susceptibility?”. To answer this, we overlay fire stations on our final suitability model raster and compute 15 minute drive time buffers and 15 mile buffers around each fire station. If a mildly susceptible area is not covered by a fire station buffer, we claim that the area is underserviced.

Image: Underserviced areas in Klamath and Modoc National Forests

Image: all fire stations with 15 mile buffers

Image: Underserviced area in Stanislaus National Forest

Wildfires in California have had a huge detrimental impact in recent years. In this analysis, we attempted to answer the question, “Can we predict California wildfire size and provide recommendations for mitigating California wildfires?”. To answer this, we utilized statistics and geospatial analyses to first determine if wildfire areas are randomly distributed. Using a hypothesis test, we rejected the null hypothesis in favor of the alternative, inclining us to believe that there are underlying factors affecting California wildfire areas. This justifies our reasoning to conduct a suitability model. We used correlated rainfall, temperature, biomass, and slope rasters with wildfire area, and found that these rasters did not correlate very well with wildfire area. Still, we used these correlation coefficients + outside research + assumptions to come up with the final weightings: rainfall - 10%, temperature - 10%, biomass - 50%, slope - 30%. We combined these weighted rasters with the data values to obtain a final suitability raster layer. In this layer, darker shades indicate higher susceptibility to wildfire. We proceeded to compute 15 mile buffers for all fire stations in California in order to find areas that have high susceptibility and are also not within the fire station buffers. We claim these areas are underserviced. Examples of underserviced areas are Klamath, Modoc, and Stanislaus National Forests. We recommend building fire stations in these areas in order to mitigate potential future wildfires.

Thank you for reading! The code, paper, and presentation for this project can be found here.