The world’s leading publication for data science, AI, and ML professionals.

Predicting California Wildfire Size with Neural Networks: Building A Machine Learning Project From…

"You could travel the world, but nothing comes close to the golden coast." – Katy Perry

Predicting California Wildfire Size with Neural Networks (Part 1)

Building A Machine Learning Project From Problem to Solution

"You could travel the world, but nothing comes close to the golden coast." – Katy Perry

California’s a beautiful place.

Rolling hills of the Bay Area (source: me)
Rolling hills of the Bay Area (source: me)

It allows for some of the best outdoor activities given its dry, sunlit, Mediterranean climate.

But, because of these natural factors and the presence of human activity that follows, the state also has over 2 million homes at high or extreme risk of wildfire damage. In 2018 alone, over 1.8 million acres burned from wildfire in California.

When fire burns at this scale, it not only poses an immediate threat to those at the ignition site, but it can cause wide reaching effects several miles away like I experienced first hand with the 2018 Camp Fire: California’s deadliest, most destructive wildfire.

Heavy smoke blew from the northeast part of the state making its way down to the Bay Area, closing down schools and creating a local shortage of the pollutant-blocking N95 masks. For those affected it was at best two weeks of isolation from the outdoors, as was the case for me, but at worst it meant tragic death, forced homelessness, and the destruction of entire communities for the unfortunate.

Downtown San Francisco, despite the fire being ~200 miles northeast (source: me)
Downtown San Francisco, despite the fire being ~200 miles northeast (source: me)
Yikes! Actual photos of San Jose at the time (source: me)
Yikes! Actual photos of San Jose at the time (source: me)

Having access to tools that can provide early indication of risk are important since they can allow officials to make more informed and proactive decisions to help prepare communities for these potentially devastating events.

Thus, I’d like to see if we can use historical wildfire data, statistics, and modern Machine Learning techniques to construct a model that can predict the approximate area of land burned in a wildfire event with the hope of productionizing such a system to aid officials in real time.

Before I continue, I’d like to mention this project currently represents the lone continuation of an undergraduate research study previously conducted alongside my former research partner Michael L. who helped in formulating the problem, designing the data schema, and guiding in model analysis while we built both decision tree based risk classification models and deep regression models (like those that will be explored in this series).

Data Sources

Using only meteorological (climate) data bound to the time and space relevant to each event is a reasonable starting point for model inputs given that factors like temperature, wind speed, and humidity have been used for many years in occurrence forecasting via fire indexes. Having a tool based on inputs like these could give fire specialists a sense of how far-reaching a new fire event could be post-ignition. We’ll discuss other potentially useful factors one might consider for model development and try to bring them in as we iterate on different model architectures later on.

In order to come up with a preliminary dataset useful for training, we have to marry two different data sources: wildfire data and climate data.

Wildfire Data Source

Through a bit of searching, I came across a REST API run by the United States Geological Survey (USGS), an organization dedicated to studying natural hazards in the United States. This API powers ArcGIS map systems used in tracking live wildfire events and exposes useful information about these events including each fire ignition’s date of discovery, latitude, longitude, and burn area. The API is publicly accessible, spans incidents going back to the year 2002, and appears actively updated through 2019 as of this writing.

Climate Data Source

Given that each wildfire event obtained by the USGS API contains a corresponding spatiotemporal (time and space) coordinate, we should be able to map each coordinate through some weather service to retroactively inspect the scene of the ignition site corresponding to each event.

After searching and playing with different climate data sources with varying access limitations, I stumbled upon Dark Sky’s Time Machine API. While not completely free (nor eternal 😢 ), this REST API allows users to query the hour by hour observed weather conditions including temperature, wind speed, and humidity for a given spatiotemporal coordinate described by a latitude, longitude, and historical date. So we can obtain the date of discovery, latitude, and longitude of a fire event by querying the USGS API, come up with a time range to represent the context and duration of the fire event, then feed that information in as inputs to requests made to the Dark Sky API.

Great! Now that we have our data sources and know what we want to accomplish with them, let’s see what tools we’ll need to get our hands on the actual data.

Data Tech Stack

Decisions, decisions.

Web Automation

Because of the number of API requests we’ll need to make, and given that each Dark Sky request incurs an actual dollar amount beyond a certain number of requests, I decided to go with my weapon of choice for web automation: Node.js. While a language like Python would work just as fine, creating chains of asynchronous tasks like web requests is something best accomplished in a Node environment as we’ll see later when creating the data transformation pipeline.

Great, so we’ve established the tool to get the data we need, but where will we store the results of these API requests? We need a database!

Database

Since each API request may generate a JSON response object that varies by field names, i.e. has varying schema, a natural choice would be to use a document-based NoSQL solution. Also, since we don’t anticipate knowing the structure of these response objects completely prior to making requests for them, it behooves us to use a schema-less solution so that we can dump all response objects then analyze and clean any unexpected attribute arrangements or degenerate responses. (Note: This is a choice of convenience, finding all theoretical attribute variations a priori is time consuming and not very rewarding since the cost of not using some data is minimal) Because of its ease of use, large community support, and strong integration with Node, MongoDB seems like the right choice.

Data sources ✔Automation tool ✔ Database ✔

Let’s set up our environment for using these tools.

Environment

I happened to use MongoDB Community Server 4.0, Node.js v8.11.3, and Windows 10 from a desktop environment while working on this project, so you should be able to follow along with that setup without issues. Once you have those installed, we’ll be working directly from the command line and following the README.

First, we make sure to start the MongoDB server (this will take up a single terminal window).

> mongo

After cloning the project, in a new terminal inside the scripts folder, we’ll install the project dependencies using npm. This could take some time.

> npm install

Now we’ll set up an environment file so that we can abstract some commonly used variables from the rest of the code that will change depending on the deployment mode. Make sure to replace YOUR_MONGODB_URL with the URL of your local server instance, most likely "mongodb://localhost:27017". If you choose to deploy to a remote MongoDB server, make sure to change this URL accordingly. Also, make sure to replace YOUR_DARKSKY_API_KEY with the API key obtained after registering for access to the Dark Sky API.

> touch .env
> echo export PRIMARY_MONGODB_URL=YOUR_MONGODB_URL >> .env
> echo export DARKSKY_API_KEY=YOUR_DARKSKY_API_KEY >> .env
> source .env

Data sources ✔Automation tool ✔ Database ✔ Environment ✔

Alright, let’s see some code.

Data Collection Code

We’ll start by gathering a collection of historical wildfire events from the USGS REST API.

Wildfire Data

Here’s the base URL we’ll be querying:

https://wildfire.cr.usgs.gov/ArcGIS/rest/services/geomac_dyn/MapServer/${_yearId_}/query?where=1%3D1&outFields=*&outSR=4326&f=json

The API distinguishes historical wildfire event datasets from each year via an integer id identifying the ArcGIS layer holding that data, starting with 10 representing the latest year, increasing until the earliest year of 2002. For example, as of this writing, the layer with id=10 holds 2019 fire data, the layer with id=11 holds 2018 fire data, … , the layer with id=27 holds 2002 fire data. We can hardcode these ids in an array, iterate through each entry replacing the ${yearId} parameter of the query URL with that id, then perform an API request on the new query URL, thus mapping each id to a response object and getting fire data for each year.

We can now use the resultant array to populate our database. First, we connect to the database using the PRIMARY_MONGODB_URL field we set earlier, then we execute the above code upon successful connection. Once the API requests have completed, the resultant array of response objects will be in the array mapRes. We can store this array as a collection "raw" in a database "arcgis" using a function called saveToDBBuffer. This entire process gets wrapped in a Promise object and is returned by the downloadRaw function.

Using Promises, we can encapsulate asynchronous operations, like processing network requests, into intuitive, synchronous blocks of code that we can then sequence to build out a data transformation pipeline that can be efficiently logged and tested. On successful processing we resolve the Promise, and on unsuccessful processing we reject the Promise, passing on an error status for the pipeline step that failed. The value in setting up such a structure will be manifest when adding additional stages to the data processing pipeline.

What we’ve done here is modularized asynchronous parts of the pipeline (e.g. constructed and exported the wildfireStages Promise) allowing any stage to then be imported from another script into an execution sequence (exports array) in pipeline.js.

This array of Promises representing our pipeline can then be imported via require(./pipeline) and executed by our main script…

…which can now be run automatically whenever we’d like to reproduce a pipeline, e.g. whenever we need more data.

Sweet!

As you can see in pipeline.js, we have two other Promises: climateStages and combineStages. These stages will make use of the updated state of our databases after processing the wildfireStages Promise, and thus are sequenced after in the exported array. They respectively download the climate data for each event and join the resulting datasets into a single training set.

The result of executing the wildfireStages sequence is a few new collections in the arcgis database containing useful snapshots of the raw dataset as we transform it into a more useful working set. The final collection being the "training" collection. Now let’s get the climate data.

Climate Data

Similar to how we mapped years to fire event response objects with the USGS API, we are going to map attributes from fire event objects in the arcgis training collection to climate data response objects using the Dark Sky API, only now streaming the resultant response objects directly into the database instead of downloading to a buffer in memory. This is done to prevent data loss in case we get a lot of data back from the API and can’t fit it all in memory.

By creating a read stream for fire events from the training collection, we kick-start the streaming process by piping each fire event into a transform stream that executes a Dark Sky API request for that event then pipes the resultant response object into a transform stream that uploads the document to the climate database. The function getClimateDataEach starts this off by producing a transform stream object customized by the CLIMATE_CONFIG object. Inside this transform stream object, we execute the getHistoricalClimateDataEach function, produce the resultant climate response object "res", then push it on to the next transform stream object in the pipe. The next transform stream object is customized by a database config object and is the result of saveToDBEach. Inside the returned stream object lies code that executes the database upload.

Within the getHistoricalClimateDataEach function, a time window is constructed around the ignition time of a wildfire event using an hourly or daily interval as specified in the CLIMATE_CONFIG object. The resultant start and end dates then get passed through to the download function which executes API requests for each time unit starting from the start time and recurses toward the end time (base case) aggregating results into a result object.

Cool! Now we have climate data for each fire event from the arcgis training collection saved in the "training" collection in the climate database. We just need to combine both training datasets from the two databases we’ve created into one master dataset that’ll be used for model development.

Training Data

We can load each fire event from the arcgis training collection, load each corresponding climate response object linked by the "Event" attribute, then join the desired attributes and format the columns to fit a typical relational schema. Having a relational schema will allow the dataset to be easily interpretable and conveniently exportable into Python as we’ll soon see.

A subset of the master dataset
A subset of the master dataset

For each fire event, the column names are labeled with the climate property and an index string representing the hour relative to the ignition time of that fire event, e.g. temperature_336, temperature0, and temperature336 represent the temperature of the ignition site 336 hours prior to, at exactly, and 336 hours after the fire ignition time respectively. 14 days of hourly measurements around the ignition time gives 336 data points prior to and after ignition per climate property, with additional data points representing each climate property for the ignition hour.

Okay, we got the data. Now let’s see how we can use this dataset to build some regression models in part 2 of this series. (Coming soon!)


Related Articles