Predicting spatial distributions for ecological species leveraging Python's ever-strengthening machine learning stack.
Author : Daniel Furman | November 2020
Species Distributions Models (SDMs) are an important and widely used tool for ecologists, agriculture scientists, conservation biologists, and many other geospatial science enthusiasts. While dozens of tutorials leverage the traditional R stack for SDMs, with packages such as Raster, implementations of SDMs in Python are, surprisingly, rather limited. To bridge this gap, we explore a SDM workflow powered by Python's machine learning capabilities. The methods employed barely scratch the surface of available techniques, and hopefully this introduction can serve as a springboard to further exploration.
If you are completely new to SDMs, it may be prudent to start here. SDMs associate presence locations of a species to climate variables, giving you the power to predict species suitability across an entire landscape. First, environmental variables are sampled from presence coordinates. Second, a statistical model (here, SK-Learn classifiers) defines a species-environment relationship. Third, the species–environment relationship is mapped across the study space, denoting a potential distribution of the species (referred to as interpolation). Projecting to future/past climates or to novel geographic areas is referred to as extrapolation. A typical workflow is as follows:
conceptualization
-> data pre-processing
-> model training/assessment
->
interpolate/extrapolate
-> iterate
import os
os.mkdir("inputs")
os.mkdir("outputs")
We now install the additional dependencies we will need for our SDMs, with four primary libraries:
These can be installed at the terminal using pip install LIBRARY
, but you may find it cleaner to create a conda virtual environment from requirements-py.txt
(see Git repo).
We first need to download a geodatabase (here we use a .shp
file) denoting presence/absence coordinates, which can be directly loaded into Python as a GeoPandas GeoDataFrame
(a tabular data structure for geometric data types). Here, the CLASS
column is a binary indication of the presence/absence of the species. For this tutorial, we are using Joshua trees (Yucca brevifolia) as the example species:
To follow along chunk by chunk, clone the Git repo and open Intro-to-SDMs-Py.ipynb
in your working directory of choice.
import geopandas as gpd
import shutil
import glob
# grab jtree data after cloning Git repo
for f in sorted(glob.glob('data/jtree*')):
shutil.copy(f,'inputs/')
# or grab your data of choice and move to 'inputs/'
pa = gpd.GeoDataFrame.from_file("inputs/jtree.shp")
pa.sample(5) # GeoDataFrame for the species
We now check that there are no duplicate or NaN
coordinates, as well as inspect the shapefile's attributes.
print("number of duplicates: ", pa.duplicated(subset='geometry', keep='first').sum())
print("number of NA's: ", pa['geometry'].isna().sum())
print("Coordinate reference system is: {}".format(pa.crs))
print("{} observations with {} columns".format(*pa.shape))
We can map the species presences (pa==1
).
pa[pa.CLASS == 1].plot(marker='*', color='green', markersize=8)
And we can map the background points (pa == 0
).
pa[pa.CLASS == 0].plot(marker='+', color='black', markersize=4)
However, if you don't have a geospatial database with presences/absence coordinates, there are some easy steps to create one for virtually any species of interest! You can start by searching the open-data Global Biodiversity Information Facility (GBIF), downloading a species database to .csv
, and migrating to R to pipe the database to .shp
(e.g. see data-pre-processing.R
in the Git repo or the additional information section below).
In this section we will train our machine learning classifiers and make spatial predictions of the species distribution over current conditions (1970-2000).
First, we load 19 bioclimatic features (here we use 2.5 arc-minute resolution) from the publicly available WorldClim database (v. 2.1, Fick & Hijmans, 2017).
# grab climate features - cropped to joshua tree study area
for f in sorted(glob.glob('data/bioclim/bclim*.asc')):
shutil.copy(f,'inputs/')
raster_features = sorted(glob.glob(
'inputs/bclim*.asc'))
# check number of features
print('\nThere are', len(raster_features), 'raster features.')
We are now ready to use pyimpute
to generate the raster maps of suitability. We first prep the pyimpute workflow:
from pyimpute import load_training_vector
from pyimpute import load_targets
train_xs, train_y = load_training_vector(pa, raster_features, response_field='CLASS')
target_xs, raster_info = load_targets(raster_features)
train_xs.shape, train_y.shape # check shape, does it match the size above of the observations?
and we implemement several scikit-learn
classifiers:
# import machine learning classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
CLASS_MAP = {
'rf': (RandomForestClassifier()),
'et': (ExtraTreesClassifier()),
'xgb': (XGBClassifier()),
'lgbm': (LGBMClassifier())
}
from pyimpute import impute
from sklearn import model_selection
# model fitting and spatial range prediction
for name, (model) in CLASS_MAP.items():
# cross validation for accuracy scores (displayed as a percentage)
k = 5 # k-fold
kf = model_selection.KFold(n_splits=k)
accuracy_scores = model_selection.cross_val_score(model, train_xs, train_y, cv=kf, scoring='accuracy')
print(name + " %d-fold Cross Validation Accuracy: %0.2f (+/- %0.2f)"
% (k, accuracy_scores.mean() * 100, accuracy_scores.std() * 200))
# spatial prediction
model.fit(train_xs, train_y)
os.mkdir('outputs/' + name + '-images')
impute(target_xs, model, raster_info, outdir='outputs/' + name + '-images',
class_prob=True, certainty=True)
All done! We have a responses.tif
raster which is the predicted class (0 or 1) and probability_1.tif
with a continuous suitability scale. Let's average the continuous output for the four models and plot our map.
from pylab import plt
# define spatial plotter
def plotit(x, title, cmap="Blues"):
plt.imshow(x, cmap=cmap, interpolation='nearest')
plt.colorbar()
plt.title(title, fontweight = 'bold')
import rasterio
distr_rf = rasterio.open("outputs/rf-images/probability_1.0.tif").read(1)
distr_et = rasterio.open("outputs/et-images/probability_1.0.tif").read(1)
distr_xgb = rasterio.open("outputs/xgb-images/probability_1.0.tif").read(1)
distr_lgbm = rasterio.open("outputs/lgbm-images/probability_1.0.tif").read(1)
distr_averaged = (distr_rf + distr_et + distr_xgb + distr_lgbm)/4
plotit(distr_averaged, "Joshua Tree Range, averaged", cmap="Greens")
Lastly, let's zoom in to Joshua Tree National Park and inspect the suitability there.
plotit(distr_averaged[100:150, 100:150], "Joshua Tree National Park Suitability", cmap="Greens")