Skip to main content

Environmental predictability & Population trends: Linking environmental predictability to assemblage population trends of Afro-Palearctic migrants

Table of contents

  • Introduction
  • Data Preparation
  • 1 GEE: Harmonic Regression
  • 2 Preprocess bird trends and distribution maps
  • 3 Species assemblage trends
  • 4 Ecoregions & Biomes
  • 5 Environmental predictability & trends
  • Analysis: Global model
  • 6 Model assemblage trends
  • 7 Quantify model uncertainty
  • Analysis: Local model
  • 8 Model local assemblage trends
  • Figures
  • 9 Environmental variables
  • 10 Trends-Variability method
  • Other
  • 11 References

View book source

6 Model assemblage trends

In previous chapters we have prepared a dataset for modelling, which we can now start to use.

6.1 Correlation among predictors

We first assess correlations among predictors. As we also intend to compare models using environmental predictability with those of long-term trends we separate the correlation matrices for both.

6.1.1 Pearson’s correlation matrix

unused_predictors <- c("count_long", "count_short",
                       "def", "tmmx", "tmmn", "aet", "pet", "pr", "ndvi", "x", "y",
                       lut$ecoregions, lut$biomes, lut$realms)
predictors <- colnames(data)[!colnames(data) %in% unused_predictors]
predictors_trends <- c("trend_long", "trend_short", colnames(dplyr::select(data, ends_with("_trend"))))
predictors_resids <- c("trend_long", "trend_short", colnames(dplyr::select(data, ends_with("_resid"))))
predictors_all <- c(predictors_trends, predictors_resids)

ggcorrmat(data, cor.vars = all_of(predictors_trends), type = "parametric")
ggcorrmat(data, cor.vars = all_of(predictors_resids), type = "parametric")
ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric")

This indicates environmental residuals are generally correlated more strongly than long-term trends. Of the covariates, def, tmmx and tmmn are most strongly correlated with each other or other variables. We can reduce correlation for temperature measures by calculating the average temperature from both tmmx and tmmn and will remove def from our modeling dataset.

data %>%
  mutate(tm_resid = rowMeans(dplyr::select(., tmmn_resid, tmmx_resid)),
         tm_trend = rowMeans(dplyr::select(., tmmn_trend, tmmx_trend))) %>%
  ungroup() %>%
  identity() -> data
predictors_trends <- c(predictors_trends[!predictors_trends %in% c("tmmx_trend", "tmmn_trend", "def_trend")], "tm_trend")
predictors_resids <- c(predictors_resids[!predictors_resids %in% c("tmmx_resid", "tmmn_resid", "def_resid")], "tm_resid")
predictors_all <- c(predictors_trends, predictors_resids)

And we calculate a new correlation matrix.

ggcorrmat(data, cor.vars = all_of(predictors_trends), type = "parametric", maxtrix.type = "upper")
ggcorrmat(data, cor.vars = all_of(predictors_resids), type = "parametric", maxtrix.type = "upper")
ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric", maxtrix.type = "upper")

Let’s calculate how trends and variability of the same variable are correlated.

ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric", output = "dataframe") %>%
  rowwise() %>%
  mutate(var1 = str_split(parameter1, "_", simplify = TRUE)[[1]],
         var2 = str_split(parameter2, "_", simplify = TRUE)[[1]]) %>%
  ungroup() %>%
  filter(var1 == var2, var1 != "trend") -> corr_trend_var

corr_trend_var[, c("parameter1", "parameter2", "r", "ci_low", "ci_high", "p")]
parameter1 parameter2 r ci_low ci_high p
ndvi_trend ndvi_resid 0.2932629 0.2900672 0.2964521 0
pr_trend pr_resid 0.2207327 0.2174075 0.2240529 0
pet_trend pet_resid -0.1890657 -0.1924315 -0.1856955 0
aet_trend aet_resid 0.2944051 0.2912117 0.2975920 0
tm_trend tm_resid 0.2422414 0.2389507 0.2455266 0

So, clearly, these are only weakly positively correlated, with the exception of pet potential evapotranspiration. The mean Pearson’s r for the correlation between environmental trends and respective variability is only 0.1723153.

We continue to see strong correlation between pr_resid and ndvi_resid, but Pearson correlation coefficients are < 0.7, and as both original variables (ndvi and pr) are effectively uncorrelated in ndvi_trend and pr_trend and we want the trends and predictability model to be comparable, we keep both in.

6.2 Pairplot

While we’re at it, it’s useful to make a pairplot of all predictor variables.

set.seed(42)
data %>%
  slice_sample(n = 1000) -> data_tiny

ggpairs(data_tiny, columns = all_of(c(predictors_trends)), progress = FALSE)
ggpairs(data_tiny, columns = all_of(c(predictors_resids)), progress = FALSE)

6.3 Optimal dataset size

The current dataset contains 314868 rows of data, which is quite large for fast and efficient ML workflows. We will first determine how the performance of a few suitable learners depends on dataset size and then determine an optimal balance between model performance and dataset size. In other words, we determine at what point increasing the dataset size (the resampled fraction) does not improve the model performance substantially anymore. Rather than optimizing this in several steps, we will tune the hyperparameter subsample.frac by constructing a GraphLearner objects in an mlr3 pipeline which tunes dataset size. The following chunk implements this procedure. As it may take very long to run, it’s more convenient to run it only when strictly necessary from the R/hyperparm_subsample_frac.R file. As we can use this procedure to test suitable ML learners, we use mboost::gamboost, gbm::gbm and xgboost::xgboost learners in our GraphLearner.

task <- TaskRegrST$new(id = "data", backend = data, target = "trend_long", extra_args = list(coordinate_names = c("x", "y")))
task$select(all_of(predictors))

learners <- list(
  lrn("regr.gamboost", id = "gamboost"),
  lrn("regr.gbm", id = "gbm"),
  lrn("regr.xgboost", id = "xgboost")
)
learners_ids <- sapply(learners, function(x) x$id)

gr_subsample <- po("subsample") %>>%
  po("branch", options = learners_ids) %>>%
  gunion(lapply(learners, po)) %>>%
  po("unbranch")

grlrn_subsample <- GraphLearner$new(gr_subsample)

min_nrows <- 100
max_nrows <- 50000

min_nrows_s <- log10(min_nrows) / log10(2)
max_nrow_s <- log10(max_nrows) / log10(2)

ps_subsample <- ParamSet$new(list(
  ParamDbl$new("subsample.frac", lower = min_nrows_s, upper = max_nrow_s),
  ParamFct$new("branch.selection", levels = learners_ids)
))

# Distribution of subsample.frac in grid can be visualised as follows
# x <- seq(from = min_nrows_s, to = max_nrow_s, length.out = 20)
# y <- 2^x
# plot(x, y)
# Beware: division by nrow(data) has to be hard-coded in the function below, somehow...

ps_subsample$trafo <- function(x, param_set) {
  x$subsample.frac <- (2^x$subsample.frac) / 314868
  return(x)
}

resample_inner <- rsmp("cv", folds = 5)
resample_outer <- rsmp("cv", folds = 5)
measure <- msr("regr.rmse")
terminator <- trm("none")
tuner <- tnr("grid_search", resolution = 20)

at <- AutoTuner$new(
  learner = grlrn_subsample,
  resampling = resample_inner,
  measure = measure,
  search_space = ps_subsample,
  terminator = terminator,
  tuner = tuner,
  store_tuning_instance = TRUE,
  store_benchmark_result = TRUE
)

future::plan("multisession", workers = 3)
rr <- resample(task = task, learner = at, resampling = resample_outer, store_models = TRUE)

saveRDS(as.data.table(rr), file = "data/processed/rr_dataset_size.RDS")

rr_data <- bind_rows(lapply(rr$learners, function(x) x$model$tuning_instance$archive$data()))
saveRDS(rr_data, file = "data/processed/rr_dataset_size_df.RDS")

We can now plot the results of the previous workflow and see how cross-validated model performance depends on the subsampled size of the training dataset.

if (!exists("rr_data")) rr_data <- readRDS("data/processed/rr_dataset_size_df.RDS")

rr_data %>%
  group_by(branch.selection, subsample.frac) %>%
  summarise(mean_rmse = mean(regr.rmse),
            q05 = quantile(regr.rmse, 0.05),
            q95 = quantile(regr.rmse, 0.95),
            .groups = "drop_last") %>%
  ungroup() %>%
  mutate(nrows = 2^subsample.frac) %>%
  rename(Learner = branch.selection) %>%
  ggplot(aes(color = Learner, fill = Learner)) +
  geom_path(aes(x = nrows, y = mean_rmse)) +
  geom_ribbon(aes(x = nrows, ymin = q05, ymax = q95), alpha = 0.5) +
  scale_x_continuous(trans = "log10", breaks = c(100, 500, 1000, 5000, 10000, 50000)) +
  labs(x = "Subsample size (# rows)", y = "Model RMSE")

It is clear beyond 10,000 rows performance only continues to improve marginally for untuned xgboost learners and model improvement for the other learners has clearly hit a plateau. So, we can stick to resampling 10,000 rows from the 314868 rows in the original dataframe for our continued statistical modelling.

nrow_subsample <- 10000

6.4 Modeling assemblage trends

We can now finally start to model long-term assemblage trends using a 10^{4}-row subsample of the original dataset. We use a gamboost model as it strikes a balance between predictive accuracy and interpretability, without introducing a lot of ‘jaggedness’ of boosted regression trees (e.g. from gbm and xgboost) that hamper interpretability to those unfamiliar with the data.

6.4.1 Data filtering

Exploratory modeling showed that model errors are largest in the Sahara. As only few species included in our assemblage trends inhabit this area, the assemblage trend is also least reliable (i.e. it is composed of population trends of few species). It thus makes sense to exclude these areas. We set a minimum number of species that contribute to an assemblage trend to >5 and limit our analysis strictly to the areas south of the Sahara, which is the Afrotropic realm in our ecoregion dataset. To illustrate this, let’s plot the number of species that are included in the assemblage trends.

set.seed(42)
data %>%
  slice_sample(n = nrow_subsample) %>% 
  mutate(included = if_else((count_long > 5 & Afrotropic == 1), 1, 0, 0)) %>%
  as.data.frame() -> data_subset

xlim <- c(-20, 60)
ylim <- c(-40, 40)
clim <- c(min(data_subset$count_long), max(data_subset$count_long))

ggplot(data_subset, aes(x = x, y = y, color = count_long)) +
  geom_point() +
  scale_color_viridis_c(limits = clim) +
  labs(subtitle = "Original data",
       x = "Longitude", y = "Latitude", color = "# species") +
  theme(legend.position = "bottom") +
  coord_fixed(xlim = xlim, ylim = ylim) -> original

data_subset %>%
  filter(included == 1) %>%
  ggplot(aes(x = x, y = y, color = count_long)) +
  geom_point() +
  scale_color_viridis_c(limits = clim) +
  labs(subtitle = "Filtered data",
       x = "Longitude", y = "Latitude", color = "# species") +
  theme(legend.position = "bottom") +
  coord_fixed(xlim = xlim, ylim = ylim) -> filtered

original + filtered + plot_annotation(title = "Species in long-term assemblage trend") +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

6.4.2 Training boosted GAMs

The training of boosted GAMs (implemented in mboost) will go as follows:

  1. We train models predicting long term assemblage trends using long-term (linear) trends in environmental predictors and environmental predictability (model residuals from harmonic regression).
  2. We then determine the optimal number of boosting iterations using 10-fold cross-validation, as implemented in the cvrisk function, varying the number of boosts between 250 and 30000.
  3. We calculate the autocovariate of the residuals to correct for (strong) effects of spatial autocorrelation,4 which violates the independence assumption of regression.
  4. We retrain the model but now whilst including the residual autocovariate as a parameter.
  5. We, once again, determine the optimal number of boosting iterations on the new models. Most likely this does not differ substantially from the previous optimum, but it is good to make sure that is the case.

Let’s start with training the models.

set.seed(42)
data %>%
  filter(count_long > 5,
         Afrotropic == 1) %>%
  slice_sample(n = nrow_subsample) %>%
  as.data.frame() -> data_subset

saveRDS(data_subset, file = "data/processed/data_subset.RDS")

model_formula <- function(predictors, response, interactions = FALSE) {
  formula <- paste0(response, " ~ ", paste0("bbs(", predictors, ")", collapse = " + "))
  
  if (interactions) {
    envir_vars <- unlist(unique(lapply(str_split(predictors, "_"), function(x) x[[1]])))
    interactions <- lapply(envir_vars, function(x) {
      paste0("bspatial(", predictors[str_starts(predictors, paste0(x, "_"))][1], ", ", 
             predictors[str_starts(predictors, paste0(x, "_"))][2], ")")
    })
    formula <- paste0(formula, " + ", paste0(interactions, collapse = " + "))
  }
  
  formula <- as.formula(formula)
  return(formula)
}

formula_trends <- model_formula(predictors_trends[!predictors_trends %in% c("trend_long", "trend_short")], "trend_long")
formula_resids <- model_formula(predictors_resids[!predictors_resids %in% c("trend_long", "trend_short")], "trend_long")
formula_all <- model_formula(predictors_all[!predictors_all %in% c("trend_long", "trend_short")], "trend_long", 
                             interactions = TRUE)

ctrl <- boost_control(trace = FALSE, mstop = 10000)
mod_trends <- gamboost(formula_trends, data = data_subset, family = Gaussian(), control = ctrl)
mod_resids <- gamboost(formula_resids, data = data_subset, family = Gaussian(), control = ctrl)
mod_all <- gamboost(formula_all, data = data_subset, family = Gaussian(), control = ctrl)

saveRDS(mod_trends, file = "data/processed/models/mod_trends.RDS")
saveRDS(mod_resids, file = "data/processed/models/mod_resids.RDS")
saveRDS(mod_all, file = "data/processed/models/mod_all.RDS")

6.4.2.1 Determine the optimal number of boosting iterations

Having trained the models, we can now determine the optimal number of boosting iterations. As this is a time-consuming process, the chunk below does not run by default and it’s probably best to run this cross-validation procedure separately through the dedicated R/hyperparm_mboost_mstop.R script.

grid <- seq(from = 250, to = 30000, by = 250)
cv10f_trends <- cv(model.weights(mod_trends), type = "kfold")
saveRDS(cv10f_trends, file = "data/processed/models/cv10f_trends.RDS")
cvm_trends <- cvrisk(mod_trends, folds = cv10f_trends, grid = grid)
saveRDS(cvm_trends, file = "data/processed/models/cvm_trends.RDS")

cv10f_resids <- cv(model.weights(mod_resids), type = "kfold")
saveRDS(cv10f_resids, file = "data/processed/models/cv10f_resids.RDS")
cvm_resids <- cvrisk(mod_resids, folds = cv10f_resids, grid = grid)
saveRDS(cvm_resids, file = "data/processed/models/cvm_resids.RDS")

cv10f_all <- cv(model.weights(mod_all), type = "kfold")
saveRDS(cv10f_all, file = "data/processed/models/cv10f_all.RDS")
cvm_all <- cvrisk(mod_all, folds = cv10f_all, grid = grid)
saveRDS(cvm_all, file = "data/processed/models/cvm_all.RDS")

We can load the results and plot the outputs.

cvm_trends <- readRDS("data/processed/models/cvm_trends.RDS")
cvm_resids <- readRDS("data/processed/models/cvm_resids.RDS")
cvm_all <- readRDS("data/processed/models/cvm_all.RDS")

plot_cvm <- function(cvm, plot_id = "") {
  m <- cvm
  class(m) <- NULL
  as.data.frame(m) %>%
    rownames_to_column(var = "fold") %>%
    pivot_longer(-c(fold), names_to = "boosts", values_to = "risk") %>%
    mutate(boosts = as.numeric(boosts),
           fold = as.factor(fold)) %>%
    identity() %>%
    ggplot(aes(x = boosts, y = risk, group = fold, color = fold)) +
      geom_line() +
      labs(title = paste0(attr(cvm, "type"), ": ", plot_id), x = "boosts (mstop)", y = attr(cvm, "risk"))
}
plot_cvm_trends <- plot_cvm(cvm_trends, "trends")
plot_cvm_resids <- plot_cvm(cvm_resids, "resids")
plot_cvm_all <- plot_cvm(cvm_all, "all")

plot_cvm_trends + plot_cvm_resids + plot_cvm_all + plot_layout(guides = "collect") & theme(legend.position = "bottom")

We can see that most cross-validated models seem to show stagnating model improvement beyond 10000 boosts, while an optimum is mostly not reached before 30000 boosts. Given the stagnating improvement, limited compute time and no need to optimize predictive performance ad infinitum, we limit the number of boosts to 10000.

6.4.2.2 Correct for spatial autocorrelation

A quick plot of model residuals will show there is strong spatial autocorrelation in the residuals, violating the independence assumption of regression modelling.

if (!exists("mod_trends")) mod_trends <- readRDS("data/processed/models/mod_trends.RDS")
if (!exists("mod_resids")) mod_resids <- readRDS("data/processed/models/mod_resids.RDS")
if (!exists("mod_all")) mod_all <- readRDS("data/processed/models/mod_all.RDS")
data_subset$norac_trends <- resid(mod_trends)
data_subset$norac_resids <- resid(mod_resids)
data_subset$norac_all <- resid(mod_all)

data_subset %>%
  dplyr::select(x, y, norac_trends, norac_resids, norac_all) %>%
  rename(c(Predictability = norac_resids, Trends = norac_trends, `All predictors` = norac_all)) %>%
  pivot_longer(-c(x, y), names_to = "model", values_to = "residuals") %>%
  ggplot(aes(x = x, y = y, color = residuals)) +
  geom_point() +
  scale_color_gradient2() +
  coord_fixed() +
  facet_wrap(vars(model)) +
  labs(title = "Spatial pattern of model residuals", x = "Longitude", y = "Latitude", color = "Residual")

There’s no need to even quantify spatial autocorrelation (SAC) in residuals here, as it’s so obvious. We will correct for it using the residual autocovariate method introduced by Crase et al. 4. We calculate the autocovariate of the residuals using a neighborhood radius of 100 kilometers. This radius could be optimized to reflect actual SAC distances in our dataset, but for now it seems sufficient to dramatically reduce residual SAC.

neighborhood_dist <- 100  # Kilometers if longlat = TRUE
acov_trends <- autocov_dist(resid(mod_trends), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE, 
                            nbs = neighborhood_dist, zero.policy = TRUE)
## Warning in autocov_dist(resid(mod_trends), as.matrix(cbind(data_subset$x, : With
## value 100 some points have no neighbours
## Warning in nb2listw(nb, glist = gl, style = style, zero.policy = zero.policy):
## zero sum general weights
acov_resids <- autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE,
                            nbs = neighborhood_dist, zero.policy = TRUE)
## Warning in autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, : With
## value 100 some points have no neighbours

## Warning in autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, : zero
## sum general weights
acov_all <- autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE,
                         nbs = neighborhood_dist, zero.policy = TRUE)
## Warning in autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x,
## data_subset$y)), : With value 100 some points have no neighbours

## Warning in autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x,
## data_subset$y)), : zero sum general weights

The chunk above may throw some warnings if some points are not within 100 kilometers from other points, but that will most certainly not be many points in our dataset, so we can accept that some of these autocovariates are not as accurate as others.

Now we will add the calculated autocovariates of residuals to our earlier dataset and retrain the models.

data_subset$autocov_trends <- acov_trends
data_subset$autocov_resids <- acov_resids
data_subset$autocov_all <- acov_all

formula_rac <- function(formula, var) {
  as.formula(paste0(paste(formula[2], formula[3], sep = " ~ "), " + bbs(", var, ")"))
}

formula_trends <- formula_rac(formula_trends, "autocov_trends")
formula_resids <- formula_rac(formula_resids, "autocov_resids")
formula_all <- formula_rac(formula_all, "autocov_all")

ctrl <- boost_control(trace = FALSE, mstop = 10000)
mod_trends_rac <- gamboost(formula_trends, data = data_subset, family = Gaussian(), control = ctrl)
mod_resids_rac <- gamboost(formula_resids, data = data_subset, family = Gaussian(), control = ctrl)
mod_all_rac <- gamboost(formula_all, data = data_subset, family = Gaussian(), control = ctrl)
saveRDS(mod_trends_rac, file = "data/processed/models/mod_trends_rac.RDS")
saveRDS(mod_resids_rac, file = "data/processed/models/mod_resids_rac.RDS")
saveRDS(mod_all_rac, file = "data/processed/models/mod_all_rac.RDS")

And now let’s reassess residual autocorrelation from our retrained model. If all went well, that should be visibly and substantially less than before.

if (!exists("mod_trends_rac")) mod_trends_rac <- readRDS("data/processed/models/mod_trends_rac.RDS")
if (!exists("mod_resids_rac")) mod_resids_rac <- readRDS("data/processed/models/mod_resids_rac.RDS")
if (!exists("mod_all_rac")) mod_all_rac <- readRDS("data/processed/models/mod_all_rac.RDS")
data_subset$rac_trends <- resid(mod_trends_rac)
data_subset$rac_resids <- resid(mod_resids_rac)
data_subset$rac_all <- resid(mod_all_rac)

data_subset %>%
  dplyr::select(x, y, rac_trends, rac_resids, rac_all) %>%
  rename(c(Predictability = rac_resids, Trends = rac_trends, `All predictors` = rac_all)) %>%
  pivot_longer(-c(x, y), names_to = "model", values_to = "residuals") %>%
  ggplot(aes(x = x, y = y, color = residuals)) +
  geom_point() +
  scale_color_gradient2() +
  coord_fixed() +
  facet_wrap(vars(model)) +
  labs(title = "Spatial pattern of model residuals", subtitle = "after including residual autocovariates",
       x = "Longitude", y = "Latitude", color = "Residual")

That is much better!

6.4.2.3 Determine the optimal number of boosting iterations of RAC model

Once again, we should make sure the number of boosts used (10000) is ‘optimal’ in our newly trained model. As before, it’s probably best to run this using the dedicated file in R/hyperparm_mboost_mstop.R, but the chunk below can do it as well.

grid <- seq(from = 250, to = 30000, by = 250)
cv10f_trends_rac <- cv(model.weights(mod_trends_rac), type = "kfold")
saveRDS(cv10f_trends_rac, file = "data/processed/models/cv10f_trends_rac.RDS")
cvm_trends_rac <- cvrisk(mod_trends_rac, folds = cv10f_trends_rac, grid = grid)
saveRDS(cvm_trends_rac, file = "data/processed/models/cvm_trends_rac.RDS")

cv10f_resids_rac <- cv(model.weights(mod_resids_rac), type = "kfold")
saveRDS(cv10f_resids_rac, file = "data/processed/models/cv10f_resids_rac.RDS")
cvm_resids_rac <- cvrisk(mod_resids_rac, folds = cv10f_resids_rac, grid = grid)
saveRDS(cvm_resids_rac, file = "data/processed/models/cvm_resids_rac.RDS")

cv10f_all_rac <- cv(model.weights(mod_all_rac), type = "kfold")
saveRDS(cv10f_all_rac, file = "data/processed/models/cv10f_all_rac.RDS")
cvm_all_rac <- cvrisk(mod_all_rac, folds = cv10f_all_rac, grid = grid)
saveRDS(cvm_all_rac, file = "data/processed/models/cvm_all_rac.RDS")

We can load these results as well and plot the outputs.

cvm_trends_rac <- readRDS("data/processed/models/cvm_trends_rac.RDS")
cvm_resids_rac <- readRDS("data/processed/models/cvm_resids_rac.RDS")
cvm_all_rac <- readRDS("data/processed/models/cvm_all_rac.RDS")

plot_cvm <- function(cvm, plot_id = "") {
  m <- cvm
  class(m) <- NULL
  as.data.frame(m) %>%
    rownames_to_column(var = "fold") %>%
    pivot_longer(-c(fold), names_to = "boosts", values_to = "risk") %>%
    mutate(boosts = as.numeric(boosts),
           fold = as.factor(fold)) %>%
    identity() %>%
    ggplot(aes(x = boosts, y = risk, group = fold, color = fold)) +
      geom_line() +
      labs(title = paste0(attr(cvm, "type"), ": ", plot_id), x = "boosts (mstop)", y = attr(cvm, "risk"))
}
plot_cvm_trends_rac <- plot_cvm(cvm_trends_rac, "trends")
plot_cvm_resids_rac <- plot_cvm(cvm_resids_rac, "resids")
plot_cvm_all_rac <- plot_cvm(cvm_all_rac, "all")

plot_cvm_trends_rac + plot_cvm_resids_rac + plot_cvm_all_rac + 
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

Although model improvement did not stagnate as fast in this model, stopping boosting iterations at 10000 boosts again continues to remain reasonable. We could increase it to 20000, but that would (probably) at least double the time required for bootstrapping to quantify model uncertainty in further steps. As predictive performance is not the ultimate goal of this analysis, that trade-off is fine.

6.4.2.4 Quantifying residual autocorrelation

Now is a good moment to quantify residual autocorrelation (RSAC) in our model. We do so by calculating Moran’s I across a number of distance classes (8) using the sp.correlogram function from the spdep package. As this is computationally very intensive, we run this calculation once in the chunk below and save the results for fast plotting during knitting of this document.

nclass <- 8  # Number of distance classes
neighbors <- dnearneigh(as.matrix(cbind(data_subset$x, data_subset$y)), d1 = 0, d2 = neighborhood_dist, longlat = TRUE)

spc_norac_trends <- sp.correlogram(neighbors, data_subset$norac_trends, order = nclass, method = "I", zero.policy = TRUE)
spc_norac_resids <- sp.correlogram(neighbors, data_subset$norac_resids, order = nclass, method = "I", zero.policy = TRUE)
spc_norac_all <- sp.correlogram(neighbors, data_subset$norac_all, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_trends <- sp.correlogram(neighbors, data_subset$rac_trends, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_resids <- sp.correlogram(neighbors, data_subset$rac_resids, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_all <- sp.correlogram(neighbors, data_subset$rac_all, order = nclass, method = "I", zero.policy = TRUE)

saveRDS(spc_norac_trends, file = "data/processed/models/spc_norac_trends.RDS")
saveRDS(spc_norac_resids, file = "data/processed/models/spc_norac_resids.RDS")
saveRDS(spc_norac_all, file = "data/processed/models/spc_norac_all.RDS")
saveRDS(spc_rac_trends, file = "data/processed/models/spc_rac_trends.RDS")
saveRDS(spc_rac_resids, file = "data/processed/models/spc_rac_resids.RDS")
saveRDS(spc_rac_all, file = "data/processed/models/spc_rac_all.RDS")

And we can now plot the RSAC correlograms.

if (!exists("spc_norac_trends")) spc_norac_trends <- readRDS("data/processed/models/spc_norac_trends.RDS")
if (!exists("spc_norac_resids")) spc_norac_resids <- readRDS("data/processed/models/spc_norac_resids.RDS")
if (!exists("spc_norac_all")) spc_norac_all <- readRDS("data/processed/models/spc_norac_all.RDS")
if (!exists("spc_rac_trends")) spc_rac_trends <- readRDS("data/processed/models/spc_rac_trends.RDS")
if (!exists("spc_rac_resids")) spc_rac_resids <- readRDS("data/processed/models/spc_rac_resids.RDS")
if (!exists("spc_rac_all")) spc_rac_all <- readRDS("data/processed/models/spc_rac_all.RDS")

par(mfrow = c(2, 3))
plot(spc_norac_trends, main = "RSAC Trends mdl")
plot(spc_norac_resids, main = "RSAC Predictability mdl")
plot(spc_norac_all, main = "RSAC All preds mdl")
plot(spc_rac_trends, main = "RSAC Trends RAC mdl")
plot(spc_rac_resids, main = "RSAC Predictability RAC mdl")
plot(spc_rac_all, main = "RSAC All preds RAC mdl")

Adding the autocovariate of residuals seems to have helped a lot to counter residual spatial autocorrelation. It’s possible this can be improved further by tweaking the radius of the neighborhood over which the autocovariate is calculated, but that’s some tweaking we can do later.

6.4.3 Model outputs

6.4.3.1 Model performance

We use good ol’ RMSE, % of deviance explained and Moran’s I to compare the performance of all these models.

perf_metrics <- function(actual, models) {
  rmse <- lapply(models, function(x) rmse(actual, predict(x)))
  deviance_explained <- lapply(models, function(x) (1 - (tail(risk(x), 1) / head(risk(x), 1))) * 100)
  m <- rbind(rmse, deviance_explained)
  mode(m) <- "numeric"
  m
}

residual_columns <- c("norac_trends", "norac_resids", "norac_all", "rac_trends", "rac_resids", "rac_all")
if (!exists("neighbors")) {
  neighbors <- dnearneigh(as.matrix(cbind(data_subset$x, data_subset$y)), d1 = 0, d2 = neighborhood_dist, longlat = TRUE)
}
sar_metrics <- lapply(residual_columns, function(x) {
  mi <- moran.test(data_subset[, x], nb2listw(neighbors, zero.policy = TRUE), zero.policy = TRUE)
  c(round(mi$estimate[1], 2), mi$p.value)
})
sar_metrics <- data.frame(matrix(unlist(sar_metrics), ncol = length(sar_metrics)))

pm <- as.data.frame(perf_metrics(data_subset$trend_long, 
                                 list(mod_trends, mod_resids, mod_all, mod_trends_rac, mod_resids_rac, mod_all_rac)))
colnames(sar_metrics) <- colnames(pm)
pm <- rbind(pm, sar_metrics)
colnames(pm) <- c("mod_trends", "mod_resids", "mod_all", "mod_trends_rac", "mod_resids_rac", "mod_all_rac")
rownames(pm) <- c("RMSE", "% Deviance explained", "Moran's I", "Moran's I test p-val")
pm <- as.data.frame(t(pm))
pm
RMSE % Deviance explained Moran’s I Moran’s I test p-val
mod_trends 16.820661 28.84072 0.83 0
mod_resids 13.346926 55.19692 0.73 0
mod_all 11.230939 68.27676 0.64 0
mod_trends_rac 6.834131 88.25341 0.24 0
mod_resids_rac 6.227944 90.24483 0.18 0
mod_all_rac 5.869456 91.33555 0.13 0

6.4.3.2 Variable importance

We can quantify the importance of variables within our model using the varimp function of the mboost package. This returns variable importance, a measure of the total improvement to model deviance a variable is responsible for. This is weighted by the number of times a predictor is selected in the boosted models.

6.4.3.2.1 Trends model
plot(varimp(mod_trends_rac), main = "Varimp Trends RAC mdl")
as.data.frame(varimp(mod_trends_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2))
reduction blearner variable selfreq imp imp_biol
269.422461 bbs(autocov_trends) autocov_trends 0.2410 76.78 0.00
46.902638 bbs(tm_trend) tm_trend 0.2565 13.37 57.56
21.384352 bbs(pr_trend) pr_trend 0.1996 6.09 26.24
4.796977 bbs(aet_trend) aet_trend 0.1274 1.37 5.89
4.594185 bbs(pet_trend) pet_trend 0.1399 1.31 5.64
3.801587 bbs(ndvi_trend) ndvi_trend 0.0356 1.08 4.67
6.4.3.2.2 Predictability model
plot(varimp(mod_resids_rac), main = "Varimp Predictability RAC mdl")
as.data.frame(varimp(mod_resids_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2))
reduction blearner variable selfreq imp imp_biol
177.3754837 bbs(autocov_resids) autocov_resids 0.1691 49.43 0.00
94.7949396 bbs(ndvi_resid) ndvi_resid 0.5856 26.42 52.24
57.5502643 bbs(tm_resid) tm_resid 0.0712 16.04 31.72
17.5826413 bbs(pet_resid) pet_resid 0.0631 4.90 9.69
10.6466394 bbs(aet_resid) aet_resid 0.0616 2.97 5.87
0.8702888 bbs(pr_resid) pr_resid 0.0494 0.24 0.48
6.4.3.2.3 All predictor models
plot(varimp(mod_all_rac), main = "Varimp All pred RAC mdl")
as.data.frame(varimp(mod_all_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2)) %>%
  head(20)
reduction blearner variable selfreq imp imp_biol
139.9040016 bbs(autocov_all) autocov_all 0.0352 38.52 0.00
127.0774984 bspatial(tm_trend, tm_resid) tm_resid, tm_trend 0.2324 34.99 56.92
56.2974251 bbs(ndvi_resid) ndvi_resid 0.1593 15.50 25.22
18.0412086 bspatial(aet_trend, aet_resid) aet_resid, aet_trend 0.2036 4.97 8.08
12.2534894 bspatial(pet_trend, pet_resid) pet_resid, pet_trend 0.1641 3.37 5.49
4.1345608 bspatial(ndvi_trend, ndvi_resid) ndvi_resid, ndvi_trend 0.0977 1.14 1.85
2.6364365 bbs(pr_trend) pr_trend 0.0058 0.73 1.18
1.1339132 bbs(pet_resid) pet_resid 0.0082 0.31 0.51
0.8446243 bbs(aet_resid) aet_resid 0.0073 0.23 0.38
0.4523058 bspatial(pr_trend, pr_resid) pr_resid, pr_trend 0.0610 0.12 0.20
0.1833741 bbs(pet_trend) pet_trend 0.0095 0.05 0.08
0.0976552 bbs(tm_trend) tm_trend 0.0123 0.03 0.04
0.0790772 bbs(aet_trend) aet_trend 0.0017 0.02 0.04
0.0196674 bbs(tm_resid) tm_resid 0.0015 0.01 0.01
0.0017870 bbs(pr_resid) pr_resid 0.0004 0.00 0.00
0.0000000 bbs(ndvi_trend) ndvi_trend 0.0000 0.00 0.00

6.4.3.3 Partial/Marginal Effects

And, finally, we can have a look at the modelled effects of predictors on long-term assemblage trends. We first compare the output of the original models.

Because the full model included interaction terms, and by default the partial dependence plots are calculated by evaluating the estimated value over every unique value of the two involved variables, this can get extremely computationally intensive. So instead, we use a tweaked version of the mboost.plot function, which lets use set a gridsize, the number of values over which the partial dependence is evaluated on a quantile grid. See the R/mboost_plot_grid.R file for details.

source('~/envirpred/R/mboost_plot_grid.R')
6.4.3.3.1 Partial effects RAC models
par(mfrow = c(2, 3))
plot(mod_trends_rac)
rm(mod_trends_rac)

par(mfrow = c(2, 3))
plot(mod_resids_rac)
rm(mod_resids_rac)
plot_all <- function(model, gridsize) {
  preds <- names(model.frame(model))
  single <- which(str_starts(preds, "bbs"))
  interx <- which(str_starts(preds, "bspatial"))
  
  for (id in single) {
    w <- which(names(variable.names(model)) == preds[id])
    plot(model, which = w)
  }
  
  for (id in interx) {
    w <- which(names(variable.names(model)) == preds[id])
    print(plot.mboost_grid(model, which = w, gridsize = gridsize))
  }  
}
plot_all(mod_all_rac, 200)

And because we focus specifically on temperature, let’s prettify that plot a bit.

preds <- names(model.frame(mod_all_rac))
  single <- which(str_starts(preds, "bbs"))
  interx <- which(str_starts(preds, "bspatial"))
w <- which(names(variable.names(mod_all_rac)) == preds[14])
pal <- rev(RColorBrewer::brewer.pal('RdBu', n=11))
theme <- rasterTheme(region = pal)
plot.mboost_grid(mod_all_rac, which = w, gridsize = 200, at = seq(from = -40, to = 40, by = 2.5),
                 xlab = expression("Temp. trend [0.1"~degree*C~y^-1*"]"), 
                 ylab = expression("Temp. variability [0.1"~degree*C*"]"), 
                 par.settings = theme)
rm(mod_all_rac)
5 Environmental predictability & trends
7 Quantify model uncertainty

On this page

  • 6 Model assemblage trends
  • 6.1 Correlation among predictors
    • 6.1.1 Pearson’s correlation matrix
  • 6.2 Pairplot
  • 6.3 Optimal dataset size
  • 6.4 Modeling assemblage trends
    • 6.4.1 Data filtering
    • 6.4.2 Training boosted GAMs
    • 6.4.3 Model outputs
  • View source
  • Edit this page

"Environmental predictability & Population trends: Linking environmental predictability to assemblage population trends of Afro-Palearctic migrants" was written by Bart Hoekstra.

This book was built by the bookdown R package.