6 Model assemblage trends

In previous chapters we have prepared a dataset for modelling, which we can now start to use.

6.1 Correlation among predictors

We first assess correlations among predictors. As we also intend to compare models using environmental predictability with those of long-term trends we separate the correlation matrices for both.

6.1.1 Pearson’s correlation matrix

unused_predictors <- c("count_long", "count_short",
                       "def", "tmmx", "tmmn", "aet", "pet", "pr", "ndvi", "x", "y",
                       lut$ecoregions, lut$biomes, lut$realms)
predictors <- colnames(data)[!colnames(data) %in% unused_predictors]
predictors_trends <- c("trend_long", "trend_short", colnames(dplyr::select(data, ends_with("_trend"))))
predictors_resids <- c("trend_long", "trend_short", colnames(dplyr::select(data, ends_with("_resid"))))
predictors_all <- c(predictors_trends, predictors_resids)

ggcorrmat(data, cor.vars = all_of(predictors_trends), type = "parametric")

ggcorrmat(data, cor.vars = all_of(predictors_resids), type = "parametric")

ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric")

This indicates environmental residuals are generally correlated more strongly than long-term trends. Of the covariates, def, tmmx and tmmn are most strongly correlated with each other or other variables. We can reduce correlation for temperature measures by calculating the average temperature from both tmmx and tmmn and will remove def from our modeling dataset.

data %>%
  mutate(tm_resid = rowMeans(dplyr::select(., tmmn_resid, tmmx_resid)),
         tm_trend = rowMeans(dplyr::select(., tmmn_trend, tmmx_trend))) %>%
  ungroup() %>%
  identity() -> data
predictors_trends <- c(predictors_trends[!predictors_trends %in% c("tmmx_trend", "tmmn_trend", "def_trend")], "tm_trend")
predictors_resids <- c(predictors_resids[!predictors_resids %in% c("tmmx_resid", "tmmn_resid", "def_resid")], "tm_resid")
predictors_all <- c(predictors_trends, predictors_resids)

And we calculate a new correlation matrix.

ggcorrmat(data, cor.vars = all_of(predictors_trends), type = "parametric", maxtrix.type = "upper")

ggcorrmat(data, cor.vars = all_of(predictors_resids), type = "parametric", maxtrix.type = "upper")

ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric", maxtrix.type = "upper")

Let’s calculate how trends and variability of the same variable are correlated.

ggcorrmat(data, cor.vars = all_of(predictors_all), type = "parametric", output = "dataframe") %>%
  rowwise() %>%
  mutate(var1 = str_split(parameter1, "_", simplify = TRUE)[[1]],
         var2 = str_split(parameter2, "_", simplify = TRUE)[[1]]) %>%
  ungroup() %>%
  filter(var1 == var2, var1 != "trend") -> corr_trend_var

corr_trend_var[, c("parameter1", "parameter2", "r", "ci_low", "ci_high", "p")]

parameter1	parameter2	r	ci_low	ci_high
ndvi_trend	ndvi_resid	0.2932629	0.2900672	0.2964521
pr_trend	pr_resid	0.2207327	0.2174075	0.2240529
pet_trend	pet_resid	-0.1890657	-0.1924315	-0.1856955
aet_trend	aet_resid	0.2944051	0.2912117	0.2975920
tm_trend	tm_resid	0.2422414	0.2389507	0.2455266

So, clearly, these are only weakly positively correlated, with the exception of pet potential evapotranspiration. The mean Pearson’s r for the correlation between environmental trends and respective variability is only 0.1723153.

We continue to see strong correlation between pr_resid and ndvi_resid, but Pearson correlation coefficients are < 0.7, and as both original variables (ndvi and pr) are effectively uncorrelated in ndvi_trend and pr_trend and we want the trends and predictability model to be comparable, we keep both in.

6.2 Pairplot

While we’re at it, it’s useful to make a pairplot of all predictor variables.

set.seed(42)
data %>%
  slice_sample(n = 1000) -> data_tiny

ggpairs(data_tiny, columns = all_of(c(predictors_trends)), progress = FALSE)

ggpairs(data_tiny, columns = all_of(c(predictors_resids)), progress = FALSE)

6.3 Optimal dataset size

The current dataset contains 314868 rows of data, which is quite large for fast and efficient ML workflows. We will first determine how the performance of a few suitable learners depends on dataset size and then determine an optimal balance between model performance and dataset size. In other words, we determine at what point increasing the dataset size (the resampled fraction) does not improve the model performance substantially anymore. Rather than optimizing this in several steps, we will tune the hyperparameter subsample.frac by constructing a GraphLearner objects in an mlr3 pipeline which tunes dataset size. The following chunk implements this procedure. As it may take very long to run, it’s more convenient to run it only when strictly necessary from the R/hyperparm_subsample_frac.R file. As we can use this procedure to test suitable ML learners, we use mboost::gamboost, gbm::gbm and xgboost::xgboost learners in our GraphLearner.

task <- TaskRegrST$new(id = "data", backend = data, target = "trend_long", extra_args = list(coordinate_names = c("x", "y")))
task$select(all_of(predictors))

learners <- list(
  lrn("regr.gamboost", id = "gamboost"),
  lrn("regr.gbm", id = "gbm"),
  lrn("regr.xgboost", id = "xgboost")
)
learners_ids <- sapply(learners, function(x) x$id)

gr_subsample <- po("subsample") %>>%
  po("branch", options = learners_ids) %>>%
  gunion(lapply(learners, po)) %>>%
  po("unbranch")

grlrn_subsample <- GraphLearner$new(gr_subsample)

min_nrows <- 100
max_nrows <- 50000

min_nrows_s <- log10(min_nrows) / log10(2)
max_nrow_s <- log10(max_nrows) / log10(2)

ps_subsample <- ParamSet$new(list(
  ParamDbl$new("subsample.frac", lower = min_nrows_s, upper = max_nrow_s),
  ParamFct$new("branch.selection", levels = learners_ids)
))

# Distribution of subsample.frac in grid can be visualised as follows
# x <- seq(from = min_nrows_s, to = max_nrow_s, length.out = 20)
# y <- 2^x
# plot(x, y)
# Beware: division by nrow(data) has to be hard-coded in the function below, somehow...

ps_subsample$trafo <- function(x, param_set) {
  x$subsample.frac <- (2^x$subsample.frac) / 314868
  return(x)
}

resample_inner <- rsmp("cv", folds = 5)
resample_outer <- rsmp("cv", folds = 5)
measure <- msr("regr.rmse")
terminator <- trm("none")
tuner <- tnr("grid_search", resolution = 20)

at <- AutoTuner$new(
  learner = grlrn_subsample,
  resampling = resample_inner,
  measure = measure,
  search_space = ps_subsample,
  terminator = terminator,
  tuner = tuner,
  store_tuning_instance = TRUE,
  store_benchmark_result = TRUE
)

future::plan("multisession", workers = 3)
rr <- resample(task = task, learner = at, resampling = resample_outer, store_models = TRUE)

saveRDS(as.data.table(rr), file = "data/processed/rr_dataset_size.RDS")

rr_data <- bind_rows(lapply(rr$learners, function(x) x$model$tuning_instance$archive$data()))
saveRDS(rr_data, file = "data/processed/rr_dataset_size_df.RDS")

We can now plot the results of the previous workflow and see how cross-validated model performance depends on the subsampled size of the training dataset.

if (!exists("rr_data")) rr_data <- readRDS("data/processed/rr_dataset_size_df.RDS")

rr_data %>%
  group_by(branch.selection, subsample.frac) %>%
  summarise(mean_rmse = mean(regr.rmse),
            q05 = quantile(regr.rmse, 0.05),
            q95 = quantile(regr.rmse, 0.95),
            .groups = "drop_last") %>%
  ungroup() %>%
  mutate(nrows = 2^subsample.frac) %>%
  rename(Learner = branch.selection) %>%
  ggplot(aes(color = Learner, fill = Learner)) +
  geom_path(aes(x = nrows, y = mean_rmse)) +
  geom_ribbon(aes(x = nrows, ymin = q05, ymax = q95), alpha = 0.5) +
  scale_x_continuous(trans = "log10", breaks = c(100, 500, 1000, 5000, 10000, 50000)) +
  labs(x = "Subsample size (# rows)", y = "Model RMSE")

It is clear beyond 10,000 rows performance only continues to improve marginally for untuned xgboost learners and model improvement for the other learners has clearly hit a plateau. So, we can stick to resampling 10,000 rows from the 314868 rows in the original dataframe for our continued statistical modelling.

nrow_subsample <- 10000

6.4 Modeling assemblage trends

We can now finally start to model long-term assemblage trends using a 10^{4}-row subsample of the original dataset. We use a gamboost model as it strikes a balance between predictive accuracy and interpretability, without introducing a lot of ‘jaggedness’ of boosted regression trees (e.g. from gbm and xgboost) that hamper interpretability to those unfamiliar with the data.

6.4.1 Data filtering

Exploratory modeling showed that model errors are largest in the Sahara. As only few species included in our assemblage trends inhabit this area, the assemblage trend is also least reliable (i.e. it is composed of population trends of few species). It thus makes sense to exclude these areas. We set a minimum number of species that contribute to an assemblage trend to >5 and limit our analysis strictly to the areas south of the Sahara, which is the Afrotropic realm in our ecoregion dataset. To illustrate this, let’s plot the number of species that are included in the assemblage trends.

set.seed(42)
data %>%
  slice_sample(n = nrow_subsample) %>% 
  mutate(included = if_else((count_long > 5 & Afrotropic == 1), 1, 0, 0)) %>%
  as.data.frame() -> data_subset

xlim <- c(-20, 60)
ylim <- c(-40, 40)
clim <- c(min(data_subset$count_long), max(data_subset$count_long))

ggplot(data_subset, aes(x = x, y = y, color = count_long)) +
  geom_point() +
  scale_color_viridis_c(limits = clim) +
  labs(subtitle = "Original data",
       x = "Longitude", y = "Latitude", color = "# species") +
  theme(legend.position = "bottom") +
  coord_fixed(xlim = xlim, ylim = ylim) -> original

data_subset %>%
  filter(included == 1) %>%
  ggplot(aes(x = x, y = y, color = count_long)) +
  geom_point() +
  scale_color_viridis_c(limits = clim) +
  labs(subtitle = "Filtered data",
       x = "Longitude", y = "Latitude", color = "# species") +
  theme(legend.position = "bottom") +
  coord_fixed(xlim = xlim, ylim = ylim) -> filtered

original + filtered + plot_annotation(title = "Species in long-term assemblage trend") +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

6.4.2 Training boosted GAMs

The training of boosted GAMs (implemented in mboost) will go as follows:

We train models predicting long term assemblage trends using long-term (linear) trends in environmental predictors and environmental predictability (model residuals from harmonic regression).
We then determine the optimal number of boosting iterations using 10-fold cross-validation, as implemented in the cvrisk function, varying the number of boosts between 250 and 30000.
We calculate the autocovariate of the residuals to correct for (strong) effects of spatial autocorrelation,⁴ which violates the independence assumption of regression.
We retrain the model but now whilst including the residual autocovariate as a parameter.
We, once again, determine the optimal number of boosting iterations on the new models. Most likely this does not differ substantially from the previous optimum, but it is good to make sure that is the case.

Let’s start with training the models.

set.seed(42)
data %>%
  filter(count_long > 5,
         Afrotropic == 1) %>%
  slice_sample(n = nrow_subsample) %>%
  as.data.frame() -> data_subset

saveRDS(data_subset, file = "data/processed/data_subset.RDS")

model_formula <- function(predictors, response, interactions = FALSE) {
  formula <- paste0(response, " ~ ", paste0("bbs(", predictors, ")", collapse = " + "))
  
  if (interactions) {
    envir_vars <- unlist(unique(lapply(str_split(predictors, "_"), function(x) x[[1]])))
    interactions <- lapply(envir_vars, function(x) {
      paste0("bspatial(", predictors[str_starts(predictors, paste0(x, "_"))][1], ", ", 
             predictors[str_starts(predictors, paste0(x, "_"))][2], ")")
    })
    formula <- paste0(formula, " + ", paste0(interactions, collapse = " + "))
  }
  
  formula <- as.formula(formula)
  return(formula)
}

formula_trends <- model_formula(predictors_trends[!predictors_trends %in% c("trend_long", "trend_short")], "trend_long")
formula_resids <- model_formula(predictors_resids[!predictors_resids %in% c("trend_long", "trend_short")], "trend_long")
formula_all <- model_formula(predictors_all[!predictors_all %in% c("trend_long", "trend_short")], "trend_long", 
                             interactions = TRUE)

ctrl <- boost_control(trace = FALSE, mstop = 10000)
mod_trends <- gamboost(formula_trends, data = data_subset, family = Gaussian(), control = ctrl)
mod_resids <- gamboost(formula_resids, data = data_subset, family = Gaussian(), control = ctrl)
mod_all <- gamboost(formula_all, data = data_subset, family = Gaussian(), control = ctrl)

saveRDS(mod_trends, file = "data/processed/models/mod_trends.RDS")
saveRDS(mod_resids, file = "data/processed/models/mod_resids.RDS")
saveRDS(mod_all, file = "data/processed/models/mod_all.RDS")

6.4.2.1 Determine the optimal number of boosting iterations

Having trained the models, we can now determine the optimal number of boosting iterations. As this is a time-consuming process, the chunk below does not run by default and it’s probably best to run this cross-validation procedure separately through the dedicated R/hyperparm_mboost_mstop.R script.

grid <- seq(from = 250, to = 30000, by = 250)
cv10f_trends <- cv(model.weights(mod_trends), type = "kfold")
saveRDS(cv10f_trends, file = "data/processed/models/cv10f_trends.RDS")
cvm_trends <- cvrisk(mod_trends, folds = cv10f_trends, grid = grid)
saveRDS(cvm_trends, file = "data/processed/models/cvm_trends.RDS")

cv10f_resids <- cv(model.weights(mod_resids), type = "kfold")
saveRDS(cv10f_resids, file = "data/processed/models/cv10f_resids.RDS")
cvm_resids <- cvrisk(mod_resids, folds = cv10f_resids, grid = grid)
saveRDS(cvm_resids, file = "data/processed/models/cvm_resids.RDS")

cv10f_all <- cv(model.weights(mod_all), type = "kfold")
saveRDS(cv10f_all, file = "data/processed/models/cv10f_all.RDS")
cvm_all <- cvrisk(mod_all, folds = cv10f_all, grid = grid)
saveRDS(cvm_all, file = "data/processed/models/cvm_all.RDS")

We can load the results and plot the outputs.

cvm_trends <- readRDS("data/processed/models/cvm_trends.RDS")
cvm_resids <- readRDS("data/processed/models/cvm_resids.RDS")
cvm_all <- readRDS("data/processed/models/cvm_all.RDS")

plot_cvm <- function(cvm, plot_id = "") {
  m <- cvm
  class(m) <- NULL
  as.data.frame(m) %>%
    rownames_to_column(var = "fold") %>%
    pivot_longer(-c(fold), names_to = "boosts", values_to = "risk") %>%
    mutate(boosts = as.numeric(boosts),
           fold = as.factor(fold)) %>%
    identity() %>%
    ggplot(aes(x = boosts, y = risk, group = fold, color = fold)) +
      geom_line() +
      labs(title = paste0(attr(cvm, "type"), ": ", plot_id), x = "boosts (mstop)", y = attr(cvm, "risk"))
}
plot_cvm_trends <- plot_cvm(cvm_trends, "trends")
plot_cvm_resids <- plot_cvm(cvm_resids, "resids")
plot_cvm_all <- plot_cvm(cvm_all, "all")

plot_cvm_trends + plot_cvm_resids + plot_cvm_all + plot_layout(guides = "collect") & theme(legend.position = "bottom")

We can see that most cross-validated models seem to show stagnating model improvement beyond 10000 boosts, while an optimum is mostly not reached before 30000 boosts. Given the stagnating improvement, limited compute time and no need to optimize predictive performance ad infinitum, we limit the number of boosts to 10000.

6.4.2.2 Correct for spatial autocorrelation

A quick plot of model residuals will show there is strong spatial autocorrelation in the residuals, violating the independence assumption of regression modelling.

if (!exists("mod_trends")) mod_trends <- readRDS("data/processed/models/mod_trends.RDS")
if (!exists("mod_resids")) mod_resids <- readRDS("data/processed/models/mod_resids.RDS")
if (!exists("mod_all")) mod_all <- readRDS("data/processed/models/mod_all.RDS")
data_subset$norac_trends <- resid(mod_trends)
data_subset$norac_resids <- resid(mod_resids)
data_subset$norac_all <- resid(mod_all)

data_subset %>%
  dplyr::select(x, y, norac_trends, norac_resids, norac_all) %>%
  rename(c(Predictability = norac_resids, Trends = norac_trends, `All predictors` = norac_all)) %>%
  pivot_longer(-c(x, y), names_to = "model", values_to = "residuals") %>%
  ggplot(aes(x = x, y = y, color = residuals)) +
  geom_point() +
  scale_color_gradient2() +
  coord_fixed() +
  facet_wrap(vars(model)) +
  labs(title = "Spatial pattern of model residuals", x = "Longitude", y = "Latitude", color = "Residual")

There’s no need to even quantify spatial autocorrelation (SAC) in residuals here, as it’s so obvious. We will correct for it using the residual autocovariate method introduced by Crase et al. ⁴. We calculate the autocovariate of the residuals using a neighborhood radius of 100 kilometers. This radius could be optimized to reflect actual SAC distances in our dataset, but for now it seems sufficient to dramatically reduce residual SAC.

neighborhood_dist <- 100  # Kilometers if longlat = TRUE
acov_trends <- autocov_dist(resid(mod_trends), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE, 
                            nbs = neighborhood_dist, zero.policy = TRUE)

## Warning in autocov_dist(resid(mod_trends), as.matrix(cbind(data_subset$x, : With
## value 100 some points have no neighbours

## Warning in nb2listw(nb, glist = gl, style = style, zero.policy = zero.policy):
## zero sum general weights

acov_resids <- autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE,
                            nbs = neighborhood_dist, zero.policy = TRUE)

## Warning in autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, : With
## value 100 some points have no neighbours

## Warning in autocov_dist(resid(mod_resids), as.matrix(cbind(data_subset$x, : zero
## sum general weights

acov_all <- autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x, data_subset$y)), longlat = TRUE,
                         nbs = neighborhood_dist, zero.policy = TRUE)

## Warning in autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x,
## data_subset$y)), : With value 100 some points have no neighbours

## Warning in autocov_dist(resid(mod_all), as.matrix(cbind(data_subset$x,
## data_subset$y)), : zero sum general weights

The chunk above may throw some warnings if some points are not within 100 kilometers from other points, but that will most certainly not be many points in our dataset, so we can accept that some of these autocovariates are not as accurate as others.

Now we will add the calculated autocovariates of residuals to our earlier dataset and retrain the models.

data_subset$autocov_trends <- acov_trends
data_subset$autocov_resids <- acov_resids
data_subset$autocov_all <- acov_all

formula_rac <- function(formula, var) {
  as.formula(paste0(paste(formula[2], formula[3], sep = " ~ "), " + bbs(", var, ")"))
}

formula_trends <- formula_rac(formula_trends, "autocov_trends")
formula_resids <- formula_rac(formula_resids, "autocov_resids")
formula_all <- formula_rac(formula_all, "autocov_all")

ctrl <- boost_control(trace = FALSE, mstop = 10000)
mod_trends_rac <- gamboost(formula_trends, data = data_subset, family = Gaussian(), control = ctrl)
mod_resids_rac <- gamboost(formula_resids, data = data_subset, family = Gaussian(), control = ctrl)
mod_all_rac <- gamboost(formula_all, data = data_subset, family = Gaussian(), control = ctrl)
saveRDS(mod_trends_rac, file = "data/processed/models/mod_trends_rac.RDS")
saveRDS(mod_resids_rac, file = "data/processed/models/mod_resids_rac.RDS")
saveRDS(mod_all_rac, file = "data/processed/models/mod_all_rac.RDS")

And now let’s reassess residual autocorrelation from our retrained model. If all went well, that should be visibly and substantially less than before.

if (!exists("mod_trends_rac")) mod_trends_rac <- readRDS("data/processed/models/mod_trends_rac.RDS")
if (!exists("mod_resids_rac")) mod_resids_rac <- readRDS("data/processed/models/mod_resids_rac.RDS")
if (!exists("mod_all_rac")) mod_all_rac <- readRDS("data/processed/models/mod_all_rac.RDS")
data_subset$rac_trends <- resid(mod_trends_rac)
data_subset$rac_resids <- resid(mod_resids_rac)
data_subset$rac_all <- resid(mod_all_rac)

data_subset %>%
  dplyr::select(x, y, rac_trends, rac_resids, rac_all) %>%
  rename(c(Predictability = rac_resids, Trends = rac_trends, `All predictors` = rac_all)) %>%
  pivot_longer(-c(x, y), names_to = "model", values_to = "residuals") %>%
  ggplot(aes(x = x, y = y, color = residuals)) +
  geom_point() +
  scale_color_gradient2() +
  coord_fixed() +
  facet_wrap(vars(model)) +
  labs(title = "Spatial pattern of model residuals", subtitle = "after including residual autocovariates",
       x = "Longitude", y = "Latitude", color = "Residual")

That is much better!

6.4.2.3 Determine the optimal number of boosting iterations of RAC model

Once again, we should make sure the number of boosts used (10000) is ‘optimal’ in our newly trained model. As before, it’s probably best to run this using the dedicated file in R/hyperparm_mboost_mstop.R, but the chunk below can do it as well.

grid <- seq(from = 250, to = 30000, by = 250)
cv10f_trends_rac <- cv(model.weights(mod_trends_rac), type = "kfold")
saveRDS(cv10f_trends_rac, file = "data/processed/models/cv10f_trends_rac.RDS")
cvm_trends_rac <- cvrisk(mod_trends_rac, folds = cv10f_trends_rac, grid = grid)
saveRDS(cvm_trends_rac, file = "data/processed/models/cvm_trends_rac.RDS")

cv10f_resids_rac <- cv(model.weights(mod_resids_rac), type = "kfold")
saveRDS(cv10f_resids_rac, file = "data/processed/models/cv10f_resids_rac.RDS")
cvm_resids_rac <- cvrisk(mod_resids_rac, folds = cv10f_resids_rac, grid = grid)
saveRDS(cvm_resids_rac, file = "data/processed/models/cvm_resids_rac.RDS")

cv10f_all_rac <- cv(model.weights(mod_all_rac), type = "kfold")
saveRDS(cv10f_all_rac, file = "data/processed/models/cv10f_all_rac.RDS")
cvm_all_rac <- cvrisk(mod_all_rac, folds = cv10f_all_rac, grid = grid)
saveRDS(cvm_all_rac, file = "data/processed/models/cvm_all_rac.RDS")

We can load these results as well and plot the outputs.

cvm_trends_rac <- readRDS("data/processed/models/cvm_trends_rac.RDS")
cvm_resids_rac <- readRDS("data/processed/models/cvm_resids_rac.RDS")
cvm_all_rac <- readRDS("data/processed/models/cvm_all_rac.RDS")

plot_cvm <- function(cvm, plot_id = "") {
  m <- cvm
  class(m) <- NULL
  as.data.frame(m) %>%
    rownames_to_column(var = "fold") %>%
    pivot_longer(-c(fold), names_to = "boosts", values_to = "risk") %>%
    mutate(boosts = as.numeric(boosts),
           fold = as.factor(fold)) %>%
    identity() %>%
    ggplot(aes(x = boosts, y = risk, group = fold, color = fold)) +
      geom_line() +
      labs(title = paste0(attr(cvm, "type"), ": ", plot_id), x = "boosts (mstop)", y = attr(cvm, "risk"))
}
plot_cvm_trends_rac <- plot_cvm(cvm_trends_rac, "trends")
plot_cvm_resids_rac <- plot_cvm(cvm_resids_rac, "resids")
plot_cvm_all_rac <- plot_cvm(cvm_all_rac, "all")

plot_cvm_trends_rac + plot_cvm_resids_rac + plot_cvm_all_rac + 
  plot_layout(guides = "collect") & theme(legend.position = "bottom")

Although model improvement did not stagnate as fast in this model, stopping boosting iterations at 10000 boosts again continues to remain reasonable. We could increase it to 20000, but that would (probably) at least double the time required for bootstrapping to quantify model uncertainty in further steps. As predictive performance is not the ultimate goal of this analysis, that trade-off is fine.

6.4.2.4 Quantifying residual autocorrelation

Now is a good moment to quantify residual autocorrelation (RSAC) in our model. We do so by calculating Moran’s I across a number of distance classes (8) using the sp.correlogram function from the spdep package. As this is computationally very intensive, we run this calculation once in the chunk below and save the results for fast plotting during knitting of this document.

nclass <- 8  # Number of distance classes
neighbors <- dnearneigh(as.matrix(cbind(data_subset$x, data_subset$y)), d1 = 0, d2 = neighborhood_dist, longlat = TRUE)

spc_norac_trends <- sp.correlogram(neighbors, data_subset$norac_trends, order = nclass, method = "I", zero.policy = TRUE)
spc_norac_resids <- sp.correlogram(neighbors, data_subset$norac_resids, order = nclass, method = "I", zero.policy = TRUE)
spc_norac_all <- sp.correlogram(neighbors, data_subset$norac_all, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_trends <- sp.correlogram(neighbors, data_subset$rac_trends, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_resids <- sp.correlogram(neighbors, data_subset$rac_resids, order = nclass, method = "I", zero.policy = TRUE)
spc_rac_all <- sp.correlogram(neighbors, data_subset$rac_all, order = nclass, method = "I", zero.policy = TRUE)

saveRDS(spc_norac_trends, file = "data/processed/models/spc_norac_trends.RDS")
saveRDS(spc_norac_resids, file = "data/processed/models/spc_norac_resids.RDS")
saveRDS(spc_norac_all, file = "data/processed/models/spc_norac_all.RDS")
saveRDS(spc_rac_trends, file = "data/processed/models/spc_rac_trends.RDS")
saveRDS(spc_rac_resids, file = "data/processed/models/spc_rac_resids.RDS")
saveRDS(spc_rac_all, file = "data/processed/models/spc_rac_all.RDS")

And we can now plot the RSAC correlograms.

if (!exists("spc_norac_trends")) spc_norac_trends <- readRDS("data/processed/models/spc_norac_trends.RDS")
if (!exists("spc_norac_resids")) spc_norac_resids <- readRDS("data/processed/models/spc_norac_resids.RDS")
if (!exists("spc_norac_all")) spc_norac_all <- readRDS("data/processed/models/spc_norac_all.RDS")
if (!exists("spc_rac_trends")) spc_rac_trends <- readRDS("data/processed/models/spc_rac_trends.RDS")
if (!exists("spc_rac_resids")) spc_rac_resids <- readRDS("data/processed/models/spc_rac_resids.RDS")
if (!exists("spc_rac_all")) spc_rac_all <- readRDS("data/processed/models/spc_rac_all.RDS")

par(mfrow = c(2, 3))
plot(spc_norac_trends, main = "RSAC Trends mdl")
plot(spc_norac_resids, main = "RSAC Predictability mdl")
plot(spc_norac_all, main = "RSAC All preds mdl")
plot(spc_rac_trends, main = "RSAC Trends RAC mdl")
plot(spc_rac_resids, main = "RSAC Predictability RAC mdl")
plot(spc_rac_all, main = "RSAC All preds RAC mdl")

Adding the autocovariate of residuals seems to have helped a lot to counter residual spatial autocorrelation. It’s possible this can be improved further by tweaking the radius of the neighborhood over which the autocovariate is calculated, but that’s some tweaking we can do later.

6.4.3 Model outputs

6.4.3.1 Model performance

We use good ol’ RMSE, % of deviance explained and Moran’s I to compare the performance of all these models.

perf_metrics <- function(actual, models) {
  rmse <- lapply(models, function(x) rmse(actual, predict(x)))
  deviance_explained <- lapply(models, function(x) (1 - (tail(risk(x), 1) / head(risk(x), 1))) * 100)
  m <- rbind(rmse, deviance_explained)
  mode(m) <- "numeric"
  m
}

residual_columns <- c("norac_trends", "norac_resids", "norac_all", "rac_trends", "rac_resids", "rac_all")
if (!exists("neighbors")) {
  neighbors <- dnearneigh(as.matrix(cbind(data_subset$x, data_subset$y)), d1 = 0, d2 = neighborhood_dist, longlat = TRUE)
}
sar_metrics <- lapply(residual_columns, function(x) {
  mi <- moran.test(data_subset[, x], nb2listw(neighbors, zero.policy = TRUE), zero.policy = TRUE)
  c(round(mi$estimate[1], 2), mi$p.value)
})
sar_metrics <- data.frame(matrix(unlist(sar_metrics), ncol = length(sar_metrics)))

pm <- as.data.frame(perf_metrics(data_subset$trend_long, 
                                 list(mod_trends, mod_resids, mod_all, mod_trends_rac, mod_resids_rac, mod_all_rac)))
colnames(sar_metrics) <- colnames(pm)
pm <- rbind(pm, sar_metrics)
colnames(pm) <- c("mod_trends", "mod_resids", "mod_all", "mod_trends_rac", "mod_resids_rac", "mod_all_rac")
rownames(pm) <- c("RMSE", "% Deviance explained", "Moran's I", "Moran's I test p-val")
pm <- as.data.frame(t(pm))
pm

	RMSE	% Deviance explained	Moran’s I
mod_trends	16.820661	28.84072	0.83
mod_resids	13.346926	55.19692	0.73
mod_all	11.230939	68.27676	0.64
mod_trends_rac	6.834131	88.25341	0.24
mod_resids_rac	6.227944	90.24483	0.18
mod_all_rac	5.869456	91.33555	0.13

6.4.3.2 Variable importance

We can quantify the importance of variables within our model using the varimp function of the mboost package. This returns variable importance, a measure of the total improvement to model deviance a variable is responsible for. This is weighted by the number of times a predictor is selected in the boosted models.

6.4.3.2.1 Trends model

plot(varimp(mod_trends_rac), main = "Varimp Trends RAC mdl")

as.data.frame(varimp(mod_trends_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2))

reduction	blearner	variable	selfreq	imp	imp_biol
269.422461	bbs(autocov_trends)	autocov_trends	0.2410	76.78	0.00
46.902638	bbs(tm_trend)	tm_trend	0.2565	13.37	57.56
21.384352	bbs(pr_trend)	pr_trend	0.1996	6.09	26.24
4.796977	bbs(aet_trend)	aet_trend	0.1274	1.37	5.89
4.594185	bbs(pet_trend)	pet_trend	0.1399	1.31	5.64
3.801587	bbs(ndvi_trend)	ndvi_trend	0.0356	1.08	4.67

6.4.3.2.2 Predictability model

plot(varimp(mod_resids_rac), main = "Varimp Predictability RAC mdl")

as.data.frame(varimp(mod_resids_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2))

reduction	blearner	variable	selfreq	imp	imp_biol
177.3754837	bbs(autocov_resids)	autocov_resids	0.1691	49.43	0.00
94.7949396	bbs(ndvi_resid)	ndvi_resid	0.5856	26.42	52.24
57.5502643	bbs(tm_resid)	tm_resid	0.0712	16.04	31.72
17.5826413	bbs(pet_resid)	pet_resid	0.0631	4.90	9.69
10.6466394	bbs(aet_resid)	aet_resid	0.0616	2.97	5.87
0.8702888	bbs(pr_resid)	pr_resid	0.0494	0.24	0.48

6.4.3.2.3 All predictor models

plot(varimp(mod_all_rac), main = "Varimp All pred RAC mdl")

as.data.frame(varimp(mod_all_rac)) %>% 
  arrange(desc(reduction)) %>%
  mutate(imp = reduction / sum(reduction) * 100,
         imp_biol = if_else(str_detect(blearner, "autocov"), 0, imp),
         imp_biol = round((imp_biol / sum(imp_biol) * 100), 2),
         imp = round(imp, 2)) %>%
  head(20)

reduction	blearner	variable	selfreq	imp	imp_biol
139.9040016	bbs(autocov_all)	autocov_all	0.0352	38.52	0.00
127.0774984	bspatial(tm_trend, tm_resid)	tm_resid, tm_trend	0.2324	34.99	56.92
56.2974251	bbs(ndvi_resid)	ndvi_resid	0.1593	15.50	25.22
18.0412086	bspatial(aet_trend, aet_resid)	aet_resid, aet_trend	0.2036	4.97	8.08
12.2534894	bspatial(pet_trend, pet_resid)	pet_resid, pet_trend	0.1641	3.37	5.49
4.1345608	bspatial(ndvi_trend, ndvi_resid)	ndvi_resid, ndvi_trend	0.0977	1.14	1.85
2.6364365	bbs(pr_trend)	pr_trend	0.0058	0.73	1.18
1.1339132	bbs(pet_resid)	pet_resid	0.0082	0.31	0.51
0.8446243	bbs(aet_resid)	aet_resid	0.0073	0.23	0.38
0.4523058	bspatial(pr_trend, pr_resid)	pr_resid, pr_trend	0.0610	0.12	0.20
0.1833741	bbs(pet_trend)	pet_trend	0.0095	0.05	0.08
0.0976552	bbs(tm_trend)	tm_trend	0.0123	0.03	0.04
0.0790772	bbs(aet_trend)	aet_trend	0.0017	0.02	0.04
0.0196674	bbs(tm_resid)	tm_resid	0.0015	0.01	0.01
0.0017870	bbs(pr_resid)	pr_resid	0.0004	0.00	0.00
0.0000000	bbs(ndvi_trend)	ndvi_trend	0.0000	0.00	0.00

6.4.3.3 Partial/Marginal Effects

And, finally, we can have a look at the modelled effects of predictors on long-term assemblage trends. We first compare the output of the original models.

Because the full model included interaction terms, and by default the partial dependence plots are calculated by evaluating the estimated value over every unique value of the two involved variables, this can get extremely computationally intensive. So instead, we use a tweaked version of the mboost.plot function, which lets use set a gridsize, the number of values over which the partial dependence is evaluated on a quantile grid. See the R/mboost_plot_grid.R file for details.

source('~/envirpred/R/mboost_plot_grid.R')

6.4.3.3.1 Partial effects RAC models

par(mfrow = c(2, 3))
plot(mod_trends_rac)

rm(mod_trends_rac)

par(mfrow = c(2, 3))
plot(mod_resids_rac)

rm(mod_resids_rac)

plot_all <- function(model, gridsize) {
  preds <- names(model.frame(model))
  single <- which(str_starts(preds, "bbs"))
  interx <- which(str_starts(preds, "bspatial"))
  
  for (id in single) {
    w <- which(names(variable.names(model)) == preds[id])
    plot(model, which = w)
  }
  
  for (id in interx) {
    w <- which(names(variable.names(model)) == preds[id])
    print(plot.mboost_grid(model, which = w, gridsize = gridsize))
  }  
}
plot_all(mod_all_rac, 200)

And because we focus specifically on temperature, let’s prettify that plot a bit.

preds <- names(model.frame(mod_all_rac))
  single <- which(str_starts(preds, "bbs"))
  interx <- which(str_starts(preds, "bspatial"))
w <- which(names(variable.names(mod_all_rac)) == preds[14])
pal <- rev(RColorBrewer::brewer.pal('RdBu', n=11))
theme <- rasterTheme(region = pal)
plot.mboost_grid(mod_all_rac, which = w, gridsize = 200, at = seq(from = -40, to = 40, by = 2.5),
                 xlab = expression("Temp. trend [0.1"~degree*C~y^-1*"]"), 
                 ylab = expression("Temp. variability [0.1"~degree*C*"]"), 
                 par.settings = theme)

rm(mod_all_rac)

5 Environmental predictability & trends

7 Quantify model uncertainty