One of the most unique classes I took in graduate school, taught by a PhD student, was called “Machine Learning Concepts and Applications” (FOR 796). Machine learning refers to the process by which a model uses data to make predictions about a y-variable of interest. This is a little different from traditional statistical models in that machine learning is not interested in the true relationship between a set of variables (nor in statistical significance, for that matter); instead, it is interested only in using a set of x-variables to predict a y-variable. Put another way, it is only concerned about accuracy in making predictions, regardless of the strength of correlations between variables. This course covered machine learning algorithms such as random forests and gradient boosting machines, and also how to quantify accuracy from different algorithms using loss functions such as root mean square error (in other words, how close are our predictions to the true y-values?). Moreover, we discussed modeling approaches both for continuous and categorical data, how to tune hyperparameters of different models to improve predictive power via techniques like cross validation (hyperparameters are values that determine the learning process), and using a hold-out test set to see how well trained models predict new data. All analyses were conducted in the program R. Because I, like many wildlife biology students, have always been more interested in the ecology side of research over the coding element, this course was a good exposure to implementing entirely new approaches in modeling in R. Machine learning has been and will continue to be extremely useful in the future as humans seek to make predictions about how their activities will affect ecological systems across the globe.
Our grade in the course was determined by a final project in which we applied two “complex” machine learning algorithms and one “simple” traditional model to a dataset of our choosing to determine which had the highest predictive power, and then wrote up our results in a short report. Though it had a small number of observations (47), I decided to use the dataset from my undergraduate research to predict path tortuosity (broadly defined as the number of twists and turns in a trail) of translocated woodrats by different explanatory variables (vegetation type, age, sex, among others). For the simple model, I used a linear mixed model with animal ID as a random effect, whereas for the complex models I used a random forest and stochastic gradient boosting machine. The hyperparameter tuning process for the complex models involved a leave-one-out cross validation approach due to the small sample size. My findings indicated that both complex models similarly outperformed the simple model by having the highest accuracy, which was a surprise given the small size of my dataset (machine learning methods perform best on large datasets containing hundreds to thousands of observations). Overall, though, my project therefore demonstrated the value of machine learning algorithms in making predictions even with small datasets. The full abstract is attached below.
Abstract: Small mammals are commonly used to determine what effects a disturbance will have on an ecological system. Their small size enables them to perceive alterations in vegetation structure at fine scales, and also makes them easy and inexpensive to capture and study. While most published studies in disturbance ecology have focused on changes in small mammal diversity, movement has been less studied. Movement, especially path tortuosity, is important to consider because it indicates space use, microhabitat selection, foraging behavior, and perceived predation risk, features that affect population-level processes. Being able to predict how tortuosity relates to various extrinsic and intrinsic variables after a disturbance, specifically wildfire, can inform management actions in accordance with changing environmental conditions. Therefore, three models, a linear mixed model, random forest, and stochastic gradient boosting machine, were fit to a dataset of Mexican woodrat (Neotoma mexicana) movement collected in the PinaleƱo Mountains of southeastern Arizona to assess which best predicts path tortuosity based on nine predictor variables. Results indicated that the random forest and stochastic gradient boosting machine both produced similar outcomes and outperformed the linear mixed model based on root mean square error and mean absolute error values. The similarity in outcome between the two machine learning models could be explained by the small sample size, though nevertheless stresses the value of machine learning algorithms even for small datasets. Overall, the findings of this analysis can be used to predict how species will respond to changing environmental conditions and inform management actions.