Date of Graduation

12-2024

Document Type

Thesis

Degree Name

Master of Science in Statistics and Analytics (MS)

Degree Level

Graduate

Department

Statistics and Analytics

Advisor/Mentor

Fernandes, Samuel B.

Committee Member

Vieira, Caio C.

Second Committee Member

Adams, Richard

Third Committee Member

das Gracas Dias, Kaio O.

Abstract

Plant breeding is essential to increase genetic gain and food production worldwide. This study was conducted to evaluate new ways to use machine learning (ML) to tackle plant breeding challenges, where two ideas were tested — the first chapter focuses on how to combine genetic and environmental data using ML to improve the prediction of maize grain yield in multi-environment trials, while the second chapter centers on how to couple feature selection of molecular markers with ML to enhance prediction of yield in soybean, and, in both cases, ML approaches were compared to well-established statistical methods greatly adopted by the plant breeding community. For the first chapter, utilizing multi-environment trial data from the Genomes To Fields initiative, different models were developed and tested to predict maize grain yield with various input types: genetic, environmental, or a combination of both types of data, integrated in either an additive (genetic-and-environmental; G+E) or a multiplicative (genotype-by-environment interaction; GEI) framework. For the second chapter, using two distinct soybean datasets, different feature selection methods were tested and coupled with an ML model to test whether using a subset of the molecular markers could result in a better prediction ability than using the whole set of predictors. Overall, incorporating environmental data increased the mean prediction accuracy of ML-based genomic prediction models by up to 7% compared to the established Factor Analytic Multiplicative Mixed Model across the three cross-validation scenarios assessed for the maize dataset, and, notably, the G+E model demonstrated advantages over the GEI model, offering comparable or superior prediction accuracy, reduced computational demands, and the flexibility to capture interactions by construction. Additionally, employing feature selection with ML improved prediction ability in one of the soybean datasets across the three cross-validation schemes evaluated. Interestingly, mean predictive ability was improved with ML using only 60−100 predictors over the Ridge Regression Best Linear Unbiased Predictor, which used as much as 1400−2200 predictors. Overall, as the collection of high-dimensional genetic and environmental data is facilitated with emerging technologies, our results suggest that employing ML-based approaches can enhance the efficiency of plant breeding programs.

Share

COinS