Date of Graduation
5-2025
Document Type
Thesis
Degree Name
Bachelor of Science in Data Science
Degree Level
Undergraduate
Department
Data Science
Advisor/Mentor
Roa, Raj
Committee Member
Schubert, Karl
Second Committee Member
Muldoon, Tim
Abstract
Stem cells are the cells in our body with the unique ability to both self-renew and differentiate into specialized cell types, making them fundamental to growth, development, and tissue regeneration across our various systems. However, the behavior and essential functions of stem cells are influenced by a complex interplay of genetic and environmental factors. Aberrations in their gene expression profiles can lead to dysfunctional or diseased cells, potentially compromising tissue repair and regeneration. Given their promise in regenerative medicine for restoring damaged tissues and treating various conditions, accurately classifying stem cells to detect abnormalities is critical. Such classification ensures that only healthy, viable cells are utilized in therapeutic applications, preventing issues that could limit effectiveness or introduce complications in clinical practice of stem cell therapies.
One such method of classification is via machine learning, which is a transformative tool that allows researchers to process and interpret vast, complex datasets, including these stem cell gene expression profiles. By leveraging machine learning, researchers can uncover subtle patterns within these profiles that might otherwise go undetected, which offers deeper understandings of cell quality and differentiation potential. The machine learning models are able to analyze thousands of genes simultaneously, allowing them to identify key biomarkers and expression patterns that distinguish normal from abnormal stem cells. This capability is valuable for this classification task and the potential for future predictive modeling. Furthermore, machine learning allows for high-throughput analysis, making it possible to evaluate large numbers of stem cells quickly and with lesser bias and greater precision than manual analysis. This not only accelerates the research process but also supports scalable, reproducible insights into stem cell health, ultimately enhancing regenerative medicine approaches and the safe application of stem cell therapies.
After comparing Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting models, we found that Random Forest delivered the most consistent and contextually appropriate results for this study. While Logistic Regression achieved the highest overall accuracy, both the Random Forest and Logistic Regression aligned identically with our key performance priorities: a low false negative rate and high recall for Class I. Although Random Forest tended to produce more false positives, this skew reflects a conservative approach – favoring the identification of abnormal stem cells, even at the risk of overcalling. In the context of stem cell therapy, this trade-off is desirable: a false negative could allow a harmful cell to slip through, while a false positive simply errs on the side of caution. Ultimately, Random Forest’s ability to capture complex, nonlinear relationships – something Logistic Regression inherently lacks – combined with its emphasis on minimizing false negatives, makes it the most suitable choice for our application.
Keywords
Stem Cell Therapy; Stem Cell Classification; Machine Learning, Random Forest; Abnormal Stem Cell
Citation
Saitta, S. A. (2025). Machine learning-assisted analyses for identification and prediction of genetic abnormalities in human pluripotent stem cell populations. Data Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/dtscuht/24