Date of Graduation


Document Type


Degree Name

Doctor of Philosophy in Engineering (PhD)

Degree Level



Industrial Engineering


Xiao Liu

Committee Member

Edward A. Pohl

Second Committee Member

Margaret Bennewitz

Third Committee Member

W. Art Chaovalitwongse


Boosting Trees, Ensemble Learning, Feature Selection, Gradient Boosted Trees, Medical Imaging, Random Forests, Spatial Statistics, Statistical Learning


In particular medical imaging data, such as positron emission tomography (PET), computed tomography (CT), and fluorescence intravital microscopy (IVM), have become prevalent for use in a wide variety of applications, from diagnostic purposes, tracking diseases' progress, and monitoring the effectiveness of treatments to decision-making processes. The detailed information generated by medical imaging has enabled physicians to provide more comprehensive care. Although numerous machine learning algorithms, especially those used for imaging data, have been developed, dealing with unique structures in imaging data remained a big challenge. In this dissertation, we are proposing novel statistical tree-based methods with more efficient and more accurate responses for use in medical imaging applications.

In Chapter 2, we introduce a gradient Boosted Trees for Spatial Data (Boost-S) with covariate information.The main innovation of this chapter is to incorporate the spatial correlation structure into the boosting structure. Boosting trees are one of the most successful statistical learning approaches that involve sequentially growing an ensemble of simple regression trees. However, gradient boosted trees are not yet available for spatially correlated data. Boost-S integrates the spatial correlation structure into the classical framework of gradient boosted trees. Each tree is constructed by solving a regularized optimization problem, where the objective function takes into account the underlying spatial correlation and involves two penalty terms on tree complexity. A computationally-efficient greedy heuristic algorithm is proposed to obtain an ensemble of trees. The proposed Boost-S is applied to the spatially-correlated FDG-PET (fluorodeoxyglucose-positron emission tomography) imaging data collected from clinical trials of cancer chemoradiotherapy. Quantitatively assessing and monitoring tumor response to therapy is essential for an optimized treatment plan. Hence, accurately predicting the change of SUV (standardized uptake value) is critical for treatment optimization and control. Our numerical investigations successfully demonstrate the advantages of the proposed Boost-S over existing approaches for this particular application.

In Chapter 3, we propose a Structured Adaptive Boosting Trees algorithm (AdaBoost.S) for the edge detection problem associated with medical images.The main innovation of this chapter is to develop structural learning in an additive boosting model. The algorithm is motivated by the well-known observation that edges over an image mask often exhibit special structures and are highly interdependent. Such structures can be predicted using the features extracted from a bigger image patch that covers the image mask. We present the details of feature extraction and the technical details of constructing structured boosting trees leveraging the classical framework of adaptive boosting. The proposed AdaBoost.S is applied to detect the platelet-neutrophil aggregates from a large number of fluorescence IVM images of the pulmonary microcirculation. The platelet-neutrophil aggregates are important to assess lung injury from e-cigarette exposure. Therefore, a statistical learning algorithm is required to efficiently and accurately detect the edges of platelet-neutrophil aggregates. The predictive capabilities of the proposed approach are demonstrated by comparing the F-score, precision, and recall with those of other methods.

In Chapter 4, we review a variety of feature selection (FS) techniques that are built around random forests (wrappers based on RF) since high dimensional datasets recently have become overwhelmingly generated in different fields, especially in gene selection studies. The main goal of using such techniques is to identify and eliminate features with less or no predictive power in order to improve the predictive accuracy by removing unimportant or non-informative features, enhance the interpretability of a former complex data structure, and significantly reduce the computational complexity of the predictor. Our review includes Boruta, RRF, GRRF, GRF, r2VIM, PIMP (Altmann), NTA (vita), varSelRF, VSURF, RF-SRC, AUCRF, and RFE methods. Also, the mentioned methods are applied to three publicly available datasets.