Date of Graduation

7-2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Mathematics (PhD)

Degree Level

Graduate

Department

Mathematical Sciences

Advisor/Mentor

Qingyang Zhang

Committee Member

Tulin Kaman

Second Committee Member

Jung Ae Lee-Barlett

Keywords

compositional data, distance-based, high dimensional compositional data, microbiome, statistical model

Abstract

Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology, and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. This dissertation is motivated by some statistical problems arising in the analysis of compositional data. In particular, we focus on the high-dimensional and over-dispersed setting, where the dimensionality of compositions is greater than the sample size and the dispersion parameter is moderate or large. In this dissertation, we consider a general problem of testing for the compositional difference between K populations. We propose a new Bayesian hypothesis, together with a nonparametric and distance-based testing method. Furthermore, we utilize multiple variable-selecting models, including LASSO, elastic net, ridge regression and cumulative logit model, to identify the most important subset of variables. This dissertation is structured as follows:

Chapter 1 introduces the compositional microbiome data, and then briefly review different statistical tests and model to be used in our framework, including distance correlation, LASSO, Ridge regression, elastic net, cumulative logit and adjacent-category logit model.

Chapter 2 then presents our new statistical test together with two real world applications form human microbiome study. We first formulate a hypothesis from the Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, the distance-based method is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. It does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. The performance of this method is evaluated using simulation studies. We apply this new procedure to two human microbiome datasets including a throat microbiome dataset and an intestinal microbiome data.

In addition to the overall testing, we also want to identify a small subset of variables that distinguish different populations. Chapter 3 introduces the procedure to select most significant variables (bacteria or genus) using LASSO, Ridge regression, elastic net, cumulative logit model and adjacent-category logit models. Chapter 4 validates our findings from Chapter 3 and presents visualizations using multi-dimensional scaling (MDS).

Chapter 5 discusses and concludes the dissertation with some future perspectives.

Share

COinS