Graduate Theses and Dissertations

Statistical Modeling for High-dimensional Compositional data with Applications to the Human Microbiome

Thy Dao, University of Arkansas, FayettevilleFollow

Date of Graduation

7-2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Mathematics (PhD)

Degree Level

Graduate

Department

Mathematical Sciences

Advisor/Mentor

Zhang, Qingyang

Committee Member

Kaman, Tulin

Second Committee Member

Lee-Bartlett, Jung Ae

Keywords

compositional data; distance-based; high dimensional compositional data; microbiome; statistical model

Abstract

Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology, and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. This dissertation is motivated by some statistical problems arising in the analysis of compositional data. In particular, we focus on the high-dimensional and over-dispersed setting, where the dimensionality of compositions is greater than the sample size and the dispersion parameter is moderate or large. In this dissertation, we consider a general problem of testing for the compositional difference between K populations. We propose a new Bayesian hypothesis, together with a nonparametric and distance-based testing method. Furthermore, we utilize multiple variable-selecting models, including LASSO, elastic net, ridge regression and cumulative logit model, to identify the most important subset of variables. This dissertation is structured as follows:

Chapter 1 introduces the compositional microbiome data, and then briefly review different statistical tests and model to be used in our framework, including distance correlation, LASSO, Ridge regression, elastic net, cumulative logit and adjacent-category logit model.

Chapter 2 then presents our new statistical test together with two real world applications form human microbiome study. We first formulate a hypothesis from the Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, the distance-based method is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. It does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. The performance of this method is evaluated using simulation studies. We apply this new procedure to two human microbiome datasets including a throat microbiome dataset and an intestinal microbiome data.

In addition to the overall testing, we also want to identify a small subset of variables that distinguish different populations. Chapter 3 introduces the procedure to select most significant variables (bacteria or genus) using LASSO, Ridge regression, elastic net, cumulative logit model and adjacent-category logit models. Chapter 4 validates our findings from Chapter 3 and presents visualizations using multi-dimensional scaling (MDS).

Chapter 5 discusses and concludes the dissertation with some future perspectives.

Citation

Dao, T. (2021). Statistical Modeling for High-dimensional Compositional data with Applications to the Human Microbiome. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/4137

Download

Included in

Biostatistics Commons, Categorical Data Analysis Commons, Statistical Methodology Commons

COinS

Graduate Theses and Dissertations

Statistical Modeling for High-dimensional Compositional data with Applications to the Human Microbiome

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Keywords

Abstract

Citation

Included in

Search

Links

Browse

Contact Us

Graduate Theses and Dissertations

Statistical Modeling for High-dimensional Compositional data with Applications to the Human Microbiome

Author

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Keywords

Abstract

Citation

Included in

Share

Search

Links

Browse

Contact Us