Document Type
Article
Publication Date
12-2020
Keywords
Compositional data; Data type; Data point
Abstract
Background
Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data.
Results
In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method.
Conclusions
Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.
Citation
Zhang, Q., & Dao, T. (2020). A Distance Based Multisample Test for High-dimensional Compositional Data with Applications to the Human Microbiome. BMC Bioinformatics, 21 (9) https://doi.org/10.1186/s12859-020-3530-x
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Included in
Artificial Intelligence and Robotics Commons, Mathematics Commons, Mechanical Engineering Commons