Date of Graduation
Doctor of Philosophy in Mathematics (PhD)
Second Committee Member
Third Committee Member
The rise of Big Data in recent years brings many challenges to modern statistical analysis and modeling. In toxicogenomics, the advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on key word search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this study, we developed a generalized probabilistic topic model to analyze a toxicogenomics data set that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.
In online social network, social network services have hundreds of millions, sometimes even billions, of monthly active users. These complex and vast social networks are tremendous resources for understanding the human interactions. Especially, characterizing the strength of social interactions becomes essential task for researching or marketing social networks. Instead of traditional dichotomy of strong and weak tie assumption, we believe that there are more types of social ties than just two. We use cosine similarity to measure the strength of the social ties and apply incremental Dirichlet process Gaussian mixture model to group tie into different clusters of ties. Comparing to other methods, our approach generates superior accuracy in classification on data with ground truth. The incremental algorithm also allow data to be added or deleted in a dynamic social network with minimal computer cost. In addition, it has been shown that the network constraints of individuals can be used to predict ones' career successes. Under our multiple type of ties assumption, individuals are profiled based on their surrounding relationships. We demonstrate that network profile of a individual is directly linked to social significance in real world.
Chung, Ming-Hua, "Probabilistic Graphical Modeling on Big Data" (2015). Theses and Dissertations. 1415.