Date of Graduation

12-2017

Document Type

Thesis

Degree Name

Master of Science in Statistics and Analytics (MS)

Degree Level

Graduate

Department

Graduate School

Advisor

Avishek Chakraborty

Committee Member

Mark Arnold

Second Committee Member

Giovanni Petris

Third Committee Member

Qingyang Zhang

Keywords

Bayesian Analysis, Linear Regression, Statistics

Abstract

Outlier detection is one of the most important challenges with many present-day applications. Outliers can occur due to uncertainty in data generating mechanisms or due to an error in data recording/processing. Outliers can drastically change the study's results and make predictions less reliable. Detecting outliers in longitudinal studies is quite challenging because this kind of study is working with observations that change over time. Therefore, the same subject can produce an outlier at one point in time produce regular observations at all other time points. A Bayesian hierarchical modeling assigns parameters that can quantify whether each observation is an outlier or not. The purpose of this thesis is to detect the outlying observations by developing three approaches of techniques and comparing each of them under dierent data generating mechanisms. In the rst chapter, we introduce the important concepts in Bayesian inference with three examples. The rst two examples (Binomial and Poisson distributions) are to illustrate the idea behind the Monte Carlo method, while the last example (normal distribution) is to illustrate the Markov Chain Monte Carlo (MCMC). We visited three dierent types of MCMC Methods: Metropolis-Hastings, Gibbs sampler and Slice sampler which we have used in the three algorithms of outlier detection. In Chapter Two, we used Gibbs sampler techniques with the linear regression model. Simulated data with three covariates were used, and then we applied our method to a real dataset: the Strong Rock data. We explained the ndings using diagrams. In Chapter Three, we focused on the core problem of identifying outliers by using three methods. We applied our methods on four simulation datasets. We found that the rst two methods did not work well under assumptions of systematic heteroscedasticity but the last one did an ecient job, as we expected, even when the functional form of heteroscedasticity was not correctly specied. Next, we formulated our model for the real data, so we could apply the methods that we developed in chapter three. Given access to the real data that have large numbers of observations, we will apply these methods.

Share

COinS