Date of Graduation

5-2024

Document Type

UAF Access Only - Thesis

Degree Name

Bachelor of Science in Data Science

Degree Level

Undergraduate

Department

Data Science

Advisor/Mentor

Schubert, Karl

Committee Member/Reader

Buttle, Casey

Committee Member/Second Reader

Mitchell, Rachael

Abstract

This study is going to be based on an inventory outlier automation data science problem that is being solved to identify and prescribe inventory level outliers to help keep shelves stocked in terms of beverages. The objective of this paper will address why it is so important to understand the data that is involved in a particular data science problem and how planning ahead ensures a successful outcome in the data science world. In this data science project, Spatiotemporal Outlier Analysis for Inventory Intervention Automation, it was crucial for the team to understand, research, and visualize the data we were working with so we could prepare our data for the models correctly. We had to encode many values and deal with many null values, as well as cluster our data using date time warp clustering to feed it into our random forest model for the most accurate, yet not too granular results to get an importance list and be able to prescribe outliers. The team hopes to provide our industry partners, The Coca Cola Company, with an automated system to be ran every week that will clean the data, cluster the data, feed the clustered data to the random forest model, and output an importance list that can be interpreted to identify what specific products need attention in terms of inventory level. We then used Alteryx to build out a flow of all the code the team made to have it in one cohesive place, as this automated process was built with sample, static datasets limited to a year of inventory for only a couple states. When we then feed our models tens of millions of rows of data, we can better train our models and therefore produce better results. With all the code put into a single flow, the result is an importance list of the top 10 contributing features to why a certain product had an inventory outage. This paper will cover the importance of the first two phases of the data science cycle (Understand, the Data), how it impacted our project, and why it is so important in the general data science world.

Keywords

Clustering, Data Visualization, Imputation

Citation

Beard, S. (2024). The Importance of Data Preparation in a Data Science Problem. Data Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/dtscuht/6

Download

Available for download on Tuesday, April 29, 2025

Included in

Business Analytics Commons, Data Science Commons, Operational Research Commons, Operations and Supply Chain Management Commons

COinS

Data Science Undergraduate Honors Theses

The Importance of Data Preparation in a Data Science Problem

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member/Reader

Committee Member/Second Reader

Abstract

Keywords

Citation

Included in

Browse

Links

Search

Data Science Undergraduate Honors Theses

The Importance of Data Preparation in a Data Science Problem

Author

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member/Reader

Committee Member/Second Reader

Abstract

Keywords

Citation

Included in

Share

Browse

Links

Search