Date of Graduation
5-2024
Document Type
UAF Access Only - Thesis
Degree Name
Bachelor of Science in Data Science
Degree Level
Undergraduate
Department
Data Science
Advisor/Mentor
Schubert, Karl
Committee Member
Buttle, Casey
Second Committee Member
Mitchell, Rachael
Abstract
This study is going to be based on an inventory outlier automation data science problem that is being solved to identify and prescribe inventory level outliers to help keep shelves stocked in terms of beverages. The objective of this paper will address why it is so important to understand the data that is involved in a particular data science problem and how planning ahead ensures a successful outcome in the data science world. In this data science project, Spatiotemporal Outlier Analysis for Inventory Intervention Automation, it was crucial for the team to understand, research, and visualize the data we were working with so we could prepare our data for the models correctly. We had to encode many values and deal with many null values, as well as cluster our data using date time warp clustering to feed it into our random forest model for the most accurate, yet not too granular results to get an importance list and be able to prescribe outliers. The team hopes to provide our industry partners, The Coca Cola Company, with an automated system to be ran every week that will clean the data, cluster the data, feed the clustered data to the random forest model, and output an importance list that can be interpreted to identify what specific products need attention in terms of inventory level. We then used Alteryx to build out a flow of all the code the team made to have it in one cohesive place, as this automated process was built with sample, static datasets limited to a year of inventory for only a couple states. When we then feed our models tens of millions of rows of data, we can better train our models and therefore produce better results. With all the code put into a single flow, the result is an importance list of the top 10 contributing features to why a certain product had an inventory outage. This paper will cover the importance of the first two phases of the data science cycle (Understand, the Data), how it impacted our project, and why it is so important in the general data science world.
Keywords
Clustering; Data Visualization; Imputation
Citation
Beard, S. (2024). The Importance of Data Preparation in a Data Science Problem. Data Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/dtscuht/6
Included in
Business Analytics Commons, Data Science Commons, Operational Research Commons, Operations and Supply Chain Management Commons