Date of Graduation

5-2025

Document Type

Thesis

Degree Name

Bachelor of Science in Computer Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor/Mentor

Gauch, Susan

Committee Member

Patitz, Matthew

Second Committee Member

Le, Thi

Abstract

Synthetic dataset generation attempts to solve the issue of needing large datasets to train models and observe behaviors in both a cost and time effective way. This project aims to highlight the value of synthetic data creation and how it was useful when training a neural network model by creating a synthetic dataset generator for a specific model in order to evaluate its behavior in difference scenarios. The model that was used in this project was created to select high-quality papers while maximizing the diversity of authors based on race, gender, and country characteristics, essentially mimicking a review process. The goal of this synthetic dataset generator is to provide an efficient way to test the model on relevant data in order to train the model and evaluate its effectiveness.

Two experiments were conducted, the first verifying that the generator created a dataset with the correct user-provided proportions, and the second to test how well the model behaved when run on a synthetic dataset. The first experiment did indeed verify that the generator created datasets with the correct percentages provided by the user and the second experiment revealed that the synthetic data simulated real data well based on the performance results from the model.

Keywords

synthetic data; datasets; model training

Share

COinS