Date of Graduation

5-2024

Document Type

Thesis

Degree Name

Bachelor of Science in Data Science

Degree Level

Undergraduate

Department

Data Science

Advisor/Mentor

Schubert, Karl

Committee Member/Reader

Zhu, Lijun

Committee Member/Second Reader

Liu, Yuyan

Abstract

Text representation is a fundamental aspect of natural language processing (NLP) when it comes to the performance of neural networks. Free-form text fields are being utilized in more and more industries. Anything from a description of an item on a web store to tracking service events to military-grade aircraft is being collected in free-form text. The goal of the thesis is to highlight best practices and discuss trends in data to prepare text for a neural network. It will demonstrate various techniques for representing free-form text in the context of neural networks, focusing on data preparation decisions, embedding techniques, and formatting strategies. The first section will delve into methodologies to manipulate raw text into a form that makes it suitable for embedding and neural network training. The data preparation step includes tokenization, lemmatization, normalization, and cleaning to reduce noise within the text and to ensure a quality input for the following stages. Next, the thesis addresses different embedding techniques and when to use them. The embedding technique chosen is pivotal when transforming text into a vector representation. Traditional methods such as word embeddings and more recent advancements like contextual embeddings are compared in terms of their effectiveness with neural networks. Finally, formatting considerations for neural networks are discussed, such as input representation, sequence modeling, and output formatting. Methods for handling varying input lengths, incorporating positional information, and designing output layers are discussed to optimize model performance. Through an in-depth review and experimentation with the techniques listed above, this thesis aims to provide insights into the decision-making process a data scientist might go through when given raw text.

Keywords

Text Representation, Data Preparation, Cleaning, Text Embedding, Natural Language Processing, Neural Networks

Available for download on Friday, April 25, 2025

Share

COinS