Date of Graduation
5-2024
Document Type
Thesis
Degree Name
Bachelor of Science in Data Science
Degree Level
Undergraduate
Department
Data Science
Advisor/Mentor
Schubert, Karl
Committee Member
Zhu, Lijun
Second Committee Member
Liu, Yuyan
Abstract
Text representation is a fundamental aspect of natural language processing (NLP) when it comes to the performance of neural networks. Free-form text fields are being utilized in more and more industries. Anything from a description of an item on a web store to tracking service events to military-grade aircraft is being collected in free-form text. The goal of the thesis is to highlight best practices and discuss trends in data to prepare text for a neural network. It will demonstrate various techniques for representing free-form text in the context of neural networks, focusing on data preparation decisions, embedding techniques, and formatting strategies. The first section will delve into methodologies to manipulate raw text into a form that makes it suitable for embedding and neural network training. The data preparation step includes tokenization, lemmatization, normalization, and cleaning to reduce noise within the text and to ensure a quality input for the following stages. Next, the thesis addresses different embedding techniques and when to use them. The embedding technique chosen is pivotal when transforming text into a vector representation. Traditional methods such as word embeddings and more recent advancements like contextual embeddings are compared in terms of their effectiveness with neural networks. Finally, formatting considerations for neural networks are discussed, such as input representation, sequence modeling, and output formatting. Methods for handling varying input lengths, incorporating positional information, and designing output layers are discussed to optimize model performance. Through an in-depth review and experimentation with the techniques listed above, this thesis aims to provide insights into the decision-making process a data scientist might go through when given raw text.
Keywords
Text Representation; Data Preparation; Cleaning; Text Embedding; Natural Language Processing; Neural Networks
Citation
Parsley, W. (2024). The Importance of Text Representation for Neural Networks through Natural Language Processing Techniques. Data Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/dtscuht/4