Data Science Undergraduate Honors Theses

The Importance of Text Representation for Neural Networks through Natural Language Processing Techniques

William Parsley, University of Arkansas, FayettevilleFollow

Date of Graduation

5-2024

Document Type

Thesis

Degree Name

Bachelor of Science in Data Science

Degree Level

Undergraduate

Department

Data Science

Advisor/Mentor

Schubert, Karl

Committee Member

Zhu, Lijun

Second Committee Member

Liu, Yuyan

Abstract

Text representation is a fundamental aspect of natural language processing (NLP) when it comes to the performance of neural networks. Free-form text fields are being utilized in more and more industries. Anything from a description of an item on a web store to tracking service events to military-grade aircraft is being collected in free-form text. The goal of the thesis is to highlight best practices and discuss trends in data to prepare text for a neural network. It will demonstrate various techniques for representing free-form text in the context of neural networks, focusing on data preparation decisions, embedding techniques, and formatting strategies. The first section will delve into methodologies to manipulate raw text into a form that makes it suitable for embedding and neural network training. The data preparation step includes tokenization, lemmatization, normalization, and cleaning to reduce noise within the text and to ensure a quality input for the following stages. Next, the thesis addresses different embedding techniques and when to use them. The embedding technique chosen is pivotal when transforming text into a vector representation. Traditional methods such as word embeddings and more recent advancements like contextual embeddings are compared in terms of their effectiveness with neural networks. Finally, formatting considerations for neural networks are discussed, such as input representation, sequence modeling, and output formatting. Methods for handling varying input lengths, incorporating positional information, and designing output layers are discussed to optimize model performance. Through an in-depth review and experimentation with the techniques listed above, this thesis aims to provide insights into the decision-making process a data scientist might go through when given raw text.

Keywords

Text Representation; Data Preparation; Cleaning; Text Embedding; Natural Language Processing; Neural Networks

Citation

Parsley, W. (2024). The Importance of Text Representation for Neural Networks through Natural Language Processing Techniques. Data Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/dtscuht/4

Data Science Undergraduate Honors Theses

The Importance of Text Representation for Neural Networks through Natural Language Processing Techniques

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Abstract

Keywords

Citation

Included in

Search

Links

Browse

Contact Us

Data Science Undergraduate Honors Theses

The Importance of Text Representation for Neural Networks through Natural Language Processing Techniques

Author

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Abstract

Keywords

Citation

Included in

Share

Search

Links

Browse

Contact Us