Date of Graduation
9-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Engineering (PhD)
Degree Level
Graduate
Department
Electrical Engineering and Computer Science
Advisor/Mentor
Huang, Miaoqing
Committee Member
Alexander Nelson
Third Committee Member
David Andrews
Fourth Committee Member
Zhong Chen
Keywords
DNN, data storage
Abstract
Deep neural networks (DNNs) are widely used in applications such as classification, prediction, and regression. Various DNN architectures, such as convolutional neural networks (CNN), multilayer perceptrons (MLP), long short-term memory (LSTM), recurrent neural networks (RNN), and transformers, have become leading machine learning techniques in these applications. They require significant computational resources and have substantial memory demands due to intensive matrix-matrix multiplications and complex data flows. Hence, efficient utilization of on-chip computational and memory resources is essential to maximize parallelism and minimize latency. Designing an optimal tiling scheme that aligns effectively with the architecture is also necessary. Modern FPGAs are equipped with Block Random Access Memory (BRAM) for storing data, such as weights, intermediate results, or configurations during processing, and Digital Signal Processing (DSP) blocks for performing high-speed arithmetic operations like multiplication and accumulation, which are essential for deep learning tasks. Efficient parallel utilization of these resources while maintaining operational frequency is crucial for achieving low-latency inference. Moreover, achieving high performance and accuracy across diverse applications often necessitates an optimized network architecture, typically developed through iterative experimentation and evaluation across various network topologies. However, custom hardware accelerators lack scalability and flexibility, as they cannot adapt to different topologies at runtime. Additionally, designing custom hardware using FPGA vendor tools and programming languages is a time-intensive process that demands a deep understanding of hardware architecture. This research developed versatile accelerators to address diverse computational and memory demands, enabling support for various DNNs across multiple applications while striving to maximize processing unit utilization to achieve low latency. Due to their high parallelism, efficient tiling, and advanced coding techniques, they surpassed the performance of most custom accelerators and general-purpose processors.
Citation
Kabir, E. (2025). FPGA-Based Overlay Accelerators with Massive Parallel Processing Units to Accelerate Deep Neural Networks. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/5834