Date of Graduation


Document Type


Degree Name

Doctor of Philosophy in Engineering (PhD)

Degree Level



Computer Science & Computer Engineering


David Andrews

Committee Member

Mark E. Arnold

Second Committee Member

Miaoqing Huang

Third Committee Member

John M. Gauch


hardware design, coding


Low latency inferencing is of paramount importance to a wide range of real time and userfacing Machine Learning (ML) applications. Field Programmable Gate Arrays (FPGAs) offer unique advantages in delivering low latency as well as energy efficient accelertors for low latency inferencing. Unfortunately, creating machine learning accelerators in FPGAs is not easy, requiring the use of vendor specific CAD tools and low level digital and hardware microarchitecture design knowledge that the majority of ML researchers do not possess. The continued refinement of High Level Synthesis (HLS) tools can reduce but not eliminate the need for hardware-specific design knowledge. The designs by these tools can also produce inefficient use of FPGA resources that ultimately limit the performance of the neural network. This research investigated a new FPGA-based software-hardware codesigned overlay architecture that opens the advantages of FPGAs to the broader ML user community. As an overlay, the proposed design allows rapid coding and deployment of different ML network configurations and different data-widths, eliminating the prior barrier of needing to resynthesize each design. This brings important attributes of code portability over different FPGA families. The proposed overlay design is a Single-Instruction-Multiple-Data (SIMD) Processor-In-Memory (PIM) architecture developed as a programmable overlay for FPGAs. In contrast to point designs, it can be programmed to implement different types of machine learning algorithms. The overlay architecture integrates bit-serial Arithmetic Logic Units (ALUs) with distributed Block RAMs (BRAMs). The PIM design increases the size of arithmetic operations and on-chip storage capacity. User-visible inference latencies are reduced by exploiting concurrent accesses to network parameters (weights and biases) and partial results stored throughout the distributed BRAMs. Run-time performance comparisons show that the proposed design achieves a speedup compared to HLS-based or custom-tuned equivalent designs. Notably, the proposed design is programmable, allowing rapid design space exploration without the need to resynthesize when changing ML algorithms on the FPGA.