Abstract
This project presents the design and FPGA implementation of a Hybrid Deep Learning Accelerator Unit (Hybrid DLAU) optimized for high performance and energy efficiency in real-time AI applications. The proposed architecture overcomes the limitations of conventional accelerators by reducing computation delay, power consumption, and hardware utilization through a hybrid arithmetic design.
The Hybrid DLAU integrates Carry-Save Adders (CSA) and a Wallace-tree reduction network to minimize carry-propagation delay and enhance throughput. It comprises the three pipelined modules called Tiled Matrix Multiplication Unit (TMMU), Partial Sum Accumulation Unit (PSAU), and Activation Function Acceleration Unit (AFAU) for supporting multiple nonlinear activation functions such as ReLU, Linear, Hard-Sigmoid, and Hard-Tanh using fixed-point approximations.
FPGAs were chosen over other devices like ASICs, CPUs, and GPUs for their balance of flexibility, parallelism, and low power, also enabling rapid prototyping and reconfiguration for evolving neural network models without the high cost and inflexibility of ASIC fabrication. The architecture was modeled in a Verilog HDL and synthesized using Xilinx Vivado 2018 version on a Zynq-7000 FPGA. Experimental results show a 26.8% reduction in data-path delay, 49.9% lower power consumption, and over 60% fewer logic registers than the baseline DLAU while maintaining identical DSP usage. These results demonstrate that the proposed Hybrid DLAU provides a scalable, reconfigurable, and energy-efficient hardware platform for real-time deep-learning inference on FPGA systems.