In this work we introduce a novel framework for mapping multiple Long Short-Term Memory (LSTM) models onto FPGA devices. The proposed approach leverages Singular Value Decomposition (SVD)-based approximation and structured pruning to optimize the execution of multiple LSTMs in parallel, significantly improving performance and memory efficiency. Our FPGA-specific accelerator design features a custom dataflow architecture with dedicated SVD and non-linear kernels to handle LSTM gate computations efficiently. The framework achieves a 3× to 5× performance increase over traditional methods, with bounded accuracy loss, making it a scalable solution for high-performance applications requiring parallel LSTM execution.

Abstract