Near Data Computing with FPGA
基於FPGA的近數據計算
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 3 Dec 2021 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(b6ed9116-0c22-44e1-88f1-46fd1ce550f5).html |
---|---|
Other link(s) | Links |
Abstract
Data are now accumulating exponentially in the internet era, and how to process massive data efficiently has drawn much attention in recent years. Naively increasing the core numbers and frequency is a direct but ineffective solution. Equipped with multiple parallel processing units, hardware acceleration, e.g., GPU, FPGA, has been shown a promising and remarkably effective approach to improving system performance. Low power consumption and high parallelism make FPGA become a popular accelerator which can be deployed in different system frameworks. On the one hand, for the computation-limited system, FPGA can be PCIe-attached to the host as a co-processor for near memory task acceleration. On the other hand, for those data transfer bounded applications which may take advantage of the in-storage computing strategy, FPGA is also suitable to be integrated into SSDs due to its low power consumption.
However, a direct implementation of code running in the CPU would not work, as FPGA has only one-tenth frequency of CPU. Therefore, the design optimization for FPGA is critical and challenging. Moreover, the application employed in the heterogeneous architecture also brings interface challenges. The communication protocol should be designed delicately. In this thesis, we exploit FPGA acceleration for data-driven applications suffering from different limitations and employed in different architectures and improve the system performance after tackling the aforementioned challenges.
First, we propose a novel FPGA-based storage engine for traditional read-intensive DBMS in the cloud with a focus on data filtering operation. A hardware data filter is designed which can significantly speed up filtering operations by utilizing parallelism provided by PCIe-attached FPGA. Meanwhile, it can support different queries without partial reconfiguration. This FPGA-based storage engine is integrated with DBMS to realize end-to-end acceleration. In addition, intelligent filtering on/off switch is designed to adaptively decide whether the FPGA-based filter should be employed, based on selectivity estimation. Our work achieves faster speed performance than the conventional storage engine in low-selectivity cases and higher energy efficiency than GPU-based acceleration solution.
Second, with the same system architecture, we design and implement an FPGA-based compaction engine to accelerate compaction in LSM-tree based key-value stores, which is suitable for write-intensive workloads. To take full advantage of the pipeline mechanism on FPGA, the key-value separation and index-data block separation strategies are proposed. To improve the compaction performance, the bandwidth of the FPGA chip is fully utilized. In addition, the proposed acceleration engine is integrated with a classic LSM-tree based store without modifications to the original storage format. The proposed FPGA-based compaction engine can achieve a vast improvement for the random write throughput.
Third, we investigate the FPGA acceleration performance of the in-storage computing for recommendation systems. Production-scale recommendation systems desire substantial memory capacity, which leads to considerable money cost and even cannot fit entirely in the DRAM in the future. Naively moving the embedding tables from DRAM to SSD would degrade the system performance. We propose to offload both the embedding lookup layer and MLP layers into SSD. Embedding Lookup Engine is optimized by applying the two-stage fined-grained reading strategies. MLP Acceleration Engine facilitates the in-memory recommendation model into FPGA-oriented functional units. In addition, we minimize the resource consumption and keep the optimal throughput through the kernel reuse strategy and kernel search algorithm. The proposed SSD-side FPGA solutions targets low-end FPGA to speed up both the embedding-dominated and MLP-dominated models with high resource efficiency.
In conclusion, this thesis brings more understanding of the FPGA acceleration mechanism. It can be applied to various architectures, but the pipeline design of the offloaded unit should be carefully considered. FPGA would not always promise the best performance, and sophisticated analysis of each kernel model of the application is vitally important to decide whether FPGA should be deployed.
However, a direct implementation of code running in the CPU would not work, as FPGA has only one-tenth frequency of CPU. Therefore, the design optimization for FPGA is critical and challenging. Moreover, the application employed in the heterogeneous architecture also brings interface challenges. The communication protocol should be designed delicately. In this thesis, we exploit FPGA acceleration for data-driven applications suffering from different limitations and employed in different architectures and improve the system performance after tackling the aforementioned challenges.
First, we propose a novel FPGA-based storage engine for traditional read-intensive DBMS in the cloud with a focus on data filtering operation. A hardware data filter is designed which can significantly speed up filtering operations by utilizing parallelism provided by PCIe-attached FPGA. Meanwhile, it can support different queries without partial reconfiguration. This FPGA-based storage engine is integrated with DBMS to realize end-to-end acceleration. In addition, intelligent filtering on/off switch is designed to adaptively decide whether the FPGA-based filter should be employed, based on selectivity estimation. Our work achieves faster speed performance than the conventional storage engine in low-selectivity cases and higher energy efficiency than GPU-based acceleration solution.
Second, with the same system architecture, we design and implement an FPGA-based compaction engine to accelerate compaction in LSM-tree based key-value stores, which is suitable for write-intensive workloads. To take full advantage of the pipeline mechanism on FPGA, the key-value separation and index-data block separation strategies are proposed. To improve the compaction performance, the bandwidth of the FPGA chip is fully utilized. In addition, the proposed acceleration engine is integrated with a classic LSM-tree based store without modifications to the original storage format. The proposed FPGA-based compaction engine can achieve a vast improvement for the random write throughput.
Third, we investigate the FPGA acceleration performance of the in-storage computing for recommendation systems. Production-scale recommendation systems desire substantial memory capacity, which leads to considerable money cost and even cannot fit entirely in the DRAM in the future. Naively moving the embedding tables from DRAM to SSD would degrade the system performance. We propose to offload both the embedding lookup layer and MLP layers into SSD. Embedding Lookup Engine is optimized by applying the two-stage fined-grained reading strategies. MLP Acceleration Engine facilitates the in-memory recommendation model into FPGA-oriented functional units. In addition, we minimize the resource consumption and keep the optimal throughput through the kernel reuse strategy and kernel search algorithm. The proposed SSD-side FPGA solutions targets low-end FPGA to speed up both the embedding-dominated and MLP-dominated models with high resource efficiency.
In conclusion, this thesis brings more understanding of the FPGA acceleration mechanism. It can be applied to various architectures, but the pipeline design of the offloaded unit should be carefully considered. FPGA would not always promise the best performance, and sophisticated analysis of each kernel model of the application is vitally important to decide whether FPGA should be deployed.
- FPGA, Hardware Acceleration, Near-Data Computing