Efficient Application-specific Hardware Architecture for Dense Tensor Computation


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date10 Dec 2021


A tensor is defined as a multidimensional (or multiway) array; for example, a matrix is a two-dimensional tensor. The most common approach to analyse multidimensional data is to first flatten or vectorise the data and then use well-developed matrix analysis tools. However, this approach ignores the spatial information that a tensor provides. A better approach to analyse the sensor is the tensor decomposition method. To perform the tensor computation efficiently and conveniently, it is necessary to build up a hardware architecture for tensor computation.

First of all, in Chapter 1, we introduce tensor algebra and the terms will be used in this thesis. The significance of the tensor computation is also provided. In Chapter 2 of this thesis, we review the state-of-the-art works on tensor computing on different architectures, including a distributed computing system, graphics processing unit and field-programmable gate array (FPGA). We briefly summarise the key contributions of different works in the literature. Moreover, we state the potential challenges and problems limiting the tensor computation architecture.

As we aim to provide novel optimized hardware architecture design to perform tensor computations, in Chapter 3, we present a hardware architecture for singular spectrum analysis of Hankel tensors, which is a special structure tensor that is useful in signal processing. In the proposed design, in general, we have 3 major modules. To minimize BRAM usage, Hankel tensor entries are computed on the fly in higher order singular value decomposition (HOSVD). The fast tensor-matrix multiplication scheme is used to accelerate core tensor calculation. In tensor reconstruction and hankelization, a fully pipeline architecture is used to accelerate the whole process.

Chapter 4 presents a specific hardware architecture for tensor decomposition. Non-optimised dense tensor decomposition easily consumes a large amount of memory space. Frequent and large amounts of off-chip memory access will limit the overall performance improvement. Through computation partition and rearrangement, data movement between the FPGA and off-chip DDR memory is reduced. To reduce resource usage, an efficient and unified processing element array for a three-dimensional real tensor computation is designed. The processing element array is optimised for thin and tall tensor–matrix multiplication and two types of tensor-times-matrix chain operations.

In Chapter 5, we conclude the thesis and also provide our perspective on the future work. Nowadays, to develop a optimized and domain specific accelerator architecture is attracting more and more focus. Thus, this thesis explores a unified and efficient architecture for dense tensor computation.