An Efficient Parallel Processor for Dense Tensor Computation

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations

Related Research Unit(s)


Original languageEnglish
Pages (from-to)1335-1347
Journal / PublicationIEEE Transactions on Very Large Scale Integration (VLSI) Systems
Issue number7
Online published27 May 2021
Publication statusPublished - Jul 2021


Nowadays, many data are multidimensional, which are called tensors. Tensor computations have been applied in different fields and various software libraries have been developed. However, not much attention has been received for developing a hardware architecture to accelerate the tensor computations. In this article, an efficient and unified processing element (PE) array for the 3-D tensor computation is demonstrated. Our PE array is optimized for thin and tall tensor-matrix multiplication and two types of tensor times matrices chain (TTMc) operations. Our design is evaluated in three study cases and compared with the state-of-the-art design. By using computation partition and rearrangement, data movement between the field-programmable gate array (FPGA) and off-chip DDR memory can be reduced by (²), where I is the maximum range among all the dimensions of the data tensor. For TTMc implementation, clock frequency has been increased by 18% compared with the state-of-the-art implementation on the same FPGA chip. An experiment on 3-D volumetric data set rendering by tensor approximation method is conducted for demonstration. For the bricks reconstruction process, the runtime decreased by 50%, i.e., two times faster, on our FPGA implementation compared with that running on GPU. In CANDECOMP/PARAFAC decomposition, for one iteration, the runtime has been decreased by up to 93% compared with the programs implemented by Tensorly, which is a python library.

Research Area(s)

  • Computer architecture, Field programmable gate arrays, Field-programmable gate array (FPGA), Hardware, hardware architecture, Matrix decomposition, parallel processor, Task analysis, tensor computation., Tensors, Very large scale integration