SIMD / Vectorization - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

An important aspect of System Partitioning for AI Engine is to consider how the SIMD vector data path may be leveraged for high performance compute. This usually involves investigating strategies for “vectorization” or assigning signal samples to lanes. For the $\rho$ computation of the Hough Transform, a workable scheme involves using mac16() intrinsics to process four pixels at a time.

  • In one vector register, four copies of four different $\theta$ values are loaded to allow you to process four $\theta$ values for four different pixels in a single instruction. This produces sixteen histogram outputs per cycle and two such computes per loop body are scheduled.

  • In a second vector register, load four \((x,y)\) pixel values into lanes aligned properly to their $\theta$ counterparts in the other register.