Frame 7 - WP552

AI Engine Programming: A Kahn Process Network Evolution (WP552)

Document ID
WP552
Release Date
2023-07-20
Revision
1.0 English

The seventh set of data cannot be written to the buffer due to the lock. This is because the vadd is waiting to process frame 4. The fir_32 kernel (node/actor) is still processing frame 1.

Figure 1. Frame 7
Table 1. Sending the Frame 7 – Tokens/Kernels Status
KPN Terminology Input Token for Vadd Node/Actor Input Token for addConstant Node/Actor Input Token for fir_32 Input Token for copy_in_out Node/ Actor Port Node/ Actor Port
AI Engine Buffer (ping/pong) Vadd

Buffer (ping/pong)

addConstant

Buffer (ping/pong)

Buffer (ping/pong)

copy_in_out

Buffer (ping/pong)

fir_32

Buffer (ping/pong)

buf0/

buf1

buf0d/

buf1d

buf2 buf2d buf3 buf3d Buf4 buf4d buf6 buf6d buf5 buf5d
Frame 1 Fill - Waiting - - Waiting - - - - Waiting - - Waiting - -
Frame 2 Token ready Frame 1 Fill Processing (Frame 1) Fill - Waiting - - - - Waiting - - Waiting - -
Frame 3 Fill Token ready Frame 2 Processing (Frame 2) Token ready Frame 1 Fill Processing (Frame 1) Fill - Fill - Waiting - - Waiting - -
Frame 4 Token ready Frame 3 Fill Processing (Frame 3) Fill Token ready Frame 2 Processing (Frame 2) Token ready Frame 1 Fill Token ready for Frame 1 Fill

Processing (Frame 1)

Fill -

Processing (Frame 1)

Fill -
Frame 5 Fill Token ready Frame 4 Processing (Frame 4) - Locked Waiting Locked Token ready Frame 2 Locked Token ready Frame 2 Processing (Frame 2) - Fill Processing (Frame 1) Fill -
Frame 6 Locked Fill Waiting Locked Locked Waiting Locked Token ready Frame 2 Locked Locked Waiting Fill - Processing (Frame 1) Fill -
Frame 7 (Wait) Locked Locked Waiting Locked Locked Waiting Locked Token ready Frame 2 Locked Locked Waiting Fill - Processing (Frame 1) Fill -

In this case, the fir_32 is implemented using the scalar processor, which is very slow to execute. Implementing the fir_32 using the vector processor solves this issue and works much faster. The key takeaway is not only that the proper data flow improves the performance, but the kernel performance has an impact on the overall system.

The following figures show the event trace view of the scalar code and vector code designs.

In the following figure, the kernel addConstant performs two frames but the kernel fir (fir_32t_scalar) is still processing frame 1. This causes the locks to be generated respectively to the previous buffers and also leads to a kernel stall.

Figure 2. Event Trace – Scalar Code Design

As explained in the paragraph following Table 1, the kernel fir has been replaced with the vector version (fir_32t_vector), which performs faster compare to the scalar version. As a result, the locks are prevented.

Figure 3. Event Trace – Vector Code Design