The seventh set of data cannot be written to the buffer due to the lock. This is because the vadd is waiting to process frame 4. The fir_32 kernel (node/actor) is still processing frame 1.
KPN Terminology | Input Token for Vadd | Node/Actor | Input Token for addConstant | Node/Actor | Input Token for fir_32 | Input Token for copy_in_out | Node/ Actor | Port | Node/ Actor | Port | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AI Engine | Buffer (ping/pong) | Vadd |
Buffer (ping/pong) |
addConstant |
Buffer (ping/pong) |
Buffer (ping/pong) |
copy_in_out |
Buffer (ping/pong) |
fir_32 |
Buffer (ping/pong) |
||||||
buf0/ buf1 |
buf0d/ buf1d |
buf2 | buf2d | buf3 | buf3d | Buf4 | buf4d | buf6 | buf6d | buf5 | buf5d | |||||
Frame 1 | Fill | - | Waiting | - | - | Waiting | - | - | - | - | Waiting | - | - | Waiting | - | - |
Frame 2 | Token ready Frame 1 | Fill | Processing (Frame 1) | Fill | - | Waiting | - | - | - | - | Waiting | - | - | Waiting | - | - |
Frame 3 | Fill | Token ready Frame 2 | Processing (Frame 2) | Token ready Frame 1 | Fill | Processing (Frame 1) | Fill | - | Fill | - | Waiting | - | - | Waiting | - | - |
Frame 4 | Token ready Frame 3 | Fill | Processing (Frame 3) | Fill | Token ready Frame 2 | Processing (Frame 2) | Token ready Frame 1 | Fill | Token ready for Frame 1 | Fill |
Processing (Frame 1) |
Fill | - |
Processing (Frame 1) |
Fill | - |
Frame 5 | Fill | Token ready Frame 4 | Processing (Frame 4) | - | Locked | Waiting | Locked | Token ready Frame 2 | Locked | Token ready Frame 2 | Processing (Frame 2) | - | Fill | Processing (Frame 1) | Fill | - |
Frame 6 | Locked | Fill | Waiting | Locked | Locked | Waiting | Locked | Token ready Frame 2 | Locked | Locked | Waiting | Fill | - | Processing (Frame 1) | Fill | - |
Frame 7 (Wait) | Locked | Locked | Waiting | Locked | Locked | Waiting | Locked | Token ready Frame 2 | Locked | Locked | Waiting | Fill | - | Processing (Frame 1) | Fill | - |
In this case, the fir_32 is implemented using the scalar processor, which is very slow to execute. Implementing the fir_32 using the vector processor solves this issue and works much faster. The key takeaway is not only that the proper data flow improves the performance, but the kernel performance has an impact on the overall system.
The following figures show the event trace view of the scalar code and vector code designs.
In the following figure, the kernel addConstant performs two frames but the kernel fir (fir_32t_scalar) is still processing frame 1. This causes the locks to be generated respectively to the previous buffers and also leads to a kernel stall.
As explained in the paragraph following Table 1, the kernel fir has been replaced with the vector version (fir_32t_vector), which performs faster compare to the scalar version. As a result, the locks are prevented.