- One instance achieves about 6~14 times acceleration. Here are some examples:
Table 11 Table 1 Acceleration process on CPU comparison with FPGA
Pictures |
Texture complexity |
Width (pix) |
Height (pix) |
-q |
kernel1 latency(ms) |
kernel2 latency (ms) |
Freq (MHz) |
Throughput FPGA U200 (MB/s) |
Throughput CPU (MB/s) |
Speed up |
3840-city.png |
complex |
3840 |
2160 |
80 |
95.30 |
87.93 |
250 |
129.82 |
18.46 |
7.03 |
1920x1080x4.png |
simple |
3840 |
2160 |
80 |
83.90 |
74.96 |
250 |
159.74 |
16.17 |
9.88 |
1920x1080.png |
simple |
1920 |
1080 |
80 |
21.51 |
18.60 |
250 |
172.54 |
11.85 |
14.56 |
853x640.png |
simple |
853 |
640 |
80 |
4.13 |
74.96 |
250 |
156.45 |
20.97 |
7.46 |
lena_c_512.png |
middle |
512 |
512 |
80 |
2.90 |
2.84 |
250 |
127.17 |
21.32 |
5.96 |
Platform: CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (single thread)
- One instance takes about 6% resource of U200 acceleraction card, following is the detail:
Table 12 Table 2 Resource using for FPGA
Utilizations |
Kernel-1 |
Kernel-2 |
Kernel-1 + Kernel-2 |
LUT |
52889 |
15866 |
5.37% |
FF |
68991 |
23039 |
3.30% |
DSP |
410 |
4 |
6.00% |
BRAM |
72 |
157 |
4.00% |
URAM |
10 |
0 |
2.08% |
- Multi-pictures process. Host code supports multi-pictures process with asynchronous behaviors, which allows to overlap host-device communiations, prediction kernel computation and arithmetic coding kernel computation. This is shown by following demonstration picture and profiling result.