Use the Vitis analyzer to visualize the run_summary report.
vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/xrt.run_summary
Review the profile reports and compare the metrics with configuration Bloom4x
Kernels & Compute Unit:Kernel Execution reports 146 ms compared to the 292 ms. This is exactly half the time as now 8 words are computed in parallel instead of 4 words.
Host Data Transfer: Host Transfer section reports the same delays.
The overall gain in the application overall gain is only because that kernel now is processing 8 words in parallel compared to 4 words in parallel.