Kernels & Compute Unit: Kernel Execution indicates that the Total Time by kernel enqueue is about 292 ms.
4 words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents).
Number of words/(Clock Freq * Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations.
Host Data Transfer: Host Transfer shows that the Host Write Transfer to DDR is 145 ms and the Host Read Transfer to DDR is 36 ms.
Host Write transfer using a theoretical PCIe bandwidth of 9GB should be 1399 MB/9GBps = 154 ms
Host Read transfer using a theoretical PCIe bandwidth of 12GB should be 350 MB/12 GBps = 30 ms
Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth
Kernels & Compute Unit: Compute Unit Stalls confirms that there are almost no “External Memory Stalls”