If the datapath cannot be parallelized (or not sufficiently), then look at adding more kernel instances, as described in Creating Multiple Instances of a Kernel. This is usually referred to as using multiple compute units (CUs).
Adding more kernel instances improves the performance of the application by allowing the execution of more invocations of the targeted function in parallel as shown below. Multiple data sets are processed concurrently by the different instances. Application performance scales linearly with the number of instances, provided that the host application can keep the kernels busy.
As illustrated in the Using Multiple Compute Units tutorial, the Vitis technology makes it easy to scale performance by adding additional instances.
At this point, the developer should have a good understanding of the amount of parallelism necessary in the hardware to meet performance goals and through a combination of datapath width and kernel instances, how that parallelism will be achieved.