To balance the latency by adding pipeline stages, add the stage to the control path and not the data path. The data path includes wider buses, which increases the number of flip-flop and register resources used.
For example, if you have a 128-bit data path, 2 stages of registers, and a requirement of 5 cycles of latency, inserting 3 register stages results in an extra 3 x 128 = 384 flip-flops. Alternatively, you can use registers to control logic to enable the data path. Use 5 stages of single-bit registers to control the enable signal of datapath flip-flops and multicycle path timing exceptions accordingly.
Note: This example is only possible for certain designs. For example, in cases where there is a fanout from the intermediate data path flip-flops, having only 2 stages does not work.
Recommended: The optimal LUT:FF
ratio in a device is 1:1. Designs with significantly more FFs will increase unrelated
logic packing into slices, which will increase routing complexity and can degrade
QoR.