cfloat x cfloat multiplications take two cycles to perform due to the absence of the post add. You can interleave these two parts with the two cycle latency of the accumulator.
There are still 16 coefficients but now they are complex. Hence, double the size. The code must update the coefficients four times for a complete iteration. The data transfer is also slightly more complex.