To understand what kind of hardware implementation is needed given the performance constraints, you can examine the convolution kernel in some detail:
The core compute is done in a 4-level nested loop, but you can break it to the compute per output pixel produced.
In terms of the output-pixels produced, it is clear from the filter source code that a single output pixel is produced when the inner two loops finish execution once.
These two loops are essentially doing the sum-of-product on a coefficient matrix and image sub-matrix. The matrix sizes are defined by the coefficient matrix, which is 15x15.
The inner two loops are performing a dot product of size 225(15x15). In other words, the two inner loops perform 225 multiply-accumulate (MAC) operations for every output pixel produced.