Before building a kernel to implement FIR filtering, consider the following:
What kind of interface are you using?
How many coefficients do you have?
How does it influence the size of the data register and coefficient register?
How many lanes can you use in the intrinsics?
When do you schedule data reading and writing?