Before building a kernel to implement FIR filtering, consider the following:
What kind of interface will you use?
How many coefficients do you have?
How will it influence the size of the data register and coefficient register?
How many lanes can you use in my intrinsics?
When will you schedule data reading and writing?