Some things to try to build on this experiment:
Edit the host code to play with the image sizes. How does the run time change if you scale up more? Down
more? Where is the crossover point where it no longer makes sense to use an accelerator in this case?
What is the extra hardware latency?