In general, for acceleration systems we are trying to hit a particular performance target, whether that target is expressed in terms of overall end-to-end latency, frames per second, raw throughput, or something else.
Generally speaking, candidates for acceleration are algorithmic processing blocks that work on large chunks of data in deterministic ways. In 1967, Gene Amdahl famously proposed what has since become known as Amdahl’s Law. It expresses the potential speedup in a system:
In this equation, S_latency
represents the theoretical speedup of a task, s
is the speedup of the portion
of the algorithm that benefits from acceleration, and p
represents the proportion of the execution time the
task occupied pre-acceleration.
Amdahl’s Law demonstrates a clear limit to the benefit of acceleration in a given application, as the portion of the task that cannot be accelerated (generally involving decision making, I/O, or other system overhead tasks) will always become the system bottleneck.
If Amdahl’s Law is correct, however, then why do so many modern systems use either general-purpose or
domain-specific acceleration? The crux lies here: modern systems are processing an ever-increasing amount of
data, and Amdahl’s Law only applies to cases where the problem size is fixed. That is to say, the limit is
imposed only so long as p
is a constant fraction of overall execution time. In 1988 John Gustafson and Edwin
Barsis proposed a reformulation that has since become known as Gustafson’s Law. It re-casts the problem:
Again, S_latency
represents the theoretical speedup of a task, s
is the speedup in latency of the task
benefiting from parallelism, and p
is the percentage of the overall task latency of the part benefiting from
improvement (prior to the application of the improvement).
Gustafson’s Law re-frames the question raised by Amdahl’s Law. Rather than more compute resources speeding up the execution of a single task, more compute resources allow more computation in the same amount of time.
Both of the laws frame the same thing in different ways: to accelerate an application, we are going to parallelize it. Through parallelization we are attempting to either process more data in the same amount of time, or the same amount of data in less time. Both approaches have mathematical limitations, but both also benefit from more resources (although to different degrees).
In general, FPGAs and ACAPs are useful for accelerating algorithms with lots of “number crunching” on large data sets - things like video transcoding, financial analysis, genomics, machine learning, and other applications with large blocks of data that can be processed in parallel. It’s important to approach acceleration with a firm understanding of how and when to apply external acceleration to a software algorithm. Throwing random code onto any accelerator, FPGA or otherwise, does not generally give you optimal results. This is for two reasons: first and foremost, leveraging acceleration sometimes requires restructuring a sequential algorithm to increase its parallelism. Second, while the FPGA architecture allows you to quickly process parallel data, transferring data over PCIe and between DDR memories has an inherent additive latency. You can think of this as a kind of “acceleration tax” that you must pay to share data with any heterogeneous accelerator.
With that in mind, we want to look for areas of code that satisfy several conditions. They should:
Process large blocks of data in deterministic ways.
Have well-defined data dependencies, preferably sequential or stream-based processing. Random-access should be avoided unless it can be bounded.
Take long enough to process that the overhead of transferring the data between the CPU and the accelerator does not dominate the accelerator run time.