A good system design model makes it very easy to use hardware acceleration for specific functions in an existing application with minimal changes to instantiate compute hardware and run it efficiently. In the Vitis HLS based acceleration flow, the efficiency of the compute hardware will still depend on modeling/coding style and pragmas. In the case of RTL flow, it depends on the chosen architecture. The invocation of accelerated function or CU and interaction with the host should be automated as much as possible, this includes pipelining data through the hardware, using and composing multiple CU's etc.
- in the device memory, typically a DDR, local to the accelerator card having the FPGA
- in a smartSSD connected to the FPGA over PCIe
- arriving to the PE input through one or more AXI4-Stream.
The CUs must connect to platform ports which are typically memory-mapped AXI4 (M_AXI) for data transfers to/from a host CPU through a DDR, or AXI4-Lite for low bandwidth scalar word transfers. The CUs may operate on independent data sets to achieve macro parallelism inherent to the application, achieving compelling acceleration. VSC provides the ability to use a data-mover (DM) for each M_AXI. The DM is an RTL IP that efficiently implements DDR transfers by automating well-defined protocols such as AXI-bursting. The CUs may also transfer data to another user-defined accelerator's CUs through the device memory.
VSC provides an application layer interface as shown on the left-side
of the above figure. This is a C++ API interface consisting primarily of two threads
for each hardware accelerator, or a cluster of CUs. The send-thread controls
forwarding data and launching jobs on the accelerator, while the receive-thread
allows gathering results from the accelerator. The send-thread uses a named
C-function called compute()
which acts as the
software interface to launch the corresponding call-job on the accelerator. The run
time layer will automate the several details in scheduling such jobs onto CU group
and managing efficient data transfers of the compute()
arguments. These independent threads allow the software to
asynchronously interact with the hardware execution, thereby efficiently modeling
the application-specific computation and data transfers. The VSC software interface
also provides several controls for user-driven synchronization with the
hardware.
VSC provides a unified system composition paradigm in C++, provides a runtime layer that allows a hardware composition with streamlined data transfer between CUs and device memory, and efficiently implements hardware-software interactions out of the box.
Because the compilation of hardware is a very time-consuming process, it is important that changes to the hardware code should not trigger recompilation of the hardware. This is avoided by using a specific coding style from the user, and VSC will allow the creation of reusable user-space libraries. Those libraries also act as software stack (of C++ APIs) on top of the hardware accelerator system specified by the user. Such a library may even be used as a dynamic run time shared library to be integrated with a third-party software application.