The previous section described the organization, software interfaces, and hardware compiled by VSC from a unified C++ model. The following is a quick summary of some key features of the source C++ model that you can use to easily capture specific design intent:
- Explicitly model data transfer from host to device and back using two C-threads, to send/receive data to/from each accelerator and its group of CUs
- Enable performance exploration through guidance parameters, changes such as the use of multiple memory banks, change number of CUs, and concurrent (ping-pong style) data transfers between host and device with minimal or no changes to host or kernel code
- Replication of CUs to be run in parallel to exploit coarse grain parallelism using only a single numeric guidance parameter
- Automatic job scheduling on compute cluster (multiple CUs) using round-robin or free-polling
- Automatic data transfer and accelerator job scheduling on replicated CUs within an accelerator
- Automatic pipelining of each accelerator job and other optimizations over
PCIe®
:
- to send/receive data using multiple buffer objects on the host side
- data mover plugin for every CU, based on the data
movement pattern guidance
- to copy data between global (DDR/HBM) and on-chip (RAM) memory resources
- to pre-fetch the next transaction from global memory before the current one finishes on a CU
- Clustered transactions (a sequence of n data sets transferred as one data block) to be processed in one shot, for amortizing PCIe latency
- Improve throughput by automatic concurrent (ping-pong style) data transfers using multiple host and device-memory buffers for each CU
- Allow variable payload size using peak memory allocation (allocate for largest data size).
- Dynamic output buffer sizes (allocated at runtime) can
be supported, when:
- max buffer size is known at compile-time, and
- dynamic size is determined by the application code
- System-level composition using a mix of software (host side) and
hardware (compute units):
- Hardware-only composition with direct connection (AXI4-Stream) inference. You can create a PE pipeline or a network within each CU, and easily replicate such units.
- Allow free-running PEs with streaming interfaces within a synchronous pipeline (in a CU)
- Mixed hardware and software composition for creating a data processing pipeline. Software tasks can be processing data in-between hardware tasks, with different accelerators compiled into the same xclbin
- The entire system design is captured in C++ which can be validated in by software compilation of the C++ source and execution