Design Capture and Modeling - 2024.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
Release Date
2024-05-30
Version
2024.1 English

The previous section described the organization, software interfaces, and hardware compiled by VSC from a unified C++ model. The following is a quick summary of some key features of the source C++ model that you can use to easily capture specific design intent:

  1. Explicitly model data transfer from host to device and back using two C-threads, to send/receive data to/from each accelerator and its group of CUs
  2. Enable performance exploration through guidance parameters, changes such as the use of multiple memory banks, change number of CUs, and concurrent (ping-pong style) data transfers between host and device with minimal or no changes to host or kernel code
  3. Replication of CUs to be run in parallel to exploit coarse grain parallelism using only a single numeric guidance parameter
  4. Automatic job scheduling on compute cluster (multiple CUs) using round-robin or free-polling
  5. Automatic data transfer and accelerator job scheduling on replicated CUs within an accelerator
  6. Automatic pipelining of each accelerator job and other optimizations over PCIe® :
    1. to send/receive data using multiple buffer objects on the host side
    2. data mover plugin for every CU, based on the data movement pattern guidance
      1. to copy data between global (DDR/HBM) and on-chip (RAM) memory resources
      2. to pre-fetch the next transaction from global memory before the current one finishes on a CU
    3. Clustered transactions (a sequence of n data sets transferred as one data block) to be processed in one shot, for amortizing PCIe latency
    4. Improve throughput by automatic concurrent (ping-pong style) data transfers using multiple host and device-memory buffers for each CU
    5. Allow variable payload size using peak memory allocation (allocate for largest data size).
    6. Dynamic output buffer sizes (allocated at runtime) can be supported, when:
      1. max buffer size is known at compile-time, and
      2. dynamic size is determined by the application code
  7. System-level composition using a mix of software (host side) and hardware (compute units):
    1. Hardware-only composition with direct connection (AXI4-Stream) inference. You can create a PE pipeline or a network within each CU, and easily replicate such units.
    2. Allow free-running PEs with streaming interfaces within a synchronous pipeline (in a CU)
    3. Mixed hardware and software composition for creating a data processing pipeline. Software tasks can be processing data in-between hardware tasks, with different accelerators compiled into the same xclbin
  8. The entire system design is captured in C++ which can be validated in by software compilation of the C++ source and execution