An HLS component is synthesized from a C or C++ function into RTL code for implementation in the programmable logic (PL) region of a Versal Adaptive SoC , Zynq MPSoC, or AMD FPGA device. The HLS component is tightly integrated with both the Vivado Design Suite for synthesis, place, and route, and the Vitis core development kit for heterogeneous system-level design and application acceleration.
The HLS component can be used to develop and export:
- Vivado IP to be integrated into hardware designs using the Vivado Design Suite, and used with provided software drivers for application development in embedded systems
- Vitis kernels for use in the Vitis application acceleration development flow for use with AI Engine graph applications in heterogeneous compute systems, or for use in Data Center acceleration
The HLS component tool flow provides tools to simulate, analyze, implement, and optimize the C/C++ code in programmable logic and to achieve low latency and high throughput. The inference of required pragmas to produce the right interface for your function arguments and to pipeline loops and functions within your code is the foundation of the HLS component flow whether accomplished from the command-line, Makefile, or the Vitis unified IDE.
Here are the steps for the development of the HLS component from a C++ function:
- Architect the algorithm based on the Design Principles.
- (C-Simulation) Verify the functionality of the C/C++ code with the C/C++ test bench.
- (Code Analyzer) Analyze the performance, parallelism, and legality of the C/C++ code.
- (C-Synthesis) Generate the RTL using the
v++
compiler. - (C/RTL Co-Simulation) Verify the RTL code generated using the C/C++ test bench.
- (Package) Review the HLS synthesis reports and implementation timing reports.
- Re-run previous steps until performance goals are met.
The tool implements the HLS component based on the target flow, default tool configuration, design constraints, and any optimization pragmas or directives you specify. You can use optimization directives to modify and control the implementation of the internal logic and I/O ports, overriding the default behaviors of the tool.
Here are some key concepts related to coding and synthesizing the C++ functions in your HLS component with details covered in forthcoming sections:
- Hardware Interfaces
- The arguments of the top-level function in an HLS component are synthesized
into interfaces and ports that group multiple signals to define the
communication protocol between the hardware design and components external to
the design. The
v++
compiler defines interfaces automatically, using industry standards to specify the protocol used. The default interface protocols differ based on whether the HLS component is targeted for packaging as a Vivado IP or a Vitis kernel. The default assignments of the interfaces can be overridden by using the INTERFACE pragma or directive. - Controlling the Execution of the HLS Component
- The execution mode of an HLS component is specified by the block-level control protocol. The HLS component can have control signals to start/stop the execution or it can be only driven when the data is available. As a designer, you do need to be aware of how your HLS design can be executed, as described in Execution Modes of HLS Designs.
- Task-Level Parallelism
-
To achieve high performance on the generated hardware, the synthesizer tool must infer parallelism from sequential code and exploit it to achieve greater performance. The Design Principles section introduces the three main paradigms that need to be understood for writing good software for FPGA platforms. The HLS component offers two types of task-level parallelism (TLP): either by specifying the DATAFLOW pragma or explicitly creating parallelism using
hls::task
object as described in Abstract Parallel Programming Model for HLS. - Memory Architecture
-
- The memory architecture is fixed in the CPU but the developer can create their own to optimize the memory accesses for running applications on programmable logic.
- In a C++ program, arrays are fundamental data structures used to save or manipulate data. In hardware, these arrays are implemented as memory or registers after synthesis. The memory can be implemented as local storage or global memory which is often DDR memory or HBM memory. Access to global memory has higher latency costs and can take many cycles while access to local memory is often quick and only takes one or more cycles.
- Often memory is allocated/deallocated dynamically in a C++ program but this can not be synthesized in hardware. So the designer needs to be aware of the exact amount of memory required for the algorithm.
- Memory accesses should be optimized to reduce the overhead of global memory accesses. The tool may perform memory burst access or coalescing (for example, combining multiple accesses into one, by expanding the data width) as directed by pragmas or when a particular access pattern is detected. Burst access hides the memory access latency and improves the memory bandwidth.
- Micro Level Optimization
-
- In C++ programs, there is a frequent need to implement repetitive
algorithms that process blocks of data — for example, signal or image
processing. Typically, the C/C++ source code tends to include several
loops or several nested loops. The
v++
compiler can unroll, or pipeline a loop or nested loops by inserting pragmas at appropriate levels in the source code. For more information, refer to the Loops Primer. - Once the algorithm is architected based on the design principles, inferring parallelism, you still need the right combination of micro-level HLS pragmas like PIPELINE, UNROLL, ARRAY_PARTITION, etc. The PERFORMANCE pragma or directive lets you define a single top-level performance goal for a given body of loop or nested loops. The tool automatically infers the necessary lower-level pragmas to meet the goal. With the PERFORMANCE pragma, fewer pragmas are needed to achieve good QoR and is an intuitive way to drive the tool.
- In C++ programs, there is a frequent need to implement repetitive
algorithms that process blocks of data — for example, signal or image
processing. Typically, the C/C++ source code tends to include several
loops or several nested loops. The