The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.
Settings
This displays the report and XRT configuration settings.
Summary
This displays summary statistics including device execution time and device power.
Kernels & Compute Units
The following table displays the profile summary data for all kernel functions scheduled and executed.
Name | Description |
---|---|
Kernel | Name of kernel |
Enqueues | Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats are all the same. |
Total Time | Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms) |
Minimum Time | Minimum runtime of all enqueues |
Average Time | Average kernel runtime (in ms) (Total time) / (Number of enqueues) |
Maximum Time | Maximum runtime of all enqueues (in ms) |
The following table displays the profile summary data for top kernel functions.
Name | Description |
---|---|
Kernel | Name of kernel |
Kernel Instance Address | Host address of kernel instance (in hex) |
Context ID | Context ID on host |
Command Queue ID | Command queue ID on host |
Device | Name of device where kernel was executed (format: <device>-<ID> ) |
Start Time | Start time of execution (in ms) |
Duration | Duration of execution (in ms) |
This following table displays the profile summary data for all compute units on the device.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Kernel | Kernel this compute unit is associated with |
Device | Name of the device (format: <device>-<ID> ) |
Calls | Number of times the compute unit is called |
Dataflow Execution | Specifies whether the CU is executed with dataflow |
Max Parallel Executions | Number of executions in the dataflow region |
Dataflow Acceleration | Shows the performance improvement due to dataflow execution |
CU Utilization (%) | Shows the percent of the total kernel runtime that is consumed by the CU |
Total Time | Sum of the runtimes of all calls (in ms) |
Minimum Time | Minimum runtime of all calls (in ms) |
Minimum runtime of all calls | (Total time) / (Number of work groups) |
Maximum Time | Maximum runtime of all calls (in ms) |
Clock Frequency | Clock frequency used for a given accelerator (in MHz) |
This following table displays the profile summary data for running times and stalls for compute units on the device.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Execution Count | Execution count of the compute unit |
Running Time | Total time compute unit was running (in µs) |
Intra-Kernel Dataflow Stalls (%) | Percent time the compute unit was stalling from intra-kernel streams |
External Memory Stalls (%) | Percent time the compute unit was stalling from external memory accesses |
Inter-Kernel Pipe Stalls (%) | Percent time the compute unit was stalling from inter-kernel pipe accesses |
Kernel Data Transfers
This following table displays the data transfer for kernels to the global memory.
Name | Description |
---|---|
Compute Unit Port | Name of compute unit/port |
Kernel Arguments | List of kernel arguments attached to this port |
Device | Name of device (format: <device>-<ID> ) |
Memory Resources | Memory resource accessed by this port |
Transfer Type | Type of kernel data transfers |
Number of Transfers | Number of kernel data transfers (in AXI transactions) Note: This might contain printf transfers.
|
Transfer Rate | Rate of kernel data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
Bandwidth Utilization with regard to Current Port Configuration | Application bandwidth usage on this port with respect to the current configuration: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max Achievable BW) where Max Achievable BW is based on the bit-width of the port and the clock speed of the kernel in the design |
Maximum Bandwidth with regard to Current Port Configuration | Maximum achievable bandwidth on the current port configuration: Bandwidth MB/s) = (Current port bit width/8) * (Running PL clock rate in MHz) |
Bandwidth Utilization with regard to Ideal Port Configuration | Application bandwidth usage against the maximum possible with ideal conditions: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max Possible BW) where Max Possible BW is based on the max bit-width of a port (512 bits) and the max clock speed of a kernel on this platform |
Maximum Bandwidth with regard to Ideal Port Configuration | Maximum theoretical bandwidth on an ideal port configuration: Bandwidth MB/s) = (Maximum possible port bit width/8) * (Highest possible PL clock rate in MHz) |
Avg Size | Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Avg Latency | Average latency of kernel data transfers (in ns) |
This following table displays the top data transfer for kernels to the global memory.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Device | Name of device |
Number of Transfers | Number of write and read data transfers |
Avg Bytes per Transfer | Average bytes of kernel data transfers: Average Bytes = (Total Bytes) / (Number of Transfers) |
Transfer Efficiency (%) | Efficiency of kernel data transfers: Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096) |
Total Data Transfer | Total data transferred by kernels (in MB): Total Data = (Total Write) + (Total Read) |
Total Write | Total data written by kernels (in MB) |
Total Read | Total data read by kernels (in MB) |
Total Transfer Rate | Average total data transfer rate (in MB/s): Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
This following table displays the data transfer streams.
Name | Description |
---|---|
Master Port | Name of master compute unit and port |
Master Kernel Arguments | List of kernel arguments attached to this port |
Slave Port | Name of slave compute unit and port |
Slave Kernel Arguments | List of kernel arguments attached to this port |
Device | Name of device (format: <device>-<ID> ) |
Number of Transfers | Number of stream data packets |
Transfer Rate | Rate of stream data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
Avg Size | Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Link Utilization (%) | Link utilization (%): Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles) |
Link Starve (%) | Link starve (%): Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles) |
Link Stall (%) | Link stall (%): Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles) |
Host Data Transfers
This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.
Name | Description |
---|---|
Buffer Address | Specifies the address location for the buffer |
Context ID | OpenCL Context ID on host |
Command Queue ID | OpenCL Command queue ID on host |
Start Time | Start time of write operation (in ms) |
Duration | Duration of write operation (in ms) |
Buffer Size | Amount of data being transferred (in KB) |
Writing Rate | Data transfer rate (in MB/s): (Buffer Size)/(Duration) |
This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.
Name | Description |
---|---|
Buffer Address | Specifies the address location for the buffer |
Context ID | Context ID on host |
Command Queue ID | Command queue ID on host |
Start Time | Start time of read operation (in ms) |
Duration | Duration of read operation (in ms) |
Buffer Size | Amount of data being transferred (in KB) |
Reading Rate | Data transfer rate (in MB/s): (Buffer Size) / (Duration) |
This following table displays the data transfer for host to the global memory.
Name | Description |
---|---|
Context:Number of Devices | Context ID and number of devices in context |
Transfer Type | Type of kernel host transfers |
Number of Buffer Transfers | Number of host buffer transfers Note: This might contain printf transfers.
|
Transfer Rate | Rate of host buffer transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs) |
Avg Bandwidth Utilization (%) | Average bandwidth of host buffer transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate) |
Avg Size | Average size of host buffer transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Total Time | Sum of host buffer transfer durations (in ms) |
Avg Time | Average of host buffer transfer durations (in ms) |
API Calls
This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.
Name | Description |
---|---|
API Name | Name of the API function (for example, clCreateProgramWithBinary , clEnqueueNDRangeKernel ) |
Calls | Number of calls to this API made by the host application |
Total Time | Sum of runtimes of all calls (in ms) |
Minimum Time | Minimum runtime of all calls (in ms) |
Average Time | Average Time (in ms) (Total time) / (Number of calls) |
Maximum Time | Maximum runtime of all calls (in ms) |
Device Power
This following table displays the profile data for device power.
Name | Description |
---|---|
Power Used By Platform | Shows a line graph of the three power rails on a Data Center
acceleration card:
|
Temperature | One chart is created for each device that has non-zero temperature readings. Displays one line for each temperature sensor with readouts in (°C). |
Fan Speed | One chart is created for each device that has non-zero fan speed readings. The fan speed is measure in RPM. |
Kernel Internals
This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.
Name | Description |
---|---|
Compute Unit | Indicates the compute unit instance name |
Running Time | Reports the total running time for the CU (in µs) |
Intra-Kernel Dataflow Stalls (%) | Reports the percentage of running time consumed in stalls when streaming data between kernels |
External Memory Stalls (%) | Reports the percentage of running time consumed in stalls for memory transfers outside the CU |
Inter-Kernel Pipe Stalls (%) | Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU |
This following table displays the data transfer for specific ports on the compute unit.
Name | Description |
---|---|
Port | Indicates the port name on the compute unit |
Compute Unit | Indicates the compute unit instance name |
Write Time | Specifies the total data write time on the port (in µs) |
Outstanding Write (%) | Specifies the percentage of the runtime consumed in the write process |
Read Time | Specifies the total data read time on the port (in µs) |
Outstanding Read (%) | Specifies the percentage of the runtime consumed in the read process |
This following table displays the functional port data transfers on the compute unit.
Name | Description |
---|---|
Port | Name of port |
Function | Name of function |
Compute Unit | Name of compute unit |
Write Time | Total time the port had an outstanding write (in µs) |
Outstanding Write (%) | Percent time the port had an outstanding write |
Read Time | Total time the port had an outstanding read (in µs) |
Outstanding Read (%) | Percent time the port had an outstanding read |
This following table displays the running time and stalls on the compute unit.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Function | Name of function |
Running Time | Total time function was running (in ms) |
Intra-Kernel Dataflow Stalls | Percent time the function was stalling from intra-kernel streams (in ms) |
External Memory Stalls | Percent time the function was stalling from external memory accesses (in ms) |
Inter-Kernel Pipe Stalls | Percent time the function was stalling from inter-kernel pipe accesses (in ms) |
Shell Data Transfers
This following table displays the DMA data transfers.
Name | Description |
---|---|
Device | Name of device (format: <device>-<ID> ) |
Transfer Type | Type of data transfers |
Number of Transfers | Number of data transfers (in AXI transactions) |
Transfer Rate | Rate of data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs) |
Total Data Transfer | Total amount of data transferred (in MB) |
Total Time | Total duration of data transfers (in ms) |
Avg Size | Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Avg Latency | Average latency of data transfers (in ns) |
For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.
NoC Counters
NoC Counters display the NoC Counters Read and NoC Counters Write. These sections are only displayed if there is a non-zero NoC counter data.
Each section has a table containing summary data with line graphs for transfer rate and latency. The graphs can have multiple NoC counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.
Depending on the design, it can be possible to correlate NoC counters to CU ports. In this case, the CU port appears in the table, and selecting it cross-probes to the system diagram, profile summary, and any other views that include CU ports as selectable objects.
Name | Description |
---|---|
Compute Unit Port | Name of compute unit/port |
Name | Name of NoC port |
Traffic Class | Traffic class type |
Requested QoS | QoS (MB/s): Requested quality of service (in MB/s) |
Min Transfer Rate | Rate of minimum data transfers (in MB/s) |
Avg Transfer Rate | Rate of average data transfers (in MB/s) |
Max Transfer Rate | Rate of maximum data transfers (in MB/s) |
Avg Size | Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Min Latency | Minimum latency of data transfers (in ns) |
Avg Latency | Average latency of data transfers (in ns) |
Max Latency | Maximum latency of data transfers (in ns) |
AI Engine Counters
AI Engine counters display if there is a non-zero AI Engine counter data. If there is an incompatible configuration of the AI Engine counters, this section displays a message stating that the configuration does not support performance profiling.
This section has a table containing summary data with line graphs for active time and usage. The usage chart is only available if stall profiling is enabled.
The graphs can have multiple AI Engine counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.
It is possible to cross-probe tiles to the AI Engine array and graph views.
Name | Description |
---|---|
Tile | AI Engine Tile [Column, Row] |
Clock Frequency (MHz) | Frequency (in MHz) of clock used for AI Engine tiles |