CCIX Transaction Flows

Versal ACAP CPM CCIX Architecture Manual (AM016)

Document ID
AM016
Release Date
2020-11-24
Revision
1.1 English

This section describes the transaction flows for various transactions and agents in detail. Using the CPM as the CCIX-RA requires additional logic in the programmable logic along with the accelerator kernel. The additional modules required for CCIX-RA mode are discussed first.

Request Agent: A Request Agent (RA) is CCIX agent that is the source of the read and write transactions. Each of the CCIX RAs might have one or more internal initiators, also referred to as acceleration functions (AF).

The following block diagram provides an overview of PL modules required for CCIX-RA.
Figure 1. CPM as CCIX-RA with Programmable Logic Block Diagram
For an application on the host system that programs the user kernel with virtual addresses, the transaction flow is as follows:
  • The system cache IP implements the address translation cache (ATC) and address translation services (ATS). When the ATS is enabled, for any received AX4 request, the system cache IP checks the ATC. In case of a hit, it uses the translated address from the ATC. In case of a miss, it issues an ATS request for this virtual address over the AXI4-Stream interface provided by the PCIe AXI bridge IP, uses the translated address received, and also caches the translation in the ATC.
  • For kernel issuing read or write transaction to a virtual address, system cache performs address translation and then looks up its cache for hit. If there is a miss, the transaction is sent to the CPM via the CHI interface exposed in the PL.

Transaction Flows

This section describes the basic transaction flows. With regards to terminology, 'local' refers to the device itself and 'remote' refers to peer agents accessed over the link.

Local requests to local memory – local cache hit
In this scenario, the kernel in programmable logic generates a memory access that references local memory. This transaction can be a read or a write. It can hit in either the Level-1 cache mapped to the PL (implemented in System Cache IP) or the Level-2 cache (in CPM), if present and enabled. No request needs to be propagated beyond the Level-2 cache, if present and enabled, if the cached copy is unique.
Local requests to local memory – remote cache hit
In this scenario, the reference to local memory generated by the kernel misses in all the local caches. The Home Agent serializes the request with respect to all the other requests in the system. If the snoop-filter indicates that a remote node might have a cached copy, then a snoop is sent to those caches (via the PCIe). A cache hit may result in data being returned to the requester (kernel in programmable logic in this case).
Local requests to local memory – local memory access
In this scenario, the local caches miss & the snoop-filter indicates that there are no cached copies in the system. The Home Agent will read the data from the local DDR memory (SBSX via NoC) and return it to the kernel in programmable logic.
Local requests to remote memory – local cache hit
In this scenario, the local kernel generates a memory access that references remote memory. The transaction can be a read or a write. It can hit in either the Level-1 cache mapped to the PL (implemented by System Cache IP) or the Level-2 cache, if present and enabled. No request needs to be propagated beyond the Level-2 cache, if present and enabled, if the cached copy is unique.
Local requests to remote memory – local cache miss
In this scenario, the request is transmitted to the remote home-node where it is serialized at the PoC of the remote home-node. The remote home-node’s PoC snoops other caches in the system if the snoop filter (on the remote home node) indicates that they have a cached copy, or the remote home-node’s PoC sends a broadcast snoop if no snoop filter is present. A cache hit can return data to the remote home node and then the requestor. If all the caches miss, the PoC returns data from remote memory.
Remote requests to local memory
In this scenario, requests from the remote RA to the accelerator-attached memory are received. The Home node PoC in the CCB serializes the requests against all the requests to the same address. The snoop filter is looked up and snoops are sent to caches as indicated by the snoop filter. In addition to the snoops, the PoC can also access local memory to satisfy the request.
Remote snoops
Remote snoops arrive for remote references to addresses cached in the accelerator. The snoops look up the caches, update the state if necessary, and generate responses according to the protocol.
Important: PL system cache to CML WriteUnique (WU) bandwidth discrepancy: WU bandwidth drops from an ideal 16 GB/s through CML to 12.8 GB/s. This bandwidth drop is only observed through L2. Therefor if a use-case requires high WU bandwidth, then the L2 instance can be bypassed.