The perceived performance is dependent on many factors such as frequency, latency, and throughput. Which factor has the dominating effect is application-specific. There is also a correlation between the performance factors; for example, achieving high frequency can add latency and also wide datapaths for throughput can adversely affect frequency.
Read latency is defined as the clock cycle from the read address is accepted by the System Cache core to the cycle when first read data is available.
Write latency is defined as the clock cycle from the write address is accepted by the System Cache core to the cycle when the BRESP is valid. These calculations assume that the start of the write data is aligned to the transaction address.
Snoop latency is defined as the time from the clock cycle a snoop request is accepted by the System Cache core to the cycle when CRRESP or CDDATA is valid, whichever is last. Not all snoops result in a CDDATA transaction.
Maximum Frequencies
For details about performance, visit Performance and Resource Utilization.
CCIX Cache Latency
CCIX latency calculations are in principle defined in the same way as above in the introduction, but the time in flight on the PCIe bus is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.
For the different types of transactions, this means:
- Read: From AXI ARVALID to start of TLP request + start of TLP response to AXI RRESP
- Write: From AXI AWVALID to start of TLP request + start of TLP response to AXI BRESP
- Snoop: From start of TLP request to start of TLP response
- Scrub (automatic): From timer initiation of the scrub to completion
- Scrub (manual): From AXI control write initiation of the scrub to completion
The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.
Type | CCIX Latency |
---|---|
Read Hit | 16 |
Read Miss | 45 + round-trip delay on PCIe for request |
Read Miss Dirty |
max of: 45 + round-trip delay on PCIe for read request 39 + round-trip delay on PCIe for write request |
Write Hit | 16 |
Write Miss | 44 + round-trip delay on PCIe for request |
Write Miss Dirty |
max of: 45 + round-trip delay on PCIe for read request 39 + round-trip delay on PCIe for write request |
Snoop (missing broadcast) | 23 |
Snoop | 26 |
Snoop with data | 33 |
Scrub (automatic) | 11 |
Scrub (manual) | 17 |
CHI Cache Latency
CHI latency calculations are defined in a similar way as CCIX, but the time in flight in the CHI domain is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.
For the different types of transactions, this means:
- Read: From AXI ARVALID to request FLIT + response FLIT to AXI RRESP
- Write: From AXI AWVALID to request FLIT + response FLIT to AXI BRESP
- Snoop: From start of Request FLIT to start of TLP response
- Scrub (automatic): From timer initiation of the scrub to completion
- Scrub (manual): From AXI control write initiation of the scrub to completion
The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.
Type | CHI Latency |
---|---|
Read Hit | 16 |
Read Miss | 34 + round-trip delay on CHI for request |
Read Miss Dirty |
max of: 32 + round-trip delay on CHI for read request 31 + round-trip delay on CHI for write request |
Write Hit | 16 |
Write Miss | 32 + round-trip delay on CHI for request |
Write Miss Dirty |
max of: 32 + round-trip delay on CHI for read request 31 + round-trip delay on CHI for write request |
Snoop (missing broadcast) | 23 |
Snoop | 26 |
Snoop with data | 33 |
Scrub (automatic) | 11 |
Scrub (manual) | 17 |
ATS/ATC Latency
The inclusion of the Address Translation Service, ATS, and the ATC TLB in the AXI port interfaces will add latency to the above AXI4/ACE Cache Latency, in most cases the locality will have minimal hit latency but miss latency are expected in restarted systems with no accumulated translation and in case of locality context change.
Read latency is defined from the read address is accepted by the System Cache core to the cycle when first read data is available via the Address Translation service.
Write latency is defined from the write address is accepted by the System Cache core to start of the write data is aligned to the transaction address via the Address Translation service.
For the transaction best case latency it is assumed that previous accesses have already used the address range, hence both read and write will result in ATC TLB hits.
When a miss occurs in the ATC TLB, it is assumed that the ATC Table has a copy of a valid Translation, latency is added with best-case and worst-case latency due to ATS search.
The average expected for ATC Search latency depends upon temporal locality of addresses in use in the ATC Table, where the entries of the n last mapped translations are cached, and ATC search hits will be within these n translations.
In the case of ATC table miss, the latency will be extended with PCIe Root Complex, Host TA, translation latency, which are system level defined latency – outside System Cache definition scope.
The host TA best-case translation time is the round trip transaction latency from request to response plus the host TA lookup time. The time can be extended with one or more page request round trips plus host page management latencies, and finally another retry translation round trip host TA lookup.
ATC table lookup latency is two clock cycles best case, 256+2 clock cycles worst case, and n/2+2 clock cycles on average (assuming locality for the n last accesses and hit latency within n entries).
Type | AXI4 Port Latency with Address Translation Latency (See CHI/CCIX Master port Read/Write Latency) |
---|---|
Read Hit/Miss, ATC TLB Hit | 2 + Master port Read latency (Hit/Miss) |
Read Hit/Miss, ATC TLB Miss, ATC Table Hit | 3 + ATC table lookup + Master port Read latency (Hit/Miss) |
Read Hit/Miss, ATC TLB Miss and ATC Table Miss | 3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Read latency (Hit/Miss) |
Write Hit/Miss, ATC TLB Hit | 2 + Master port Write burst latency (Hit/Miss) |
Write Hit/Miss, ATC TLB Miss, ATC Table Hit | 3 + ATC table lookup + Master port Write burst latency (Hit/Miss) |
Write Hit/Miss, ATC TLB Miss, ATS Table Miss | 3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Write burst latency (Hit/Miss) |
AXI4/ACE Cache Latency
Here latency is used as described in the introduction.
The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. For transactions using a generic AXI4 port an additional two clock cycle latency is added.
Type | Optimized Port Latency |
---|---|
Read Hit | 6 |
Read Miss | 7 + latency added by memory subsystem |
Read Miss Dirty |
Maximum of: 7 + latency added by memory subsystem 7 + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width) |
Write Hit | 3 + burst length |
Write Miss |
Non-bufferable transaction: 7 + latency added by memory subsystem for writing data Bufferable transaction: Same as Write Hit |
Enabling optimized port cache coherency affects the latency and also introduces new types of transaction latencies. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. Transactions from a generic port still have two cycles of extra latency.
Type | Coherent Optimized Port Latency |
---|---|
DVM Message | 9 + latency added by snooped masters |
DVM Sync | 12 + latency added by snooped masters |
Read Hit | 9 + latency added by snooped masters |
Read Miss | 10 + latency added by snooped masters + latency added by memory subsystem |
Read Miss Dirty |
Maximum of: 10 + latency added by snooped masters + latency added by memory subsystem 10 + latency added by snooped masters + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width) |
Write Hit |
Maximum of: 3 + burst length 6 + latency added by snooped masters |
Write Miss |
Non-bufferable transaction: 10 + latency added by snooped masters + latency added by memory subsystem for writing data Bufferable transaction: same as Write Hit |
Type | Master Port Snoop Latency |
---|---|
Snoop Miss |
3 + latency of any preceding snoop blocking progress 4 + latency of any preceding snoop blocking progress (if hazard with pipelined access) 5 + latency of any preceding snoop blocking progress + latency to compete active write with hazard |
Snoop Hit |
4 + latency to acquire data access + latency of any preceding snoop blocking progress 5 + latency of any preceding snoop blocking progress (if hazard with pipelined access) 5 + latency of any preceding snoop blocking progress + latency to complete active write with hazard |
Type | Hit Rate |
---|---|
Read | 99.82% |
Write | 92.93% |
Type | Hit Rate | Min | Max | Average | Standard Deviation |
---|---|---|---|---|---|
Read | 99.68% | 6 | 290 | 8 | 3 |
Write | 96.63% | 4 | 31 | 4 | 1 |
Type | Hit Rate | Min | Max | Average | Standard Deviation |
---|---|---|---|---|---|
Read | 9.96% | 5 | 568 | 6 | 2 |
Write | N/A | N/A | N/A | N/A | N/A |
Type | Hit Rate | Min | Max | Average | Standard Deviation |
---|---|---|---|---|---|
Read | 76.68% | 7 | 388 | 18 | 13 |
Write | 9.78% | 6 | 112 | 24 | 5 |
Throughput
The System Cache core is fully pipelined and can have a theoretical maximum transaction rate of one read or write hit data concurrent with one read and one write miss data per clock cycle when there are no conflicts with earlier transactions.
This theoretical limit is subject to memory subsystem bandwidth, intra-transaction conflicts and cache hit detection overhead, which reduce the achieved throughput to less than three data beats per clock cycle.