Performance - 5.0 English

The perceived performance is dependent on many factors such as frequency, latency and throughput. Which factor has the dominating effect is application-specific. There is also a correlation between the performance factors; for example, achieving high frequency can add latency and also wide datapaths for throughput can adversely affect frequency.

Read latency is defined as the clock cycle from the read address is accepted by the System Cache core to the cycle when first read data is available.

Write latency is defined as the clock cycle from the write address is accepted by the System Cache core to the cycle when the BRESP is valid. These calculations assume that the start of the write data is aligned to the transaction address.

Snoop latency is defined as the time from the clock cycle a snoop request is accepted by the System Cache core to the cycle when CRRESP or CDDATA is valid, whichever is last. Not all snoops result in a CDDATA transaction.

Maximum Frequencies

For details about performance, visit Performance and Resource Utilization.

CCIX Cache Latency

CCIX latency calculations are in principle defined in the same way as above in the introduction, but the time in flight on the PCIe bus is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.

For the different types of transactions, this means:

Read: From AXI ARVALID to start of TLP request + start of TLP response to AXI RRESP
Write: From AXI AWVALID to start of TLP request + start of TLP response to AXI BRESP
Snoop: From start of TLP request to start of TLP response
Scrub (automatic): From timer initiation of the scrub to completion
Scrub (manual): From AXI control write initiation of the scrub to completion

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.

Table 1. System Cache CCIX Latency
Type	CCIX Latency
Read Hit	16
Read Miss	45 + round-trip delay on PCIe for request
Read Miss Dirty	max of: 45 + round-trip delay on PCIe for read request 39 + round-trip delay on PCIe for write request
Write Hit	16
Write Miss	44 + round-trip delay on PCIe for request
Write Miss Dirty	max of: 45 + round-trip delay on PCIe for read request 39 + round-trip delay on PCIe for write request
Snoop (missing broadcast)	23
Snoop	26
Snoop with data	33
Scrub (automatic)	11
Scrub (manual)	17

CHI Cache Latency

CHI latency calculations are defined in a similar way as CCIX, but the time in flight in the CHI domain is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.

For the different types of transactions, this means:

Read: From AXI ARVALID to request FLIT + response FLIT to AXI RRESP
Write: From AXI AWVALID to request FLIT + response FLIT to AXI BRESP
Snoop: From start of Request FLIT to start of TLP response
Scrub (automatic): From timer initiation of the scrub to completion
Scrub (manual): From AXI control write initiation of the scrub to completion

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.

Table 2. System Cache CHI Latency
Type	CHI Latency
Read Hit	16
Read Miss	34 + round-trip delay on CHI for request
Read Miss Dirty	max of: 32 + round-trip delay on CHI for read request 31 + round-trip delay on CHI for write request
Write Hit	16
Write Miss	32 + round-trip delay on CHI for request
Write Miss Dirty	max of: 32 + round-trip delay on CHI for read request 31 + round-trip delay on CHI for write request
Snoop (missing broadcast)	23
Snoop	26
Snoop with data	33
Scrub (automatic)	11
Scrub (manual)	17

ATS/ATC Latency

The inclusion of the Address Translation Service, ATS and the ATC TLB in the AXI port interfaces will add latency to the above AXI4/ACE Cache Latency, in most cases the locality will have minimal hit latency but miss latency are expected in restarted systems with no accumulated translation and in case of locality context change.

Read latency is defined from the read address is accepted by the System Cache core to the cycle when first read data is available via the Address Translation service.

Write latency is defined from the write address is accepted by the System Cache core to start of the write data is aligned to the transaction address via the Address Translation service.

For the transaction best case latency it is assumed that previous accesses have already used the address range, hence both read and write will result in ATC TLB hits.

When a miss occurs in the ATC TLB, it is assumed that the ATC Table has a copy of a valid Translation, latency is added with best-case and worst-case latency due to ATS search.

Note: The worst-case latency is related to ATS Tables size and is depth dependent – here the default size of 256 entries is used.

The average expected for ATC Search latency depends upon temporal locality of addresses in use in the ATC Table, where the entries of the n last mapped translations are cached, and ATC search hits will be within these n translations.

In the case of ATC table miss, the latency will be extended with PCIe Root Complex, Host TA, translation latency, which are system level defined latency – outside System Cache definition scope.

The host TA best-case translation time is the round trip transaction latency from request to response plus the host TA lookup time. The time can be extended with one or more page request round trips plus host page management latencies, and finally another retry translation round trip host TA lookup.

ATC table lookup latency is two clock cycles best case, 256+2 clock cycles worst case, and n/2+2 clock cycles on average (assuming locality for the n last accesses and hit latency within n entries).

Table 3. System Cache Core Latencies with Address Translation
Type	AXI4 Port Latency with Address Translation Latency (See CHI/CCIX Master port Read/Write Latency)
Read Hit/Miss, ATC TLB Hit	2 + Master port Read latency (Hit/Miss)
Read Hit/Miss, ATC TLB Miss, ATC Table Hit	3 + ATC table lookup + Master port Read latency (Hit/Miss)
Read Hit/Miss, ATC TLB Miss and ATC Table Miss	3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Read latency (Hit/Miss)
Write Hit/Miss, ATC TLB Hit	2 + Master port Write burst latency (Hit/Miss)
Write Hit/Miss, ATC TLB Miss, ATC Table Hit	3 + ATC table lookup + Master port Write burst latency (Hit/Miss)
Write Hit/Miss, ATC TLB Miss, ATS Table Miss	3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Write burst latency (Hit/Miss)

AXI4/ACE Cache Latency

Here latency is used as described in the introduction.

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. For transactions using a generic AXI4 port an additional two clock cycle latency is added.

Table 4. System Cache Core Latencies for Optimized Port
Type	Optimized Port Latency
Read Hit	6
Read Miss	7 + latency added by memory subsystem
Read Miss Dirty	Maximum of: 7 + latency added by memory subsystem 7 + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width)
Write Hit	3 + burst length
Write Miss	Non-bufferable transaction: 7 + latency added by memory subsystem for writing data Bufferable transaction: Same as Write Hit

Enabling optimized port cache coherency affects the latency and also introduces new types of transaction latencies. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. Transactions from a generic port still have two cycles of extra latency.

Table 5. System Cache Core Latencies for Cache Coherent Optimized Port
Type	Coherent Optimized Port Latency
DVM Message	9 + latency added by snooped masters
DVM Sync	12 + latency added by snooped masters
Read Hit	9 + latency added by snooped masters
Read Miss	10 + latency added by snooped masters + latency added by memory subsystem
Read Miss Dirty	Maximum of: 10 + latency added by snooped masters + latency added by memory subsystem 10 + latency added by snooped masters + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width)
Write Hit	Maximum of: 3 + burst length 6 + latency added by snooped masters
Write Miss	Non-bufferable transaction: 10 + latency added by snooped masters + latency added by memory subsystem for writing data Bufferable transaction: same as Write Hit

When master port cache coherency is enabled the System Cache core provides CRRESP and potential data as quickly as possible, but the response time varies according to the current state and transactions in flight, both internally and externally, as long as they have an effect on the System Cache state. See the following table for latency values.

Table 6. Core Latency Values for Master Port Cache Coherency
Type	Master Port Snoop Latency
Snoop Miss	3 + latency of any preceding snoop blocking progress 4 + latency of any preceding snoop blocking progress (if hazard with pipelined access) 5 + latency of any preceding snoop blocking progress + latency to compete active write with hazard
Snoop Hit	4 + latency to acquire data access + latency of any preceding snoop blocking progress 5 + latency of any preceding snoop blocking progress (if hazard with pipelined access) 5 + latency of any preceding snoop blocking progress + latency to complete active write with hazard

The numbers for an actual MicroBlaze application vary depending on access patterns, hit/miss ratio and other factors. Example values from a system (see Typical System with a Single Processor above) running the iperf network testing tool with the LWIP TCP/IP stack in raw mode are shown in the following four tables, where the first contains the hit rate for transactions from all ports, and the remaining show per port hit rate and latencies for the three active ports.

Table 7. Application Total Hit Rate
Type	Hit Rate
Read	99.82%
Write	92.93%

Table 8. System Cache Hit Rate and Latencies for MicroBlaze D-Side Port
Type	Hit Rate	Min	Max	Average	Standard Deviation
Read	99.68%	6	290	8	3
Write	96.63%	4	31	4	1

Table 9. System Cache Hit Rate and Latencies for MicroBlaze I-Side Port
Type	Hit Rate	Min	Max	Average	Standard Deviation
Read	9.96%	5	568	6	2
Write	N/A	N/A	N/A	N/A	N/A

Table 10. System Cache Hit Rate and Latencies for Generic Port
Type	Hit Rate	Min	Max	Average	Standard Deviation
Read	76.68%	7	388	18	13
Write	9.78%	6	112	24	5

Throughput

The System Cache core is fully pipelined and can have a theoretical maximum transaction rate of one read or write hit data concurrent with one read and one write miss data per clock cycle when there are no conflicts with earlier transactions.

This theoretical limit is subject to memory subsystem bandwidth, intra-transaction conflicts and cache hit detection overhead, which reduce the achieved throughput to less than three data beats per clock cycle.