To improve branch performance, MicroBlaze provides a branch target cache (BTC) coupled with a branch prediction scheme. With the BTC enabled, a correctly predicted immediate branch or return instruction incurs no overhead.
The BTC operates by saving the target address of each immediate branch and return instruction the first time the instruction is encountered. The next time it is encountered, it is usually found in the Branch Target Cache, and the Instruction Fetch Program Counter is then simply changed to the saved target address, in case the branch should be taken. Unconditional branches and return instructions are always taken, whereas conditional branches use branch prediction, to avoid taking a branch that should not be taken and vice versa.
The BTC is cleared when a memory barrier (MBAR 0) or synchronizing branch (BRI 4) is executed. This also occurs when the memory barrier or synchronizing branch follows immediately after a branch instruction, even if that branch is taken. To avoid inadvertently clearing the BTC, the memory barrier or synchronizing branch should not be placed immediately after a branch instruction.
There are three cases where the branch prediction can cause a mispredict, namely:
- A conditional branch that should not be taken, is actually taken,
- A conditional branch that should actually be taken, is not taken,
- The target address of a return instruction is incorrect, which might occur when returning from a function called from different places in the code.
All of these cases are detected and corrected when the branch or return instruction reaches the execute stage, and the branch prediction bits or target address are updated in the BTC, to reflect the actual instruction behavior. This correction incurs a penalty of 2 clock cycles for the 5-stage pipeline and 7 clock cycles (with MMU disabled) or 9 clock cycles (with MMU enabled) for the 8-stage pipeline due to additional instruction fetch pipeline stages.
The size of the BTC can be selected with
C_BRANCH_TARGET_CACHE_SIZE
. The default recommended setting uses
one block RAM with 32-bit address (C_ADDR_SIZE
= 32) and provides 512
entries. When selecting 64 entries or below, distributed RAM is used to implement the
BTC, otherwise block RAM is used.
When the BTC uses block RAM, and C_FAULT_TOLERANT
is set to 1,
block RAMs are protected by parity. In case of a parity error, the branch is not
predicted. To avoid accumulating errors in this case, the BTC should be cleared
periodically by a synchronizing branch.
The Branch Target Cache is available when
C_USE_BRANCH_TARGET_CACHE
is set to 1 and
C_AREA_OPTIMIZED
is set to 0 (Performance) or 2 (Frequency).