The instructions in the fetch and decode stages (as well as the prefetch buffer) are flushed when executing a taken branch. The fetch pipeline stage is then reloaded with a new instruction from the calculated branch address. A taken branch in MicroBlaze V takes a minimum of three clock cycles to execute, two of which are required for refilling the pipeline. To reduce this latency overhead, MicroBlaze V supports the optional branch target cache (BTC).
To improve branch performance, the branch target cache is coupled with a branch prediction scheme. With the BTC enabled, a correctly predicted branch or jump instruction incurs no overhead.
The BTC operates by saving the target address of each immediate branch and return instruction the first time the instruction is encountered. The next time it is encountered, it is usually found in the branch target cache, and the instruction fetch program counter is then changed to the saved target address, in case the branch should be taken. Jump instructions are always taken, whereas branches use branch prediction, to avoid taking a branch that should not have been taken and vice versa.
The BTC is cleared when a fence.i instruction
is executed.
Branch prediction can cause a mispredict in the following cases:
- A branch that should not have been taken is taken.
- A branch that should have been taken is not taken.
- The target address of a return instruction is incorrect in the BTC, which might occur when returning from a function called from different places in the code.
All of these cases are detected and corrected when the branch or return instruction reaches the execute stage, and the branch prediction bits or target address are updated in the BTC, to reflect the actual instruction behavior. This correction incurs a penalty of two clock cycles for the 3-stage, 4-stage, and 5-stage pipeline and 7–9 clock cycles for the 8-stage pipeline due to the pipelined instruction fetch.
The size of the BTC can be selected with C_BRANCH_TARGET_CACHE_SIZE. The default recommended setting uses two block
RAMs and provides 1024 entries. When selecting 64 entries or below, distributed RAM is
used to implement the BTC, otherwise block RAM is used.
When the BTC uses block RAM, and C_FAULT_TOLERANT is set to 1, block RAMs are protected by parity. In case
of a parity error, the branch is not predicted. To prevent accumulating errors in this
case, the BTC should be cleared periodically by a fence.i instruction.