VCU Latency Modes - 2023.1 English

H.264/H.265 Video Codec Unit v1.2 Solutions LogiCORE IP Product Guide (PG252)

Document ID
Release Date
2023.1 English

The VCU supports four latency modes: normal latency, reduced latency (also called no-reordering mode), low latency, and Xilinx low latency modes. The pipeline instantaneous latency may vary depending upon the frame structure, encoding standard, levels, profiles, and target bitrate.

The VCU encoder and decoder works at frame level. All possible frame types (I, P, and B) are supported and there is no restriction on GOP structure. The end-to-end latency depends on the profile/level, GOP structure, and the number of internal buffers used for processing. This is standard latency and can be used with any control rate mode.
No Reordering (Reduced-Latency)
The VCU encoder works at frame level. Hardware rate control is used to reduce the bitrate variations. I-only, IPPP, and low-delay-P are supported. There is no output reordering, thus reducing latency on the decoder side. The VCU continues to operate at frame level.
The frame is divided into multiple slices; the VCU encoder output and decoder input are processed in slice mode. The VCU Encoder input and Decoder output still works in frame mode. The VCU encoder generates a slice done interrupt at every end of the slice and outputs stream buffer for slice, and it will be available immediately for next element processing. So, with multiple slices it is possible to reduce VCU processing latency from one frame to one-frame/num-slices. In the low-latency mode, a maximum of four streams for the encoder and two streams for the decoder can be run.
Xilinx Low-Latency
In the low-latency mode, the VCU encoder and decoder work at subframe or slice level boundary but other components at the input of encoder and output of decoder namely capture DMA and display DMA still work at frame level boundary. This means that the encoder can read input data only when capture has completed writing full frame. In the Xilinx low-latency mode, the capture and display also work at subframe level thus reducing the pipeline latency significantly. This is made possible by making the producer (Capture DMA) and the consumer (VCU encoder) work on the same input buffer concurrently but maintaining the synchronization between the two such that consumer read request is unblocked only once the producer is done writing the data required for that read request. This functionality to maintain synchronization is managed by a separate IP block called the synchronization IP.

Similarly, the decoder and the display are also allowed to have concurrent access to the same buffer, but here there is no separate hardware synchronization IP block between them. The software handles the synchronization by making sure that buffer starts getting displayed only when the decoder has written at least half a frame period of data.

Similar to the low-latency mode, the Xilinx low-latency also supports a maximum of four streams for the encoder and two streams for the decoder. See VCU Sync IP v1.0 for more information.

The maximum number of streams should be equivalent to 4kp60 bandwidth. Following are the possible combinations of latency modes:

  • Possible combination for normal and reduced latency:
    • For live and file sources:
      • One instance of 3840x2160p60
      • Two instances of 3840x2160p30
      • Four instances of 1920x1080p60
      • Eight instances of 1920x1080p30
    • For file source only:
      • 32 instances of 640x480@30
      • 32 instances of 720x480@30
  • Possible combinations for low latency and Xilinx low latency:
    • Four instances of 1920x1080p60 streams at encoder and two instances of 1920x1080p60 streams at decoder
    • Two instances of 3840x2160p30 streams at encoder and two instances of 3840x2160p30 streams at decoder