GPU IP Architecture - GPU IP Architecture - AM026

Versal AI Edge Series Gen 2 and Prime Series Gen 2 Technical Reference Manual (AM026)

Document ID
AM026
Release Date
2025-12-23
Revision
1.3 English

The Arm® Mali™ -G78AE GPU is part of the Valhall architecture family. G78AE architecture supports a modular configuration that adapts to both safety-critical and performance-driven applications by organizing its shader cores into core clusters, slices, and partitions. The elements are structured to optimize functionality as described below:

Each core cluster contains multiple shader cores designed for a variety of tasks, such as vertex shading, pixel processing, and machine learning computations. These clusters form the foundation of the GPU's parallel processing capabilities, with shader cores dedicated to specific functions based on workload requirements. Where-in a slice is the fundamental unit of GPU resource allocation, consisting of shader cores grouped together with its own tiling unit (tiler) for organizing rendering tasks. Each slice operates as an independent hardware unit that can be allocated as needed to optimize performance. In configurations where multiple slices are grouped, only the tiler of the first slice in the partition remains active to manage data processing, with other tilers disabled to conserve resources. The Partitions are virtualized, isolated segments of the GPU, composed of one or more slices. Each partition functions as a standalone processing environment, capable of handling independent tasks with dedicated resources. The GPU supports configurations with either two partitions (each containing two slices) or two slices functioning in a 2-core-per-slice structure. This flexible architecture ensures that critical tasks, like safety applications, remain isolated and secure, operating independently of other GPU activities.

Figure 1. Conceptual GPU Device (Mali-G78AE)

Graphics and Compute Pipeline

The GPU pipeline supports both graphics and compute tasks, optimizing for the concurrent execution of shader-based computations. With compute shaders, the GPU can handle complex, parallelized workloads such as image processing, object detection, and real-time autonomous decision-making, ideal for applications in automotive, healthcare, and industrial automation. Each shader core has two parallel data paths for issuing threads to the core, one for non-fragment workloads and one for fragment workloads as shown in the figure below.

Asynchronous Compute Engines

To boost parallelism, the GPU includes asynchronous compute engines that operate independently from the graphics pipeline. This setup allows compute tasks, such as image recognition and sensor fusion, to run in parallel without disrupting graphics rendering, maximizing the efficiency and throughput of the GPU for multi-application use cases. For shader-bound content, the functional unit with the highest loading is likely to be the bottleneck. To improve performance, you can reduce the number of operations of that type in the shader. Alternatively, reduce the precision of the operations to use 8 and 16-bit types so that multiple operations are performed in parallel. For thermally bound content, reducing the critical path load gives the biggest gain as it allows use of a lower operating frequency. However, reducing load on any functional unit helps improve energy efficiency.

Figure 2. Valhall GPU Shader Core

Cache Hierarchy and Error Correction

The GPU's multi-level cache structure improves memory access efficiency, minimizing latency for high-frequency data retrieval. ECC (Error Correction Code) is applied to cache and memory accesses, ensuring reliable data integrity for safety-critical applications.

Dedicated AXI Bus Interfaces

To further isolate GPU functions, three dedicated AXI interfaces (AXI-A, AXI-B, AXI-C) control data flow across different partitions

Mali Valhall Core Architecture

Arm Mali Valhall GPU shader cores have six parallel pipeline classes, comprising three arithmetic pipelines and three fixed-function support pipelines. All Valhall GPUs implement two parallel processing engines, each containing their own set of arithmetic pipelines.

Arithmetic fused multiply accumulate unit (FMA)
The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle. Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.
Arithmetic convert unit (CVT)
The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.
Arithmetic special functions unit (SFU)
The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.
Load/store unit (LS)
The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations.
Varying unit (V)
The varying pipeline is a dedicated pipeline which implements the varying interpolator.
Texture unit (T)
The texture pipeline handles all texture sampling and filtering operations.
Figure 3. Shader Core - Pipeline Classes

Clock Domain

The GPU uses a single clock source from one of the PLLs in the PS High-Speed Connectivity module. The root clock is propagated to CLK and CLK_CG[0] and CLK_CG[1] on the GPU IP interface. The 3 clocks can be treated as asynchronous clocks to each other or as synchronous clocks.

The GPU interfaced-based clock domains are listed as follow:

  • GPU host interface ports are synchronous to CLK. All the IOs except AMBA 5 ACE‑Lite master are synchronous to CLK.
  • Each AMBA® 5 ACE‑Lite ports are synchronous to the CLK_CG<x> clock

Refer to the below table for the interface ports and associated domains

Table 1. GPU interface-based clock domains
Interface Name Protocol Type Data Width Clock Domain Clock Source
if_gpu_intmmi_acel_0 ACELite4 128 mmi_gpu_cg0_clk MMI_CRX.MMI_PLL
if_gpu_intmmi_acel_1 ACELite4 128 mmi_gpu_cg0_clk MMI_CRX.MMI_PLL
if_gpu_intmmi_acelite ACELite4 256 ps_axi_dma_clk PSXC_CRL.PPLL/NPLL/FLXPLL/RPLL
if_inimmi_gpu_axil_a AXI-Lite 32 mmi_gpu_clk MMI_CRX.MMI_PLL
if_inimmi_gpu_axil_b AXI-Lite 32 mmi_gpu_clk MMI_CRX.MMI_PLL
if_inimmi_gpu_axil_c AXI-Lite 32 mmi_gpu_clk MMI_CRX.MMI_PLL
if_intmmi_gpu_slcr_apb APB4 32 mmi_lsbus_clk PSXC_CRL.PPLL/NPLL/FLXPLL/RPLL

There are two modes of operation in GPU

  1. GPU Mode (FAST Mode): GPU runs with mmi_gpu_clk (from PLL) with 5% boost in the operational frequency.
  2. DC Internal (Segmented Mode): GPU clock source is limited to ps_axi_dma_clk clock while PLL is used for Display controller video and reference clock generation.

Refer to the figure below for the muxes that implement the two modes of operation for GPU and DC.

Figure 4. GPU Clock Mux

Reset Schemes

Hard Reset

The RESETn and RESET_RECOVERYn (excludes system error registers) active low signals with “CHK” initialize the whole GPU. PMC (in PSXC LPD) provides two MMI-level reset signals.

  • “FPD_PoR” which is shared with the PSXC FPD.
  • “MMI_SYS_Reset” that resets to all MMI-level features and all MMI IP sub-blocks.

Soft Reset

The partition manager or partitions can use the registers to control the software resets. There are two software reset types:

  • software controlled hard resets
    • group manager reset
  • software controlled soft resets
    • slice reset
    • group manager reset
    • GPU control-based reset
    • partition reset