P4 Pipeline Stages

Flow Offload Reference Pipeline User Guide (UG1670)

Document ID
UG1670
Release Date
2024-02-15
Revision
1.1 English

Each P4 stage within a pipeline begins with a table engine (TE). The TE extracts bits and bytes from the PHV in any combination to build table keys and to provide inputs to actions. Table keys constructed from multiple header fields can be up to 512 bits in width, or keys can be chained together to create extra wide keys up to 2048 bits in width. The resulting key can be hashed or used as a direct index, then matched against any data structure in TCAM, SRAM, DRAM, or attached host memory. 

Figure 1. P4 Pipeline Stages

The table data and lookup result (match or no match) is forwarded to a match processing unit (MPU), along with the original key and meta data from the PHV. Architecturally, an important role of the TE is to provide a latency tolerance mechanism to protect the MPUs from stalls. The TE logic processes multiple PHVs in advance of the MPUs, issuing high latency table reads and moving to the next PHV before earlier table reads complete. As table responses return to the TE, lookup results and action entry vectors are passed to the MPUs for immediate execution. The MPUs never stall waiting for table results. This latency tolerance feature of the TE allows tables to be stored in high latency DRAM memory, enabling large scale tables to be processed at high packet rates.

As the TE completes a table fetch and match operation, results are delivered to the next available MPU. Subsequent table fetches to the same entry which have serial update dependencies are locked by the TE to support single flow processing and data ordering. Ordered table lookup results are delivered to an MPU along with an entry program counter (PC), calculated from the current table configuration and a programmable offset which can be stored in a table entry. Multiple MPUs in one stage can work on the same Packet Header Vector (PHV) at the same time, provided each MPU is accessing a different table result. MPUs have a dedicated write path to the stage data buffer where PHVs are kept (SDP), and writes are merged at the bit level to support updating arbitrary header formats and alignments. After the final write completes, the PHV graduates to the next P4 stage.

Starting from the entry PC, the MPU executes a run-to-completion program stored in DRAM. Instructions are fetched and cached in the stage instruction cache which experiences a high hit rate due to functional code locality as P4 divides the pipeline into discrete functional stages. The MPU implements a domain specific instruction set architecture with an emphasis on bit field manipulation and fast header updates. In particular, the MPU ISA focuses on rapid field extraction from multiple sources (tables, registers, or PHV) and forwards those fields directly to an ALU or straight to the PHV. MPU instruction types include register and field comparisons, branches, boolean and arithmetic ALU operations, and memory load and store operations. In addition to these familiar CPU instructions, the MPU includes PHV write operations, packet sequence number comparison and modification, queue state reduction, leading 0 and 1 detection, and other special protocol acceleration instructions. A general purpose, 64-bit wide register file can be used to hold intermediate computational values, plus a domain-specific, 512-bit wide table entry vector and a 512-bit wide header field vector provide operands directly to ALU instructions.

In addition to efficient packet header manipulation, MPUs are used in the P4 pipelines to interpret and create memory descriptors that drive DMA operations based on packet content, flow state, or processor work queue entries. For example, received TCP packets are parsed, classified to a connection or flow, sequence checked, and DMA’d directly to associated data stream buffers. These data stream buffers can subsequently be passed to crypto offload processing for P4 TLS support, higher layer applications running on the host CPUs or the DPU Arm complex, or P4 proxy programs to attach the data stream to a new connection. Processing memory descriptors in the P4 pipeline has the additional advantage of adapting to customized descriptor formats, allowing P4 programs to read and write packets directly from/to Linux mbufs with associated metadata in the format required by Data Plane Development Kit (DPDK) or other drivers.

In a flow offload scenario, once a packet is received on the uplink interface in the P4 ingress pipeline, a lookup is made with its 5-tuple (source ip, destination ip, source port, destination port and protocol). If a flow is found, it is used to pick up the next hop or the outgoing interface on which the packet needs to be forwarded.

Figure 2. Flow Offload Overview

Now there are two key concepts that need to be understood: flow hit and flow miss.

Flow hit
A flow that is matched in the table is counted as flow hit.

In the case of flow hit, the flow table provides the uplink, the next hop that is picked up, and the session ID that is applicable to this flow. The flow table is simply a hash table with hash value/hint to each of the flow entries. Both iflow and rflow entries are stored in the flow table. In order to handle any hash collision, the overflow table (ohash table) is needed, and additional code needs to be written if flow table is placed in the DDR memory to handle the logic from hash table to ohash table.

The flow hit packet goes into the egress pipeline directly from ingress. A session lookup and next hop lookup are performed using the result from the flow table, and the next hop table provides the outgoing uplink port for this packet. The packet is forwarded from egress to the corresponding uplink, so the destination device gets that packet.

Flow miss
A flow that is not matched in the table lookup is counted as flow miss.

But if it is a flow missed packet and a flow was not found for the 5-tuple that was looked up, the NACL table is consulted to find an NACL created to redirect the packet for flow miss. This points to a next hop, corresponding to the CPU mnic0 interface. The DP app running on the Arm CPU is polling for such packets.

Flow missed packets from NACL go into the egress pipeline. The packet is sent to RXDMA with the ID of CPU and mnic0 interface that was derived from NACL. From RXDMA the packet goes to DP_app via DPDK. Once DP_ app gets that packet, it creates the flows and sessions that are required for the packet, and injects it back into P4. Because the flows are now installed, P4 finds the flow when the packet goes through the ingress pipeline, and the next hop in the flow specifies where the packet should be forwarded. See the following figure.

Figure 3. Flow Miss

This results in the following paths:

Flow Hit Path
Ingress → Out
Flow Miss Path
Ingress → Egress → RXDMA → DP App → TXDMA → Ingress → Out

This concludes how flow miss and flow hit packets work in the pipeline.

The purpose of DP app here is the creation of flows and sessions based on the packets that are coming in from P4. Other functionalities of DP app are configuration, packet processing and flow aging.

After traversing all stages of the pipeline, the packet goes back to PB and, in this case, back out on an Ethernet port:

[22-10-14 20:27:04] P4 :: PBC-MODEL: RECEIVED PACKET ON PORT 7
[22-10-14 20:27:04] P4 :: PBC-MODEL: SENDING PACKET ON PORT 1