Stage 1 translation is intended to assist the operating system, both when running natively or inside a hypervisor. Stage 1 translation works similarly to a traditional (single stage) CPU MMU. Normally, an operating system causes fragmentation of physical memory by continuously allocating and freeing memory space on the heap, both for kernel and applications. A virtualized system that includes a fragmented model between IPA and PA spaces (where multiple guest operating systems are sharing the same physical memory) is not advised because of this issue.
A typical solution, to allocate large contiguous physical memory, is to pre-allocate such buffers. This is very inefficient because the buffer is only required at runtime. Also, in a virtualized system, a pre-allocated solution requires the hypervisor to allocate any contiguous buffers to the guest operating system, which could require hypervisor modifications.
For a DMA device to operate on fragmented physical memory, a DMA scatter-gather mechanism is typically used, which increases software complexity and adds performance overhead. Also, some devices are not capable of accessing the full memory range, such as 32-bit devices in a 64-bit system. One solution is to provide a bounce buffer—an intermediate area of memory at a low address that acts as a bridge. The operating system allocates pages in an address space visible to the device and uses them as buffer pages for DMA to and from the operating system. Once the I/O completes, the content of the buffer pages is copied by the operating system into its final destination. There is significant overhead to this operation, which can be avoided with the use of SMMU. I/O virtualization can be achieved by using stage 1 (for native operating systems) and by stage 1 or 2 (for guest operating systems).