This section describes other performance considerations.
Selecting a PCI Express Slot
AMD Solarflare PCI Express® ( PCIe® ) adapters are designed for x8 lane operation if they can be fitted to a x8 lane slot, or for x16 lane operation if they can only be fitted to a x16 lane slot.
The PCIe interface on your X4 series adapter can function at different speeds and widths of lanes. The adapter and server negotiate the highest speed and width that is mutually possible for the PCIe slot you are using. For maximum performance, ensure the following:
- Your server uses the same PCIe generation that your adapter uses, or a later generation. This ensures the speed used is the maximum possible.
- The slot you are using can not only physically accept the adapter, but also has
at least the number of electrical lanes that your adapter hardware provides. This ensures the width
used is the maximum possible.Note: The negotiated width is independent of the physical slot size used to connect the adapter. On some server motherboards or risers, some slots (including those that are physically x8 or x16 lanes) might only electrically support x4 lanes. X4 series PCIe adapters continue to operate in x4 lane slots, but not at full speed.
The AMD Solarflare network driver warns if it detects
that the adapter is placed in a sub-optimal, for example because it electrically has fewer than x8
lanes. You can view warning messages in dmesg from /var/log/messages.
To discover the currently negotiated PCIe lane
width and speed, use the lspci command:
lspci -d 1924: -vv
01:00.0 Ethernet controller: AMD Solarflare Device 0c03
...
LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 <64us
LnkSta: Speed 32GT/s (ok), Width x8 (ok)
Speed might be returned as unknown, if the lspci utility you are using is too old to determine that a slot is using a
more recent generation of PCIe.A further consideration when choosing a PCIe slot is that the latency of communications between the host CPUs, system memory and the X4 series PCIe adapter might be PCIe slot dependent. Some slots might be “closer” to the CPU, and therefore have lower latency and higher throughput. If possible, install the adapter in a slot which is local to the desired NUMA node
Consult your server documentation for more information.
CPU Speed Service
Some Linux distributions run the cpuspeed
service by default. This service controls the CPU clock speed dynamically according to current
processing demand. For latency sensitive applications, where the application switches between having
packets to process and having periods of idle time waiting to receive a packet, dynamic clock speed
control might increase packet latency. AMD recommend disabling the
cpuspeed service if minimum latency is the main consideration.
To stop the service temporarily:
systemctl stop cpuspeed
To disable the service across reboots:
systemctl disable cpuspeed
CPU Power Service
On more recent Linux distributions, cpuspeed is
replaced with cpupower. AMD
recommend disabling the cpupower service if minimum latency is the
main consideration.
To stop the service temporarily:
systemctl stop cpupower
To disable the service across reboots:
systemctl disable cpupower
Tuned Service
Some Linux distributions run the tuned service.
If minimum latency is your main consideration, you are advised to experiment:
Busy poll
If the kernel supports the busy poll features
(Linux 3.11 or later), and minimum latency is the main consideration, AMD recommend that you enable the busy_poll socket
options with a value of 50 microseconds as follows:
sysctl net.core.busy_poll=50 && sysctl net.core.busy_read=50
Only sockets having a non-zero value for SO_BUSY_POLL are polled, so you must do one of the following:
- Set the poll timeout with the global busy_ read option, as shown above.
(Setting busy_read also sets the default value for the
SO_BUSY_POLLoption.) - Set the per-socket
SO_BUSY_POLLsocket option on selected sockets.
Memory bandwidth
Many chipsets use multiple channels to access main system memory. Maximum memory performance is only achieved when the chipset can make use of all channels simultaneously. You must take this into account when selecting the number of memory modules (DIMMs) to populate in the server. For optimal memory bandwidth in the system, it is likely that:
- all DIMM slots are populated
- all NUMA nodes have memory installed.
Consult the motherboard documentation for details.
Server Motherboard, Server BIOS, and Chipset Drivers
Tuning or enabling other system capabilities might further enhance adapter performance. Consult the documentation for your server. Possible opportunities include tuning the PCIe memory controller (a PCIe Latency Timer setting is available in some BIOS versions).