Built-in shutdown logic protects the card from damage by removing power to the device when either electrical or thermal limits reach or exceed their respective shutdown thresholds. The voltage regulator module (VRM) monitors the VCCINT current and temperature. When any of the thresholds are exceeded, card power is removed. A cold reboot of the server hosting the card is subsequently necessary to reload the device configuration and re-enumerate the card in the server. The following table lists the card shutdown thermal and electrical thresholds of the VRM. The temperature thresholds apply equally with and without AUX power connected.
Sensor Description | Card Shutdown Threshold |
---|---|
VCCINT Current |
|
VCCINT Temperature | 125°C |
There is no external system controller on the V80 card that is monitoring voltage and temperature thresholds like previous Alveo cards. While the VRM protects the card from physical damage, precautions must be taken to avoid these limits to prevent a system failure. This includes installing the V80 card in a server that provides sufficient managed airflow as well as preventative in-system monitoring of temperature and voltage via the design on the Versal device or via communication with the host (PCIe in-band or BMC Out of Band).
The Alveo Versal Example Design shows an example of RPU firmware that collects telemetry on the card and can be used as a starting point for a thermal and electrical monitoring solution. It is the user's responsibility, however, to ensure adequate protection measures such as clock throttling are implemented for their application to avoid the system failure resulting from hitting the shutdown thresholds. Refer to the Alveo Versal Example Design documentation for more information on the example implementation.