Determining Device Maximum Temperature from Simulation
A successful simulation usually results in the temperature monitor point (generally at the center of the die) being within the temperature targets determined for the simulation. For devices using HBM, there are two monitor points: one for the FPGA die and one for the HBM stacks. Even for devices that is composed of multiple die or multiple HBM stacks, the model simplifies the analysis to a single point for a grouping of die. Temperature gradients and hot-spot mitigation is generally not required in this analysis as long as proper contact of the designed thermal solution exists. The programmable nature of the design means that power density maps can shift and change over time, and thus Xilinx only requires a uniform power map for simulation and measurement to the device thermal diode system monitor (SYSMON) to remain in specification. Xilinx takes care of any additional higher temperature gradient through device testing. The end goal of the thermal designer is to have thermal simulations meet targets as specified by the monitoring point within adequate thermal margin, and that the device operates within that same target temperature as measured by the thermal diode within the measurement tolerances of the circuitry.
Determining Amount of Thermal Margin
The amount of thermal margin you should apply to the simulation results is unique to each design and designer. However, the one common thing is that all thermal designs should incorporate some thermal margin to ensure that uncertainties in the simulation do not lead to the device missing thermal targets in actual operation. Uncertainties in thermal design can occur in many places including:
- Power used during simulation
- Thermal model uncertainties
- Device measurement system monitor (SYSMON) uncertainty
- TIM contact and thickness variations
- Heatsink manufacturing tolerances
- Airflow deviations
- Other device power and thermal exhaust disparities
Designers often incorporate margin by using worst-case or sometimes worse than worst-case parameters in simulations. For example, they might incorporate highest possible power, upper limits of ambient, impeded airflow, improbably high TIM thickness and contact resistance, and more. Sometimes, they will apply additional margin on top of that. In general, Xilinx does not recommend this practice, as it can lead to worse than worst-case conditions that result in thermal over-design, or sometimes near impossible scenarios that will never be seen. The likelihood of all worst-case parameters occurring at once is very low so proper judgment must be made when entering simulation constraints/boundaries. In addition, proper judgment is needed when interpreting the results as a means of getting relative assurance that the end design will perform as needed but not be over designed resulting in greater cost, area, weight, and other undesirable characteristics.
In the end, it is up to the designer to determine the amount of margin that should be used to gain enough confidence in the thermal performance of the design. In general, it should be based on the confidence in the parameters in the design actually occurring in the final system collectively. If there is low confidence that most simulation parameters can be exceeded, greater margin might be more necessary than a system that is not expected to exceed any of the parameters. In the end, the quality and certainty of the simulation parameters and the simulation itself is the primary guide to how much margin is enough.
Relating Thermal Simulation Results Back to the Power Estimations
For the initial power estimations, an assumed fixed junction temperature is set to the target for the thermal design. After the thermal design becomes more final, it is suggested to back annotate simulation results to the power estimations to dynamically calculate the device junction temperature as a means to allow more accurate power estimations and monitor thermal margins as the internal design evolves. This can serve as a constraint for the logical design allowing better understanding and management of the design power to ensure that the thermal design is not over-designed later. To do this, derive a local ambient and effective θJA. This will serve as a simplified representation of the thermal performance of the designed thermal system as it relates to the junction temperature of the Xilinx device. The local ambient is generally the ambient temperature in the simulation as seen by the device. It can differ from the system-level ambient depending on whether the device is exposed to exhaust heat of other devices. However, local ambient should be set to the value as seen within the simulation. The effective θJA is a simplified thermal resistance value from device junction to ambient. This is used in power calculations to allow simple calculation of junction temperature based on estimated power. It also is used to assess the impact of increased/decreased power due to the effectiveness of the thermal system. It is calculated from simulation results by taking the power applied to the device in simulation divided by the calculated junction temperature as seen by the monitor point in the simulation model minus the local ambient temperature used.
The local ambient and effective θJA from the simulation results can be entered into XPE in the environment settings after the thermal design has been established as shown in the following figure.
This should also be used in the Xilinx Vitis™ or Vivado® software as XDC constraints to allow for proper understanding of the thermal design capabilities during FPGA design development:
set_operating_conditions -ambient_temp <temp_value> -thetaja
<thetaja_value>
The units for the command arguments –ambient_temp is °C and for –thetaja is °C/W. To convey the same values as shown in Figure 1, the following should be added to the Vivado XDC constraint file:
set_operating_conditions -ambient_temp 44.6 -thetaja 3.4
See XPE for Early Thermal Analysis for a video with more details on using thermal simulation results for power estimation.