Determining Device Maximum Temperature from Simulation
A successful simulation usually results in the temperature monitor point (generally at the center of the die) being within the temperature targets determined for the simulation. For devices using HBM, there are two monitor points: one for the FPGA die and one for the HBM stacks. Even for devices that are composed of multiple die or multiple HBM stacks, the model simplifies the analysis to a single point for a grouping of die. Temperature gradients and hot-spot mitigation is generally not required in this analysis as long as proper contact of the designed thermal solution exists.
The programmable nature of the design means that power density maps can shift and change over time, and thus AMD only requires a uniform power map for simulation and measurement to the device thermal diode system monitor (SYSMON) to remain in specification. AMD takes care of any additional higher temperature gradient through device testing. The end goal of the thermal designer is to have thermal simulations meet targets as specified by the monitoring point within adequate thermal margin, and for the device to operate within that same target temperature as measured by the thermal diode within the measurement tolerances of the circuitry.
Determining Amount of Thermal Margin
The amount of thermal margin you should apply to the simulation results is unique to each design and designer. However, the one common thing is that all thermal designs should incorporate some thermal margin to ensure that uncertainties in the simulation do not lead to the device missing thermal targets in actual operation. Uncertainties in thermal design can occur in many places including:
- Power used during simulation
- Thermal model uncertainties
- Device measurement SYSMON uncertainty
- TIM contact and thickness variations
- Heatsink manufacturing tolerances
- Airflow deviations
- Other device power and thermal exhaust disparities
Designers often incorporate margin by using worst-case or sometimes worse than worst-case parameters in simulations. For example, they might include the highest possible power, upper limits of ambient conditions, impeded airflow, improbably high TIM thickness and contact resistance. Sometimes, they apply additional margin on top of that. In general, this practice is not recommended, as it can lead to worse than worst-case conditions that result in thermal over-design, or sometimes near impossible scenarios that never occur. The likelihood of all worst-case parameters occurring at once is very low, so proper judgment must be made when entering simulation constraints and boundaries. Additionally, proper judgment is needed when interpreting the results to ensure relative assurance that the end design performs as needed without being over-designed, which results in greater cost, area, weight, and other undesirable characteristics.
In the end, it is up to the designer to determine the amount of margin that should be used to gain enough confidence in the thermal performance of the design. Generally, it should be based on the confidence that the parameters in the design actually occur in the final system collectively. If there is low confidence that most simulation parameters can be exceeded, greater margin might be more necessary than for a system not expected to exceed any of the parameters. Ultimately, the quality and certainty of the simulation parameters and the simulation itself serve as the primary guide to how much margin is enough.
Relating Thermal Simulation Results Back to the Power Estimations
For the initial power estimations, an assumed fixed junction temperature is set to the target for the thermal design. After the thermal design becomes more final, it is suggested to back annotate simulation results to the power estimations to dynamically calculate the device junction temperature as a means to allow more accurate power estimations and monitor thermal margins as the internal design evolves. This can serve as a constraint for the logical design allowing better understanding and management of the design power to ensure that the thermal design is not over-designed later. To do this, derive a local ambient and effective θJA. This serves as a simplified representation of the thermal performance of the designed thermal system as it relates to the junction temperature of the AMD device. The local ambient is generally the ambient temperature in the simulation as seen by the device. It can differ from the system-level ambient depending on whether the device is exposed to exhaust heat of other devices. However, local ambient should be set to the value as seen within the simulation. The effective θJA is a simplified thermal resistance value from device junction to ambient. This is used in power calculations to allow simple calculation of junction temperature based on estimated power. It also is used to assess the impact of increased/decreased power due to the effectiveness of the thermal system. It is calculated from simulation results by taking the power applied to the device in simulation divided by the calculated junction temperature as seen by the monitor point in the simulation model minus the local ambient temperature used.
The local ambient and effective θJA from the simulation results can be entered into XPE or PDM in the environment settings after the thermal design has been established as shown in the following figure.
This should also be used in the AMD Vitis™ or AMD Vivado™ software as XDC constraints to allow for proper understanding of the thermal design capabilities during FPGA design development:
set_operating_conditions -ambient_temp <temp_value> -thetaja
<thetaja_value>
The units for the command arguments –ambient_temp is °C and for –thetaja
is °C/W. To convey the same values as shown in Figure 1, the following should be added to
the Vivado XDC constraint file:
set_operating_conditions -ambient_temp 44.6 -thetaja 3.4