You can build a clock multiplexer using a combination of parallel and cascaded BUFGCTRLs. The placer finds the optimal placement based on the clock buffer site availability. If possible, the placer places BUFGCTRLs in adjacent sites to take advantage of the dedicated cascade paths. If that is not possible, the placer will attempt to place the BUFGCTRLs from the same level in the adjacent clock regions.
The following figure shows a 4:1 MUX with balanced cascading. The first level of BUFGCTRL buffers are both placed in the directly adjacent sites (X10Y7, X10Y5) of the last BUFGCTRL (X10Y6). This configuration ensures a comparable insertion delay for all the clocks reaching the last BUFGCTRL. You can use an equivalent structure for a 3:1 MUX.
When creating a 5:1 or larger clock MUX structure, it is common to create a symmetrical clock structure as shown in the following figure. However, this is a suboptimal solution, because each BUFGCTRL only has one cascade path to the two adjacent BUFGCTRLs, which does not provide minimal delay for all connections between the BUFGCTRLs.
To support larger clock multiplexers (from 5:1 to 8:1 MUX), Xilinx recommends using cascaded BUFGCTRL buffers as shown in the following figure. This figure shows an optimal 8:1 MUX that uses 7 BUFGCTRL buffers.
When using wide BUFGCTRL-based clock multiplexers, the clock insertion delays cannot be balanced because some paths are longer than other paths in hardware. Therefore, this method is recommended for multiplexing asynchronous clocks only.