Pictures consist of I-pictures (intra coded), P-pictures (predicted) and B-pictures (bi-directional predicted). An I-picture is encoded using intra prediction only. A P-picture is encoded using intra prediction and inter prediction, referencing earlier I/P-pictures. B-pictures reference two reference frames using inter prediction.
An Instantaneous Decoder Refresh (IDR) picture is an I-picture after which the decoding process marks all reference pictures as "unused for reference" immediately after decoding the IDR picture.
The following figure shows a basic GOP structure that has a GOP length of 12 and for which the number of B-pictures between reference frames is set to 2. Next that it shows the same GOP but using B-pictures instead of P-pictures.
A low delay mode with P-pictures only, and a low delay mode with B-pictures only, after a single I, are possible GOP modes. The B-pictures refer to the picture before as first reference. The second reference refers several pictures back, as defined by the GOP-length (see following figure, which has GOP length 3).
A pyramidal GOP has hierarchical B-pictures. The hierarchy size depends on number of B-pictures specified. The following figure shows an example with a GOP length of 15 and number of B-pictures is 5.