Difference between I-Frames, P-Frames and B-Frames

MPEG-2 compression is the standard method for lossy video compression, developed by The Moving Pictures Expert Group (MPEG). The MPEG-2 compression method groups a number of video frames together, in what’s called a GOP (Group of Pictures). The length of a GOP is not specified by the MPEG standard, and a video sequence can even contain GOPs of different lengths. A GOP has three types of frames

I (intra) frames : I frames are intracoded frames. They don’t depend on any other frames
P (predicted) frames: P frames are predicted frames. They depend only on the previous I or P frame.
B (bi-directional) frames : B frames are bidirectional frames. They can depend on both the previous and the next I or P frames.

Because I and P frames are used to predict other P and B frames, they are called Reference Frames. This technique is called Motion Compensation.

The interleaving of I, P, and B frames in a video sequence is content dependent. For example, video conferencing applications may employ more B frames since there is little motion in the video. On the other hand, sports content with rapid or frequent motion may require more I frames in order to maintain good video quality.

A GOP with a variable length adheres to the frame pattern but allows the flexibility of inserting an I frame when the video content demands it. For example, when there is a new scene in the content, an I frame can be inserted. This may potentially lead to better compression efficiency than periodically inserting an I frame in a GOP. Video conferencing applications typically do not require an I frame for every group of 10 frames because the content is relatively static with few scene changes. By conserving the I frames, more B and P frames can be used to improve compression efficiency and the GOPs become longer. However, there is a limit on the maximum GOP length because the P frames are dependent on the I frames for referencing.

This article discuss about the difference between I-Frames, P-Frames and B-Frames in video compression.

I Frame

I‑frames are the least compressible but don’t require other video frames to decode. An I‑frame (Intra-coded image), a complete image, like a JPG or BMP image file. P and B frames hold only part of the image information (the part that changes between frames), so they need less space in the output file than an I‑frame.

The keyframe (i-frame) is the full frame of the image in a video. Subsequent frames, the delta frames, only contain the information that has changed. Keyframes will appear multiple times within a stream, depending on how it was created or how it’s being streamed.

I frames are key frames that provide checkpoints for resynchronization or re-entry to support trick modes (e.g., pause, fast forward, rewind) and error recovery. They are commonly inserted at a rate that is a function of the video frame rate. These frames are spatially encoded within themselves (i.e., independently coded or intracoded) and are reconstructed without any reference to other frames. Since subsequent frames after the I frame will not reference frames preceding the I frame. I frames are able to restrict the impact of error propagation caused by corrupted frames, thus providing some form of error resiliency.

P Frame

A P‑frame (Predicted image) holds only the changes in the image from the previous frame. For example, in a scene where a car moves across a stationary background, only the car’s movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P‑frame, thus saving space. P‑frames are also known as delta‑frames.

P frames are temporally encoded using motion estimation and compensation techniques. P frames are first partitioned into blocks before motion-compensated prediction is applied. The prediction is based on a reference to an I frame that is most recently encoded and decoded before the P frame (i.e., the I frame is a past frame that becomes a forward reference frame). Thus, P frames are forward predicted or extrapolated and the prediction is unidirectional. The residual can be calculated between a block in the current frame and the previously coded I frame.

B Frame

B‑frames can use both previous and forward frames for data reference to get the highest amount of data compression. A B‑frame (Bidirectional predicted image) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content. The I and P frames usually serve as references for the B frames (i.e., they are referenced frames). The interpolation of two reference frames typically leads to more accurate interprediction (i.e., smaller residuals) than P frames. The residual is calculated between a block in the current frame and the past/future reference frames. Due to more references, B frames incur higher bit overheads than P frames.

The main advantage of the usage of B frames is coding efficiency. In most cases, B frames will result in less bits being coded overall. Since B frames are not used to predict future frames, errors generated will not be propagated further within the sequence.

One disadvantage is that the frame reconstruction memory buffers within the encoder and decoder must be doubled in size to accommodate the 2 anchor frames. B frames are typically omitted for constant bit rate (CBR) encoding (i.e., only I and P frames are employed)

Motion Estimation

Motion estimation in MPEG operates on Macroblocks. A macroblock is a 16×16 pixel range in a frame. There are two primary types of motion estimation

Forward prediction predicts how a macroblock from the previous reference frame moves forward into the current frame.
Backward prediction predicts how a macroblock from the next reference frame moves back into the current frame.

Examples are shown below, with Forward prediction (left-to-right) and Backward prediction (right-to-left).

Difference between I-Frames, P-Frames and B-Frames — I,P and B Frames

Motion estimation operates as follows:

Compare a macroblock of the current frame against all 16×16 regions of the frame you are predicting from.
Then, select the 16×16 region with the least mean-squared error from the current macroblock and encode a motion vector, which specifies the 16×16 region you are predicting from and the error values for each pixel in the macroblock.
This is done only for the combined Y,U, and V values. Subsampling and separation of the Y, U, and V bands comes later.

There are four types of macroblocks:

Forward Predicted: (P and B only) Predict from a 16×16 region in the previous reference frame
Backward Predicted: (B only) Predict from a 16×16 region in the next reference frame
Bidirectional Predicted: (B only) Predict from the average of a 16×16 region in the previous reference frame and a 16×16 region in the next reference frame.
Intracoded: (I, P, or B) are not predicted, the actual pixel values are used for the macroblock.

It is important to remember that P and B frames can contain intracoded macroblocks as well as predicted macroblocks if there is no efficient way to predict the macroblock.

Motion compensation is necessary because motion prediction may not be perfect. For instance, when objects in a frame are rotated, a linear translation method for obtaining the MV may result in a residual. The residual is used to compensate for the MV and is normally subdivided into 8 × 8 blocks and sent to the block transform.