Video Coding Basic Concepts

Video compression is the process of compacting data into a smaller number of bits. Video compression (video coding) is the process of converting digital video into a format suitable for transmission or storage. Raw or uncompressed digital video requires a large bitrate, while compressed video typically requires less bitrate.

Compression involves a compressor (encoder) and a decompressor (decoder). Encoder converts the source data into a compressed form occupying a reduced number of bits, prior to transmission or storage. Decoder converts the compressed form back into a representation of the original video data.

Data compression is achieved by removing redundancy (components that are not necessary for faithful reproduction of the data). There are two types of compression

In lossless compression, reconstructed data at the output of the decoder is a perfect copy of the original data.
In a lossy compression system, the decompressed data is not identical to the source data and much higher compression ratios can be achieved at the expense of a loss of visual quality.

Lossy video compression systems are based on the principle of removing subjective redundancy. It removes elements of the image or video sequence that can be removed without significantly affecting the viewer’s perception of visual quality.

This article discuss about basic building blocks(concepts) required for Video Coding.

Video Encoding

A video encoder consists of three main functional units:

Prediction model
Spatial model
Entropy encoder

Prediction model

The input to the prediction model is an uncompressed(raw) video sequence. The prediction model attempts to reduce redundancy by exploiting the similarities between neighbouring video frames and/or neighbouring image samples, by constructing a prediction of the current video frame or block of video data. A prediction is formed from data in the current frame or in one or more previous and/or future frames:

Intra prediction, created by spatial extrapolation from neighbouring image samples.
Inter or motion compensated prediction by compensating for differences between the frames.

The output of the prediction model is a residual frame, created by subtracting the prediction from the actual current frame, and a set of model parameters indicating the intra prediction type or describing how the motion was compensated.

Spatial model

The residual frame forms the input to the spatial model which makes use of similarities between local samples in the residual frame to reduce spatial redundancy. In H.264/AVC this is carried out by applying a transform to the residual samples and quantizing the results. The transform converts the samples into another domain in which they are represented by transform coefficients. The coefficients are quantized to remove insignificant values, leaving a small number of significant coefficients that provide a more compact representation of the residual frame. The output of the spatial model is a set of quantized transform coefficients.

Entropy encoder

The parameters of the prediction model, i.e. intra prediction mode(s) or inter prediction mode(s) and motion vectors, and the spatial model, i.e. coefficients, are compressed by the entropy encoder. This removes statistical redundancy in the data, for example representing commonly occurring vectors and coefficients by short binary codes.

The video decoder reconstructs a video frame from the compressed bit stream. The coefficients and prediction parameters are decoded by an entropy decoder. After that spatial model is decoded to reconstruct a version of the residual frame. Decoder then uses the prediction parameters, together with previously decoded image pixels, to create a prediction of the current frame. Final frame is reconstructed by adding the residual frame to this prediction.