MPEG MP4 is a container format that allows video, audio and other related information to be bundled together in a single file. It is largely based on the QuickTime file format. It supports a wide range of codecs.
Structure of MP4 Files
ISO Base Media File Format (ISOBMFF, MPEG-4 Part 12) is the base of the MP4 container format. Base component of ISOBMFF are boxes, which are also called atoms. The standard defines the boxes, by using classes and an object oriented approach. MP4 format is organized into a series of boxes, which may, in turn, contain other boxes. Which types of boxes exist, what information they can contain, and where in the box hierarchy they can appear is clearly defined. Each box type has a four-character identifier.
Below is the list of the root level atoms contained in MP4 files:
- ftyp (file type) contains the file type, description, and the common data structures used.
- pdin contains progressive video loading/downloading information.
- moov (movie) container for all the movie metadata. It contains movie information including movie header, tracks, mdat references and more.
- moof is the container with video fragments.
- mfra is the container with random access to the video fragment
- mdat (media data) data container for media. It contains audio/video payload.
- stts (sample-to-time table)
- stsc (sample-to-chunk table)
- stsz (sample sizes)
- meta box contains metadata information.
Second-level atoms used in MP4:
- mvhd contains the video header information with full details of the video.
- trak is the container with the individual track.
- iods is MP4 file descriptor

moov contains video and audio tracks, movie header information like timescale, duration, display characteristics of the movie. It stores references to locations in mdat atom. mdat is where the actual raw information is being stored.
Fragmented MP4
A fragmented (also know as segment or chunk) MP4 stream consists of an initialization segment and a sequence of media segments. The initialization segment is similar to the start of an unfragmented file: it consists of an ftyp box and a moov box. The moov box contains additional information to indicate that the stream is fragmented, including an mvex (movie extends) box.
Media segments consist of a moof (movie fragment) box, containing metadata for just that fragment, and an mdat box, containing a portion of the audio/video payload. It may also have an styp box at the beginning, which is like ftyp but for a segment. In a properly encoded stream, each media segment can be decoded and played back without needing information from any previous segments other than the initialization segment.
Fragmented MP4 allows to download a few seconds of clip at a time (so that we can play), rather than waiting for a full MP4 file to download.
