There are several different definitions of delay. Different applications put emphasis on different aspects of delay.
There are many sources of delay in a video codec system. Here is a non-exhaustive list for videoconferencing applications:
It's hard to get all that to add up to less than a few hundred milliseconds on average, particularly if your video frame rate is relatively low.
High frame rates (which tend to require high bit rates) can reduce delay dramatically. Delay gets pretty awful at a few frames per second.
Higher encoding bit rate does not mean higher delay. In fact one of the best ways to reduce delay is to increase the channel bit rate that you're operating over, which allows frame rate to increase and other sources of delay to be reduced. For example, uncompressed PCM video has very high bit rate and very low delay.
In videoconferencing, sources 4 and 12 are ordinarily eliminated, and sources 6 and 14 are also greatly compromised and all the others are minimized to the extent feasible for the implementation.
We asume a video transmission system that conveys video frames in PAL resolution (720 x 576 pixels) at a frame rate of 25 frames per second. In the following table the different sources of delay are described for a low delay (e.g. video conference) and a higher delay (e.g. broadcast) scenario.
|1||The mean time between any action in the "real world" and the end of the camera shutter integration time is half of a frame time.||20 ms||see low delay scenario||20 ms|
|2||A typical camera will output the pixels at a speed that all pixels of one frame are transmitted in one frame period starting with the top left pixel and traversing the image in lines from left to right and the lines from top to bottom. A typical video encoder like H.263 or MPEG-2 processes images in 16 x 16 pixel macroblocks. So the encoder needs at least 16 lines of an image to start encoding. In our example this means 16/576 of a frame period (not considering real video timing, e.g. sync periods)||1.1 ms||For ease of implementation a frame grabber aquires a whole video frame before it hands the video data over to the following processing steps. This means one frame period delay.||40 ms|
|3||We asume no preprocessing.||0 ms||We asume one frame period for preprocessing.||40 ms|
|4||B frames are not used, so no delay for reordering.||0 ms||If we use two B frames between P frames we need to store the input images for the B frames and encode the P frame first. This results in two frame periods delay.||80 ms|
|5||A balanced system has the computing power to encode a frame in one frame period. It can output the first encoded bits as soon as is has processed the first macroblock. This delay can be very small.||0.1 ms||For ease of implementation the encoder processes a whole video frame before it hands the encoded data over to the next steps. This means one frame period delay.||40 ms|
|6||A low delay videoencoder tries to spend the same number of bits for each individual video frame. However, some variations must be possible. We asume half the mean number of bits per frame variation which leads to half of one frame period delay.||20 ms||Broadcast encoders allow the number of bits spent for a video frame to vary in order to achieve a constant quality over time. Buffer periods in the range of 0.5 to 1 seconds and more are common. For more intormation on delay and ratecontrol see the respective section below.||1000 ms|
|7||There are a lot of problems in an implementation of a video encoding system where some extra buffering makes life easier. One example is that each individual video frame may need a different time for encoding. We asume half of a frame period for a low delay video encoder.||20 ms||In the broadcast video encoder we allow for some more buffering to make the implementation easier.||40 ms|
|8 - 10||Transmission delay can vary over a wide range depending on the channel. For our low delay example we asume a low latency channel and no channel coding. If the encoder operates at the channel bitrate the transmission of one frame takes one frame period.||40 ms||Transmission latency can get quite high in broadcast systems. There may be elaborate channel coding or a satellite link. Delays of one second or more are not uncommon.||1000 ms|
|11||The decoder can decode one frame in one frame period. It could output parts of the image as soon as the first macroblock is processed, but this would not make much sense, since a whole frame must be displayed. So decoding delay is one frame period.||40 ms||see low delay scenario||40 ms|
|12||B frames are not used, so no delay for reordering.||0 ms||We need one frame period for reordering of B frames at the decoder.||40 ms|
|13||We asume no postprocessing.||0 ms||We asume one frame period for postprocessing.||40 ms|
|14||Some video frames may need more computations to be encoded or decoded than other frames. So the time intervals at which the decoder outputs frames will jitter. We allow half a frame period to compensate for this jitter.||20 ms||In the broadcast scenario there may be higher variations in the computation demands of induvidial frames. Here we allow a full frame period delay.||40 ms|
|15||In a low delay application we pay close attention to audiovisual synchronisation over the whole transmission chain. So we don not need much extra delay here, half of a frame period schould be sufficient.||20 ms||In this scenario we may transmit audio and video over seperate channels, e.g. two different UDP ports over IP. This may lead to to great differences in the arrival times of audio and video packets. We may need half a second for audiovisual synchronisation.||500 ms|
|16||There are a lot of problems in an implementation of a video decoding system where some extra buffering makes life easier. We asume half of a frame period for a low delay video encoder.||20 ms||In the broadcast video encoder we allow for some more buffering to make the implementation easier.||40 ms|
|17||We want to synchronize the output of a video frame with the frame timing of the display device. The mean delay here is half a frame period of the display frame rate. We asume 60 Hz display frame rate.||16 ms||see low delay scenario||16 ms|
|Total||~ 260 ms||~ 3000 ms|
As you can see from the low delay example you get about 260 ms end to end delay even with quite optimistic asumptions for the individual sources of delay. Many sources of delay are dependent on the frame period so that you get higher delays when you use lower frame rates. Some of this delay can be reduced with higher computational power, so that encoding and decoding takes less time, or higher network bandwidth, so that the actual transmission of the bits takes less time. However, higher computational power and higher network bandwidth only come at a cost.
The broadcast example shows that several seconds end to end delay are not uncommon in video transmission. In fact, in internet video streaming buffering periods of more than 10 seconds are often employed to compensate for fluctuating channel conditions.
In broadcast, you also need to think about channel acquisition refresh time.
In DVD-style playback, you also need to think about random-access seek time.
The number of bits spent by an encoder for an individual video frame may vary to a great degree, e.g. typically I frames need more bits than P frames which in turn need more bits than B frames. Furthermore the "complexity" of a video frame may vary over time. Complexity here means the number of bits the encoder has to spend for a video frame at a given fixed setting. For example static content has a low complexity, because a frame can easily be predicted from the pervious frame. Scenes with high motion typically have high complexity.
The following figure shows the bit usage over time for a video sequence. The sequence is coded at QCIF size with 8.33 fps. The scene shows mostly head and shoulder content (wich has relatively low complexity) with 2 camera pans (which have higher complexity). The mean bitrate is about 24 kbps. You can see the peaks in bitrate due to the pans.
The encoder rate control can try to code each frame with the same number of bits, but then the video quality would be poor for I frames or scenes with high complexity. To allow short term fluctuations in bitrate when transmitting over a fixed rate channel a buffer has to be inserted before and after the channel.
The following figure illustrates a video transmission system. The encoder produces a variable number of bits per frame. These bits are wirtten to the transmit buffer. Reading from the transmit buffer, the actual transmission and writing to the receive buffer occur at constant rate. The decoder again reads a varibale number of bits per frame from the receive buffer.
The encoder rate control has to ensure that the buffers never overflow or underflow. The transmission can only start after a delay of half of the buffer capacity. Likewise the decoding can only start after another delay of half of the buffer capacity.
The following figure shows the situation at the transmitter. Encoding starts at t = 0s with a mean bitrate of 24 kbps. The buffer capacity is 48 kbit (2 seconds). After one second delay the actual transmission starts. The solid and dashed thin lines mark the lower and upper buffer limit.
The size of the buffer determines how much the bit allocation can vary locally. If the buffer is small, there is relatively little ability for the local bit allocation for a segment of the video content to vary. However if the buffer is large, the buffer can be used very effectively (although with high end-to-end delay) to allow the number of bits per frame to vary in different parts of the content while retaining constant bit rate for transmission.
Video coding standards define constraints on the buffer size. In MPEG-1, MPEG-2, and MPEG-4 these constraints are defined by means of a Video Buffering Verifier (VBV). H.261, H.263 and H.264 define a Hypothetical Reference Decoder (HRD).
Four examples may be useful to illustrate the situation:
|||Harald Fuchs: Regelung der Bitrate und Bitverteilung prädiktiver Video-Hybrid-Kodierer, Diplomarbeit, Erlangen, 1997|
This article was inspired by some very insightful postings from Gary Sullivan to the MPEG Industry Forum MP4-tech mailinglist.