Delay in Video Transmission and Ratecontrol

What is delay?

There are several different definitions of delay. Different applications put emphasis on different aspects of delay.

In videoconferencing, delay ordinarily means the amount of time between when you move and when the person you're talking to sees you move.
In broadcast video, delay may mean the amount of time between when you punch the button on your remote control to switch to a new channel and when the new channel shows up on your TV.
Algorithmic delay means delay inherent to a (coding) algorithm. It would occur even if unlimited computing power and transmission bandwidth were available.

Sources of delay

There are many sources of delay in a video codec system. Here is a non-exhaustive list for videoconferencing applications:

the time between when you move and when the camera shutter integration time stops
the time it takes to dump a picture from the camera into the capture device
the time for pre-processing in an encoder for noise reduction, etc.
re-ordering of pictures prior to encoding, e.g., for conventional B picture use
the time it takes an encoder to encode a picture and turn it into compressed bits
extra buffering delay used to allow the encoder to vary the number of bits it spends on different pictures
extra buffering delays introduced in the encoder to make its implementation easier
the time it takes to multiplex or packetize the video and perform channel coding
the time it takes the bits to get from the encoder to the decoder
the time it takes to perform channel decoding and demultiplex or depacketize the video
the time it takes a decoder to process the compressed bits into a decompressed picture
re-ordering of pictures prior to display, e.g., for conventional B picture use
the time it takes to post-process decoded pictures, e.g., for deblocking
time delays added so that the relative timing of the output pictures matches the input timing
time delays added for synchronization with other events such as audio sync
buffering delays introduced in the decoder to make its implementation easier
the time it takes the decoder to get the output picture onto the display

It's hard to get all that to add up to less than a few hundred milliseconds on average, particularly if your video frame rate is relatively low.

High frame rates (which tend to require high bit rates) can reduce delay dramatically. Delay gets pretty awful at a few frames per second.

Higher encoding bit rate does not mean higher delay. In fact one of the best ways to reduce delay is to increase the channel bit rate that you're operating over, which allows frame rate to increase and other sources of delay to be reduced. For example, uncompressed PCM video has very high bit rate and very low delay.

In videoconferencing, sources 4 and 12 are ordinarily eliminated, and sources 6 and 14 are also greatly compromised and all the others are minimized to the extent feasible for the implementation.

Examples

We asume a video transmission system that conveys video frames in PAL resolution (720 x 576 pixels) at a frame rate of 25 frames per second. In the following table the different sources of delay are described for a low delay (e.g. video conference) and a higher delay (e.g. broadcast) scenario.

Source	Low Delay		Broadcast
Source	Description	Delay	Description	Delay
1	The mean time between any action in the "real world" and the end of the camera shutter integration time is half of a frame time.	20 ms	see low delay scenario	20 ms
2	A typical camera will output the pixels at a speed that all pixels of one frame are transmitted in one frame period starting with the top left pixel and traversing the image in lines from left to right and the lines from top to bottom. A typical video encoder like H.263 or MPEG-2 processes images in 16 x 16 pixel macroblocks. So the encoder needs at least 16 lines of an image to start encoding. In our example this means 16/576 of a frame period (not considering real video timing, e.g. sync periods)	1.1 ms	For ease of implementation a frame grabber aquires a whole video frame before it hands the video data over to the following processing steps. This means one frame period delay.	40 ms
3	We asume no preprocessing.	0 ms	We asume one frame period for preprocessing.	40 ms
4	B frames are not used, so no delay for reordering.	0 ms	If we use two B frames between P frames we need to store the input images for the B frames and encode the P frame first. This results in two frame periods delay.	80 ms
5	A balanced system has the computing power to encode a frame in one frame period. It can output the first encoded bits as soon as is has processed the first macroblock. This delay can be very small.	0.1 ms	For ease of implementation the encoder processes a whole video frame before it hands the encoded data over to the next steps. This means one frame period delay.	40 ms
6	A low delay videoencoder tries to spend the same number of bits for each individual video frame. However, some variations must be possible. We asume half the mean number of bits per frame variation which leads to half of one frame period delay.	20 ms	Broadcast encoders allow the number of bits spent for a video frame to vary in order to achieve a constant quality over time. Buffer periods in the range of 0.5 to 1 seconds and more are common. For more intormation on delay and ratecontrol see the respective section below.	1000 ms
7	There are a lot of problems in an implementation of a video encoding system where some extra buffering makes life easier. One example is that each individual video frame may need a different time for encoding. We asume half of a frame period for a low delay video encoder.	20 ms	In the broadcast video encoder we allow for some more buffering to make the implementation easier.	40 ms
8 - 10	Transmission delay can vary over a wide range depending on the channel. For our low delay example we asume a low latency channel and no channel coding. If the encoder operates at the channel bitrate the transmission of one frame takes one frame period.	40 ms	Transmission latency can get quite high in broadcast systems. There may be elaborate channel coding or a satellite link. Delays of one second or more are not uncommon.	1000 ms
11	The decoder can decode one frame in one frame period. It could output parts of the image as soon as the first macroblock is processed, but this would not make much sense, since a whole frame must be displayed. So decoding delay is one frame period.	40 ms	see low delay scenario	40 ms
12	B frames are not used, so no delay for reordering.	0 ms	We need one frame period for reordering of B frames at the decoder.	40 ms
13	We asume no postprocessing.	0 ms	We asume one frame period for postprocessing.	40 ms
14	Some video frames may need more computations to be encoded or decoded than other frames. So the time intervals at which the decoder outputs frames will jitter. We allow half a frame period to compensate for this jitter.	20 ms	In the broadcast scenario there may be higher variations in the computation demands of induvidial frames. Here we allow a full frame period delay.	40 ms
15	In a low delay application we pay close attention to audiovisual synchronisation over the whole transmission chain. So we don not need much extra delay here, half of a frame period schould be sufficient.	20 ms	In this scenario we may transmit audio and video over seperate channels, e.g. two different UDP ports over IP. This may lead to to great differences in the arrival times of audio and video packets. We may need half a second for audiovisual synchronisation.	500 ms
16	There are a lot of problems in an implementation of a video decoding system where some extra buffering makes life easier. We asume half of a frame period for a low delay video encoder.	20 ms	In the broadcast video encoder we allow for some more buffering to make the implementation easier.	40 ms
17	We want to synchronize the output of a video frame with the frame timing of the display device. The mean delay here is half a frame period of the display frame rate. We asume 60 Hz display frame rate.	16 ms	see low delay scenario	16 ms
Total	~ 260 ms		~ 3000 ms

As you can see from the low delay example you get about 260 ms end to end delay even with quite optimistic asumptions for the individual sources of delay. Many sources of delay are dependent on the frame period so that you get higher delays when you use lower frame rates. Some of this delay can be reduced with higher computational power, so that encoding and decoding takes less time, or higher network bandwidth, so that the actual transmission of the bits takes less time. However, higher computational power and higher network bandwidth only come at a cost.

The broadcast example shows that several seconds end to end delay are not uncommon in video transmission. In fact, in internet video streaming buffering periods of more than 10 seconds are often employed to compensate for fluctuating channel conditions.

Other aspects

In broadcast, you also need to think about channel acquisition refresh time.

In DVD-style playback, you also need to think about random-access seek time.

Ratebuffer and delay

The number of bits spent by an encoder for an individual video frame may vary to a great degree, e.g. typically I frames need more bits than P frames which in turn need more bits than B frames. Furthermore the "complexity" of a video frame may vary over time. Complexity here means the number of bits the encoder has to spend for a video frame at a given fixed setting. For example static content has a low complexity, because a frame can easily be predicted from the pervious frame. Scenes with high motion typically have high complexity.

The following figure shows the bit usage over time for a video sequence. The sequence is coded at QCIF size with 8.33 fps. The scene shows mostly head and shoulder content (wich has relatively low complexity) with 2 camera pans (which have higher complexity). The mean bitrate is about 24 kbps. You can see the peaks in bitrate due to the pans.

The encoder rate control can try to code each frame with the same number of bits, but then the video quality would be poor for I frames or scenes with high complexity. To allow short term fluctuations in bitrate when transmitting over a fixed rate channel a buffer has to be inserted before and after the channel.

The following figure illustrates a video transmission system. The encoder produces a variable number of bits per frame. These bits are wirtten to the transmit buffer. Reading from the transmit buffer, the actual transmission and writing to the receive buffer occur at constant rate. The decoder again reads a varibale number of bits per frame from the receive buffer.

The encoder rate control has to ensure that the buffers never overflow or underflow. The transmission can only start after a delay of half of the buffer capacity. Likewise the decoding can only start after another delay of half of the buffer capacity.

The following figure shows the situation at the transmitter. Encoding starts at t = 0s with a mean bitrate of 24 kbps. The buffer capacity is 48 kbit (2 seconds). After one second delay the actual transmission starts. The solid and dashed thin lines mark the lower and upper buffer limit.

The size of the buffer determines how much the bit allocation can vary locally. If the buffer is small, there is relatively little ability for the local bit allocation for a segment of the video content to vary. However if the buffer is large, the buffer can be used very effectively (although with high end-to-end delay) to allow the number of bits per frame to vary in different parts of the content while retaining constant bit rate for transmission.

Video coding standards define constraints on the buffer size. In MPEG-1, MPEG-2, and MPEG-4 these constraints are defined by means of a Video Buffering Verifier (VBV). H.261, H.263 and H.264 define a Hypothetical Reference Decoder (HRD).

Examples

Four examples may be useful to illustrate the situation:

To take one extreme, if the frame rate is 1/T frames per second and the bit rate is R bits per second, and the buffer size is only R*T bits, then the number of bits used on each frame can't vary at all. Every frame must use exactly R*T bits.
To take another extreme, if the entire video sequence is L seconds in length and the bit rate is R bits per second (and if only R*L bits are spent coding the video), then if the buffer size is R*L bits or more, the encoder can use any extreme variations it wants regarding how many bits it spends on each picture (within the total bit rate constraint).
MPEG-2 video Main profile at Main level has a maximum bit rate of 15 Mbits/sec and a VBV buffer size requirement of 1.835 Mbits. That means that the buffer can only hold 0.122 seconds of video (at the maximum bit rate), so constant bit rate operation with MPEG-2 has a rather limited ability to vary the bit allocation to different parts of a video sequence.
It can spend more bits on an I frame than on a B frame, but the average number of bits per picture over a half-second or so must be pretty close to R.
MPEG-4/H.264 AVC Main profile at level 3.1 has a maximum VCL bit rate of 14 Mbits/sec and an HRD buffer size requirement of 14 Mbits. The buffer can hold an entire second of video, so although constant bit rate operation does enforce some smoothness of the video bit rate over extended periods of times, it allows significant variation in bit rate over shorter periods (e.g., perhaps ten or so frames). If the encoder wants to, it can double the number of bits per frame for a quarter second or or so of critical very active video content and make up for it later by being stingy on subsequent pictures.

References

[1]	Harald Fuchs: Regelung der Bitrate und Bitverteilung prädiktiver Video-Hybrid-Kodierer, Diplomarbeit, Erlangen, 1997

Acknowledgements

This article was inspired by some very insightful postings from Gary Sullivan to the MPEG Industry Forum MP4-tech mailinglist.

HOME->Video->Delay

Impressum