AV Sync

Video and Audio Syncing

http://www.doom9.org/index.html?/synch.htm

Video and Audio Syncing Problem: Why and How.

Since the first release of Powerip in mid-1999, people have been experiencing the problem of determining the correct speed of video and audio when converting an NTSC mpeg-2 video/audio stream to any other format possible (e.g. mpeg-1, avi, asf, or divx) to get a perfect video and audio syncing.

This video and audio syncing problem is the result of an incorrect conversion of the mpeg-2 video stream (either using Powerip, mpeg2avi or any other conversion utility out there). This document is not meant to discard Squeezer or Flask, but it is in fact can be considered as a support so PERHAPS, the explanation can be applied to perfect-ize both Squeezer or Flask — or even the AGrabber plugin. To be of note, there have been a lot of successful "synced" conversion made using utilities such as "SQUEEZER" and "FLASK". But there are some cases, where none of the conversion utilities produce a total "synced" video and audio.

Why? Let's see the process of transferring a 35mm film format to an NTSC video format, to see the root of this evil.

35mm Film to NTSC Video Conversion

Movie is usually made on a 35mm Film Negative. This format has a 24 FRAME per second speed. A Frame is the smallest unit of a FILM format. NTSC Video is a "field-based" format of 59.94 FIELD per second. A Field is the smallest unit in Video format. 2 Fields made up into 1 FRAME. So, this 59.94 FIELD per second equals 29.97 FRAME per second. Now we can see the difference. 1 second in FILM (24 frame) is NOT equal to 1 second in NTSC Video (29.97 frame).

To be able to "match" the speed of an NTSC Video, conversion from a FILM format to an NTSC Video format undergone a process called "2:3 pulldown" or TELECINE. This process, in its simplest term, means "to add 6 frames so that a 24 fps becomes 30fps — which is VERY close to 29.97fps". The problem that rises when doing this TELECINE transfer, is to decide WHICH 6 FRAMES to be added - or REPEATED?

Some kind of community of film/moviemaker/videomaker/engineers created a STANDARDIZATION of this TELECINE conversion. Since a Video FRAME consist of 2 Fields, why not make the FILM format into Field first, so that the smallest unit of both formats is the same? Let's see the process:

Telecine in MPEG-2 Video

In an Mpeg-2 Video, storing a 30fps frames in 1 second will create a much bigger files than storing a 24 frames. If you do your calculation, a 1 second of 24 frames is 20% SMALLER in SIZE than 1 second of 30fps. But, as we have already discussed, NTSC video should be 29.97fps. It would mean that ALL movies that's created from 35mm FILM should be TELECINED, then ENCODED to 29.97fps Mpeg-2 Video stream, right? ….. NO!

A good thing about Mpeg-2 Video is that it can contain some FLAGS or PROGRAMMING, that would tell a SOFTWARE or HARDWARE to perform a TELECINE when playing the Video. Since the INTERLACED FRAMES that made-up the 29.97fps is a REPEATED field(s), it is REDUNDANT, and TRASHABLE. Just let the FLAGS tells the player to perform the TELECINE. Really, it CAN do that ;). The benefit of this that the movie CAN be stored in its original 24 FRAME per second, and thus SAVE 20% of total filesize!.

The FLAGS related to this are: REPEAT_FIRST_FIELD, TOP_FIELD_FIRST. The rules of applying these FLAGS follows the STANDARDIZATION. So you don't have to worry about the process not meeting the standard :). Let see some example:

Adding T_F_F and R_F_F Flags

As the we can see, a Value of 1 for both T_F_F and R_F_F will ORDER the player to DISPLAY FRAME A in a sequence of Atop Abottom Atop, and the Value of 0 both T_F_F and R_F_F will ORDER the player to display FRAME B in a sequence of Bbottom Btop.

When T_F_F is 0 and R_F_F is 1 (FRAME C), the player will display FRAME C in a sequence of Bbottom Btop Bbottom and so forth. Since it is a STANDARDIZED conversion, we can see a repeating Value of T_F_F and R_F_F as the following:

T_F_F sequence: 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

R_F_F sequence: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

So, now we have an Mpeg-2 Video stream CONTAINING 24 FRAMES per second and TFF and RFF flags in action. This will create a CONFLICT between 24 fps versus 30fps and the VERBATIM 29.97fps NTSC Video standard. To solve this, there are 2 other advantages of Mpeg-2 Video stream than can be applied, the FPS flag and the DROP_FRAME flag.

When the FPS flag value is PROGRAMMED in the header of an Mpeg-2 Video stream, it will ORDER the player to PLAY this Video stream at an exact SPEED. So, if the FPS flag is set as 29.97fps, the Video stream will play at exactly 29.97 frames per second.

When the DROP_FRAME flag value is 1, it will ORDER the player to REMEMBER that the 00 and 01 frames are dropped at the start of each minute except minutes which are even multiples of 10. The result is much the same as applying the 29.97fps value.

So, THAT is how we make an Mpeg-2 NTSC video stream as 24 FRAME stored, but 29.97fps playback speed. Now that we understand the the process, we are ready to REVERSE it, in order to achieve total Video and Audio syncing when converting BACK from a 24-stored-29.97-fps Mpeg-2 Video stream into any video format we want.

How? Let start with "mpeg2avi", an utility that converts an Mpeg-2 Video stream into .avi format (with codecs of your choosing).

Deinterlacing DVDs

http://vektor.theorem.ca/dvd/tech/

by Billy Biggs 11-Apr-2002

Deinterlacing and 3:2 pulldown inversion are important for playback of DVDs on progressive-scan displays like computer monitors.

A scary issue is mentioned in MPEG document 2820 which indicates that the 'progressive_frame' flag in the MPEG2 header is unreliable for deinterlacing purposes! This moves intelligent deinterlacing almost completely into the image heuristics area, except for reversing 3:2 pulldown when performed using the repeat_first_field flag. Note below that even then we can have problems, see under 'weird 3:2 pulldown encoding'!!

An example of 3:2 pulldown encoding

The following sequence is taken from the NTSC release of Lawrence of Arabia, Title 1, Chapter 15. It is the first 11 frames.

The first 5 frames are marked as interlaced (thanks!) and so the coded framerate is 29.97fps, but the material is clearly from 24fps source with 3:2 pulldown applied. The DVD then switches into progressive mode, and uses the repeat_first_field flag to offload the pulldown work onto the player. This switches the effective coded framerate down to 23.976fps.

In this DVD, small bits of scenes have been encoded at 29.97fps instead of always coding at 23.976fps. Why would they do this? One thought is that maybe certain scenes were touched up at video speed to remove objectionable artifacts in the pulldown conversion, but that doesn't seem to be the case here. I did notice that often we see some interlaced frames near the beginning of chapters. Maybe they fear some DVD players need time to switch into 24fps mode?

A really weird 3:2 pulldown encoding!!

This came as a complete shocker to me. Here is some output of the first 150-or-so frames of The Good, The Bad, and the Ugly (1966). Take a careful look at all the non-repeat_first_field frames! They're interlaced! Not only that, but the progressive_frame flag is high the whole time!

The conclusion here is that a correct deinterlacer must look at _every_ non-repeat_first_field frame, even if we're clearly in a pulldown sequence! What a mess!

My biggest question here is why? Why would they ever do this? One observation we make is that every second frame is a blend of the two beside it. So, maybe the only print they found of the opening credits was at 12fps? Maybe it was originally recorded at 12fps and this is a conversion technique? Maybe the quality was so bad, they only decided to restore every second frame? If you have thoughts, please email them to me.

2-3 Pulldown Explained

http://www.zerocut.com/tech/pulldown.html

by Alan Stewart

2-3 Pulldown

An NTSC video image consists of 525 horizontal lines of information. The electron gun scans top to bottom, left to right, odd numbered lines first, then the even numbered lines. Each full scan of even numbered lines, or odd numbered lines constitutes a "field". Each field scan takes 1/60th of a second, therefore a whole frame is scanned each 1/30th of a second. (literally 29.97 frames per second)

Film is generally shot and projected at 24 frames per second (fps), so when film frames are converted to NTSC video, the rate must be modified to play at 29.97 fps. During the telecine process, twelve (12) fields are added to each 24 frames of film (12 fields = 6 frames) so the same images that made up 24 frames of film then comprise 30 frames of video.Video plays at a speed of 29.97 fps so the film actually runs at 23.976 fps when transferred to video.

The Avid Film Composer assumes a 2-3 pull down. That means that the first frame of film is represented by 2 fields of video; the second frame of film is represented by 3 fields of video (1.5 frames); the third frame of film is again represented by two ields and the fourth frame of film is represented by 3 fields, and so on. In the end, what was running at 23.976 fps is running at 29.97 fps.

The first frame of video contains two fields of the 1st (A) frame of film.

The second frame of video contains two fields of the 2nd (B) frame of film.

The fifth frame of video contains two fields of the 4th (D) frame of film.

The graphics above shows how four frames of film become five frames of video; repeat that process six times and 24 frames of film become 30 frames of video. (technically, 23.976 frames of film become 29.97 frames of video, but it is easier to speak in whole numbers)

The Avid digitizes (records) and plays the film at 24 fps, in a Film Project, so the video has to be stripped of the fields that were added in the tape transfer process. Systems that digitize the 29.97 frames of video produce film Cut Lists by a process called matchback where the timecode (from an EDL) is used to locate the nearest real film frame for the negative cutter. Matchback is only accurate + or - one frame. One can choose to work at 30fps on an Avid and matchback for a negative cut, or work at 24 fps and produce a frame accurate negative Cut List. There is a process by which one can import an EDL from a 24 fps film project into a 30 fps project and redigitize the picture at a higher resolution (film projects only capture in single field resolutions).

2-3 Pulldown vs. 3-2

It is commonly referred to as 3-2 pulldown; while modern telecine machines can go either way, the norm is 2-3. Therefore, AA BB BC CD DD. If the telecine is set for 3-2, you'll get BB BC CD DD AA, which would require you to change the default pullin before digitizing the clips, because the clips head frames would be "B" rather than "A".

MPEG-2 pulldown

http://raph.levien.com/pulldown.html

There are a number of free software codebases for decoding, encoding, and transforming MPEG-2 streams. With these, it's possible to transcode from DVD to SVCD, and other fun tricks.

However, all these codebases I've seen share a common problem: they all ignore or corrupt the RFF and TFF bits set in MPEG-2 picture extension headers, which are used to implement the 3:2 pulldown used to display 24 fps source material at 29.97 fps NTSC frame rates. Typical symptoms include loss of A/V sync, jerky motion, and excess flickering at scene changes and in high-motion sequences.

How 3:2 pulldown works in MPEG-2

The fundamental problem is simple. Most movies are shot at 24 fps, while NTSC displays frames at 29.97 fps. Thus, a process called "telecine" is used to adapt the frame rate.

It would be possible for telecine to repeat one frame every four (source) frames, but the results wouldn't be very pleasing. So, instead, telecine makes use of the fact that a single NTSC frame is interlaced into two fields, so each source frame is rendered as either 2 or 3 fields. Commonly, these are interleaved so that even-numbered fields are 3 fields, and odd-numbered are 2, hence the name "3:2 pulldown".

Compressing such a sequence presents definite problems. If you compressed individual fields, then your compression wouldn't be as good in low-motion sequences. Conversely, if you compressed frames at the 29.97 fps NTSC frame rate, you'd find that a quarter of your frames would consist of two fields from different source frames. These compress poorly and don't look that great when displayed. (Even so, it's not uncommon to see these frames occasionally on less than lovingly mastered DVD's).

Consequently, the designers of MPEG-2 provided flags so that video sequences can be encoded at 24 fps, and the 3:2 pulldown done in the player at playback time. These flags are called RFF (repeat first field) and TFF (top field first). Each frame with RFF set is displayed for 3 fields, otherwise 2. The TFF bit is actually redundant - it is equal to the previous TFF xor'ed with the previous RFF, and is initially 1.

For pure 24000/1001 fps source material, existing free software transcoders do a decent job. They ignore the pulldown flags, then add their own with an on:off:on:off pattern of RFF.

See my diary entry on the subject.

This page unmaintained since 2003-03-27.