datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

HDMI's Lip Sync and audio-video synchronization for broadcast and home video

Joseph L. Lias<br>President, Simplay Labs, LLC

8/15/2008 2:00 AM EDT

When entertainment content is decoded and rendered on Consumer Electronic (CE) devices, the timing of rendering the video portion of the signal may deviate from the timing of rendering the audio signal. The resultant timing differential is often referred to as a "lip sync" error, since it is most obviously apparent to a viewer when the content contains a representation of a person speaking. In a digital television, the video processing usually takes more time than the audio processing. Because of this, synchronization of video and audio can become an issue, creating an effect similar to a badly dubbed movie, where the audio and video don't match up and the sound of the spoken words is no longer in "sync" with the speaker's lip movement.

HDMI version 1.3 includes a Lip Sync feature that allows the audio processing times in devices to be adjusted automatically to compensate for errors in audio/video timing. The initial implementations of this functionality will be in A/V receivers, but it is likely to appear in DVD players and many other CE devices in the future. Reports from manufacturers indicate that this function is very popular and will be widely implemented.

The HDMI standard requires manufacturers to disclose specific HDMI features enabled in a product. The idea is to provide consumers with the necessary descriptive information they need to understand enabled features that exploit certain capabilities of HDMI, such as Lip Sync. For each feature, the guidelines specify a minimum level of functionality that must be met by the device in order to use the terminology.

While HDMI LLC Authorized Testing Centers (HDMI-ATCs) test for electrical parametric and protocol compliance against the HDMI specification, there is a need to build upon this basic interface testing with additional performance testing programs designed to simplify consumer purchase decisions and enhance the high definition entertainment experience. There are no HDMI-ATC system level Lip Sync performance compliance specifications, or test tools designed to ensure accurate Lip Sync delivery. There is no "timing conformance" specification that must be demonstrated to any authority in order to build a compliant product.

There is an increasing awareness in both broadcast engineering and the CE industry that audio-video synchronization errors, usually seen as problems with lip sync, are occurring more frequently and often with greater magnitude. With the advent of digital processing in CE devices, the issue has become critical. Some CE manufacturers deny there is a problem, believing the audio/video asynchronies in their units to be imperceptible. Knowing how to measure audio/video delays and compensate for them is become increasingly important.

Is it important?
Lip Sync is very important to consumers and the display industry since newer technologies have created a noticeable delay between the processing of video signals and the processing of audio signals. Lip sync correction features take into account processing delays, so that both signals can be synchronized and presented to the viewer together. This greatly improves the entertainment for the viewer.

Lip sync errors detract from the consumer entertainment experience. The lack of lip sync correction is of particular concern in certain types of content, such as product commercials and political candidates' statements. See the report "Effects of Audio Asynchrony on Viewer's Memory, Evaluation of Content and Detection Ability" by Reeves and Voelker for more information (a non-copyrighted PDF is available at Pixel Instruments).

Human studies conducted for sensitivity to audio/video asynchronies have shown that a drift where the audio arrives late is not as annoying as when the audio arrives early. In fact, even a few frames of early audio can quickly be detected by the viewer. The characterization of sensitivity to the alignment of sound and picture includes early work at Bell Laboratories. The extent to which a consumer can tolerate these asynchronies is dependent upon human perceptual limits as well as personal taste. Steinmentz and Engler conducted user studies (R. Steinmentz and C. Engler, "Human Perception of Media Synchronization," Technical Report 43.9310, IBM European Networking Center, Heidelberg.), and they report several figures of merit for quantifying tolerable audio/video asynchrony limits.

In 1998, ITU-R published BT.1359, recommending the relative timing of sound and vision for broadcasting. Studies by the ITU and the others have suggested that thresholds of timing for viewer detection are about +45ms to -125ms, and the thresholds of acceptability are about +90ms to -185ms. In addition, the ATSC Implementation Subcommittee IS-191 has found that under all operational situations, the sound program should never lead the video program by more than 15ms and should never lag the video program by more than 45ms ±15ms.

When viewers encounter difficulties such as lip sync errors, blocking or black screens, they turn to another channel. Therefore, it is imperative that television engineers find and fix network, encoding, and transmission problems before their viewers become aware of them.

Next: Sync problem origins


Next:




jesup

8/19/2008 3:25 PM EDT

From the point of view of an engineer working in this space, all these solutions sound like poor bandaids on a problem that shouldn't have happened in the first place.

There's a reason why RTP packets (in internet VOIP and video) have timestamps and packets that link those to a shared timebase so you can synchronize audio and video. It's unimaginable to me that they designed HDMI without at least considering the possible variable delays on the two chains.

You can timestamp the data (and let the receiving device implement any needed FIFO buffers for synchronization); you can specify that the sending device do so synchronized (and adjust if needed to make that happen) and let the receiving device implement internal delays as needed to keep the synchronization.

My assumption is that the HDMI designers did the second (assume sync at the interface), and didn't specify it or test it, so neither the senders NOR the receivers implemented the necessary buffering or delays or frame dup/skip to. Sheesh.

The proposed solutions are silly - why guess at the induced delays? The hardware/software *knows* what the delays are, and knows when they change. Each device can be responsible for resyncing before output. You can decide to implement a trickier scheme whereby you push all the delay handling into one device in the chain (either with reports ("audio is delayed 30ms"), or with timestamps (less needed here since the streams aren't going over lossy networks).

Sign in to Reply



NickJoh

8/20/2008 2:28 PM EDT

I applaud this excellent article with comprehensive references on an important subject largely ignored by the industry.

A few important points fundamental to the lip-sync problem may, however, be obscured by this excellent overview of the subject even though they are referenced.

For example, you might get the impression that HDMI 1.3's "Automatic Lip-Sync Correction" feature actually corrects lip-sync error in the arriving signals. It does NOT.

Most consumers have that misconception when in reality all that feature does is "automatically" set a fixed audio delay to offset the display's fixed video delay.

This is something you can already do with a "one time adjustment" in almost all recent a/v receivers and if that were an adequate solution to correcting lip-sync the three companies' products I will mention at the end of my comments would have no market.

In fact, a "fixed delay" like even HDMI 1.3 receivers will add can make lip-sync "worse" in cases where audio arrives already delayed which happens more often than you would think - especially with DVD's.

HDMI 1.3 allows a display to communicate to a receiver what it's video delay will be during the initial EDID handshaking but that fixed delay has no effect on lip-sync error in the broadcast or DVD signals and as mentioned will exacerbate the problem in cases where the display's video delay could be used to advantage in offsetting an arriving audio delay.

Unfortunately the industry's solution to lip-sync error in the arriving signals has historically been an "open loop" control system requiring the measurement of video delay at every possible entry point and the addition of a compensating audio delay.

That can only "maintain" lip-sync that was already correct and in reality much of the lip-sync error originates with the content producer often starting at image capture and continuing through editing and post production. Tiny errors accumulate until they are large.

But even if the original content has correct lip-sync (rare) any error in this open loop scheme is also cumulative since it can not be automatically detected downstream.

There is a white paper on the Pixel Instruments site with a graph showing the time varying lip-sync error present in most broadcasts. Its pretty astounding!

I have observed over 90 ms differences in lip-sync error from broadcast program to program and over 50 ms variation from DVD to DVD. Lip-sync error is clearly "not" simply due to the fixed video delay caused by a display as the HDMI 1.3 "feature" implies.

There is an excellent white paper written for the SMPTE Ad Hoc Committee on lip-sync by the VP of Engineering for Liberty Broadcasting (now Raycomm) on the technical details page of www.LipFix.com in which he describes his stations' diligent efforts to correct lip-sync but he concludes the paper saying all they can hope to accomplish is "to add no more lip-sync error" since the feeds he receives from the major networks are "already out of sync".

Tektronix had an excellent solution to the lip-sync problem - their now discontinued AVDC100 - which would watermark audio and video that was in-sync and keep it in-sync but I'm convinced its market failure was due to exactly that: "There simply is nowhere you can count on the audio and video to be in-sync." If it is not in-sync to start with, watermarking it and maintaining that incorrect sync accomplishes nothing.

There are other products on the market which do exactly that (maintain incorrect lip-sync)but which have "implied" they could automatically correct lip-sync - such as scan converters - MPEG encoders and decoders, etc. They all assume the source material was in sync upon arrival and internally watermark the streams and restore sync upon output similar to the AVDC100. But again, maintaining whatever lip-sync error was already in the arriving streams. If you see "automatic lip-sync correction" look closer because it is not possible because there is nothing in the video and audio signals to define when they were ever in sync!

What the industry needs is a standard for watermarking the audio and video during content creation (and more effort on the creator's part to produce perfect lip-sync) so that it could be maintained throughout the broadcast chain and DVD encoding processes based upon those watermarks.

Until that happens we can forget "automatic lip-sync correction".

But I feel the first step and most overlooked lip-sync issue is: "How closely synced does it need to be?"

Is the objective to "mask it" or actually correct it? All the industry standards allow lip-sync error "above" the values the Reeves and Voelker research at Stanford proved causes a negative impact on viewer perception even when not consciously noticed. Clearly then, such lax standards can only "mask" the problem concealing it rather than actually eliminating its negative impact on our perception!

Sound cannot occur "before" the action that creates the sound (in the real world) so when we encounter this contradiction of reality in our home theaters our subliminal reaction is to "look away" from the lips so as not to be confronted by this impossibility.

This explains why most people won't notice over 40 ms of lip-sync error and why the ITU (and other standards groups) seem to consider that a reasonable target.

It could also explain Stanford's discovery that viewers felt the characters were more "anxious", "less persuasive", less successful", etc. when lip-sync error was present since these are the same feelings we usually have about people who do not look us in the eye when talking.

Ironically, it is "we" who are not looking the characters in the eye!

As long as we can look askance at the characters faces and still keep our eyes on-screen we may not consciously notice the lip-sync error but this subconscious avoidance still undermines our impression of the characters and story - which is the essence of cinema itself isn't it!

If our impression of the characters and story is not the main objective of cinema, what is it?

Is it OK to just "mask" the lip-sync problem and reduce it to about 40 ms where most people won't notice it?

Not if you believe the Stanford research which proved lip-sync error - even in the 40 ms range - undermines our perception.

If you think you have corrected lip-sync error with your receiver (even an HDMI 1.3 receiver and display)run this test:

Force yourself to "look at the lips". It may take some effort because our natural tendency is to look "away" from something our brains cannot reconcile or process.

Even though most lip-sync error appears as "audio ahead of video" the opposite condition is also unnatural since our brain uses the delay as a spacial queue and is confused when audio arrives too early or too late. That is, when someone "looks" like they are 20 feet away but sounds like they are 5 feet or 50 feet away. That too could cause us to subliminally avoid such an impossibility because in nature sound will be delayed about a millisecond for every 13 inches it travels from its source with very small variations due to altitude, temperature, humidity, etc. and "never" by any significant amount.

If you force yourself to consciously look closely at the lips you will overcome your avoidance mechanism and you will see the lip-sync error still present perhaps all the way down to a few milliseconds!

It may be hard to believe but we have had customers of our first two generation lip-sync correction products, our DD340 and DD540, ask for adjustment "below a millisecond" so in our new DD740 we added a special "fine" mode allowing 1/3 ms steps. Admittedly few will need that fine an adjustment but many of our customers adjust down to a few ms which most in the industry do not believe possible.

Unfortunately this article overlooks the only current solution to perfect lip-sync error correction
which is a subjective adjustment of an audio delay while watching the moving lips.

The Pixel Instruments "Lip Tracking" system mentioned attempts to automate this type of correction and probably has the greatest potential for broadcasters to correct lip-sync but home theater equipment can distort it again downstream so the ultimate correction should be done at the endpoint - the display and the surround sound system in the home.

In addition to our Felston DD740 4 Input Digital Audio Delay for lip-sync correction, two other companies now produce remote controlled digital audio delays which allow fine tuning lip-sync while watching an undisturbed image - the essential feature for true lip-sync correction. They are Alchemy2 and Primare.

All three allow tweaking the audio delay at the touch of a + or - button on a remote control which seems a minor inconvenience considering the alternative of allowing this contradiction of reality to continue being masked and undermining our perception.

Also, note that when any of these audio delays are used with a display's inherent video delay you effectively gain a "negative" delay equal to the display's video delay.

As an example compare the use of a Felston DD740 correcting lip-sync on a display with a 100 ms video delay and an HDMI 1.3 display/receiver combination:

Case 1: Arriving video is delayed 80 ms behind its audio.

The Felston DD740 would be adjusted to 180 ms (100 ms for the display's added delay and 80 ms for the arriving signals existing video delay)and at that value lip-sync would be perfect.

The HDMI 1.3 display would tell the receiver to add 100 ms delay so the 80 ms error in the arriving signal would still be present in the program being viewed.

Case 2: Audio arrives delayed by 80 ms after the video.

The Felston DD740 would be set to 20 ms which would allow 80 ms of the display's 100 ms video delay to cancel the 80 ms audio delay in the arriving signal and the DD740's 20 ms delay would cancel the balance of the display's delay and lip-sync would be perfect.

The HDMI 1.3 display would tell the receiver to add 100 ms audio delay as before so the 80 ms lip-sync error present in the arriving signal would be preserved and visible in the program being viewed.

This is a case where lip-sync error would have been less if the HDMI 1.3 "feature" had been turned "off". That is, by doing nothing, only 20 ms lip-sync error would have been displayed since 80 ms of the display's 100 ms video delay would have cancelled the arriving lip-sync error leaving only 20 ms contributed by the display.

Sign in to Reply



jesup

8/27/2008 3:21 PM EDT

All of this ignores the fact that the HDMI design apparently was flawed to begin with. (Oddly, I commented on this here and the comment seems to have disappeared.) NOTE: I haven't read the HDMI spec; I'm working on the info in this article.

Audio and video should all the timestamped with a presentation time that tells you for sure which video should be synced with which audio. This is what's done in all video and audio going over the net using RTP, and pretty much any other networked A/V protocol (since you assume there can be delay or jitter).

My guess is that HDMI was built with an assumption that inputs are (in some way) synced at the time of reception, and as mentioned by the other commenter, the best you can do is not make it worse on your output. There's nothing actively marking or correcting sync. The classic chains in TV stations often work this way, with any sync losses traditionally being fixed delays that can be compensated with a fixed delay. You can see how easy this is to get right for digital video with the problems stations have had in switching to HD chains.

Watermarking is NOT a good solution. Multiple watermarks (or even one) degrade the signal, and at any stage processing of the audio or video could by chance remove them. The only reason to use watermarks is lack of any out-of-band way to mark time or to match up presentation times.

HDMI is probably stuck now, or at least will require fancier solutions. They probably *should* have mandated that devices timestamp audio and video (and if a device gets an unstamped stream, assume it's in-sync (ugh) and add timestamps). Devices should have been required to transmit timestamped A/V with no more than an Xms sync mismatch, which would require a small amount of buffer in each device - more if there's variable processing time inside of it.

The biggest problem would be disjoint display/playback devices - HDMI should also have provided a way for devices to synchronize times (ala NTP), so two screens have a chance of displaying data at the same time (or a receiver can play back audio at the same time the LCD monitor displays the video). This would require devices to broadcast their current delays, and for devices to use the longest current delay.

It is inherently problematic to deal with changing delays, which almost always is caused by video. However, it's far easier to adjust video delay with skipped or duplicated frames than to adjust audio delay.

Items like the device the previous poster was pushing, and the fix proposed by the writer of the article are bandaids on a bad initial set of design decisions, based on "traditional" ways of handling analog A/V.

Sign in to Reply



NickJ2008

9/5/2008 9:15 AM EDT

I applaud this excellent article with comprehensive references on an important subject largely ignored by the industry.

A few important points fundamental to the lip-sync problem may, however, be obscured by this excellent overview of the subject even though they are referenced.

For example, you might get the impression that HDMI 1.3's "Automatic Lip-Sync Correction" feature actually corrects lip-sync error in the arriving signals. It does NOT.

Most consumers have that misconception when in reality all that feature does is "automatically" set a fixed audio delay to offset the display's fixed video delay.

This is something you can already do with a "one time adjustment" in almost all recent a/v receivers and if that were an adequate solution to correcting lip-sync the three companies' products I will mention at the end of my comments would have no market.

In fact, a "fixed delay" like even HDMI 1.3 receivers will add can make lip-sync "worse" in cases where audio arrives already delayed which happens more often than you would think - especially with DVD's.

HDMI 1.3 allows a display to communicate to a receiver what it's video delay will be during the initial EDID handshaking but that fixed delay has no effect on lip-sync error in the broadcast or DVD signals and as mentioned will exacerbate the problem in cases where the display's video delay could be used to advantage in offsetting an arriving audio delay.

Unfortunately the industry's solution to lip-sync error in the arriving signals has historically been an "open loop" control system requiring the measurement of video delay at every possible entry point and the addition of a compensating audio delay.

That can only "maintain" lip-sync that was already correct and in reality much of the lip-sync error originates with the content producer often starting at image capture and continuing through editing and post production. Tiny errors accumulate until they are large.

But even if the original content has correct lip-sync (rare) any error in this open loop scheme is also cumulative since it can not be automatically detected downstream.

There is a white paper on the Pixel Instruments site with a graph showing the time varying lip-sync error present in most broadcasts. Its pretty astounding!

I have observed over 90 ms differences in lip-sync error from broadcast program to program and over 50 ms variation from DVD to DVD. Lip-sync error is clearly "not" simply due to the fixed video delay caused by a display as the HDMI 1.3 "feature" implies.

There is an excellent white paper written for the SMPTE Ad Hoc Committee on lip-sync by the VP of Engineering for Liberty Broadcasting (now Raycomm) on the technical details page of www.LipFix.com in which he describes his stations' diligent efforts to correct lip-sync but he concludes the paper saying all they can hope to accomplish is "to add no more lip-sync error" since the feeds he receives from the major networks are "already out of sync".

Tektronix had an excellent solution to the lip-sync problem - their now discontinued AVDC100 - which would watermark audio and video that was in-sync and keep it in-sync but I'm convinced its market failure was due to exactly that: "There simply is nowhere you can count on the audio and video to be in-sync." If it is not in-sync to start with, watermarking it and maintaining that incorrect sync accomplishes nothing.

There are other products on the market which do exactly that (maintain incorrect lip-sync)but which have "implied" they could automatically correct lip-sync - such as scan converters - MPEG encoders and decoders, etc. They all assume the source material was in sync upon arrival and internally watermark the streams and restore sync upon output similar to the AVDC100. But again, maintaining whatever lip-sync error was already in the arriving streams. If you see "automatic lip-sync correction" look closer because it is not possible because there is nothing in the video and audio signals to define when they were ever in sync!

What the industry needs is a standard for watermarking the audio and video during content creation (and more effort on the creator's part to produce perfect lip-sync) so that it could be maintained throughout the broadcast chain and DVD encoding processes based upon those watermarks.

Until that happens we can forget "automatic lip-sync correction".

But I feel the first step and most overlooked lip-sync issue is: "How closely synced does it need to be?"

Is the objective to "mask it" or actually correct it? All the industry standards allow lip-sync error "above" the values the Reeves and Voelker research at Stanford proved causes a negative impact on viewer perception even when not consciously noticed. Clearly then, such lax standards can only "mask" the problem concealing it rather than actually eliminating its negative impact on our perception!

Sound cannot occur "before" the action that creates the sound (in the real world) so when we encounter this contradiction of reality in our home theaters our subliminal reaction is to "look away" from the lips so as not to be confronted by this impossibility.

This explains why most people won't notice over 40 ms of lip-sync error and why the ITU (and other standards groups) seem to consider that a reasonable target.

It could also explain Stanford's discovery that viewers felt the characters were more "anxious", "less persuasive", less successful", etc. when lip-sync error was present since these are the same feelings we usually have about people who do not look us in the eye when talking.

Ironically, it is "we" who are not looking the characters in the eye!

As long as we can look askance at the characters faces and still keep our eyes on-screen we may not consciously notice the lip-sync error but this subconscious avoidance still undermines our impression of the characters and story - which is the essence of cinema itself isn't it!

If our impression of the characters and story is not the main objective of cinema, what is it?

Is it OK to just "mask" the lip-sync problem and reduce it to about 40 ms where most people won't notice it?

Not if you believe the Stanford research which proved lip-sync error - even in the 40 ms range - undermines our perception.

If you think you have corrected lip-sync error with your receiver (even an HDMI 1.3 receiver and display)run this test:

Force yourself to "look at the lips". It may take some effort because our natural tendency is to look "away" from something our brains cannot reconcile or process.

Even though most lip-sync error appears as "audio ahead of video" the opposite condition is also unnatural since our brain uses the delay as a spacial queue and is confused when audio arrives too early or too late. That is, when someone "looks" like they are 20 feet away but sounds like they are 5 feet or 50 feet away. That too could cause us to subliminally avoid such an impossibility because in nature sound will be delayed about a millisecond for every 13 inches it travels from its source with very small variations due to altitude, temperature, humidity, etc. and "never" by any significant amount.

If you force yourself to consciously look closely at the lips you will overcome your avoidance mechanism and you will see the lip-sync error still present perhaps all the way down to a few milliseconds!

It may be hard to believe but we have had customers of our first two generation lip-sync correction products, our DD340 and DD540, ask for adjustment "below a millisecond" so in our new DD740 we added a special "fine" mode allowing 1/3 ms steps. Admittedly few will need that fine an adjustment but many of our customers adjust down to a few ms which most in the industry do not believe possible.

Unfortunately this article overlooks the only current solution to perfect lip-sync error correction
which is a subjective adjustment of an audio delay while watching the moving lips.

The Pixel Instruments "Lip Tracking" system mentioned attempts to automate this type of correction and probably has the greatest potential for broadcasters to correct lip-sync but home theater equipment can distort it again downstream so the ultimate correction should be done at the endpoint - the display and the surround sound system in the home.

In addition to our Felston DD740 4 Input Digital Audio Delay for lip-sync correction, two other companies now produce remote controlled digital audio delays which allow fine tuning lip-sync while watching an undisturbed image - the essential feature for true lip-sync correction. They are Alchemy2 and Primare.

All three allow tweaking the audio delay at the touch of a + or - button on a remote control which seems a minor inconvenience considering the alternative of allowing this contradiction of reality to continue being masked and undermining our perception.

Also, note that when any of these audio delays are used with a display's inherent video delay you effectively gain a "negative" delay equal to the display's video delay.

Sign in to Reply



NickJ2008

9/5/2008 9:43 AM EDT

Sorry if I came cross as "pushing" a specific product. I did mention that three companies - not just mine - now produce remote controlled digital audio delays to allow subjective lip-sync correction.

A review off all three digital audio delays for lip-sync correction can be found here:

http://www.audaud.com/article.php?ArticleID=3011

These products are currently the "only" solution to really correct lip-sync.

I certainly agree with "jesup" that they might be considered "bandaids" but when you are bleeding a "bandaid" is certainly better than nothing.

I mentioned that the industry needs to incorporate sync information starting with the original content creation - upon which "automatic lip-sync correction" could be based but I doubt we will see that any time soon.

I say that because the Reeves and Voelker study at Stanford which (referenced in the original article)proved the negative impact on viewer perception (even at 41.25 ms which most people don't notice) was made public over 12 years ago and little has changed since.

Sign in to Reply



Yanaqa

11/6/2008 5:35 PM EST

Company Overview
AD SYSTEMS designs, manufactures, installs and services AUTOMATED DISPLAY SYSTEMS throughout the world. For over three decades, the management of AD SYSTEMS has supplied visual communication systems to thousands of customers. Every day millions of people count on AD SYSTEMS displays for information and entertainment.
As our name implies, AD SYSTEMS offers a total “systems” approach to serving our client’s needs. We provide fully integrated communications and identity solutions. Every facet of a client’s project from image design to product engineering, project management, manufacturing, installation and maintenance is provided in a total package solution from one company. AD SYSTEMS is committed to forming a partnership with our customers as their total advertising, communications and identity solutions supplier.

AD SYSTEMS can help you at every stage of a project from conception through completion that will give you a competitive edge.

“To compete in today’s marketplace, you must partner with ‘can do’ people. People who know your business. People who share your strategic goals and dedication. People who can provide the skills to help you succeed. Otherwise…..you might as well turn off the lights and lock the doors. You’re history.”

adsystemsled.com

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)