Dynamic Temporal Alignment of Speech to Lips in Post-Production

Peleg Shmuel, HUJI, School of Computer Science and Engineering, Computer Science


  • In movie filming, poor sound quality is very common. Many speech segments are re-recorded in a studio during post-production to compensate for the poor sound quality that was recorded on location.
  • The current compensating approaches are very tedious, they require much time and effort by the actor, director, recording engineer and the sound editor.
  • The most challenging part is aligning the newly-recorded audio to the actor’s original lip movement, as viewer are very sensitive to audio-lip discrepancies. This alignment is especially difficult when the original on-set speech is unclear.

Our Innovation

A novel audio to video alignment method that automates speech to lips alignment by stretching and compressing the audio signal to match the lip movements.

  • Accurate audio to video alignment, even when the original voice is unclear.
  • Compensate for cases where a constant shift of the sound can not give a perfect alignment.
  • Dynamic temporal alignment method.
  • Improved performance over existing methods


  • Temporally align audio and video of speaking person by using innovative deep audio-visual features to map the lips video and the speech signal to shared representation.
  • Based on this shared representation, the lip-sync error between every short speech period and every video frame is computerize, followed by the determination of the optimal corresponding frame for each short sound period over the entire video clip.
  • Successful alignment was demonstrated, both quantitatively, using a human perception-inspired metric, and qualitatively.


Fig. 1: Given a speech video and a segment of corresponding, but unaligned video, the video is aligned to match the lip movements



  • Movie production industry
  • TV industry
  • Video clips in other platforms

Contact for more information:

Anna Pellivert
Contact ME: