Lip Synchronization is a process taking place on the receiver end in WebRTC (and other VoIP protocols), where the audio and the video tracks are synchronized.
When media gets captured in WebRTC, the originator of the media timestamps the raw media, which then gets encoded and sent over the network. During this process, audio and video are handled separately, as they go through different encoders that have different behavior and strategies. The intent of the sender is to get the media sent as soon as possible without harming network performance.
The receiving end collects the media packets, passes them through a jitter buffer and delays the video (or the audio) in order to get lip synchronization. This is done by matching the timestamps in the different media tracks.