Technology Archives • BlogGeek.me

OpenAI, LLMs, WebRTC, voice bots and Programmable Video

Tsahi Levent-Levi — Mon, 29 Jul 2024 09:30:00 +0000

Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.

Talk about an SEO-rich title… anyways. When Philipp suggests something to write about I usually take note and write about it. So it is time for a teardown of last month’s demo by OpenAI – what place WebRTC takes there, how it affects the programmable video market of Video APIs.

I’ve been dragged into this discussion before. In my monthly recorded conversation with Arin Sime, we talked about LLMs and WebRTC:

Time to break down the OpenAI demo that was shared last month and what role WebRTC and its ecosystem plays in it.

The OpenAI GPT-4o demo
Text be like…
“Traditional” voice bots are like turn based games
Realtime LLMs are like… real-time games
Real life and conversational bots
Working on the WebRTC and LLM infrastructure
What’s next?

The OpenAI GPT-4o demo

Just to be on the same page, watch the demo below – it is short and to the point:

(for the full announcement demos video check out this link. You really should watch it all)

There were several interfaces shown (and not shown) in these demos:

No text prompts. Everything was done in a conversational manner
And by conversation I mean voice. The main interface was a person talking to ChatGPT through his phone app
There were a few demos that included “vision”
- They were good and compelling, but they weren’t video per se
- It felt more like images being uploaded, applying OCR/image recognition on them or some such
- This can be clearly indicated when in the last demo on this, the person had to tell ChatGPT to use the latest image and not an older one – there are still a few polishes needed here and there

Besides the interface used, there were 3 important aspects mentioned, explained and shown:

This was more than just speech to text or text to speech. It gave the impression that ChatGPT perceived and generated emotions. I dare say, the OpenAI team did above and beyond to show that on stage
Humor. It seems humor and in general humans are now more understandable by ChatGPT
Interruptions. This wasn’t a turn by turn prompting but rather a conversation. One where the person can interrupt in the middle to veer and change the conversation’s direction

Let’s see why this is different from what we’ve seen so far, and what is needed to build such things.

Text be like…

ChatGPT started off as text prompting.

You write something in the prompt, and ChatGPT obligingly answers.

It does so with a nice “animation”, spewing the words out a few at a time. Is that due to how it works, or does it slow down the animation versus how it works? Who knows?

This gives a nice feel of a conversation – as if it is processing and thinking about what to answer, making up the sentences as it goes along (which to some extent it does).

This quaint prompting approach works well for text. A bit less for voice.

And now that ChatGPT added voice, things are getting trickier.

“Traditional” voice bots are like turn based games

Before all the LLM craze and ChatGPT, we had voice bots. The acronyms at the time were NLP and NLU (Natural Language Processing and Natural Language Understanding). The result was like a board game where each side has its turn – the customer and the machine.

The customer asks something. The bot replies. The customer says something more. Oh – now’s the bot’s turn to figure out what was said and respond.

In a way, it felt/feels like navigating the IVR menus via voice commands that are a bit more natural.

The turn by turn nature means there was always enough time.

You could wait until you heard silence from the user (known as endpointing). Then start your speech to text process. Then run the understanding piece to figure out intents. Then decide what to reply and turn it into text and from there to speech, preferably with punctuation, and then ship it back.

The pieces in red can easily be broken down into more logic blocks (and they usually are). For the purpose of discussing the real time nature of it all, I’ve “simplified” it into the basic STT-NLU-TTS

To build bots, we focused on each task one at a time. Trying to make that task work in the best way possible, and then move the output of that task to the next one in the pipeline.

If that takes a second or two – great!

But it isn’t what we want or need anymore. Turn based conversations are arduous and tiring.

Realtime LLMs are like… real-time games

Here are the 4 things that struck a chord with me when GPT-4o was introduced from the announcement itself:

GPT-4o is faster (you need that one for something that is real-time)
Future of collaboration – somehow, they hinted on working together and not only man to machine, whatever that means at this early stage
Natural, feels like talking to another person and not a bot (which is again about switching from turn based to real-time)
Easier, on the user. A lot due to the fact that it is natural

Then there was the fact that the person in the demo cuts GPT-4o short in mid-sentence and actually gets a response back without waiting until the end.

There’s more flexibility here as well. Less to learn about what needs to be said to “strike” specific intents.

Moving from turn based voice bots to real-time voice bots is no easy feat. It is also what’s in our future if we wish these bots to become commonplace.

Real life and conversational bots

The demo was quite compelling. In a way, jaw dropping.

There were a few things there that were either emphasized or skimmed through quickly that show off capabilities that if arrive in the product once it launches are going to make a huge difference in the industry.

Here are the ones that resonated with me

Wired and not wireless. Why on earth would they do a wired demo from a mobile device? The excuse was network reception. Somehow, it makes more sense to just get an access point in the room, just below the low table and be done with it. Something there didn’t quite work for me – especially not for such an important demo (4.6M views in 2 months on the full session on YouTube)
Background noise. Wired means they want a clean network. Likely for audio quality. Background noise can be just as bad for the health of an LLM (or a real conversation). These tools need to be tested rigorously in real time environments… with noise in them. And packet loss. And latency. Well… you go the hint
Multiple voices. Two or more people sitting around the table, talking to GPT-4o. Each time someone else speaks. Does GPT need to know these are different people? That’s likely, especially if what we aim at is conversations that are natural for humans
Interruptions. People talking over each other locally (the multiple voices scenario). A person interrupting GPT-4o while it runs inference or answers. Why not GPT-4o interrupting a rumbling human, trying to focus him?
Tone of voice. Again, this one goes both ways. Understanding the tone of voice of humans. And then there’s the tone of voice GPT-4o needs to play. In the case of the demo, it was friendly and humorous. Is that the only tone we need? Likely not. Should tone be configurable? Predetermined? Dynamic based on context?

There are quite a few topics that still need to be addressed. OpenAI and ChatGPT have made huge strides and this is another big step. But it is far from the last one.

We will know more on how this plays out in real life once we get people using it and writing about their own experiences – outside of a controlled demo at a launch event.

Working on the WebRTC and LLM infrastructure

In our domain of communication platforms and infrastructure, there are a few notable vendors that are actively working on fusing WebRTC with LLMs. This definitely isn’t an exhaustive list. It includes:

Those that made their intentions clear
Had something interesting to say besides “we are looking at LLMs”
And that I noticed (sorry – I can’t see everyone all the time)

They are taking slightly different approaches, which makes it all the more interesting.

Before we start, let’s take the diagram from above of voicebots and rename the NLU piece into LLM, following marketing hype as it is today:

The main difference now is that LLM is like pure black magic: We throw corpuses of text into it, the more the merrier. We then sprinkle a bit of our own knowledge base and domain expertise. And voila! We expect it to work flawlessly.

Why? Because OpenAI makes it seem so easy to do…

Programmable Video and Video APIs doing LLM

In our domain of programmable video, what we see are vendors trying to figure out the connectors that make up the WebRTC-LLM pipeline and doing that at as low latency as possible.

Agora

Agora just published a nice post about the impact of latency on conversational AI.

The post covers two areas:

The mobile device, where they tout their native SDK as being faster and with lower latency than the typical implementation
The network, relying on their SD-RTN infrastructure for providing lower latency than others

In a way, they focus on the WebRTC-realm of the problem, ignoring (or at least not saying anything about) the AI/LLM-realm of the problem.

It should be said that this piece is important and critical in WebRTC no matter if you are using LLMs or just doing a plain meeting between mere humans.

Daily

Daily take their unique approach for LLM the same way they do for other areas. They offer a kind of a Prebuild solution. They bring in partners and integrations and optimize them for low latency.

In a recent post they discuss the creation of the fastest voice bot.

For Daily, WebRTC is the choice to go for since it is already real time in nature. Sprinkle on top of it some of the Daily infrastructure (for low latency). And add the new components that are not part of a typical WebRTC infrastructure. In this case, packing Deepgram’s STT and TTS along with Meta’s Llama 3.

The concept here is to place STT-LLM-TTS blocks together in the same container so that the message passing between them doesn’t happen over a network or an external API. This reduces latencies further.

Go read it. They also have a nice table with the latency consumers along the whole pipeline in a more detailed breakdown than my diagrams here.

LiveKit

In January this year, LiveKit introduced the LiveKit Agents. Components used to build conversational AI applications. They haven’t spoken since about this on their blog, or about latency.

That said, it is known that OpenAI is using LiveKit for their conversational AI. So whatever worries OpenAI has about latencies are likely known to LiveKit…

LiveKit has been lucky to score such a high profile customer in this domain, giving it credibility in this space that is hard to achieve otherwise.

Twilio’s approach to LLMs

Twilio took a different route when it comes to LLM.

Ever since its acquisition of Segment, Twilio has been pivoting or diversifying. From communications and real time into personalization and storage. I’ve written about it somewhat when Twilio announced sunsetting Programmable Video.

This makes the announcement a few months back quite reasonable: Twilio AI Assistant

This solution, in developer preview, focuses on fusing the Segment data on a customer with the communication channel of Twilio’s CPaaS. There’s little here in the form of latency or real time conversations. That seems to be secondary for Twilio at the moment, but is also something they are likely now exploring as well due to OpenAI’s announcement of GPT-4o.

For Twilio? Memory and personalization is what is important about the LLM piece. And this is likely highly important to their customer base. How will other vendors without access to something like Segment are going to deal with it is yet to be seen.

Fixie anyone?

When you give Philipp Hancke to review an article, he has good tips. This time it meant I couldn’t make this one complete without talking about fixie.ai. For a company that raised $17M they don’t have much of a website.

Fixie is important because of 3 things:

Justin Uberti, one of the founders of WebRTC, is a Co-founder and CTO there
It relies on WebRTC (like many others)
It does things a wee bit differently, and not just by being open source

Fixie is working on Ultravox, an open source platform that is meant to offer a speech-to-speech model. No more need for STT and TTS components. Or breaking these into smaller pieces yet.

From the website, it seems that their focus at the moment is modeling speech directly into LLM, avoiding the need to go through text to speech. The reasoning behind this approach is twofold:

You don’t lose latency on going through the translation to text and from there into the LLM
Voice has a lot more to it than just the spoken words. Having that information readily available in the LLM can be quite useful and powerful

The second part of it, of converting the result of the LLM back into speech, is not there yet.

Why is that interesting?

Justin… who is where WebRTC is (well… maybe apart from his stint at Clubhouse)
The idea of compressing multiple steps into one
It was tried for transcoding video and failed, but that was years ago, and was done computationally. Here we’re skipping all this and using generative AI to solve that piece of the puzzle. We still don’t know how well it will work, but it does have merit

What’s next?

There are a lot more topics to cover around WebRTC and LLM. Rob Pickering looks at scaling these solutions for example. Or how do you deal with punctuations, pauses and other phenomena of human conversations.

With every step we make along this route, we find a few more challenges we need to crack and solve. We’re not there yet, but we definitely stumbled upon a route that seems really promising.

The post OpenAI, LLMs, WebRTC, voice bots and Programmable Video appeared first on BlogGeek.me.

Fixing packet loss in WebRTC

Tsahi Levent-Levi — Mon, 01 Jul 2024 09:30:00 +0000

Discover the hidden dangers of packet loss and its impact on your WebRTC application. Find out how to optimize your network performance and minimize packet loss.

If there’s one thing that can give you better media quality in WebRTC it is going to be the reduction (or elimination?) of packet loss. Nothing else will be as effective as this.

What I want to do here, is to explain packet loss, what it is inevitable, and the many ways we have at our disposal to increase the resilience and quality of our media in WebRTC in the face of packet losses.

Why do we have packet loss in WebRTC?
What to do to overcome packet losses?
Have less packet losses
- Location of infrastructure elements in WebRTC
- Better bandwidth estimation
Conceal packet losses (PLC)
- Audio and packet loss concealment
- Video and packet loss concealment 👉 frame dropping
Retransmitting lost packets (RTX)
- Video and RTX
- Audio and RTX
Correct packet losses in advance (FEC)
- Audio FEC
- Video FEC
Wrapping it all up
Learn more about WebRTC (and everything about it)

Why do we have packet loss in WebRTC?

There are many reasons for packet losses to occur on modern networks and with WebRTC. To count a few of these:

Wireless and cellular networks may suffer due to the distance between the device and the access point, as well as other obstructions (physical or just aerial interference)
Routers and switches can get congested, causing delays as well as dropped packets
Ethernet cables can be faulty at times
Connections between switches are not always as clean as they could be
Media servers not doing their job correctly or just getting overtaxed with traffic
Entropy. The more we miniaturize and condense things, the more entropy will kick in (I added this one just to sound smart)
Devices might not be faring too well at times either

We think of the internet as a reliable network. You direct a browser to a web page. And magically the page loads. If it doesn’t, then the network or server is down. End of story. That’s because packet losses there are handled by retransmitting what is lost. The cost? You wait a wee bit longer for your page to load.

With WebRTC we are dealing with real time communications. So if something gets lost there is little time to fix that.

👉 Packet losses are a huge headache for WebRTC applications

What to do to overcome packet losses?

Packet loss is an inevitability when it comes to WebRTC and VoIP in general. You can’t really avoid them. The question then becomes what can we do about this?

There are four different approaches here that can be combined for a better user experience:

Have less packet losses – if we have less of these, then user experience will increase
Conceal packet losses (PLC) – once we have packet losses, we need to try and figure out what to do to conceal that fact from the user
Retransmit lost packets (RTX) – we might want to try and retransmit what was lost, assuming there’s enough time for it
Correct packet losses in advance (FEC) – when we know there’s high probability of packet losses, we might want to send packets more than once or add some error correction mechanism to deal with the potential packet losses

From here on, let’s review each one of these four approaches.

Have less packet losses

This is the most important solution.

Because I don’t want you to miss this, I’ll write this again:

This is the most important solution.

If there is less packet loss, there is going to be less headache to deal with when trying to “fix” this situation. So reducing packet loss should be your primary objective. Since you can’t fully eradicate packet loss, we will still need to use other techniques. But it starts with reducing the amount of packet losses.

Location of infrastructure elements in WebRTC

Where you place your media servers and TURN servers and how you route traffic for your WebRTC service will have a huge impact on packet loss.

Best practice today is having the first server that WebRTC media hits as close to the user as possible. The understanding behind that is that this reduces the number of hops and network infrastructure components that the media packets need to traverse over the open internet. Once on your server, you have a lot more control over how that data gets processed and forwarded between the servers.

Having a single data center in the US cater for all your traffic is great. Assuming your users are from that region – once users start joining from across the pond – say… France. Or India. You will start seeing higher latencies and with it higher levels of packet loss.

A few things here:

Where you place your servers highly depends on your users and their behavior
TURN servers are important to spread globally, but at the end of the day, check how much of your actual traffic gets related through TURN servers
Media servers are something I’d try to spread globally more, assuming these are needed in all meetings. I’d also focus on cascaded/distributed architectures where users join the closest media server (versus allocating a specific server for all users in the same meeting)

Where to start?

👉 Know the latency (RTT) of your users. Monitor it. Strive towards improving it

👉 Check if there are locations and users that are routed across regions. Beef up your infrastructure in the relevant regions based on this data

👉 Since we want to reduce packet loss, you should also monitor… packet loss

Better bandwidth estimation

I should have called this better bandwidth management, but for SEO reasons, kept it bandwidth estimation 😉

Here’s the thing:

Sending more than the network can handle, the sender can send or the receiver can receive leads to packet loss and packet drops.

Fixing that boils down to bandwidth management – you don’t want to send too little since media quality will be lower than what you can achieve. And you don’t want to send too much since… well… packet loss.

Your service needs to be able to estimate bandwidth. That needs to happen on both the uplink and the downlink for each user.

The challenge is that available bandwidth is dynamic in nature. At each point in time, we need to estimate it. If we overshoot – packets are going to be delayed or lost. If we undershoot, we are going to reduce media quality below what we can achieve.

Web browser implementations of WebRTC have their own bandwidth management algorithms and they are rather good. Media servers have different implementations and their quality varies.

For media servers, we also need to remember that we aren’t dealing only with bandwidth estimation but rather with bandwidth management. Once we approximately know the available bandwidth, we need to decide which of the streams to send over the connection and at which bitrates; doing that while seeing the bigger picture of the session (hence bandwidth management and not estimation).

Conceal packet losses (PLC)

Packet loss concealment is what we do after the fact. We lost packets, but we need to play out something for the user. What should we do to conceal the problem of packet loss?

This may seem like the last thing to deal with, but it is the first we need to tackle. There are two reasons why:

No matter what kind of techniques and resiliency mechanisms you use, at the end of the day, some level of packet loss is bound to occur
Other techniques we have are more sophisticated. Usually we will get to implement them later on. We NEED to have a rock solid concealment strategy before adding more techniques

Audio and video are different, which is why from here on, we will distinguish between the two in the techniques we are going to use.

Audio and packet loss concealment

With audio, a loss of an audio packet almost always translates immediately to a loss of one or more audio frames (and we usually have 50 audio frames per second).

“Skipping” them doesn’t work so well, as it leads to robotic audio when there’s packet loss.

Other naive approaches here include things like playing back the last frame received – either as is or with a reduction in its volume.

More sophisticated approaches try to estimate what should have been received by way of machine learning (or what we love calling it these days – generative AI). Google has such a capability inhouse (though not inside the open source implementation of WebRTC that they have). If you are interested in learning more about this, you can check out Google’s explanation of WaveNetEQ.

A few things to remember here:

👉 For the most part, this isn’t something in your control, unless you own/compile your WebRTC stack on the device side

👉 Knowing how browsers behave here enables you to be slightly smarter with the other techniques you are going to use (by deciding when to use them and how aggressively)

👉 In your own native application? You can improve on things, but you need to know what you’re doing and you need to have a compelling reason to take this route

Video and packet loss concealment 👉 frame dropping

Video is trickier with packet losses:

With video coding, each frame is usually dependent on past frames (to improve upon compression rates)
A video frame is almost always composed of multiple packets

One lost packet translates into a lost frame, which can easily cause loss of the whole video sequence:

Packet loss concealment in video means dropping a frame, and oftentimes freezing the video until the next keyframe arrives.

What can the receiver do in case of such a loss? If it believes it won’t recuperate quickly (which is most commonly the case), he can send out a FIR or PLI message over RTCP to the sender. These messages indicate to the sender that there’s a loss that needs to be addressed, where the usual solution is to reset the encoder and send a new keyframe.

In the past, systems used to try and overcome packet losses by continuing to decode without the missing packets. The end result was smearing artifacts on the video until a new keyframe arrived. Today, best practice is to freeze the video until a keyframe arrives (which is what all browser implementations do).

A few things to remember here:

👉 You have more control here than in audio. That’s because a lost packet means you will receive FIR or PLI message on the other end. If that’s your media server receiving these messages, you can decide how to respond

👉 Sending a keyframe means investing more on bitrate for that frame. If there’s congestion over the network, then this will just put more burden. Most media servers would avoid sending too many of these in larger group meetings

👉 There are video coding techniques that reduce the dependencies between frames. These include temporal scalability and SVC

Retransmitting lost packets (RTX)

If a packet is missing, then the first solution we can go for is to retransmit it.

The receiver knows what packets it is missing. Once the sender knows about the missing packets (via

NACK messages), it can resend them as RTX packets.

Retransmission is the most economic solution in terms of network resources. It is the least wasteful solution. It is also the hardest to make use of. That’s because it ends up looking something like this:

In order to retransmit, we need to:

Know there are missing packets (by receiving a newer packet)
Decide that the older ones won’t be arriving and are lost
Let the sender know they are lost
Have the sender retransmit them

This takes time. A long time.

The question then becomes, is it going to be too late to retransmit them.

Video and RTX

Video can make real use of retransmissions (and it does in WebRTC).

With video compression, we have a kind of hierarchy of frames. Some frames are more important than others:

Keyframes (or I-frames) are the most important. They are “standalone” frames that aren’t reliant on any past frames
In SVC and temporal scalability, some frames are a kind of a dead-end, with nothing reliant on them, while in other cases, have frames reliant on them

The above illustration, for example, shows how keyframes and temporal scalability build dependency chains. Key denotes the keyframe while L0 has higher usability than L1 frames (L1 frames are dependent on L0 frames and nothing depends on them).

When we have such a dependency tree of frames, we can do some interesting things with resiliency. One of them is deciding if it is worthwhile to ask for a retransmission:

If the missing packets are from a keyframe, then asking for a retransmission is useful even if the keyframe itself won’t be displayed due to the time that passed
Similarly, we can decide to do this for L0 frames (these being quote important)
And we can just skip packets of L1 frames that are lost – we might not have time to playback this frame once the retransmission arrives, and that data will be useless anyway

Audio and RTX

Audio compression doesn’t enjoy the same dependency tree that video compression does. Which is why libwebrtc doesn’t have code to deal with audio RTX.

Would having RTC for audio be useful? It can. Audio packets usually wait for video packets to arrive for lip synchronization purposes. If we can use that wait time to retransmit, then we can improve upon audio quality. Google likely deemed this not important enough.

Correct packet losses in advance (FEC)

We could ask for a retransmission after the fact, but what about making sure there’s no need? This is what FEC (Forward Error Correction) is all about.

Think of it this way – if we had one shot at what we want to send and it was super important – would it make sense to send 100 copies of it, knowing that the chances that one of these copies would reach its destination is high?

FEC is about sending more packets that can be used to reconstruct or replace lost packets.

There are different FEC schemes that can be used, with the main 3 of them being:

Duplication (send the same thing over and over again)
XOR (add packets that XOR the ones we wish to protect)
Reed Solomon (similar to XOR just more complex and more resilient)

WebRTC supports duplication and XOR out of the box.

The biggest hurdle of FEC is its use of bitrate – it is quite network hungry in that regard.

Audio FEC

Audio FEC comes in two different manners:

In-codec FEC (such as Opus in-band FEC), where the FEC mechanism is part of the codec implementation itself
RTP-based FEC, where the FEC mechanism is part of the RTP protocol

In-band FEC is implemented as part of the Opus codec library. It is ok’ish at best – nothing to write home about.

Then there’s RED – Redundancy Encoding – where each audio packet holds more than a single audio frame. And the ones it holds are just slightly older frames, so that if a packet is lost, we get it in another packet.

RED is implemented in libwebrtc. Support is limited to 1 level of redundancy for RED (meaning recovering up to one sequential lost packet). You can use WebRTC’s Insertable Streams mechanism to generate RED packets at higher redundancy or dynamic redundancy in the browser though.

In the above, Philipp Hancke explains RED (along with other resiliency features for audio in WebRTC).

Video FEC

FEC for video is considered wasteful. If we need to increase bitrate by 20% or more to introduce robustness using FEC, then it comes at a cost of video quality that we could increase by using higher video bitrate.

For the most part, WebRTC ignores FEC for video, which is a shame. When using temporal scalability or SVC, the same way that we can decide to retransmit only important packets, we can also decide to only add FEC protection only to more important frames.

Wrapping it all up

Dealing with packet loss in WebRTC isn’t a simple task. It gets more complex over time, as more techniques and optimizations are bolted on to the implementation. What I want to do here is to list the various tools at our disposal to deal with packet losses. When and how we decide to use them would determine the resulting robustness and media quality of the implementation.

Here’s a quick table to sum things up a bit:

	PLC	RTX	FEC
Focus	What to playback to the user	When to ask for missing packets	When to send duplicated packets
Advantages	None. You must have this logic implemented	Low network footprint	Low latency overhead
Challenges	Audio may sound roboticVideo will freeze	Increases latency. Might not be usable due to it	High network footprint. Can be quite wasteful
Audio	Duplicate last frames or reduce volumeUse Gen AI to estimate what was lost	Not commonly used for audio in WebRTC	FlexFEC used by WebRTCCan use RED if you want to
Video	Skip video framesAsk for a fresh keyframe to reset the video stream	Can be optimized to retransmit packets of important frames only	Not commonly used for video in WebRTC

Oh – and make sure you first put an effort to reduce the amount of packet losses before starting to deal with how to overcome packet losses that occur…

Learn more about WebRTC (and everything about it)

Packet loss is one of the topics you need to deal with when writing WebRTC applications. There are many aspects affecting media quality – packet loss is but one of them. This time, we looked into the tools available in WebRTC for dealing with packet losses.

To learn more about media processing and everything else related to WebRTC, check out these services:

And if what you want is to test, monitor, optimize and improve the performance of your WebRTC application, then I’d suggest checking out testRTC.

The post Fixing packet loss in WebRTC appeared first on BlogGeek.me.

WebRTC & HEVC – how can you get these two to work together

Tsahi Levent-Levi — Mon, 17 Jun 2024 10:00:00 +0000

Getting HEVC and WebRTC to work together is tricky and time consuming. Lets see what the advantages are and if this is worth your time or not.

Does HEVC & WebRTC make a perfect match, or a match at all???

WebRTC is open source, open standard, royalty free, …

HEVC is royalty bearing, made by committee, expensive

And yet… we do see areas where WebRTC and HEVC mix rather well. Here’s what I want to cover this time:

WebRTC and royalty free codecs
How H.264 wiggled its way into WebRTC
HEVC, patents and big 💰
HEVC hardware
Advantages of HEVC in WebRTC
Limitations of HEVC in WebRTC
Waiting for Godot AV1
Where can you fit HEVC and WebRTC?
- The Apple opportunity of WebRTC and HEVC
- Intel (and other) HEVC hardware
Should you invest in HEVC for WebRTC?
Learn more about WebRTC (and everything about it)

WebRTC and royalty free codecs

Digging here in my blog, you can find articles discussing the WebRTC codec wars dating as early as 2012.

Prior to WebRTC, most useful audio and video codecs were royalty bearing. Companies issued patents related to media compression and then got the techniques covered by their patents integrated into codec standards, usually, under the umbrella of a standardization organization.

The logic was simple: companies and research institutes need to make a profit out of their effort, otherwise, there would be no high quality codecs. That was before the internet as we know it…

Once websites such as YouTube appeared, and UGC (User Generated Content) became a thing, this started to shift:

Browser vendors grumbled a bit about this, since browsers were given away freely. Why should they pay for licensing codec implementations?
Content creators and distributors alike didn’t want to pay either – especially since these were consumers (UGC) and not Hollywood in general

The new business models broke in one way or another the notion of royalty bearing codecs. Or at least tried to break. There were solutions of sorts – smartphones had hardware encoders prepaid for, decoder licenses required no payments, etc.

But that didn’t fit something symmetric like WebRTC.

When WebRTC was introduced, the codec wars began – which codecs should be supported in WebRTC?

The early days leaned towards royalty free codecs – VP8 for video and Opus for voice. At some point, we ended up with H.264 as well…

How H.264 wiggled its way into WebRTC

H.264 is royalty bearing. But it still found its way into WebRTC that was due to Cisco in a large part – they decided to contribute their encoder implementation of H.264 and pay the royalties on it (they likely already paid up to the cap needed anyways). That opened a weird technical solution to be concocted to make room for H.264 and allow it in WebRTC:

WebRTC spec would add H.264 as a mandatory to implement codec for browsers
Browsers would use the Cisco OpenH264 implementation for the encoder, but won’t have it as part of their browser binary
They would download it from Cisco’s CDN after installing the browser

Why? Because lawyers. Or something.

It worked for browsers. But not on mobile, where the solution was to use the hardware encoder on the device, that doesn’t always exist and doesn’t always work as advertised. And it left a gaping headache for native developers that wanted to use H.264. But who cared? Those who wanted to make a decision for WebRTC and move on – got it.

That made certain that at some point in the future, the H.264 royalty bearing crowd would come back asking for more. They’d be asking for HEVC.

HEVC, patents and big 💰

HEVC is a patents minefile, or at least were – I admit I haven’t been following up on this too closely for a few years now.

Here are two slides I have in my architecture course:

There are a gazillion patents related to HEVC (not that many, but 5 figures). They are owned by a lot of companies and get aggregated by multiple patent pools. Some of them are said to be trickling into VP9 and AV1, though for the time being, most of the market and vendors ignore that.

These patents make including HEVC in applications a pain – you need to figure out where to get the implementation of HEVC and who pays for its patents. With regard to WebRTC:

Is this the browser vendors who need to pay?
Maybe the chipset vendors?
Or device manufacturers?
What about the operating system itself?
How about the application vendor?

Oh, and there’s no “easy” cap to reach as there is/were with H.264 when it was included in WebRTC and paid for by Cisco.

HEVC is expensive, with a lot of vendors waiting to be paid for their efforts.

HEVC hardware

Software codecs and royalty payments are tricky. Why? Because it opens up the can of worms above, about who is paying. Hardware codecs are different in nature – the one paying for them is either the hardware acceleration vendor or the device manufacturer.

This means that hardware acceleration of codecs has two huge benefits – not only one:

Less CPU use on the device
Someone already paid the royalties of the codec

This is likely why Apple decided to go all in with HEVC from iPhone 8 and on – it gave them an edge that Android phones couldn’t easily solve:

iPhone is vertically integrated – chipset, device and operating system
Android devices have the chipset vendor, the device manufacturer and Google. Who pays the bill on HEVC?

This gap for Android devices was a nice barrier for many years that kept Apple devices ahead. Apple could “easily” pay the HEVC royalties while Android vendors try to figure out how to get this done.

Today?

We have Intel and Apple hardware supporting HEVC. Other chipset vendors as well. Some Android devices. Not all of them. And many just do decoding but not encoding.

For the most part, the HEVC hardware support on devices is a swiss cheese with more holes than cheese in it. Which is why many focus on HEVC support in Apple devices only today (if at all).

Advantages of HEVC in WebRTC

When it comes to video codecs, there are different generations of codecs. In the context of WebRTC, this is what it looks like:

There are two axes to look at in the illustration above

From left to right, we move from one codec generation to another. Each one has better compression rates but at higher compute requirements
Then there’s bottom to top, moving from royalty bearing to royalty free

If we move from the VP8 and H.264 to the next generation of VP9 and HEVC, we’re improving on the media quality for the same bitrate. The challenge though is the complexity and performance associated with it.

To deal with the increased compute, a common solution is to use hardware acceleration. This doesn’t exist that much for VP9 but is more prevalent in HEVC. That’s especially true since ALL Apple devices have HEVC support in them – at least when using WebRTC in Safari.

The other reason for using HEVC is media processing outside of WebRTC. Streaming and broadcasting services have traditionally been using royalty bearing video codecs. They are slowly moving now from H.264 to HEVC. This shift means that a lot of media sources are going to have available in them either H.264 or HEVC as the video codec – a lot less common will be VP8 or VP9. This being the case, vendors would rather use HEVC than go for VP9 and deal with transcoding – their other alternative is going to stick to using H.264.

So, why use HEVC?

It is better than VP8 and H264
Existence of hardware acceleration for HEVC that is more common than VP9
Things we want to connect to might have HEVC and not VP9
Differentiation. Some users, customers, investors or others may assume you’re doing something unique and innovative

Limitations of HEVC in WebRTC

HEVC requires royalty payments in a minefield of organizations and companies.

Apple already committed itself fully to HEVC, but Google and the rest of the WebRTC industry haven’t.

Google will be supporting HEVC in Chrome for WebRTC only as a decoder and only if there’s hardware accelerator available – no software implementation. Google’s “official” stance on the matter can be found in the Chrome issues tracker.

So if you are going to support HEVC, this is where you’ll find it:

Most Apple devices (see here)
Chrome (and maybe Edge?) browsers on devices that have hardware acceleration for HEVC, but only for decoding. But not yet – it is work in progress at the moment
Not on Firefox (though Mozilla haven’t gotten yet to adding AV1 to Firefox either)

Waiting for Godot AV1

Then there is AV1. A video codec years in the making. Royalty free. With a new non-profit industry consortium behind it, with all the who’s who:

The specification is ready. The software implementation already exists inside libwebrtc. Hardware acceleration is on its way. And compression results are better than HEVC. What’s not to like here?

This makes the challenge extra hard these days –

Should you invest and adopt HEVC, or start investing and adopting AV1 instead?

HEVC has more hardware support today
AV1 can run anywhere from a royalties standpoint
HEVC isn’t available on many devices and device categories
AV1 is too new and can’t seriously deal with high bitrates and video resolutions
HEVC won’t be adopted by many devices even in the foreseeable future
AV1 is likely to be supported everywhere in the future, but it is almost nowhere in the present

Adopt VP9? Wait for AV1?

Where can you fit HEVC and WebRTC?

Let’s see where there is room today to use HEVC. From here, you can figure out if it is worth the effort for your use case.

The Apple opportunity of WebRTC and HEVC

Why invest now in HEVC? Probably because HEVC is available on Apple devices. Mainly the iPhone. Likely for very specific and narrow use cases.

For a use case that needs to work there, there might be some reasoning behind using HEVC. It would work best there today with the hardware acceleration that Apple pampered us with for HEVC. It will be really hard or even impossible to achieve similar video quality in any other way on an iPhone today.

Doing this brings with it differentiation and uniqueness to your solution.

Deciding if this is worth it is a totally different story.

Intel (and other) HEVC hardware

Intel has worked on adding HEVC hardware acceleration to its chipsets. And while at it, they are pushing towards having HEVC implemented in WebRTC on Chrome itself. The reason behind this is a big unknown, or at least something that isn’t explained that much.

If I had to take a stab at it here, it would be the desire of Intel to work closely with Apple. Not sure why, it isn’t as if Intel chipsets are interesting for Apple anymore – they have been using their own chips for their devices for a few years now.

This might be due to some grandiose strategy, or just because a fiefdom (or a business unit or a team) within Intel needs to find things to do, and HEVC is both interesting and can be said to be important. And it is important, but is it important for WebRTC on Intel chipsets? That’s an open question.

Should you invest in HEVC for WebRTC?

No. Yes. Maybe. It depends.

When I told Philipp Hancke I am going to write about this topic, he said be sure to write that “it is a bit late to invest in HEVC in 2024”.

I think this is more nuanced than this.

It starts with the question how much energy and resources do you have and can you spend them on both HEVC and AV1. If you can’t then you need to choose only one of them or none of them.

Investing in HEVC means figuring out how the end result will differentiate your service enough or give it an advantage with certain types of users that would make your service irresistible (or usable).

For the most part, a lot of the WebRTC applications are going to ignore and skip HEVC support. This means there might be an opportunity to shine here by supporting it. Or it might be wasted effort. Depending how you look at these things.

Learn more about WebRTC (and everything about it)

Which codecs are available, which ones to use, how is that going to affect other parts of your application, how should you architect your solutions, can you keep up with the changes coming to WebRTC?

These and many other questions are being asked on a daily basis around the world by people who deal with WebRTC. I get these questions in many of my own meetings with people.

If you need assistance with answering them, then you may want to check out these services that I offer:

The post WebRTC & HEVC – how can you get these two to work together appeared first on BlogGeek.me.

Reasons for WebRTC to discard media packets

Tsahi Levent-Levi — Mon, 27 May 2024 09:30:00 +0000

From time to time, WebRTC is going to discard media packets. Monitoring such behavior and understanding the reasons is important to optimize media quality.

WebRTC does things in real time. That means that if something takes its sweet time to occur, it will be too late to process it. This boils down to the fact that from time to time, WebRTC will discard media packets, which isn’t a good thing. Why is that going to happen? There are quite a few reasons for it, which is what this article is all about.

A WebRTC Q&A
Discarded media packets in WebRTC
WebRTC = Real-Time. Timing is everything
WebRTC discarding incoming audio packets
WebRTC discarding outgoing audio packets
WebRTC discarding incoming video frames
WebRTC discarding outgoing video frames
Maintaining media quality in WebRTC

A WebRTC Q&A

I just started a new initiative with Philipp Hancke. We’re publishing an answer to a WebRTC related question once a week (give or take), trying to keep it all below the 2 minutes mark.

We are going to cover topics ranging from media processing, through signaling to NAT traversal. Dealing with client side or server side issues. Or anything else that comes to mind.

👉 Want to be the first to know? Subscribe to the YouTube channel

👉 Got a question you need answered? Let us know

Discarded media packets in WebRTC

Media packets and frames can and are discarded by WebRTC in real life calls. There are even getstats metrics that allow you to track these:

The screenshot above was taken from the RTCInboundRtpStreamStats dictionary of getstats. I marked most of the important metrics we’re interested in for discarding media data.

packetsDiscarded – this field indicates any fields that the jitter buffer decided to discard and ignore because they arrived too early or too late. It relates to audio packets.

framesXXX fields are dealing with video only and look at full frames which can span multiple packets. They get discarded because of a multitude of reasons which we will be dealing with later in this article. For the time being – just know where to find this.

The diagram below is a screenshot taken in testRTC of a real session of a client. Here you can see a spike of 200 packetsDiscarded less than a minute into the call. We’ve recently added in testRTC insights that hunt for such cases (as well as for video frame drops), alerting about these scenarios so that the user doesn’t have to drill down and search for them too much – they now appear front and center to the user.

WebRTC = Real-Time. Timing is everything

WebRTC stands for Web Real Time Communication. The Real Time part of it is critical. It means that things need to happen in… real time… and if they don’t, then the opportunity has already passed. This leads to the eventuality that at times, media packets will need to be discarded simply because they aren’t useful anymore – the opportunity to use them has already passed.

For all that logic to happen, WebRTC uses a protocol called RTP. This protocol is in charge of sending and receiving real time media packets over the network. For that to occur, each RTP packet has two critical fields in its header:

The illustration above is taken from our course Low level WebRTC protocols. In it, you can see these two fields:

Sequence number
Timestamp

The sequence number is just a running counter which can easily be used to order the packets on the receiving end based on the value of the counter. This takes care of any reordering, duplication and packet losses that can occur over modern networks.

The timestamp is used to understand when the media packet was originally generated. It is used when we need to playback this packet. Multiple packets can have the same timestamp for example, when the frame we want to send gets split across packets – something that occurs frequently with video frames.

These two, sequence number and timestamp, are used to deal with the various characteristics of the network. Usually, we deal with the following problems (I am not going to explain them here): jitter, latency, packet loss and reordering.

All of this goodness, and more is handled in WebRTC by what is called a jitter buffer. Here’s a short explainer of how a jitter buffer works:

WebRTC discarding incoming audio packets

The above video is our first WebRTC Q&A video. We started off with this because it popped up in discuss-webrtc. The question has since been deleted for some reason, but it was a good one.

Latency

The main reason for discarded audio packets is receiving them too late.

When audio packets are received by WebRTC, it pushes them into its jitter buffer. There, these packets get sorted in their sending order by looking at the sequence number of these packets. When to play them out is then dependent on the timestamp indicated in the packet.

Assuming we already played a newer packet to the user, we will be discarding packets that have a lower (and older) sequence number since their time has already passed.

Lipsync

Audio and video packets get played out together. This is due to a lip synchronization mechanism that WebRTC has, where it tries to match timestamps of audio and video streams to make sure there’s lip synchronization.

Here, if the video advanced too much, then you may need to drop some audio packets instead of playing them out in sync with the video (simply because you can’t sync the two anymore).

Bugs

Here’s another reason why audio packets might end up being discarded by the receiver – bugs in the sender’s implementation…

When the sender doesn’t use the correct timestamp in the packets, or does other “bad” things with the header fields of the RTP packets, you can get to a point when packets get discarded.

👉 Our focus here was on the timestamp because for some arcane reasons, figuring out the timestamp values and their progression in audio (and video) is never a simple task. Audio and video use different frequency clocks when calculating timestamps, done with values that make little sense to those who aren’t dealing with the innards and logic of audio and video encoders. This may easily lead to miscalculations and bugs in timestamp setting

WebRTC discarding outgoing audio packets

This doesn’t really happen. Or at least WebRTC ignores this option altogether.

How do we know that? Besides looking at the code, we can look at the fields that we have in getstats for this. While we have discarded frames for incoming and outgoing video and discarded incoming audio packets, we don’t have anything of this kind for outgoing audio packets.

These packets are too small and “insignificant” to cause any dropping of them on the sender side. That’s at least the logic…

WebRTC discarding incoming video frames

Before we go into the reasons, let’s understand how video packets are handled in the media processing pipeline of WebRTC. This is partial at best, and specifically focused on what I am trying to convey here:

The above diagram shows the process that video packets go through once they are received, along with the metrics that get updated due to this processing:

It starts with the video packets being Received from the network
They then get Reordered as they get inserted into the jitter buffer. Here, the jitter buffer may discard packets. In the case of video packets though, don’t expect packetsDiscarded to be updated properly
For video, we now construct frames, taking multiple packets and concatenating them into frames in Construct a frame. This also gives us the ability to count the framesReceived metric
Once we have frames, WebRTC will go ahead and Decode them. Here, we end up counting framesDecoded and framesDropped
Now that we have decoded frames, we can Play them back and indicate that in framesRendered

👉 The exact places where these metrics might be updated are a wee bit more nuanced. Consider the above just me flailing my hands in the air as an explanation.

This also hints that with video, there are multiple places where things can get dropped and discarded along the pipeline.

The above is another screenshot from testRTC. This time, indicating framesDropped. You can see how throughout the session, quite a few frames got dropped by WebRTC.

Let’s find the potential reasons for such dropped frames..

Latency, lip sync & bugs

Just like incoming audio packets, we can get dropped packets and video frames because of much the same reasons.

Latency and lip synchronization may cause the jitter buffer to discard video packets.

And bugs on the sender side can easily cause WebRTC to drop incoming packets here as well.

That said, with video, we have to look at a slightly bigger picture – that of a frame instead of that of a singular packet.

Not all packets of a frame are available

Assume you have a packet dropped. And that packet is part of a frame that is sent over a series of 7 packets. We had 1 packet drop that caused a frame drop, which in turn, caused another 6 packets to be useless to us since we can’t really decode them without the missing packet (we can to some extent, but we usually don’t these days).

Dependency on older frames

With video, unless we’re decoding a keyframe, the frame we need to decode requires a previous frame to be decoded. There are dependencies here since for the most part, we only encode and compress the differences across frames and not the full frame (that would be a keyframe).

What happens then if a frame we need for decoding a fresh frame we just received isn’t available? Here, all packets were received for this new frame, but the frame (and all its packets) will still get dropped. This will be reported in framesDropped.

Not enough CPU

We might not have enough CPU available to decode video. Video is CPU intensive, and if WebRTC understands that it won’t have time to decode the frame, it will simply drop it before decoding it.

But, it might also decode the frame, but then due to CPU issues, miss the time for playout, causing framesRendered not to increment.

WebRTC discarding outgoing video frames

With outgoing media, there is a different dictionary we need to look at in getstats – RTCOutboundRtpStreamStats:

Here, the relevant fields are framesSent and framesEncoded. We should strive to have these two equal to each other.

We know that WebRTC decided to discard frames here if framesEncoded is higher than framesSent. If this happens, then it is bad in a few levels:

Encoding video is a resource intensive process. If we took the effort to encode a frame and didn’t send it in the end, then we’ve wasted resources. To me this means something is awfully wrong with the implementation and it isn’t well balanced
Video frames are usually dependent on one another. Dropping a frame may lead to future frames that the receiver will be unable to decode without the frame that was dropped
Such failures are usually due to network or memory problems. These hint towards a deeper problem that is occurring with the device or with the way your application handles the resources available on the device

On the RTCIceCandidatePairStats dictionary, there’s also packetsDiscardedOnSend metric, which hints to when and why would we lose and discard packets and frames on the sender side:

Total number of packets for this candidate pair that have been discarded due to socket errors, i.e. a socket error occurred when handing the packets to the socket. This might happen due to various reasons, including full buffer or no available memory.

If you’re dropping video frames on the sender side (framesEncoded < framesSent), then in all likelihood the network buffer on the device is full, causing a send failure. Here you should check the resources available on the device – especially memory and CPU – or just understand the network traffic you are dealing with.

Maintaining media quality in WebRTC

Media quality in WebRTC is a lot more than just dealing with bitrates or deciding what to do about packet losses. There are many aspects affecting media quality and they all do it dynamically throughout the session and in parallel to each other. Some of these I covered when discussing fixing packet loss in WebRTC.

This time, we looked into why WebRTC discards media packets during calls. We’ve seen that there are many reasons for it.

To learn more about media processing and everything else related to WebRTC, check out these services:

The post Reasons for WebRTC to discard media packets appeared first on BlogGeek.me.

WebRTC simulcast – what is it and how is it used

Tsahi Levent-Levi — Mon, 13 May 2024 09:30:00 +0000

What exactly is simulcast, how is it used in WebRTC and why is it a critical component in any SFU media server.

WebRTC simulcast is one of these things that is commonly used by WebRTC applications that have SFU media servers. If your media server doesn’t use simulcast – make sure to ask why and to understand the answer. And if it does, then you should know what it means exactly. Which is why we’re here now.

In this article, I want to explain what WebRTC simulcast is, when and how it is used AND some new advancements coming to simulcast.

A crash course on video quality and bitrate
SFU media servers and group video sessions
Media quality: LCD or BAB
Client side = Simulcast; Server side = Adaptive bitrate
Advantages and weaknesses of using simulcast in WebRTC
- WebRTC simulcast advantages
- WebRTC simulcast weaknesses
Who decides on bitrates in WebRTC simulcast
Keyframes and switching costs in simulcast
Temporal scalability improves WebRTC simulcast
Decisions of highest layer bitrate in WebRTC simulcast
WebRTC and multi-codec simulcast
A word about SVC… and where to learn more

A crash course on video quality and bitrate

Before we begin, we need to understand the concept of bitrate. In a WebRTC video session, the first thing to look at and understand is the bitrate used. Video encoding requires sending a lot of data over the network, and WebRTC tries to match the bitrate it sends to the available bandwidth of the network.

See how I switched between talking about sending data to bitrate to bandwidth? For me, sending data is what we are trying to do. Bitrate is the actual (or target) amount of data we’re aiming for, and bandwidth is what is available for us on the network (assume that bandwidth should always be the same or preferably even higher than the bitrate).

When it comes to audio, we’re mostly working with bitrates that are static and known in advance. They are also low compared to video bitrates, so we just don’t care as much. Which leaves us with video streams.

For video streams:

The higher the bitrate, the higher the quality (most of the time)
The higher the bitrate, the higher the CPU and memory needed to encode and decode the data

This means that what we want to do is use as little bitrate as possible to get the highest possible quality. We’re trying to reach for the stars first by deciding our desired bitrate, and then we start lowering due to the constraints of the real world. Here are a few reasons for this:

Our CPU is over-burdened, so we need to reduce the bitrate we encode or decode
The resolution of the video that ends up being displayed is going to be quite rather small, so there’s no point in investing too much in bitrate. Same logic can be applied to the camera
We can’t push through the network the bitrate we want, so we need to reduce it to fit the bandwidth available on the network

👉 If you want to learn more about this topic, then read this article on WebRTC video quality

SFU media servers and group video sessions

For video group sessions in WebRTC, we use SFU media servers. Not always, but most of the time. Why? Because SFUs route media – this ends up costing us less compared to MCUs and in many ways makes things more flexible for us on the viewer’s end.

The challenge though is that SFUs harbor a wee bit more complex logic and smarts than the alternatives and they also delegate a lot of the work to the clients themselves. A good SFU is one that has tight integration and optimization methods with the clients using it. And remember here that the implementation of the browser (Chrome) is optimized for Google Meet’s needs.

Simulcast was “invented” for SFUs. Let’s take a quick example to show what we mean here.

We have 4 people on a call. All connected to an SFU. Each participant is sending his video to the SFU, and the SFU routes that video to the other 3 participants in the call:

If everyone has a decent network, then we’re all happy. But what if D has poor network conditions on his downlink? Here are some assumptions for our scenario:

All participants can send 2Mbps of video data towards the SFU
A, B and C can receive up to 20Mbps in total on the downlink
D can receive only 1Mbps in total on the downlink

If we want everyone to be displayed at the same quality on D’s screen, we need to give each one of them ~330Kbps. That’s instead of 2Mbps. So… do we just reduce the sending bitrate of everyone down to 330Kbps to accommodate for user D? Or do we drop him out of the call altogether?

Notice how we can still send 2Mbps from D to the rest of the participants? That’s just the nature and dynamics of the network and capabilities we have in this example.

Here’s where simulcast comes in…

We’re going to engineer the solution so that each participant is going to create 3 separate bitstreams of their video data: 1150kbps, 600kbps and 250kbps, totalling 2Mbps. The exact numbers are less important than the concept itself, so please go with the flow here.

* Being lazy, I’ve denoted simulcast lines as dotted lines, indicating Simulcast instead of using a better notation like 1150/600/250.

Now that we do that, A, B and C get 1150Kbps video from everyone else and D receives the lower 250Kbps bitstreams (it can’t handle 1150kbps or 600kbps even for only one of the users without dropping one of the other video streams it is receiving altogether). Now each one is getting the most he can handle (or at the very least, closer to that than just lowering everyone down).

Media quality: LCD or BAB

I am going to use names that don’t necessarily exist. I am making them up here to explain the nature of simulcast a bit better.

What we’ve seen in the example above is how we move from LCD (Least Common Denominator) to BAB (Best Available Bandwidth).

We started with a naive implementation where the same video bitrate is being sent to everyone. So if there’s a hiccup somewhere along the session, everyone is going to be affected. When D had network issues, everyone had to lower their bitrate from 2Mbps down to 330Kbps… that’s quite a hit to media quality across the board for them all.

That’s our LCD – we’re going to need to accommodate the bitrate to the lowest common denominator of the available bandwidth we have across our meeting participants. And that sucks. Bigtime…

But then we went for BAB – we’re going to try and work with the best available bandwidth that each user is capable of receiving.

How did we do that? By asking the participants (nicely) to generate more than a single bitstream. Each bitstream has a different bitrate here, which gives the SFU the flexibility it needs to decide which bitrate to send to which user.

We use simulcast (or SVC, though not in this article) because there’s no equality in digital communications. Participants have different devices, they connect with different networks and they even see and focus on different things during the same meeting. Simulcast enables us to give different participants a different view of the meeting with varying degrees of quality based on the capabilities of each participant at any given moment AND based on each participants’ preference/desire.

How much flexibility and how high media quality we can reach is determined by the tools and optimizations we end up employing in our implementation. No two implementations of SFU with simulcast are exactly alike.

Client side = Simulcast; Server side = Adaptive bitrate

Simulcast as a concept and solution is about a client generating multiple streams so that a media server can use whichever of the streams it needs to send to other participants.

Video streaming had a similar(?) solution known as ABR – Adaptive Bitrate.

Here, the client sends a single media stream to the server and the server is the one that generates any number of streams in different bitrates as it sees fit. This makes sense when there are many viewers (thousands or more) and it can be useful to invest in server resources (these cost money to the vendor providing the service) for the given scenario.

Some use ABR as a term to simply say that the bitrate is variable in nature and adapts to the network. I use it to refer to server side adaptation, where there are multiple video streams generated (in advance or in realtime) and the server simply chooses the best to use per viewer.

For large scale live streaming broadcasts, you can start seeing solutions that incorporate ABR as a technology to transcode the stream to broadcast on the server and generate multiple bitrates with it. This can and is done sometimes in parallel to using simulcast from the client as well.

The way for me to compartmentalize and remember this? Simulcast is multiple bitrates generated by the client. ABR is multiple bitrates generated by the server.

👉 Your can learn more about ABR vs simulcast or just about simulcast

Advantages and weaknesses of using simulcast in WebRTC

Simulcast is great, but it isn’t a catchall solution.

What simulcast does as a concept is to offload some of the work from the media server. Offloading here means that for the client it comes at an increase in CPU use and outgoing bandwidth required.

WebRTC simulcast advantages

Here are some great things that simulcast brings with it:

Reduces the costs of media servers drastically
- By not needing to decode and encode media streams, media servers need way less CPU power
- This means that scaling large deployments becomes easier and more feasible for a lot more use cases
Different layouts for each participant
- Since each user ends up receiving multiple video streams (in different bitrates), the application is free to display a different layout for each participant
- Other media servers that mix media would need to invest even more CPU to support something like “encoder per participant” to achieve this
Display participants’ video and other data in the same space
- Again, since each participant video is separate from the others, it is simpler to place additional visual items in the same area
- Mixing all videos into a single stream makes this harder and clunkier

WebRTC simulcast weaknesses

It isn’t all good though. There are weaknesses to the use of WebRTC simulcast:

Higher bandwidth use on uplink of users
- Networks are asymmetric in bandwidth sometimes (think ADSL), and uplinks are usually lower in bandwidth than downlinks
- Simulcast has a higher uplink requirement (1.3125x to be precise) than not using simulcast, which means that there are scenarios when using simulcast can actually lower quality if not done properly
Higher CPU use for user devices
- Clients generate 2-3 media streams in different bitrates with simulcast
- So they “invest” more in the encoding when it comes to CPU use
Higher system complexity
- To really make use of simulcast in WebRTC, there should be a lot more synchronization between client and server code
- That means higher complexity of the overall system
Dependency on client code
- With other solutions, especially media mixing ones (see MCU), the clients might not even know they are in a group call
- But when it comes to simulcast and group calling, clients have a huge role to play in making sure calls are of high quality (due to the complexity mentioned above)

Who decides on bitrates in WebRTC simulcast

There are usually two to three layers/streams when it comes to WebRTC simulcast. Each with a different bitrate, and from there, also with different resolutions, frame rates and quality. I am focusing on bitrate because for me, that’s the leading factor – everything else gets derived from it.

Which bitrates are we going to support and which ones get sent to whom are the most important questions for any SFU implementation that uses simulcast.

WebRTC by itself can’t make such decisions. It has its own default bitrates for simulcast, but this is only what they are – defaults. I wouldn’t recommend developers to use these without understanding their implications (they’re likely not useful for the use case you have at hand).

The decision which bitrates to support in simulcast to begin with should take into consideration the possible display layouts of the videos on the viewers’ end. By knowing at what resolutions the videos get displayed we can try to better estimate the desired bitrates to use while using simulcast. Factor into it things like number of videos in the layout (so that you take total bitrates and available bandwidth into consideration), importance of videos on the display (lower priority streams can manage with lower frame rates and resolutions), etc.

Here’s the thing though:

The client is the one generating and encoding simulcast media streams. It knows best its own CPU and performance capabilities
The SFU media server knows best the estimated bandwidth in front of all viewers. It also knows what media streams and at what bitrates it has at its disposal when the time comes to send media to viewers
The viewer is the one that knows best how the video gets laid out on the display, along with its own CPU and performance capabilities
Oh, and the viewer may change the layout on the display throughout the call, changing what’s best to send to it

The end result is that the application in charge of it all needs to orchestrate the clients and the media servers in order to optimize the session for higher media quality, taking into consideration all the information. It also means that your application needs to somehow share this out-of-band information with the application session logic so decisions can be made. And this part is proprietary – it isn’t something that we have written as a standard or even a best practice.

Keyframes and switching costs in simulcast

With all this goodness, there’s an achilles heel. One that stems from the way Google implemented simulcast in Chrome, but also by the realities of such a solution.

Here’s the thing: Whenever a viewer switches from one simulcast layer to another, there’s a change in the video stream that gets decoded. That change can only occur with a fresh keyframe on the layer that is being switched to, so that the video decoder will be able to decode the stream properly.

When there’s a need to generate a keyframe in simulcast, Chrome will automatically generate a keyframe across all simulcast layers. This isn’t a good thing, but it is what it is.

This also means that SFU media servers need to be conscious about this and not have viewers switch between the different layers all the time, limiting switches to the minimum necessary to maintain high video quality.

Temporal scalability improves WebRTC simulcast

When using temporal scalability alongside simulcast in WebRTC it gives us another level of flexibility.

In temporal scalability, the frames of a video stream are encoded in such a way that their dependency chain enables us to decode some of the frames and not others – something that is usually impossible in video compression. WebRTC’s implementation has in Chrome temporal scalability in VP8 with 2 such “layers”, so if you’re sending 30 frames per second, the SFU media server can decide to send either 30 or 15 FPS to participants (the 15 frames per second is roughly 60% of the bitrate of the 30 frames per second).

Think of it like multiplying your simulcast streams without an additional cost:

And yes, like everything else, this depends on the codec you use, the browser and the fact that some layers might not have enough frames per second to begin with (for example, the lower layer might only produce 10 or 15 frames per second and then temporal scalability might be useless).

When using simulcast, the level and variety of tools you use will enable you to increase the media quality you offer your users.

Decisions of highest layer bitrate in WebRTC simulcast

Simulcast in WebRTC gives us another level of flexibility. One that Daily explains nicely in their post where they title their solution as adaptive bitrate.

Let’s assume we’re going for the classic 3 media stream in our WebRTC simulcast solution:

Remember our example from before? Our smallest bitrate (250kbps) and medium sized bitrate (600kbps) are “static” in nature. The video encoder in our browser is going to generate these in such a way each and every time (assuming the CPU allows and bandwidth estimation is higher than the summation of these two).

That highest bitrate there isn’t really static. At least not by default. It will use as much bitrate as it needs, taking into consideration the CPU consumption and bandwidth estimation. Left to its own device, this highest bitrate layer is going to be greedy in its resource consumption. It can also get below the medium sized bitrate if there’s not enough CPU or bandwidth available, which beats the point of this being the highest layer. This all leads us to what we need to do…

Like everything else that WebRTC does in the browser though, it needs to be managed and taken into account by the SFU media server. In this case, deciding what that highest layer bitrate should be at any given point in time.

Here are some questions to ask yourself when making that decision in your SFU:

Do you want the highest layer to have a static bitrate? (hint: no)
The participants who need to get this user’s video at the highest quality – what’s the highest bitrate / resolution that they can cope with based on their device and network conditions?
- Do you need to limit the bitrate of this layer to accommodate for more of these participants?
- Are you willing to move some of these users to the mid bitrate in order to increase the quality for the other participants who have better conditions?
Are you recording this stream?
- If you are, do you need it at the highest possible quality?
- Does it mean you can “sacrifice” some of your participant’s viewing quality to get a better recording out of this session?
- Or is the recording fine with lower bitrates or quality?
I’ll finish off with a question about all the layers – which ones are actually used?
- If some of the layers aren’t being sent to any of the users in the meeting, you can decide to suspend them in the first place, practically “changing” the simulcast configuration dynamically for that specific participant. It will come at a cost when you’ll need to switch from one layer to another if the other layer is non-existent
- And if we decided not to send a specific bitrate, does it mean the other bitrates can change as well to accommodate for the extra headroom we now have of bitrate and CPU available?

These questions don’t have a single simple answer. The answer to these will vary based on the strategy you wish to employ, the use case you have, the video layouts you support, the level of your engineers, the media server you start with, …

At the end of the day, your answers are just a set of heuristics, and being able to compare one to another is going to be a challenging task. Make sure you get this right (or right enough) for your needs.

WebRTC and multi-codec simulcast

This is something that we’re just starting to see now.

Up until recently, as a developer, you chose a codec, used simulcast on it and that’s about it. The available alternatives were mostly VP8 and H.264. These days? With the introduction of the AV1 video codec a new idea started cropping:

AV1 is a better codec when it comes to media quality per bitrate compared to the other codecs available
But AV1 also takes up more CPU and there’s almost no hardware acceleration available in the market
At very low bitrates, using AV1 is possible, since it won’t take up much CPU for that
But using it at higher bitrates isn’t possible in most scenarios

So the above diagram was thought out in a way. Instead of using the same video codec in a simulcast session for WebRTC, why not use multiple codecs? Have AV1 used on the lowest bitrate and then another codec, say VP8 or VP9 on the higher bitrates?

This way, the machine’s CPU is capable of encoding the data, and the resulting media quality of the lowest bitrate in there is now higher than it would have been if we used a single codec for simulcast.

At the time of writing, this hasn’t been implemented in a workable fashion just yet.

In a way, this is our future for the coming years, until AV1 will become popular enough and its use made possible by commonplace hardware acceleration or better CPUs on the devices.

A word about SVC… and where to learn more

There are alternatives to using WebRTC simulcast:

Deciding NOT to use simulcast but still using an SFU, moving towards a LCD (least common denominator) approach to media quality
Not using SFU or media routing, going for mesh or mixing solutions
Replacing simulcast with SVC

SVC stands for Scalable Video Coding. At its heart, it is quite similar to simulcast, just done on the codec level. The video encoder itself generates a bitstream that can be peeled like an onion into multiple bitrates. This gives a solution that is less wasteful than simulcast in bitrate and CPU. The downside here is an increase in complexity and in lack of availability of hardware encoders and decoders that know how to handle SVC.

There are video meeting solutions out there that use SVC. They can usually also use WebRTC simulcast – simply because SVC gets added later as an additional tool for further optimization and flexibility.

To learn more about simulcast, SVC and everything related to WebRTC, check out these services:

The post WebRTC simulcast – what is it and how is it used appeared first on BlogGeek.me.

RTC@Scale 2024 – an event summary

Tsahi Levent-Levi — Mon, 08 Apr 2024 09:30:00 +0000

RTC@Scale is Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary for RTC@Scale 2024 so you can pick and choose the relevant ones for you.

WebRTC Insights is a subscription service I have been running with Philipp Hancke for the past three years. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.

We got into a kind of a flow:

Once every two weeks we finalize and publish a newsletter issue
Once a month we record a video summarizing libwebrtc release notes (older ones can be found on this YouTube playlist)

Oh – and we’re covering important events somewhat separately. Last month, a week after Meta’s RTC@Scale event took place, Philipp sat down and wrote a lengthy summary of the key takeaways from all the sessions, which we distributed to our WebRTC Insights subscribers.

As a community service (and a kind of a promotion for WebRTC Insights), we are now opening it up to everyone in this article 😎

Why this issue?
- Our top picks
- General thoughts (TL;DR)
SESSION 1 – RTC@Scale
SESSION 2 – RTC@Scale
SESSION 3 – RTC@Scale
Closing remarks

Why this issue?

Meta ran their rtc@scale event for the third time. Here’s what we published last year and in 2022. This year was “slightly” different for us:

Philipp was in-between jobs. March 25th was his first day at Meta and this was the reason he got a notebook
Tsahi was a speaker at rtc@scale

While you can say we’re both biased on this one, we will still be offering an event summary here for you. And we will be doing it as objectively as we can.

Our focus for this summary is what we learned or what it means for folks developing with WebRTC. Once again, the majority of speakers were from Meta. At times they crossed the line of “is this generally useful” to the realm of “Meta specific” but most of the talks provide value.

Writing up these notes takes a considerable amount of time, but is worth it (we know – we’ve done this before). You can find the list of speakers and topics on the conference website, the playlist of the videos can be found here (there’s also a 6+ hours long session there that includes all the Q&As). You can also just scroll down below for our summary.

Our top picks

Our top picks:

“Improving International Calls” since it is quite applicable to WebRTC
“Improving Video Quality for RTC” since you can learn quite a bit about AV1
“Enhanced RTC Network Resiliency with Long-Term-Reference and Reed Solomon code” since you can learn about FEC for video (LTR is not in libWebRTC currently)
“Machine Learning based Bandwidth Estimation and Congestion Control for RTC” since BWE is crucial to quality.

We find these most applicable to how you deal with WebRTC in general, even outside of Meta.

General thoughts (TL;DR)

Meta is taking the route of most large vendors who do millions of minutes a day
It is gutting out WebRTC in the places that are most meaningful to it, replacing them with their own proprietary technology
- Experiences in native applications are being prioritized over browser ones, and the browser implementation of WebRTC is kept as a fallback and interoperability mechanism
- Smaller vendors will not be able to play this game across all fronts and will need to settle for the vinyl quality and experience given by WebRTC
- Sadly, this may lead to WebRTC’s demise a few within a few years’ time
Meta can take this approach because the majority of their calls take place in mobile native applications, so they are less reliant and dependent on the browser
- Other large vendors are taking a similar route
- Even Google did that with Duo and likely is doing similar server-side things with Meet

SESSION 1 – RTC@Scale

Li-Tal Mashiach, Meta / Host Welcome

(4 minutes)

Watch if you: need a second opinion on what sessions to watch

Key insights:

Pandemic is over and still Meta is seeing growth. That said, no numbers were shared around usage

Nitin Khandelwal, Meta / Keynote: From Codec to Connection

(13 minutes)

Watch if you: are a product person

Key insights:

Great user stories with a very personal motivation
Meta is all about “Connection” and “Presence” and RTC is the technical vehicle for creating “Presence when People are apart”
Large group calling is first mentioned for collaboration and only then for social interactions but we wonder why “joining ongoing group calls at any time” is being specifically mentioned as a feature
Codec avatars and the Metaverse are mentioned here, but aren’t discussed in any of the talks, which would have been nice to have as well
Interoperability and standards are called out as an absolute requirement which ties in with the recent WhatsApp announcement

Sriram Srinivasa + Hoang Do, Meta / Revamping Audio Quality for RTC Part 1: Beryl Echo Cancellation

(20 minutes)

Watch if you: are an engineer working on audio and enjoyed last year’s session

Key Insights:

Meta implemented a new proprietary AEC called Beryl to replace the one that WebRTC uses by default. This session explains the motivation, technical details and performance results of Beryl
The audio pipeline diagram at 1:10 remains great and gives context for this year’s enhancements which are in AEC and a low-bitrate audio codec:

At 2:50 we get a good summary of what “AI” can do in this area. Unsurprisingly this depends a lot on how much computational effort can be spent on the device
Meta’s Beryl is for more general usage and aims to be a replacement for WebRTC’s AEC3 (on desktop) and AECM (on mobile). At 4:00 we get a proper definition of acoustic echo as a block diagram. Hardware AEC is noted as not effective on a large number of devices and does not support advanced features like stereo/spatial audio anyway
At 06:00 the Beryl part gets kicked off with a hat-tip to the WebRTC echo cancellation and at 7:50 another block diagram. One of the key features is that Beryl is one AEC working in two modes, with a “lite” mode for low-powered devices. The increase in quality compared to WebRTC comes at the expense of 7-10% more CPU being used:

At 09:00 we get an intro to the different subcomponents of AEC, delay estimation, linear echo cancellation (AEC) and “leftover” echo suppression (AES)
At 13:30 come the learnings from implementing the algorithms, a demo at 16:30 and a apples-to-apples comparison with libWebRTCs AEC (which should be relatively fair since the rest of the pipeline is the same) showing a 30% increase in quality for a number of scenarios
This is a nice alternative summary if you still need convincing to watch the video

Jatin Kumar + Bikash Agarwalla, Meta / Revamping Audio Quality for RTC Part 2: MLow Audio Codec

(17 minutes)

Blog post: we hope there will be one!

Watch if you: are an engineer working on audio

Key Insights:

Meta implemented a new proprietary audio codec called MLow to improve upon and replace Opus within its applications
We start (if you skip the somewhat repeated intro) at 2:30 with the already familiar audio pipeline block diagram and a motivation for a new codec including the competitive landscape. Meta aims to provide good quality even on low-end devices
At 4:30 we get a good overview of the requirements. Fast integration by reusing the Opus API is an interesting one. ML/AI would be nice to use but would increase complexity in ways which lead to worse overall quality:

At 5:50 we get an overview of how the new codec works at a high level followed by the approach taken to develop the codec at 8:15 which is interesting because you don’t hear about the compromise between “move fast when trying things” and “be extremely performant” very often
At 9:30 we get some insight into how the evaluation was done using diverse and representative input and the actual crowdsourced listening tests (which are a lot of effort and are therefore expensive) at 11:30. Tools like VISQOL and POLQA are used for regression testing. 1.5 years of development time sounds quite fast!
At 13:00 we got a demo. We wonder which Opus version was used for comparison due to the recent 1.5 improvements there which promise improvements in the same low-bandwidth area
MLow can offer comparable quality to 25kbps opus at 18kbps but you might not care if you have more than 16kbps available since both codecs show very similar POLQA scores at that bitrate:

At 15:40 we get production results which show improvements (which are not quantified in this talk). Improvements in video quality are a bit surprising, we would not spend more bits on video in low-bitrate scenarios

Yi Zhang + Saish Gersappa, Meta / Improving International Calls

(19 minutes)

Watch if you: are looking for architecture insights also applicable to WebRTC

Key Insights:

Meta details how they are moving to a move decentralized architecture globally to make their calling experience more robust
20% of Whatsapp calls are international, half a billion a day and “bad quality” is 20% more likely on those calls due to the more complex technical challenges which are clearly spelled out on the slide at 2:00 with a good explanation of how network issues are visible to the user
At 3:10 we get a very good introduction to the basics of how VoIP works. What Whatsapp calls a relay is slightly different from a TURN server since their “relay” is also used for multiparty calls. Being more than a TURN server allows the relay to do a bit more, in particular since it can decrypt and handle RTCP feedback
At 4:20 we get a good discussion of what is sometimes known as the “USPS problem” – it is very rude to make the sender retransmit a packet that *you* lost (from a 2016 Twitter conversation)
- A packet/NACK cache is an essential component of SFUs and we consider this the norm, not forwarding the NACK.
- In cases of downstream packet loss it reduces the error correction time by half and makes the retransmission more effective
- Notably this is for audio where Meta is known to leverage libWebRTC audio nack support in Messenger that is not enabled by default there (Google Meet enables it as well)
At 5:40 the relay is shown to be “smart” about upstream loss as well since it can detect the loss (i.e. a gap in the RTP sequence number) and proactively send a NACK, saving one RTT. This is followed by a summary on other things the relay can do such as duplicating packets (which is an alternative to RED for audio)
At 6:30 we get an idea how these basics apply to international calls which generally have a longer RTT (which makes the NACK handling more important)
At 8:00 we get into the new architecture called “cross relay routing” which is essentially a distributed or cascaded SFU (see e.g. the Jitsi approach from 2018 or the Vidyo talk from 2017)
- This keeps the RTT to the NACK handling low (for downstream packet loss to the level of local calls) which improves quality and also utilizes Meta’s networking backbone which has lower packet loss than the general internet
- They also have higher bandwidth so one can do more redundancy and duplication
- At Whatsapp scale this creates the problem of picking the right relays which is done by looking at latencies. This is a tricky problem, it took Jitsi from 2018 until 2022 to get the desired results

At 11:00 (or 13:00) this gets expanded to group calls by using an architecture that starts with the centralized relay and extends it to a central router that only forwards the media packets combined with RTCP-terminating edge relays
- Some decisions like bandwidth estimation are delegated to the local relay while some decisions, in particular related to selective forwarding (e.g. active speaker determination which influences bandwidth allocation, see last year’s talk) are run on the central relay which has a complete view of the call
- Simulcast and in particular temporal layer dropping is surprising to see only in the central relay, it should be done in the edge relays as well to adapt for short-term bandwidth restrictions
- Our opinion, is that over time, Meta would be moving most of these decisions from the central relay to the local relays, distributing the logic further and closer to the edge
At 16:40 we get a glimpse into the results. Unsurprisingly things work better with faster feedback! Putting servers closer to the users is an old wisdom but one of the most effective ways to improve the quality. The lesson of using dedicated networks applies not only to Meta’s backbone but also the one used by the big cloud providers. This quality increase is paid by increased network cost however

First Q&A with Speakers

https://www.youtube.com/live/dv-iEozS9H4?feature=shared&t=5821 (25 minutes)

Watch if: you found any of the sessions this covers interesting

Key Insights:

Quite a few great questions
One thing that stood out was the question whether NACK for audio helps vs FEC and the answer is “yes”, because they provide the full quality when the RTT is low. What to use in different situations depends on the conditions. Which is a sentiment that keeps coming up

SESSION 2 – RTC@Scale

Shyam Sadhwani, Meta / Improving Video Quality for RTC

(22 minutes)

Blog post: https://engineering.fb.com/2024/03/20/video-engineering/mobile-rtc-video-av1-hd/

Watch if: you are thinking of adopting AV1 or trying to improve video quality

Key Insights:

Meta’s overview of the work and effort put into improving their video quality, and the route they took, especially with AV1 – the tradeoffs made when adopting it
“Why is the video quality of RTC not as great as Netflix” is a good question to ask, followed by a history of video encoding since DVDs came out in 1997. The answer is somewhat obvious from the constraints RTC operates under (shown at 2:00)
At 3:20 we start with a histogram of the bandwidth estimation distribution seen by Meta. “Poor calls”, which are below 300kbps (for audio and video, including RTP overhead) have about 200kbps for the video target bitrate. Choosing a more efficient video codec like AV1 is one of the most effective knobs here (and we knew Meta was taking a route after last year’s talk). The bandwidth distribution Meta sees is shown below:

While AV1 is largely not there yet in hardware encoders, the slides at 06:00 explain why one actually wants software encoders; they provide better quality at the target bitrates used by RTC which is something we have seen in Chromium’s decision to use software encoding at lower resolutions a while back
At 7:00 we get a demo comparison which of course is affected by re-encoding the demo with another codec but the quality improvement of AV1 is noticeable, in particular for the background. AV1 gives 30% lower bitrate compared to H.264, even more for screen sharing due to screen content coding tools
Quite notably the 600kbps binary size increase caused by AV1 is a concern. WebRTC in Chrome was somewhat lucky in that regard since Chrome already had to include AV1 support for web video decoding
Multiple codecs get negotiated through SDP and then the switch between them happens on the fly. From the blog post that is not happening through the more recent APIs available to web browsers though
Originally a video quality score based on encoding bitrate, frame rate and quantization parameter was used (10:30) but the latter is not comparable between AV1 and H.264 so the team came up with a way to generate a peak signal to noise ratio like metric that was used for comparison. This allowed a controlled rollout with measurable improvements
High end networks (with an available bitrate above 800kbps) also benefit from AV1 as we can see starting at 12:30. At least on mobile devices 1080p resolution does not provide perceived advantages over 720p
“Isn’t it just a config change to raise max bitrate” is an excellent question asked at 13:45 and the answer is obviously “no” as this caused issues ranging from robotic voice to congestion. In particular annoying is constantly switching between high-quality video and low-quality which is perceived negatively (take this into account when switching spatial layers in SFUs). At high bitrates (2.5mbps and up) it makes a lot of sense to do 2-3x audio duplication (or redundancy) since audio quality matters more
Mobile applications have the advantage of taking into account the battery level and conditionally enable AV1 which is, for privacy reasons, not available in the browser
The talk gets wrapped up with a recap of the benefits of AV1 both in low-end (at 18:00) and high-end (at 19:10)
And we even got a blog post!

Thomas Davies, Visionular / AV1 at the coalface: challenges for delivering a next-generation codec for RTC

(19 minutes)

Watch if: you are interested in a deep dive on AV1 and video encoding in general

Key Insights:

Visionular on what goes into the implementation of a AV1 video encoder
The talk starts off with a very good explanation of the what, why and how of rolling out an additional codec to your system. For WebRTC in the browser you don’t control much beyond the bitrate and resolution but one can still ask many of the questions and use this is a framework:

At 4:30 we go into the part that describes encoder performance (where you can really optimize). The big constraint in RTC is that the encoder needs to produce a frame every 33 milliseconds (for 30fps)
Knowing the type of the content helps the encoder pick the right encoding tools (which is why we have the contentHint in WebRTC turning on screen content coding with good results)
Rate control (10:00) is particularly important for RTC use-cases. Maximum smoothness is an interesting goal to optimize for, in particular since any variance in frame size is going to be magnified by the SFU and will affect its outgoing network traffic
Adaptivity (12:50) for AV1 comes in two forms: SVC for layering and changing resolution without a keyframe
The “sales pitch” for Visionulars encoder comes quite late at 14:15, is done in less than 90 seconds and is a good pitch, the last part (15:30) is an outlook where RTC video encoding might go in the future

Gang Shen, Intel / Delivering Immersive 360-degree video over 5G networks

(16 minutes)

Watch if: you are working in the 360-degree video domain

Key points:

Intel, reviewing the challenges of 360-degree immersive video
We’re not quite sure what to do with this one. The use-case of 360 degree video is hugely demanding and solving it means pushing the boundaries in a number of areas
Until around 06:00, the discussion revolves around the unsuitability of HTTPS, and only from here, the discussion starts looking at UDP and WebRTC (an obvious choice for viewers of RTC@Scale)
Latency being a challenge, Intel went with 5G networks

It was hard to understand what Intel wanted to share here exactly
- What is the problem being solved here?
- Is 5G relevant and important here, or just the transport used, focusing on the latest and greatest cellular?
- What challenges 360-video poses that are unique (besides being 8K resolution)?
Demo starts at 09:10, results at 11:00, a summary at 12:30 and an outlook at 14:30
All in all, this session feels a bit like a missed opportunity

Fengdeng Lyu + Fan Zhou, Meta / Enhanced RTC Network Resiliency with Long-Term-Reference and Reed Solomon code

(19 minutes)

Watch if:

you are using H.264 and are interested in features like LTR
you are interested in video forward error correction

Key points:

Secret sauce is promised!
The talk starts by describing the “open source baseline”, RTX, keyframes and XOR-based FEC
- We would describe keyframes as a last resort that you really want to avoid and add temporal scalability (which allows dropping higher temporal layers) to the list of tools here
- Using half the overall traffic for FEC sounds like too much, see this KrankyGeek talk which discusses the FEC-vs-target bitrate split
- In the end this needs to be tuned heavily and we don’t know the details

At 4:20 we get a deep-dive on LTR, long-term reference frames, which is a fairly old H.264 feature
- The encoder and decoder keep those frames around for longer and can then use them as baseline from which a subsequent frame is encoded/decoded instead of a previous frame which was lost (and then no longer needs to be recovered)
- The implicit assumption here is 1:1, for multiparty LTR can not be used which is mentioned in the Q&A

When using LTR (vs NACK and FEC) makes sense is a question that is difficult to answer, we get to know Meta’s answer at 9:50: The largest gains seem to be in bandwidth-limited high-loss networks which makes sense
As a “VP8 pipeline” with only very rudimentary H.264 support libWebRTC does not support H.264 LTR out of the box and we will see whether Meta will open source this (and Google merges it)
At 10:30 we jump back to forward error correction, talking about the problems of the XOR-based approach and explaining the “only works if at most one packet covered by the recovery packet is lost” and the protection scheme
At 13:00 the important property of Reed-Solomon-FEC is explained which is more advanced than the XOR-based approach since the number of packets that can be recovered is proportional to the number of parity packets. This is followed by some practical tips when doing RS-FEC (which you won’t be able to do in the browser which also can not send FlexFEC)
At 16:30 there is a recap of the results. As with all other techniques, we are talking about single-digit improvements which is a great win. Meta promises to upstream their FEC to the open source repository which we are looking forward to (some of this already happened here)
Surprisingly video FEC has remained relatively obscure in WebRTC, neither Google Meet nor any of the well-known open source SFUs use it.

Second Q&A with Speakers

https://www.youtube.com/watch?v=dv-iEozS9H4&t=13260s (23 minutes)

Watch if: you found any of the sessions this covers interesting

Key Insights:

Quite a few great questions, including some from the one and only Justin Uberti who apparently cannot stop keeping an eye on what is going on in RTC
A lot of interest in LTR

SESSION 3 – RTC@Scale

Tsahi Levent-Levi, bloggeek.me / The past and future of WebRTC, 2024 edition

(24 minutes)

Watch if: you like to hear Tsahi speaking. He does some juggling too!

Key Insights:

Quite often when trying to explain why some things in WebRTC are a bit weird the answer is “for historical reasons”. Tsahi gives his usual overview of the history of WebRTC, dividing it into the early age of exploration, the growth and the differentiation phases and looks at the usage of WebRTC we have seen in and since the pandemic
Tsahi is undoubtedly the person who spent the most time with developers using WebRTC and thought a lot about how to explain it. What is interesting is that Tsahi has to explain what Google does while the WebRTC team at Google remains silent
Google’s libWebRTC is a cornerstone of the ecosystem and is still tightly integrated into Chromium and its build and release process. Yet despite increased usage we see a slowdown in development looking at the number of commits and is effectively in maintenance mode. And it remains a Google-owned project (notably Meta is not affected by this since they can and have forked libWebRTC and they can release changes without open sourcing them)
What we see (at 10:10) currently in libWebRTC and Chromium is Google striving for more differentiation through APIs like Insertable Streams and Breakout Box without being forced to opensource and make everything to their competitors for free (e.g. we do not have built-in background blur into Chromium). Philipp isn’t convinced that WebTransport will replace WebRTC altogether. It makes sense for use-cases for which WebRTC was not the right tool though
Screen sharing is another topic (at 14:15) where we see a lot of improvements in Chromium and this is driven by the product needs of Google Meet. Some of the advances may only make sense for Google Meet but that is fair since Google is the party who pays the development cost
Optimization and housekeeping (at 17:20) are something that is not to be underestimated. Google has paid for the development of libWebRTC for more than a decade which is a huge investment in addition to open sourcing the original intellectual property
We heard a lot about AV1 as the most modern video codec and this continues in this talk. Lyra as an alternative audio codec has some competition (such as the new Meta audio codec) and it has not landed natively in the browser. Does Google use it together with WebRTC in native apps? Maybe…it requires effort to find out. As we have seen at KrankyGeek one can use it via WASM and insertable streams
The outlook is at 22:30 and raises the question how WebRTC will fare in 2024

Mandeep Deol + Ishan Khot, Meta / RTC observability

(20 minutes)

Watch if: you deploy a WebRTC-based system in production

Key points:

WebRTC is great when it works but sometimes it does not and then you need to debug why things do not work the way you expect. And you can not seriously ask your users to send you a chrome://webrtc-internals dump. Hence you need to make your system observable which means getting logs from the clients and servers
Two of the points on the slide at 0:40 are applicable to any system you build: you need to ensure user privacy, in particular for IP addresses and you need to strike a balance between reliability and efficiency

The “call debugging” section starting at 3:10 makes a good point: your system needs to provide both service-level metrics (such as what percentage of calls fail) as well as the ability to drill down to a particular session and understand the specific behavior (as you might have noticed, this is a topic close to the hearts of Philipp and Tsahi who evolved this project into watchRTC). At 4:15 we see Meta’s tool named “call dive”:

From the looks of it, it provides the fairly standard “timeline” view of some statistics (since we are dealing with a mobile application there are battery stats) but note that this is aggregated at the call level with multiple users
At 5:40 we get a deep dive into what it took Meta to develop the system. Some of these challenges are specific to their scale but the problem of how to aggregate the logs from the various clients and servers involved is very common
At 10:50 we get a deep-dive into the RAlligator system where the big challenge is determining when a multiparty call is done, all logs have arrived and can be processed by the following parts of the pipeline (which is made more difficult by not uploading the logs in real-time to avoid competing with the actual call). Keeping the logs in memory until then at the scale of Meta must be quite challenging
The system is designed for debugging, not for customer support where you need to explain to a customer why their call failed and need all logs reliably. Cost-effectiveness is a concern as well, you can’t spend more on the logging than you spend on the actual RTC media
At 16:00 we get a nice overview of what might be next. A lot of the things make sense but real time call debugging is just a fancy showcase and not very useful in practice. We would really like to see GenAI summarize webrtc-internals logs for us!
What is missing from the talk is how such a system is generating platform statistics which together with A/B experimentation must be the basis for the rollout results we see in many of the other talks

Sean Dubois, Livekit / Open Source from One to at Scale

(21 minutes)

Watch if you: like open source

Key takeaways:

This talk is about Sean’s experience working in the open source community, and especially Pion
Here, Sean tries to explain the benefits of open sources versus proprietary software, coming at it from the angle of the individual developer and his own experiences
- When viewing, remember that most of these experiences are with highly popular open source projects
- Your mileage may vary greatly with other types of open source projects
At 05:50 Sean makes a point of why Product Managers aren’t needed (you can talk to the customers directly and they even pay for it)
- Tsahi as a Product Manager objects 😉
- Talking to customers directly is needed for developers in products, but guidance and decisions ultimately need to be taken by the right function – even for developer-centric products and services
At 07:00 we get into how Amazon maintains their Chromium fork (Silk)
- They have lots of patches made that they keep internally and are able to stay two weeks behind Chromium. But this feat requires 6 full time employees to achieve. Igalia had a great blog post on “downstreaming Chromium” recently (part two should be more interesting)
- When using an open source project, careful decisions should be taken about contributing back versus keeping modifications proprietary. Reducing the cost of maintenance is quite an effective argument that Philipp has been using countless times
Sean touches the topic of money and open source at around 15:00. We believe this viewpoint is naive, as it doesn’t factor in investors, competition and other market constraints. For example we have seen a lot of WebRTC CPaaS vendors engage in direct peeing contests in response to Twilio shutting down which had a bad effect on what was left of a sense of “community” in WebRTC

All in all, quite an interesting session. Juxtapose this with how Meta is making use of open source for its own needs and how much of their effort gets contributed back when it comes to WebRTC for example. Or how Google open-sourced WebRTC and is pretty silent about it these days. Philipp’s approach of working with Google remains quite unique in that area but is not born from enthusiasm for WebRTC – more out of a necessity

Liyan Liu + Santhosh Sunderrajan, Meta / Machine Learning based Bandwidth Estimation and Congestion Control for RTC

(20 minutes)

Blog post: https://engineering.fb.com/2024/03/20/networking-traffic/optimizing-rtc-bandwidth-estimation-machine-learning/

Watch if: you are interested in BWE and machine-learning

Key takeaways:

Meta explaining here the work and results they got from employing machine learning to bandwidth estimation
That machine learning can help with BWE has been known for some time. Emil Ivov did a great presentation on the topic at KrankyGeek in 2017
The talk starts with a recap of what Meta achieved by moving from receive-side bandwidth by rolling out send-side BWE (SSBWE) in 2021 and a lot of tuning of BWE-related parameters in 2022
- Not all networks are different and delivering the best quality requires understanding the type of network you are on
- This is followed by a high-level overview of the different components in the WebRTC SSBWE implementation. That implementation is quite robust but contains a lot of parameters that work in certain scenarios but can be tuned (which is not possible in the browser). See this block diagram of the components:

The “what is the appropriate strategy in this situation” question is one that indeed needs to be answered holistically and is driving resilience mechanisms and encoding
Applying ML to network characterization requires describing the network behavior in a way that can be understood by machine learning which is the topic of the part of the talk starting at 4:10. Make sure to talk to your favorite machine learning engineer to understand what is going on! The example that starts at 7:05 gets a bit more understandable and shows what input “features” are used
Once random packet loss is detected the question is what to do with that information and we get some answers at 9:05. E.g. one might ignore “random” loss for the purpose of loss-based estimation (which Google’s loss-based BWE does in a more traditional way by using a trendline estimator for the loss)
At 9:30 we got from network characterization to network prediction, i.e. predicting how the network is going to react in the next couple of seconds
- This is taking traditional delay-based BWE which takes an increase in receive-packet delay as input for predicting (and avoiding) congestion
- The decision matrix shown at 12:00 is a essentially a refined version of the GoogCC rate control table
- As we learn in the Q&A the ML model for this is around 30kb or ten seconds of Opus-encoded audio but binary size is a concern
At 14:50 we get into the results section which shows a relatively large gain from the improvements. Yep, getting BWE right is crucial to video quality! We are not surprised that a more complex ML-based approach outperforms simplified hand-tuned models either. WebRTCs AudioNetworkAdapter framework is an early example of this
An interesting point from the outlook that follows is how short the “window” used for the decisions is. 10 seconds is a lot of time in terms of packets but a relatively short window compared to the duration of the usual call
As we learn in the Q&A the browser lacks APIs for doing this kind of BWE tuning. Yet the W3C WebRTC Working Group prefers spending time on topics like “should an API used by 1% be available on the main thread”…

Live Q&A with Speakers

https://www.youtube.com/live/dv-iEozS9H4?feature=shared&t=21000 (24 minutes)

Watch if: you found any of the sessions this covers interesting

Key Insights:

Quite a few great questions again, including how to simulate loss in a realistic way (where the opus 1.5 approach may help)
And we learn how many balls Tsahi can juggle!

Closing remarks

As in previous years, we tried capturing as much as possible, which made this a wee bit long. The purpose though is to make it easier for you to decide in which sessions to focus, and even in which parts of each session. And of course for us so we can look things up and reference it in future blog posts or courses!

The post RTC@Scale 2024 – an event summary appeared first on BlogGeek.me.

WebRTC recording challenges and solutions

Tsahi Levent-Levi — Mon, 26 Feb 2024 10:30:00 +0000

Need WebRTC recording in your application? Check out the various requirements and architectural decisions you’ll have to make when implementing it.

A critical part of many WebRTC applications is the ability to record the session. This might be a requirement for an optional feature or it might be the main focus of your application.

Whatever the reasons, WebRTC recording comes in different shapes and sizes, with quite a few alternatives on how to get it done these days.

What I want to do this time is to review a few of the aspects related to WebRTC recording, making sure that when it is your time to implement, you’ll be able to make better choices in your own detailed requirements and design.

Record-and-upload or upload-and-record
Multi stream or single stream recording
Switching or compositing
Rigid layouts or flexible layouts
Transcoding pipeline or browser engine
Live or “offline”
Plan your WebRTC recording architecture ahead of time

Record-and-upload or upload-and-record

One of the fundamental things you will need to consider is where do you plan the WebRTC recording to take place – on the device or on the server. You can either record the media on the device and then (optionally?) upload it to a server. Or you can upload the media to a server (live in a WebRTC session) and conduct the recording operation itself on the server.

Recording locally uses the MediaRecorder API while uploading uses HTTPS or WebSocket. Recording on the server uses WebRTC peer connection and then whatever media server you use for containerizing the media itself on the server.

Here’s how I’d compare these two alternatives to one another:

	Record-and-upload	Upload-and-record
Technology	MediaRecorder API + HTTPS	WebRTC peer connection
Client-side	Some complexity in implementation, and the fact that browsers differ in the formats they support	No changes to client side
Server-side	Simple file server	Complexity in recording function
Main advantages	No added infrastructure complexity Better quality on poor networks (assuming you have time to wait for the uploaded recording)	Decoupling of recording requirements from client device characteristics and capabilities Full control over composited result

🔴 When would I record-and-upload?

I would go for client-side recording using MediaRecorder in the following scenarios:

My sole purpose is to record and I am the only “participant”. Said differently – if I don’t record, there would be no need to send media anywhere
The users are aware of the importance of the recording and are willing to “sacrifice” a bit of their flexibility for higher production quality
The recorded stream is more important to me than whatever live interaction I am having – especially if there’s post production editing needed. This usually means podcasts recording and similar use cases

🔴 When would I upload-and-record?

Here’s when I’d use classic WebRTC architectures of upload-and-record:

I lack any control over the user’s devices and behavior
Recording is a small feature in a larger service. Think web meetings where recording is optional at the discretion of the users and used a small percentage of the time
When sessions are long. In general, if the sessions can be longer than an hour, I’d prefer upload-and-record to record-and-upload. No good reason. Just a gut feeling that guides me here

🔴 How about both?

There’s also the option of doing both at the same time – recording and uploading and in parallel to upload-and-record. Confused?

Here’s where you will see this taking place:

An application that focuses on the creation of recorded podcast-like content that gets edited
One that is used for interviews where two or more people in different locations have a conversation, so they have to be connected via a media server for the actual conversation to take place
Since there’s a media server, you can record in the server using the upload-and-record method
Since you’re going to edit it in post production, you may want to have higher quality media source, so you upload-and-record as well
You then offer these multiple resulting recordings to your user, to pick and choose what works best for him

Multi stream or single stream recording

If you are recording more than a single media source, let’s say a group of people speaking to each other, then you will have this dilemma to solve:

Will you be using WebRTC recording to get a single mixed stream out of the interaction or multiple streams – one per source or participant?

Assuming you are using an SFU as your media server AND going with the upload-and-record method, then what you have in your hands are separate media streams, each per source. Also, what you need is a kind of an MCU if you plan on recording as a single stream…

For each source you could couple their audio and video into a single media file (say .webm or .mp4), but should you instead mix all of the audio and video sources together into a single stream?

Using such a mixer means spending a lot of CPU and other resources for this process. The illustration below (from my Advanced WebRTC Architecture course) shows how that gets done for two users – you can deduce from there for more media sources:

The red blocks are the ones eating up on your CPU budget. Decoding, mixing and encoding are expensive operations, especially when an SFU is designed and implemented to avoid exactly such tasks.

Here’s how these two alternatives compare to each other:

	Multiple streams	Mixed stream
Operation	Save into a media file	Decode, mix and re-encode
Resources	Minimal	High on CPU and memory
Playback	Customized, or each individual stream separately	Simple
Main advantages	No data loss from the session Can create multiple playback experiences Easy to diarize transcriptions since nothing is mixed Simple to implement Can mix later on if needed	Simply to playback anywhere Requires less storage space

🔴 When would I use multi stream recording?

Multi stream can be viewed as a step towards mixed stream recording or as a destination of its own. Here’s when I’d pick it:

When I need to be able to play back more than a single view of the session in different playback sessions
If the percentage of times recorded sessions get played back is low – say 10% or lower. Why waste the added resources? (here I’d treat it is a step an optional mixed stream “destination”)
When my customer might want to engage in post production editing. In such a case, giving him more streams with more options would be beneficial

🔴 When would I decide on mixed stream recording?

Mixed recording would be my go-to solution almost always. Usually because of these reasons:

In most cases, users don’t want to wait or deal with hassles during the playback part
Even if you choose multi stream for your WebRTC recording, you’ll almost always end up needing to provide also a mixed stream experience
Playing back multi stream content requires writing a dedicated player (haven’t seen a properly functioning one yet)

🔴 What about mixed stream client side recording?

One thing that I’ve seen once or twice is an attempt to use a device browser to mix the streams for recording purposes. This might be doable, but quality is going to be degraded for both the actual user in the live session as well as in the recorded session.

I’d refrain from taking this route…

Switching or compositing

If you are aiming for a single stream recording, then the next dilemma you need to solve is the one between switching and compositing. Switching is the poor man’s choice, while compositing offers a richer “experience”.

What do I mean by that?

Audio is easy. You always need to mix the sources together. There isn’t much of a choice here.

For video though, the question is mostly what kind of a vantage point do you want to give that future viewer of yours. Switching means we’re going to show one person at a time – the one shouting the loudest. Compositing means we’re going to mix the video streams into a composite layout that shows some or all of the participants in the session.

Google Meet, for example, uses the switching method in its recordings, with a simple composite layout when screen sharing takes place (showing the presenter and his screen side by side, likely because it wasn’t too hard on the mixing CPU).

In a way, switching enables us to “get around” the complexity of single stream creation from multiple video sources:

	Switching	Compositing
Audio	Mix all audio sources	Mix all audio sources
Video	Select single video at a time, based on active speaker detection	Pick and combine multiple video streams together
Resources	Moderate	High CPU and memory needs
Main advantages	Cost effective	More flexible in layouts and understanding of participants and what they visually did during the meeting

🔴 When would I pick switching?

When the focus is the audio and not the video.

Let’s face it – most meetings are boring anyway. We’re more interested in what is being said in them, and even that can be an exaggeration (one of the reasons why AI is used for creation of meeting summaries and action items in some cases).

The only crux of the matter here, is that implementing switching might take slightly longer than compositing. In order to optimize for machine time in the recording process, we need to first invest in more development time. Bear that in mind.

🔴 When would compositing be my choice?

The moment the video experience is important. Webinars. Live events. Video podcasts.

Media that plan or want to apply post production editing to.

Or simply when the implementation is there and easier to get done.

I must say that in many cases that I’ve been involved with, switching could have been selected. Compositing was picked just because it was thought of as the better/more complete solution. Which begs the question – how can Google Meet get away with switching in 2024? (the answer is simple – it isn’t needed in a lot of use cases).

Rigid layouts or flexible layouts

Assuming you decided on compositing the multiple video streams into a single stream in your WebRTC recording, it is now time to decide on the layout to use.

You can go for a single rigid layout used for all (say tiles or presenter mode). You can go for a few layouts, with the ability to switch from one to the other based on context or some external “intervention”. You can also go for something way more flexible. I guess it all depends on the context of what you’re trying to achieve:

	Single	Rigid	Flexible
Concept	A single layout to rule them all	Have 2, 3 or 7 specific layouts to choose from	Allow virtually any layout your users may wish to use
Main advantages	Simple to implement Once implemented, it is hands-free	Gives a few choices to your users Knowing the layouts in advance enables for code optimizations for them	Users can control everything, so you can offer the best user experience possible
Main challenges	What if that single layout isn’t enough for your users?	How to choose which layouts to have? When and how to switch between these layouts?	How are layouts defined and created? When and how to switch between the layouts?

Here’s a good example of how this is done in StreamYard:

StreamYard gives 8 predefined different layouts a host can dynamically choose from, along with the ability to edit a layout or add new ones (the buttons at the bottom right corner of the screen).

🔴 When to aim for rigid layouts?

Here’s when I’ll go with rigid layouts:

The recording is mostly an after-effect and not the “main course” of the interaction. For the most part, group meetings don’t need flexible layouts (no one cares enough anyway)
My users aren’t creatives in nature, which brings us to the same point. The WebRTC recording itself is needed, but not for its visual aesthetics – mostly for its content
When users won’t have the time or energy to pick and choose on their own

Here, make sure to figure out which layouts are best to use and how to automatically make the decision for the users (it might be that whatever the host layout is you record, or based on the current state of the meeting – with screen sharing, without, number of participants, etc).

🔴 When would flexibility be in my menu?

Flexibility will be what I’ll aim for if:

My users care deeply about the end result (assume it has production value, such as uploading it to YouTube)
This is a generic platform (CPaaS), and I am not sure who my users are, so some may likely need the extra flexibility

Transcoding pipeline or browser engine

You decided to go for a composite video stream for your WebRTC recording? Great! Now how do you achieve that exactly?

For the most part, I’ve seen vendors pick up one of two approaches here – either build their own proprietary/custom transcoding pipeline – or use a headless browser as their compositor:

	Transcoding pipeline	Browser engine
Underlying technology	Usually ffmpeg or gstreamer	Chrome (and ffmpeg)
Concept	Stitch the pipeline on your own from scratch	Add a headless browser in the cloud as a user to the meeting and capture the screen of that browser
Resources	High	High, with higher memory requirements (due to Chrome)
Main advantages	Less moving parts means the solution is more robust Cost effective, scales a bit better	Easier to implement View can easily include any HTML/CSS element you desire

Here I won’t be giving an opinion about which one to use as I am not sure there’s an easy guideline. To make sure I am not leaving you half satisfied here, I am sharing a session Daily did at Kranky Geek in 2022, talking about their native transcoding pipeline:

Since that’s the alternative they took, look at it critically, trying to figure out what their challenges were, to create your own comparison table and making a decision on which path to take.

Live or “offline”

Last but not least, decide if the recording process takes place online or post mortem – live or “offline”.

This is relevant when what you are trying to do is to have a composite single media stream out of the session being recorded. With WebRTC recording, you can decide to start off by just saving the media received by your SFU with a bit of metadata around it, and only later handle the actual compositing:

	Live	“offline”
Concept	Handle recording on demand, as it is taking place. Usually, adding 0-5 seconds of delay	Use job queues to handle the recording process itself, making the recorded media file available for playback minutes or hours after the session ended
Main advantages	Can be used to stream the media to live platforms (YouTube Live, Twitch, LinkedIn Live, Facebook Live, etc) Better user experience (available faster)	Better utilization of media processing resources Can be delayed until a request is made to playback a session

🔴 When to go live?

The simple answer here is when you need it:

If you plan on streaming the composited media to a live streaming platform
When all (or most) sessions end up being played back

🔴 When to use “offline”?

Going “offline” has its set of advantages:

Cost effective – when you’re uncle scrooge
1. Commit to compute resources with your cloud vendor and then queue such jobs to get better machine utilization
2. You can use spot instances in the cloud to reduce on costs (you may need to retry when they get taken away)
If the streams aren’t going to be viewed immediately
Assuming streams are seldomly viewed at all, it might be best to composite them only on demand, with the assumption that storage costs less than compute (depends on how long you need to store these media files)

🔴 How about both?

Here are some suggestions of combinations of these approaches that might work well:

Mix audio immediately, but wait up with video compositing (it might not be needed at all)
Use offline, but have the option to bump priority and “go live” based on the session characteristics or when users seem to want to playback the file NOW

Plan your WebRTC recording architecture ahead of time

This has been long. Sorry about that.

Designing your WebRTC recording architecture isn’t simple once you dive into the details. Take the time to think of these requirements and understand the implications of the architecture decisions you make.

Oh, and did I mention there’s a set of courses for WebRTC developers available? Just go check them out at https://webrtccourse.com 😃

The post WebRTC recording challenges and solutions appeared first on BlogGeek.me.

An FAQ for WebRTC beginners

Tsahi Levent-Levi — Mon, 29 Jan 2024 10:30:00 +0000

Answering some common FAQ questions about WebRTC that seem to be top of mind on Google search.

A few days ago, I searched something on Google, and somehow bumped into a page full of questions Google found relevant or common. These weren’t exactly relevant to my search term (not directly), but they were there. And they were beginner questions about WebRTC.

It dawned on me that I’ve probably mentioned some of these things in passing (or a wee bit more) in the past, but placing them all neatly together in one place made sense. So here we are. And here’s the WebRTC FAQ for beginners.

Is WebRTC TCP or UDP?
Is WebRTC still used?
Is WebRTC free or paid?
What is WebRTC used for?
Is WebRTC a security risk?
Does Netflix use WebRTC?
Can WebRTC be hacked?
Does WebRTC expose your IP?
What is better than WebRTC?
Is WebRTC better than Websockets?
Is Google a WebRTC?
Does WebRTC need a server?
Does WebRTC require Internet?
Does WebRTC use SSL?
Where’s the answer to my question?

Is WebRTC TCP or UDP?

WebRTC is neither TCP nor UDP. At the same time WebRTC is both TCP and UDP.

Confused?

Let’s put things in order.

With WebRTC there’s signaling and media.

Signaling is considered to be out of scope and left to the application. Most applications will use HTTPS or a secure WebSocket as transport for signaling. HTTPS runs over TCP… sort of… since HTTP/3 can also do UDP. But mostly, you can think of signaling in WebRTC as TCP and the skies won’t fall (👉 what we want for signaling is reliability and messages order, and TCP based protocols give us that).

Media in WebRTC wants to use UDP. It strives to use UDP as much as possible, but that’s not always available to it, so it then falls back towards using TCP. But you can consider this as a last resort (we don’t want to be in that predicament).

Is WebRTC still used?

Yes. You wouldn’t be reading my blog otherwise 😎

It isn’t that there aren’t any challengers. It is that WebRTC is still the most popular and common solution for real time communications in web browsers.

WebTransport + WebCodecs + WebAssembly might someday replace WebRTC. But we’re not there yet.

Read more about WebRTC’s success and future:

Is WebRTC free or paid?

Free. Err. Paid. Free? Paid? Both? None?

Let’s sort things out here.

WebRTC is an open standard with a popular open source implementation maintained by Google and used by all major browser vendors.

Accessing the APIs and using them is free.

But creating most of the meaningful applications is going to require some sort of payment. That can be to a CPaaS vendor to host the WebRTC infrastructure; or to an IaaS vendor (think AWS) to host the servers and the bandwidth use (especially with TURN and media servers).

So yes. WebRTC is free, but expect to pay for it, in particular if you need help. Google will not help you…

What is WebRTC used for?

WebRTC is used for implementing realtime voice and video communications over the internet using web browsers. But it definitely isn’t limited to that.

I’ve seen use cases dealing with recording, live streaming, broadcasting, cloud gaming, remote teleoperation (that’s driving a car… remotely), peer assisted delivery, file transfer, … the list is endless.

Is WebRTC a security risk?

WebRTC enables browsers to have (and give) access to your microphone, camera, display and IP address. This is what every voice or video meeting application you install requires in order to work properly as well.

Is that a security risk? That’s up to you to decide as a user.

Giving such power to the browser reduces the friction for users but also for nefarious third parties who want to exploit these capabilities, so some will see this as an increase in security risk.

For developers it simply means that they need to know and understand what they are doing and how they implement their applications with this technology in order to mitigate any potential risk. It is worth noting that WebRTC and web browsers from their side do the most they can to reduce such security risks and even encourage developers to write secure applications.

Does Netflix use WebRTC?

No.

Netflix might be using WebRTC somewhere, but for its main video streaming service Netflix doesn’t use WebRTC.

Why? Because WebRTC is designed and fine tuned for real time communications. As such, it sacrifices quality for improved latency.

Netflix is the exact opposite. It strives to deliver the best quality and is willing to sacrifice a bit of latency while at it – you wouldn’t mind waiting a few seconds for your movie to start in order to have crisp and pristine video. On the other hand, you’d be pissed if your online video conversation had a latency of 5 seconds and felt as if the other person was sitting on the moon.

Can WebRTC be hacked?

Yes.

Everything can be hacked.

Browsers are trying to do their best to reduce that risk for WebRTC (and other technologies they implement), but it is an arms race…

Does WebRTC expose your IP?

This is a tricky question. The answer is yes and no.

Let’s start by understanding which IP address…

Your device usually has two IP addresses:

A local IP address, used inside its local network – say the home network
A public IP address, which the NAT assigns to it and is used to communicate with “the world”

Each application on your device, including the browser, has access to the local IP address.

Each web server you connect to on the internet sees your public IP address.

When negotiating a WebRTC session, WebRTC uses a mechanism called ICE which discovers your public IP address and shares your local and public IP address with the peer it connects with.

A few quick clarifications here:

WebRTC will not expose a local IP address without permissions to access a camera or a microphone
Any voice or video communication applications ends up exposing the same addresses in similar fashion
A WebRTC application can decide to use only TURN relay or media servers so as to not expose these IP addresses to other users
There are browser extensions that can be used that limits the ability to expose local IP addresses
If your VPN leaks your public IP with WebRTC it is that VPN which is not working

More about WebRTC IP leak:

What is better than WebRTC?

A cheesecake from Philipp Hancke for my 10-years BlogGeek.me birthday

A cheesecake is definitely better than WebRTC. A chocolate cheesecake is doubly so.

In all seriousness though, I have no clue.

It depends. Which is a cop out answer but the only one here.

The question should be more specific. It should include what it is you are trying to build, what is the target audience and what medium do you want to use for it.

For live streaming, WebRTC might not be the best fit. Especially if you can live with a 2 seconds delay (in that case, LL-HLS and LL-DASH would be better solutions for example).

For video conferencing… well… I’d start by selecting WebRTC by default. And then try to poke holes in my decision and select something else – proprietary – since there is nothing else…

More about WebRTC alternatives:

Is WebRTC better than Websockets?

Apples to oranges.

I’d use both. In the same application. Seriously.

WebSocket for signaling and WebRTC for media.

There are two places where you can think of WebRTC and WebSocket as alternatives:

WebRTC’s data channel, which is bidirectional in nature and peer-to-peer. For the most part, I’ll still use WebSocket. Unless I am serious about my low latency requirements or my privacy requirements
When aiming for live streaming. But then I might just go for WebTransport instead of WebSocket – being forward thinking…

Did I already say apples to oranges?

More about transport in WebRTC:

Is Google a WebRTC?

To be frank – Google is Google. Not sure what the question is here 🤣

Google and WebRTC have an interesting relationship.

It all started when Google acquired GIPS, a company who licensed media engines. A bit afterward, WebRTC was announced in the standardization organizations and Google made the GIPS media engine into an open source implementation, integrating it into Chrome and placing APIs on top of it – these APIs were the WebRTC API specifications (or close enough at the time).

That was over 10 years ago. Since then, WebRTC has evolved and so has Google’s implementation of it.

Google uses WebRTC internally for Google Meet and for other products and projects it has.

The actual WebRTC project is open source. Maintained by Google. And most of the contributions to it are Google’s.

More about WebRTC & Google:

Does WebRTC need a server?

Yes. WebRTC needs a server. In fact, it needs multiple servers.

For starters, you need to download the application logic from somewhere, and a way to signal who you want to make a conversation with. This is done with a signaling server.

Then, when connecting the WebRTC session, there are times when you won’t have a direct route for the media. In such cases, you are going to need a TURN server. TURN servers also act as STUN servers but STUN servers are not the same as signaling servers.

And, you may want to go fancy – run a group meeting, record stuff. Such capabilities almost always mean you are adding a media server into the mix.

Does WebRTC require Internet?

Yes.

Everything today requires the Internet. Even you being able to read this FAQ requires the Internet.

WebRTC can run in local networks or private networks without connecting to the public Internet. But it still needs an IP network to work.

Does WebRTC use SSL?

Yes.

Let’s start with definitions first: For me SSL and TLS are one and the same.

HTTPS and WSS (Secure HTTP and Secure WebSocket) both run on top of TLS so they are also → SSL.

Web browsers practically force application developers to use HTTPS for the pages that host these services, which means all signaling used with WebRTC will be done via HTTPS or WSS.

The media uses SRTP, which is Secure RTP, which doesn’t use TLS (because it isn’t running over TCP). That said, when sessions need to be relayed via TURN servers, they might end up being relayed over TURN/TLS.

Where’s the answer to my question?

Couldn’t find the answer?

I can invite you to follow and read my blog – it has a lot of resources about WebRTC

My suggestion? Start here 👉 What is WebRTC?

If you are looking to skill up with WebRTC, I also have WebRTC courses for you.

The post An FAQ for WebRTC beginners appeared first on BlogGeek.me.

Top WebRTC open source media servers on github for 2024

Tsahi Levent-Levi — Mon, 01 Jan 2024 10:30:00 +0000

What are the WebRTC open source media servers in 2024, and which ones are the best, based on github stars.

This one is one of those sensitive articles which many people later complain about. So I’ll start it with a few disclaimers:

Different tools are suitable for different use cases. This means that a WebRTC media server here that is low on the popularity list might be the best fit for your requirements
It was enjoyable to look it up, so I just had to write this down
I love you all – I truly do. Please don’t be mad at me
That said, I am expecting a sarcastic enough meme by Iñaki. One that I can proudly add to this article – just below this bullet 😉

pic.twitter.com/AG8RuAjPlM
— Iñaki (@ibc__again) January 1, 2024

The WebRTC open source ecosystem
My “top 4” WebRTC open source media servers
Using github for our WebRTC popularity contest
Janus
Jitsi Meet
Mediasoup
Pion
The best WebRTC open source media server

The WebRTC open source ecosystem

WebRTC is free. At least the part of it being an open standard with a commercial grade open source implementation that is available and embedded across all modern browsers.

This has garnered a nice developer ecosystem around it, part of which is open source in its nature. A simple search for “webrtc” on github returns over 32k results.

There are a lot of different avenues to WebRTC projects on github. The main ones that come to the top of my head include:

Media servers
Signaling servers and frameworks
WebRTC implementations in different languages
Samples and experiments
Applications written on top of WebRTC
…

For this specific article, I want to focus on media servers.

My “top 4” WebRTC open source media servers

There are quite a few WebRTC media servers, many of which are open source. That said, most aren’t widely known or got to the point of being interesting enough for me to take notice (I usually take notice when someone tells me he is using it for something that goes to commercial use).

Throughout the years, the list of the popular WebRTC media servers hasn’t changed that much. I’ve been using this diagram for two years now, and it probably still holds true:

Due to this, my “top 4” is simply the WebRTC open source media servers above that are still relevant. And to make sure people don’t bash me on minor issues, I’ll be presenting my these in their dictionary order: Janus, Jitsi, mediasoup and Pion

Using github for our WebRTC popularity contest

How do you even begin deciding which WebRTC open source media server project is the most commonly used out there?

One approach is to count the stars. Github starts. Luckily, all the projects I was interested in have github repos. Philipp Hancke directed me to GitHub Star History, which after a bit of fooling around with, got me this nice initial chart:

Based on people who placed a star on these github projects, we can see that mediasoup is chugging along, last in the packet. It is followed by Janus. Then there’s Pion and Jitsi Meet is ahead of the pack.

Each of these projects started at a different point in time. Pion was last to the party, which means the other projects had a headstart on it. Aligning them all on the point in time they were added to github, produces this chart:

Initial immediate thoughts here?

mediasoup is the slowest growing media server
Janus is growing at a steady, albeit slow pace
Jitsi changed its trajectory during the pandemic and growing faster ever since
Pion is the fastest growing project here, keeping at Jitsi’s recent pace to stardom

Let’s do a quick deep dive into each one of these.

Janus

Janus is one of the oldest WebRTC media servers. It is written in C, which might be the reason for its limited adoption – most developers these days won’t know how to write a hello world application in C – let alone figure out its memory use concepts (where you have to explicitly free what you allocate).

What Janus has going for it is a company. Meetecho, the maintainer of Janus, offers paid support and development services around Janus. Something other open source WebRTC media servers lack.

The trajectory of Janus is unlikely to change. It is versatile, has a community around it and support services.

Jitsi Meet

Jitsi Meet is likely the oldest of WebRTC media servers. Started by Bluejimp, who were acquired by Atlassian and then 8×8.

While Jitsi doesn’t offer any direct support and development services for Jitsi, it does offer JaaS – a managed Jitsi service for developers.

Jitsi is written in Java and has a React UI implementation.

One reason for its meteoric rise is the pandemic. Jitsi is the only open source solution that came fully built and optimized for group calls. From the get go, their mission was to build an open source Google Hangouts (that’s Google Meet today). And they succeeded.

By narrowing their applicability to a specific use case, they opened up their viability as a solution to a larger target audience – way beyond that of developers building applications.

This unfair advantage places them here as a top dog. This doesn’t mean that they are suitable for everyone – quite the opposite. They are suitable for those building Google Meet-like experiences. For things that are beyond this use case, shop around the other media servers first. But for a Google Meet-like service? Start from Jitsi Meet.

Mediasoup

Mediasoup is an open source WebRTC media server exposed as Node.js and Rust libraries. It is designed for high performance, with the unique concept of having the application built right inside the same Node.js or Rust process.

The challenge with mediasoup is its inability to offer official support and development services. Here, the reason is simple – the main creators and contributors work as developers at Miro today.

This challenge is probably what led to the slow growth of mediasoup in the github popularity contest.

That said, if you go and look at many large scale group calling deployments, they use mediasoup…

Pion

Pion is last to the scene, but fast growing compared to the others. There are 3 reasons why:

Pion is written in Go language. For some reason, Go has its fandom of developers who love the language. This makes Pion their Go-to (pun intended) open source project
Pion is general purpose. It is used to build both clients and servers. There are multiple media server implementations written on top of Pion, but in general, the fact that you can build more with it garners immediately more stars to the project
Sean DuBois. The person who started Pion has a huge and infectious personality that helped push Pion forward. Other open source projects have their own unique personas, but whoever had the chance to speak with Sean directly will understand what I am saying here

As Pion’s popularity grows, so are the number of commercial services cropping up that use Pion.

The best WebRTC open source media server

None.

All.

It depends.

For managers, my suggestion is almost always to let their developers experiment and pick and choose the open source WebRTC media server that they see fit. There are differences across these alternatives, but at the end of the day, if anyone tries to force a developer to use something he doesn’t think is the right solution – said developer will make sure to explain to the one forcing him why the decision made is the wrong one. In other words, you don’t want to go against your developers.

For developers, I find myself suggesting different media servers depending on their use case, requirements and even company DNA.

So in short, there’s no best WebRTC open source media server. There are several alternatives that are great – you just need to pick the one that is best for you 😀

The post Top WebRTC open source media servers on github for 2024 appeared first on BlogGeek.me.

WebRTC conferences – to mix or to route audio

Tsahi Levent-Levi — Mon, 21 Aug 2023 09:30:00 +0000

How do you choose the right architecture for a WebRTC audio conferencing service?

Last month, Lorenzo Miniero published an update post on work he is doing on Janus to improve its AudioBridge plugin. It touched a point that I failed to write about for a long time (if at all), so I wanted to share my thoughts and views on it as well.

I’ll start with a quick explanation – Lorenzo is adding to Janus a lot of layers and flexibility that is needed by developers who are taking the route of mixing audio in WebRTC conferences. What I want to discuss here is when to use audio mixing and when not to use it. And as everything else, there usually isn’t a clear cut decision here.

What’s mixing and what’s routing in WebRTC?
Audio processing tools available for us in WebRTC
Mixing keeps the headaches away from the browser
Routing gets you better flexibility
Where the rubber hits the road – let’s talk use cases
Which will it be? MCU or SFU for your next audio meeting?

What’s mixing and what’s routing in WebRTC?

Group calls in WebRTC can take different shapes and sizes. For the most part, there are 3 dominant architectures for WebRTC multiparty calling: mesh, mixing and routing.

I’ll be focusing on mixing and routing here since they scale well to 100’s or more users.

Let’s start with the basics.

Assume there’s a conversation between 5 people. Each of these people can speak his mind and the others can hear him speaking. If all of these people are remote with each other and we now need to model it in WebRTC, we might think of it as something like this illustration:

This is known as a mesh network. Its biggest disadvantage for us (though there are others) is the messiness of it all – the number of connections between participants that grows polynomially with the number of users. The fact that we need to send out the same audio stream to all participants individually is another huge disadvantage. Usually, we assume (and for good reasons) that the network available to us is limited.

The immediate obvious solution is to get a central media server to mix all audio inputs, reducing all network traffic and processing from the users:

This media server is usually called an MCU (or a conferencing bridge). Users here “feel” as if they are in a session with only a single entity/user and the MCU is in charge of all the headaches on behalf of the users.

This mixer approach can be a wee bit expensive for the service provider and at times, not the most flexible of approaches. Which is why the SFU routed model was introduced, though mostly for video meetings. Here, we try to enjoy both worlds – we have the SFU route the media around, to try and keep bitrates and network use at reasonable levels while trying to reduce our hosting and media processing costs as service providers:

The SFU has become commonplace and the winning architecture model for video meetings almost everywhere. Voice only meetings though, have been somewhere in-between. Probably due to the existence and use of audio bridges a lot before WebRTC came to our lives.

This begs the question then, which architecture should we be using for our audio in group calls? Should we mix it in our media servers or just route it around like we do with video?

Before I go ahead to try and answer this question, there’s one more thing I’d like to go through, and that’s the set of media processing tools available to us today for audio in WebRTC.

Audio processing tools available for us in WebRTC

Encoding and decoding audio is the baseline thing. But other than that, there are quite a few media processing and network related algorithms that can assist applications in getting to the desired scale and quality of audio they need.

Before I list them, here are a few thoughts that came to mind when I collected them all:

This list is dynamic. It changes a bit every year or so, as new techniques are introduced
- An example for this is active speaker detection, appearing in a paper a decade ago
- First adopted by Jitsi
- Then added to mediasoup by the team at hopin
You can’t really use them all, all the time, for all use cases. You need to pick and choose the ones that are relevant to your use case, your users and the specific context you’re in
We now have a machine learning based tool as well. We will have more of these in a year or two for sure
It was a lot easier to compile this list now that we’ve finished recording and publishing all the lessons for the Higher-level WebRTC protocols course – we’ve covered most of these tools there in great detail

Audio level

There is an RTP header extension for audio level. This allows a WebRTC client to indicate what is the volume that can be found inside the encoded audio packet being sent.

The receiver can then use that information without decoding the packet at all.

What can one do with it?

Decide if you need to decode the packet at all or just discard it if there’s no or little voice activity or if the audio level is too low (no one’s going to hear what’s in there anyway).

You can replace it with DTX (see below) or not forward the packet in a Last-N architecture (see below).

Not mix its content with other audio channels (it doesn’t hold enough information to be useful to anyone).

DTX

Discontinuous transmission

If there’s nothing really to send – the person isn’t speaking but the microphone is open – then send “silence” but with less packets over the network.

That’s what DTX is about, and it is great.

In larger meetings, most people will listen and not speak over one another. So most audio streams will just be “silence” or muted. If they aren’t muted, then sending DTX instead of actual audio reduces the traffic generated. This can be a boon to SFUs who end up processing less packets.

An SFU media server can also decide to “replace” actual audio it receives from users (because they have a low audio level in them or because of Last-N decisions he is making) with DTX data when routing media around.

PLC

Packet Loss Concealment

Packets are going to be lost, but there would be content that still needs to be played back to the user.

You can decide to play silence, a repeat of the last heard packet, lower its volume a bit, etc.

This can be done both on the server side (especially in the case of an MCU mixer) or on the client side – where such algorithms are implemented in the browser already. SFUs can ignore this one, mostly since they don’t decode and process the actual media anyway.

At times, these can be done using machine learning, like Google’s proprietary WaveNetEq, which tries to estimate and predict what was in the missing packet based on past packets received.

Packet loss concealment isn’t great at all times, but it is a necessary evil.

RTX & NACK

Theoretically, you could use retransmissions for lost packets.

WebRTC does that mostly for video packets, but this can also find a home for audio.

It is/was a rather neglected area because PLC and Opus inband FEC techniques worked nicely.

For the time being, you’re likely to skip this tool, but it is one I’d keep an eye on if I were highly interested in audio quality advancements.

FEC and RED

Forward Error Correction is about sending redundant data that can be used to reconstruct lost packets. Redundancy coding is what we usually do for audio, which is duplicating encoded frames.

Audio bandwidth requirements are low, so duplicating frames doesn’t end up taxing much of our network, especially in a video call.

This approach enables us at a “low cost” to gain higher resiliency to packet losses.

This can be employed by the client sender, or even from the server side, beefing up what it received – both as an SFU or an MCU.

Check Philipp Hancke’s tal at Kranky Geek about Advanced in Audio Codecs

Then there’s the nuances and headaches of when to duplicate and how much, but that’s for another article.

Last-N

A known technicality in WebRTC’s implementation is that it only mixes the 3 loudest incoming audio channels before playing back the audio.

Why 3? Because 2 wasn’t enough and 4 seemed unnecessary is my guess. Also, the more sources you mix, the higher the noise levels are going to be, specially without good noise suppression (more on that below)

Well… Google just decided to remove that restriction. Based on the announcement, that’s because the audio decoding takes place in any case, so there’s not much of a performance optimization not to mix them all.

So now, you can decide if you want to mix everything (which you just couldn’t before) or if you want to mix or route only a few loudest volume (or most important) audio streams if that’s what you’re after. This reduces CPU and network load (depending on which architecture you are using).

Google Meet for example, is employing Last-3 technique, only sending up to 3 loudest audio streams to users in a meeting.

Oh, and if you want to dig deeper into the reasoning, there’s a nice Jitsi paper from 2016 explaining Last N.

Noise suppression: RNNoise and other machine learning algorithms

Noise suppression is all the rage these days.

RNNoise is a veteran among the ML-based noise suppression algorithms that is quite popular these days.

Janus for example, have added it to their AudioBridge and implemented optional RNNoise logic to handle channel-based noise suppression in their MCU mixer for each incoming stream.

Google added this in their Google Meet cloud – their SFU implementation passes the audio to dedicated servers that process this noise suppression – likely by decoding, noise suppression and encoding back the audio.

Many vendors today are introducing proprietary noise suppression to their solutions on the client side. These include Krisp, Dolby, Daily, Jitsi, Twilio and Agora – some via partnerships and others via self development.

Mixing keeps the headaches away from the browser

Why use an MCU for mixing your audio call? Because it takes all the implementation headaches and details away from the browser.

To understand some of what it entails on the server though, I’d refer you again to read Lorenzo’s post.

The great thing about this is that for the most part, adding more users means throwing more cloud hardware on the problem to solve it. At least up to a degree this can work well without thinking of scaling out, decentralization and other big words.

It is also how this was conducted for many years now.

Here are the tools I’d aim for in using for an audio MCU:

Tool	Use?	Reasoning
Audio level	✔️	Decoding less streams will get higher performance density for the server. Use this with Last-N logic
DTX	✔️	Both when decoding and while encoding
PLC	✔️	On each incoming audio stream separately
RTX & NACK	❌	To early to do this today
FEC and RED	✔️	Today, for an MCU, this would be rare to see as a supported featureConsider on outgoing audio streams; as well as enable for incoming streams from devices
Last-N	✔️	Last-3 is a good default unless you have a specific user experience in mind (see below examples)
Noise suppression	✔️	On incoming channels, those that passed Last-N filtering, to clean them up before mixing the incoming streams together

Things to note with an audio MCU, is that the MCU needs to generate quite a few different outgoing streams. For 10 participants with 4 speakers (at Last-4 configuration), it would be something like this:

We have 5 separate mixers at play here:

1 mixing all 4 active speakers
4 mixing only 3 out of the 4 each time – we don’t want to send the person speaking his own audio mixed in the stream

Routing gets you better flexibility

Why do we use an SFU for audio conferences? Because we use it for video already… or because we believe this is the modern way of doing things these days.

When it comes to routing audio, the thing to remember is that we have a delicate balance between the SFU and the participants, each playing a part here to get a better experience at the end of the day.

Here are the tools I’d use for an audio SFU:

Tool	Use?	Reasoning
Audio level	✔️	We must have this thing implemented and enabled, especially since we really really really want to be able to conduct Last-N logic and not send each user all audio channels from all other participants
DTX	✔️	We can use this to detect silence as well here (and remove from Last-N logic). On the sending logic, the SFU can decide to DTX the channels in Last-N that are silent or at a low volume to save a bit of extra bandwidth (a minor optimization)
PLC	❌	Not needed. We route the audio packets and let the participants fix any losses that take place
RTX & NACK	❌	To early to do this today
FEC and RED	✔️	This can be added on the receiver and sender side in the SFU to improve audio quality. Adding logic to dynamically device when and how much redundancy based on network conditions is also an advantage here
Last-N	✔️	Last-3 is a good default. Probably best to keep this at most at Last-5 since the decision here means more CPU use on the participants’ side
Noise suppression	❌	Not needed. This can be done on the participants’ side

In many ways, an audio SFU is simpler to implement than an audio MCU, but tweaking it just right to gain all the benefits and optimizations from the client implementation is the tricky part.

Where the rubber hits the road – let’s talk use cases

As with everything else I deal with, which approach to use depends on the circumstances. One of the main deciding criteria in this case is going to be the use case you are dealing with and the scenario you are solving this for.

Here are a few that came to mind.

Gateway to the old world

The first one is borderline “obvious”.

Before WebRTC, no one really did an audio conference using an SFU architecture. And if they did, it was unique, proprietary and special. The world revolved and still revolves around MCU and mixing audio bridges.

If your service needs to connect to legacy telephony services, existing deployments of VoIP services running over SIP (or god forbid H.323), connect to a large XMPP network – whatever it may be – that “other” world is going to be running as an MCU. Each device is likely capable of handling only one incoming audio stream.

So trying to connect a few users from your service (no matter if you are using an SFU or an MCU) would need to mix these users when connecting them to the legacy service.

Video meetings with mixed audio

There are services that decide to use an SFU to route video streams and an MCU for the audio streams.

Sometimes, it is because the main service started as an audio service (so an audio bridge was/is at the heart of the service already) and video was bolted on the platform. Sometimes it is because gatewaying to the old world is central to the service and its mindset.

Other times, it is due to an effort to reduce the number of audio streams being sent around, or to reduce the technical requirements of audio only participants.

Whatever the reason, this is something you might bump into.

The big downside of such an approach is the loss of lip synchronization. There is no practical way you can synchronize a single audio stream that represents mixed content of multiple video streams. In fact, no lip synchronization with any of the video streams takes place…

Usually, the excuse I’ll be hearing is that the latency difference isn’t noticeable and no one complained. Which begs the question – why do we bother with lip synchronization mechanisms at all then? (we do because it does matter and is noticeable – especially when the network is slightly bumpier than usual)

Experience the crowd

Think of a soccer game. 50,000 people in a stadium. Rawring when there’s a goal or a miss.

With Last-3 audio streams mixed, you wouldn’t be hearing anything interesting when this takes place “remotely” for the viewers.

The same applies to a virtual online concert.

Part of the experience you are trying to convey is the crowds and the noises and voices they generate.

If we’re all busy reducing noise levels, suppressing it, picking and choosing the 2-3 voices in the crowd to mix, then we just degrade the experience.

Crowds matter in some scenarios. And keeping their experience properly cannot be done by routing audio streams around. Especially not when we’re starting to talk about hundreds of more active participants.

This case necessitates the use of MCU audio bridging. And likely a distributed approach the moment the numbers of users climb higher.

Metaverse and spatial audio

The metaverse is coming. Or will be. Maybe. Now that Apple Vision Pro is upon us. But even before that, we’ve seen some metaverse use cases.

One thing that comes to mind here is the immersion part of it, which leads to spatial audio. The intent of hearing multiple sounds coming from different directions – based on where the speaker is.

This means several things:

For each user, the angle and distance (=volume level) of each other person speaking is going to be different
That Last-3 strategy doesn’t work anymore. If you can distinguish directionness and volume levels individually, then more sources might need to be “mixed” here

Do you do that on the client side by way of an SFU implementation, or would it be preferable to do this in an MCU implementation?

And what about trying to run concerts in the metaverse? How do you give the notion of the crowds on the audio side?

These are questions that definitely don’t have a single answer.

In all likelihood, in some metaverse cases, the SFU model will be the best architectural approach while in others an MCU would work better.

Recording it all

Not exactly a use case in its own right, but rather a feature that is needed a lot.

When we need to record a session, how do we go about doing that?

Today, in at least 99% of the time that would be by mixing all audio and video sources and creating a single stream that can be played as a “regular” mp4 file (or similar).

Recording as a single stream means using an MCU-like solution. Sometimes by implementing it in a headless browser (as if this is a silent participant in the session) and other times by way of dedicated media servers. The result is similar – mixing the multiple incoming streams into a single outgoing one that goes directly to storage.

The downside of this, besides needing to spend energy on mixing something that people might never see (which is a decision point to which architecture to pick for example), is that you get to view and hear only a single viewpoint of a single user – since the mixed recording is already “opinionated” based on what viewpoint it took.

We can theoretically “record” the streams separately and then play them back separately, but that’s not that simple to achieve, and for the most part, it isn’t commonplace.

A kind of a compromise we see today with professional recording and podcast services is to record by mixed and separated audio streams. This allows post production to take either based on the mixing needs, but done manually.

Which will it be? MCU or SFU for your next audio meeting?

We start with this, and we will end with this.

It depends.

You need to understand your requirements and from there see if the solution you need will be based on an MCU, and SFU or both. And if you need help with figuring that out, that’s what my WebRTC courses are for – check them out.

The post WebRTC conferences – to mix or to route audio appeared first on BlogGeek.me.

Technology Archives • BlogGeek.me

OpenAI, LLMs, WebRTC, voice bots and Programmable Video

Table of contents

The OpenAI GPT-4o demo

Text be like…

“Traditional” voice bots are like turn based games

Realtime LLMs are like… real-time games

Real life and conversational bots

Working on the WebRTC and LLM infrastructure

Programmable Video and Video APIs doing LLM

Twilio’s approach to LLMs

Fixie anyone?

What’s next?

Fixing packet loss in WebRTC

Table of contents

Why do we have packet loss in WebRTC?

What to do to overcome packet losses?

Have less packet losses

Location of infrastructure elements in WebRTC

Better bandwidth estimation

Conceal packet losses (PLC)

Audio and packet loss concealment

Video and packet loss concealment 👉 frame dropping

Retransmitting lost packets (RTX)

Video and RTX

Audio and RTX

Correct packet losses in advance (FEC)

Audio FEC

Video FEC

Wrapping it all up

Learn more about WebRTC (and everything about it)

WebRTC & HEVC – how can you get these two to work together

Table of contents

WebRTC and royalty free codecs

How H.264 wiggled its way into WebRTC

HEVC, patents and big 💰

HEVC hardware

Advantages of HEVC in WebRTC

Limitations of HEVC in WebRTC

Waiting for Godot AV1

Where can you fit HEVC and WebRTC?

The Apple opportunity of WebRTC and HEVC

Intel (and other) HEVC hardware

Should you invest in HEVC for WebRTC?

Learn more about WebRTC (and everything about it)

Reasons for WebRTC to discard media packets

Table of contents

A WebRTC Q&A

Discarded media packets in WebRTC

WebRTC = Real-Time. Timing is everything

WebRTC discarding incoming audio packets

Latency

Lipsync

Bugs

WebRTC discarding outgoing audio packets

WebRTC discarding incoming video frames

Latency, lip sync & bugs

Not all packets of a frame are available

Dependency on older frames

Not enough CPU

WebRTC discarding outgoing video frames

Maintaining media quality in WebRTC

WebRTC simulcast – what is it and how is it used

Table of contents

A crash course on video quality and bitrate

SFU media servers and group video sessions

Media quality: LCD or BAB

Client side = Simulcast; Server side = Adaptive bitrate

Advantages and weaknesses of using simulcast in WebRTC

WebRTC simulcast advantages

WebRTC simulcast weaknesses

Who decides on bitrates in WebRTC simulcast

Keyframes and switching costs in simulcast

Temporal scalability improves WebRTC simulcast

Decisions of highest layer bitrate in WebRTC simulcast

WebRTC and multi-codec simulcast

A word about SVC… and where to learn more

RTC@Scale 2024 – an event summary

Table of contents

Why this issue?