OpenAI, LLMs, WebRTC, voice bots and Programmable Video
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreWebRTC media server is an optional component in a WebRTC application. That said, in most common use cases, you will need one.
There are different types of WebRTC servers. One of them is the WebRTC media server. When will you be needing one and what exactly it does? Read on.
Oh - and if you’re looking to dig deeper into WebRTC media servers, make sure to check the end of this article for an announcement of our latest WebRTC course 😎
There are quite a few moving parts in a WebRTC application. There’s the client device side, where you’ll have the web browsers with WebRTC support and maybe other types of clients like mobile applications that have WebRTC implementations in them.
And then there are the server side components and there are quite a few of them. The illustration above shows the 4 types of WebRTC servers you are likely to need:
The illustration below shows how all of these WebRTC servers connect to the client devices and what types of data flows through them:
What is interesting, is that the only real piece of WebRTC infrastructure component that can be seen as optional is the WebRTC media server. That said, in most real-world use-cases you will need media servers.
At its conception, WebRTC was meant to be “between” browsers. Only recently, did the good people at the W3C see it fit to change it to something that can work also in browsers. We’ve know that to be the case all along 😎
What does a WebRTC media server do exactly? It processes and routes media packets through the backend infrastructure - either in the cloud or on premise.
Let's say you are building a group calling service and you want 10 people to be able to join in and talk to each other. For simplicity’s sake, assume we want to get 1Mbps of encoded video from each participant and show the other 9 participants on the screen of each of the users:
How would we go about building such an application without a WebRTC media server?
To do that, we will need to develop a mesh architecture:
We’d have the clients send out 1Mbps of their own media to all the other participants who wish to display them on their screen. This amounts to 9*1Mbps = 9Mbps of upstream data that each participant will be sending out. Each client receives streams from all 9 other participants, getting us to 9Mbps of downstream data.
This might not seem like much, but it is. Especially when sent over UDP in real time, and when we need to encode and encrypt each stream separately for each user, and to determine bandwidth estimation across the network. Even if we reduce the requirement from 1Mbps to a lower bitrate, this is still a hard problem to deal with and solve.
It becomes devilishly hard (impossible?) when we crank up the number to say 50 or a 100 participants. Not to mention the numbers we see today of 1,000 or more participants in sessions (either active participants or passive viewers).
Enter the WebRTC media server
This is where a WebRTC media server comes in. We will add it here to be able to do the following tasks for us:
Here’s what’s really going on and what we use these media servers for:
👉 WebRTC media servers bridge the gaps in the architecture that we can’t solve with clients alone
Before we continue and dive in to the different types of media servers, there’s something that must be said and discussed:
WebRTC media server != TURN server
I’ve seen people try to use the TURN server to do what media servers do. Usually that would be things like recording the data stream.
This doesn’t work.
TURN servers route media through firewalls and NAT devices. They aren’t privy to the data being sent through them. WebRTC privacy is maintained by having data encrypted end to end when passing via TURN servers - the TURN servers don’t know the encryption key so can’t do anything with the media.
WebRTC media servers are implementations of WebRTC clients in a server component. From an architectural point of view, the “session” terminates in the WebRTC media server:
A WebRTC media server is privy to all data passing through it, and acts as a WebRTC client in front of each of the WebRTC devices it works with. It is also why it isn’t so well defined in WebRTC but at the same time so versatile.
This versatility of WebRTC media servers means that there are different types of such servers. Each one works under different architectural assumptions and concepts. Lets review them quickly here.
The most common and popular WebRTC media server is the SFU.
An SFU routes media between the devices, doing as little as possible when it comes to the media processing part itself.
The concept of an SFU is that it offloads much of the decision making of layout and display to the clients themselves, giving them more flexibility than any other alternative. At the same time, it takes care of bandwidth management and routing logic to best fit the capabilities of the devices it works with.
To do all that, it uses technologies such as bandwidth estimation, simulcast, SVC and many others (things like DTX, cascading and RED).
At the beginning, SFUs were introduced and used for group calls. Later on, they started to appear as live streaming and broadcast components.
Probably the oldest media server solution is the MCU.
The MCU was introduced years before WebRTC, when networks were limited. Telephony systems had/have voice conferencing bridges built around the concept of MCUs. Video conferencing systems required the use of media servers simply because video compression required specialized hardware and later too much CPU from client devices.
👉 In telephony and audio, you’ll see this referred to as mixers or audio bridges and not MCUs. That said, they still are one and the same technically.
What MCUs do is to receive and mix the media streams it receives from the various participants, sending a single stream of media towards the clients. For clients, an MCU looks like a call between 2 participants - it is the only entity the client really interacts with directly. This means there’s a single audio and a single video stream coming into and going out of the client - regardless of the number of participants and how/when they join and leave the session.
MCUs were less used in WebRTC from the get go. Part of it was the simple economies of scale - MCUs are expensive to operate, requiring a lot of CPU power (encoding and decoding media is expensive). It is cheaper to offer the same or similar services using SFUs. There are vendors who still rely on MCUs in WebRTC for group calling, though in most cases, you will find MCUs providing the recording mechanism only - where what they end up doing is taking all inputs and mixing them into a single stream to place in storage.
Another type of media server that is used in WebRTC is a gateway.
In some cases, content - rendered, live or otherwise - needs to be shared in a WebRTC session - or a WebRTC session needs to be shared on another type of a protocol/medium. To do so, a gateway can be used to bridge between the protocols.
The two main cases where these happen are probably:
One more example is a kind of a hybrid media server. One that might do routing and processing together. A group calling service that also records the call into a single stream for example. Such solutions are becoming more and more popular and are usually deployed as multiple media servers of different types (unlike the illustration above), each catering for a different part of the service. Splitting them up makes it easier to develop, maintain and scale them based on the workload needed by each media server type.
This might not be a WebRTC media server per se, but for me this falls within the same category.
Sometimes, what we want is to render content in the cloud and share it live with a user on a browser. This is true for things like cloud gaming or cloud application delivery (Photoshop in the cloud for hourly consumption). In such a case, this is more like a peer-to-peer WebRTC session taking place between a user on a browser and a cloud server that renders the content.
I see it as a media server because many of the aspects of development and scaling of the cloud rendering components are more akin to how you’d think about WebRTC media servers than they are about browser or native clients.
Let's look at an example service - Google Meet. Why Google Meet? Well, because it is so versatile today and because if you want to trace capabilities in WebRTC, the best approach is to keep close tabs with what Google Meet is doing.
What WebRTC media servers does Google Meet use? Based on the functionality it offers, we can glean out the types that make up this service:
A classic meeting service in WebRTC may well require more than a single type of a WebRTC media server, likely deployed in hybrid mode across different hardware configurations.
As we’ve seen earlier, the answer to this is simple - when doing things with WebRTC clients only isn’t possible and we need something to bridge this gap.
We may lack:
What I usually do when analyzing the needs of a WebRTC application is to find these gaps and determine if a WebRTC media server is needed (it usually is). I do so by thinking of the solution as a P2P one, without media servers. And then based on the requirements and the gaps found, I’ll be adding certain WebRTC media server elements into the infrastructure needed for my WebRTC application.
We’ve seen a growing interest in recent years in privacy. The internet has shifted to encryption first connections and WebRTC offers encrypted only media. This shift towards privacy started as privacy from other malicious actors on the public internet but has since shifted also towards privacy from the service provider itself.
Running a group meetings service through a service provider that cannot access the meeting’s content himself is becoming more commonplace.
This capability is known as E2EE - End to End Encryption.
When introducing WebRTC media servers into the mix, it means that while they are still a part of the session and are terminating WebRTC peer connections (=terminating encrypted SRTP streams) on their own, they shouldn’t have access to the media itself.
This can be achieved only in the SFU type of WebRTC media servers by the use of insertable streams. With it, the application logic can exchange private encryption keys between the users and have a second encryption layer that passes transparently through the SFU - enabling it to do its job of packet routing without the ability to understand the media content itself.
Another important aspect to understand about WebRTC media servers is that most of those using media servers in WebRTC do so using open source frameworks for media servers.
I’ve written at length about WebRTC open source projects - there are details there about the market state and open source WebRTC media servers there.
What is important to note is that more often than not, projects who don’t use managed services for their WebRTC media servers usually pick open source WebRTC media servers to work with and not develop their own from scratch. This isn’t always the case, but it is quite common.
WebRTC Video API and CPaaS is another area I cover quite extensively.
Vendors who decide to use a CPaaS vendor for their WebRTC application will mainly do it in one of two situations:
Both cases require media servers…
This leads to the following important conclusion: there’s no such thing as a CPaaS vendor doing WebRTC that isn’t offering a managed WebRTC media server as part of its solution - and if there is, then I’ll question its usefulness for most potential customers.
Last year, I released the Low-level WebRTC protocols course along with Philipp Hancke.
The Low-level WebRTC protocols course has been a huge success, which is why we’re starting to work on our next course in this series: Higher level WebRTC protocols
Before we go about understanding WebRTC media servers, it is important to understand the inner-workings of the network protocols that WebRTC employs. Our low-level protocols course covers the first part of the underlying protocols. This second course, looks at the higher level protocols - the parts that look and deal a bit more with network realities - challenges brought to us by packet losses as well as other network characteristics.
Things we cover here include retransmissions, forward error correction, codecs packetization and a myriad of media processing algorithms.
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreGet your copy of my ebook on the top 7 video quality metrics and KPIs in WebRTC (below).
Read More