OpenAI, LLMs, WebRTC, voice bots and Programmable Video
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreAs many as you like. You can cram anywhere from one to a million users into a WebRTC call.
You’ve been asked to create a group video call, and obviously, the technology selected for the project was WebRTC. It is almost the only alternative out there and certainly the one with the best price-performance ratio. Here’s the big question: How many users can we fit into that single group WebRTC call?
Need to understand your WebRTC group calling application backend? Take this free video mini-course on the untold story of WebRTC’s server side.
At least once a week I get approached by someone saying WebRTC is peer-to-peer and asking me if you can use it for larger groups, as the technology might not fit for such use cases. Well… WebRTC fits well into larger group calls.
You need to think of WebRTC as a set of technological building blocks that you mix and match as you see fit, and the browser implementation of WebRTC is just one building block.
The most common building block today in WebRTC for supporting group video calls is the SFU (Selective Forwarding Unit). a media router that receives media streams from all participants in a session and decides who to route that media to.
What I want to do in this article, is review a few of the aspects and decisions you’ll need to take when trying to create applications that support large group video sessions using WebRTC.
The first step in our journey today will be to analyze the complexity of our use case.
With WebRTC, and real time video communications in general, we will all boil down to speeds and feeds:
Let’s start with an example.
Assume you want to run a group calling service for the enterprise. It runs globally. People will join work sessions together. You plan on limiting group sessions to 4 people. I know you want more, but I am trying to keep things simple here for us.
The illustration above shows you how a 4 participants conference would look like.
If the layout you want for this conference is the magic squares one, we’re in the domain of:
You want high quality video. That’s what everyone wants. So you plan on having all participants send out 720p video resolution, aiming for WQHD monitors (that’s 2560x1440). Say that eats up 1.5Mbps (I am stingy here - it can take more), so:
Summing it up in a simple table, we get:
Resolution | 720p |
Bitrate | 1.5Mbps |
User outgoing | 1.5Mbps (1 stream) |
User incoming | 4.5Mbps (3 streams) |
SFU outgoing | 18Mbps (12 streams) |
SFU incoming | 6Mbps (4 streams) |
If you’re not interested in resolution that much, you can aim for VGA resolution and even limit bitrates to 600Kbps:
Resolution | VGA |
Bitrate | 600Kbps |
User outgoing | 0.6Mbps (1 stream) |
User incoming | 1.8Mbps (3 streams) |
SFU outgoing | 7.2Mbps (12 streams) |
SFU incoming | 2.4Mbps (4 streams) |
The thing you may want to avoid when going VGA is the need to upscale the resolution on the display - it can look ugly, especially on the larger 4K displays.
With crude back of the napkin calculations, you can potentially cram 3 VGA conferences for the “price” of 1 720p conference.
But what if our layout is a bit different? A main speaker and smaller viewports for the other participants:
I call it Hangouts style, because Hangouts is pretty known for this layout and was one of the first to use it exclusively without offering a larger set of additional layouts.
This time, we will be using simulcast, with the plan of having everyone send out high quality video and the SFU deciding which incoming stream to use as the dominant speaker, picking the higher resolution for it and which will pick the lower resolution.
You will be aiming for 720p, because after a few experiments, you decided that lower resolutions when scaled to the larger displays don’t look that good. You end up with this:
Resolution | 720p highest (in Simulcast) |
Bitrate | 150Kbps - 1.5Mbps |
User outgoing | 2.2Mbps (1 stream) |
User incoming | 1.5Mbps (1 stream)
0.3Mbps (2 streams) |
SFU outgoing | 8.4Mbps (12 streams) |
SFU incoming | 8.8Mbps (4 streams) |
This is what have we learned:
Different use cases of group video with the same number of users translate into different workloads on the media server.
And if it wasn’t mentioned specifically, simulcast works great and improves the effectiveness and quality of group calls (simulcast is what we used in our Hangouts Style meeting).
Across the 3 scenarios we depicted here for 4-way video call, we got this variety of activity in the SFU:
Magic Squares: 720p | Magic Squares: VGA | Hangouts Style | |
SFU outgoing | 18Mbps | 7.2Mbps | 8.4Mbps |
SFU incoming | 6Mbps | 2.4Mbps | 8.8Mbps |
Here’s your homework - now assume we want to do a 2-way session that gets broadcasted to 100 people over WebRTC. Now calculate the number of streams and bandwidths you’ll need on the server side.
That’s a tough one.
If you use an MCU, you can get as many users on a call as your MCU can handle.
If you are using an SFU, it depends on a 3 different parameters:
We’re going to review them in a sec.
Anything about 8-10 users in a single call becomes complicated. Here’s an example of a publicly available service I want to share here.
The scenario:
The media server decided here how to limit and gauge traffic.
And here’s another service with an online demo running the exact same scenario:
Now the incoming bitrate on average per browser was only 2.7Mbps - almost a fourth of the other service.
Same scenario. Different implementations.
What about some popular services that do video conferencing in an SFU routed model? What kind of size restrictions do they put on their applications?
Here’s what I found browsing around:
Does this mean you can’t get above 50?
My take on it is that there’s an increasing degree of difficulty as the meeting size increases:
When you look at CPaaS platforms, those supporting video and group calling often have limits to their meeting size. In most cases, they give out an arbitrary number they have tested against or are comfortable with. As we’ve seen, that number is suitable for a very specific scenario, which might not be the one you are thinking about.
In CPaaS, these numbers vary from 10 participants to 100’s of participants in a single sesion. Usually, if you can go higher, the additional participants will be view-only.
Few things to keep in mind:
Sizing and media servers is something I have been doing lately at testRTC. We’ve played a bit with Kurento in the past and are planning to tinker with other media servers. I get this question on every other project I am involved with:
How many sessions / users / streams can we cram into a single media server?
Given what we’ve seen above about speeds and feeds, it is safe to say that it really really really depends on what it is that you are doing.
If what you are looking for is group calling where everyone’s active, you should aim for 100-500 participants in total on a single server. The numbers will vary based on the machine you pick for the media server and the bitrates you are planning per stream on average.
If what you are looking for is a broadcast of a single person to a larger audience, all done over WebRTC to maintain low latency, 200-1,000 is probably a better estimate. Maybe even more.
Another thing you will need to address is on which machines are you going to host your media server. Will that be the biggest baddest machines available or will you be comfortable with smaller ones?
Going for big machines means you’ll be able to cram larger audiences and sessions into a single machine, so the complexity of your service will be lower. If something crashes (media servers do crash), more users will be impacted. And when you’ll need to upgrade your media server (and you will), that process can cost you more or become somewhat more complicated as well.
The bigger the machine, the more cores it will have. Which results in media servers that need to run in multithreaded mode. Which means they are more complicated to build, debug and fix. More moving parts.
Going for small machines means you’ll hit scale problems earlier and they will require algorithms and heuristics that are more elaborate. You’ll have more edge cases in the way you load balance your service.
How do you decide that your media server achieved full capacity? How do you decide if the next session needs to be crammed into a new machine or another one or be placed on the current media server you’re using? If you use the current one, and new participants want to join a session actively running in this media server, will there be room enough for them?
These aren’t easy questions to answer.
I’ve see 3 different metrics used to decide on when to scale out from a single media server to others. Here are the general alternatives:
Based on CPU - when the CPU hits a certain percentage, it means the machine is “full”. It works best when you use smaller machines, as CPU would be one of the first resources you’ll deplete.
Based on Bandwidth - SFUs eat up lots of networking resources. If you are using bigger machines, you’ll probably won’t hit the CPU limit, but you’ll end up eating too much bandwidth. So you’ll end up determining the capacity available by way of bandwidth monitoring.
Based on Streams - the challenge sometimes with CPU and Bandwidth is that the number of sessions and streams that can be supported may vary, depending on dynamic conditions. Your scaling strategy might not be able to cope with that and you may want more control over the calculations. Which will lead to you sizing the machine using either CPU or bandwidth, but placing rules in place that are based on the number of streams the server can support.
-
The challenge here is that whatever scenario you pick, sizing is something you’ll need to be doing on your own. I see many who come to use testRTC when they need to address this problem.
Cascading is the process of connecting one media server to another. The diagram below shows what I mean:
We have a 4-way group video call that is spread across 3 different media servers. The servers route the media between them as needed to get it connected. Why would you want to do this?
When you run a global service and have SFUs as part of it, the question that is raised immediately is for a new session, which SFU will you allocate for it? In which of the data centers? Since we want to get our media servers as close as possible to the users, we either have pre-knowledge about the session and know where to allocate it, or decide by some reasonable means, like geolocation - we pick the data center closest to the user that created the meeting.
Assume 4 people are on a call. 3 of them join from New York, while the 4th person is from France. What happens if the French guy joins first?
The server will be hosted in France. 3 out of 4 people will be located far from the media server. Not the best approach...
One solution is to conduct the meeting by spreading it across servers closest to each of the participants:
We use more server resources to get this session served, but we have a lot more control over the media routes so we can optimize them better. This improved media quality for the session.
Assume that we can connect up to 100 participants in a single media server. Furthermore, every meeting can hold up to 10 participants. Ideally, we won’t want to assign more than 10 meetings per media server.
But what if I told you the average meeting size is 2 participants? It can get us to this type of an allocation:
This causes a lot of wasted server resources. How can we solve that?
That last one of cascading? You can do that by reserving some of a media server’s resources for cascading existing sessions to other media servers.
Assuming you want to create larger meetings than one a single media server can handle, your only choice is to cascade.
If your media server can hold 100 participants and you want meetings at the size of 5,000 participants, then you’ll need to be able to cascade to support them. This isn’t easy, which explains why there aren’t many such solutions available, but it definitely is possible.
Mind you, in such large meetings, the media flow won’t be bidirectional. You’ll have fewer participants sending media and a lot more only receiving media. For the pure broadcasting scenario, I’ve written a guest post on the scaling challenges on Red5 Pro’s blog.
We’ve touched a lot of areas here. Here’s what you should do when trying to decide how many users can fit in your WebRTC calls:
What’s the size of your WebRTC meetings?
Need to understand your WebRTC group calling application backend? Take this free video mini-course on the untold story of WebRTC’s server side.
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreGet your copy of my ebook on the top 7 video quality metrics and KPIs in WebRTC (below).
Read More