OpenAI, LLMs, WebRTC, voice bots and Programmable Video
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreNeed WebRTC recording in your application? Check out the various requirements and architectural decisions you’ll have to make when implementing it.
A critical part of many WebRTC applications is the ability to record the session. This might be a requirement for an optional feature or it might be the main focus of your application.
Whatever the reasons, WebRTC recording comes in different shapes and sizes, with quite a few alternatives on how to get it done these days.
What I want to do this time is to review a few of the aspects related to WebRTC recording, making sure that when it is your time to implement, you’ll be able to make better choices in your own detailed requirements and design.
One of the fundamental things you will need to consider is where do you plan the WebRTC recording to take place - on the device or on the server. You can either record the media on the device and then (optionally?) upload it to a server. Or you can upload the media to a server (live in a WebRTC session) and conduct the recording operation itself on the server.
Recording locally uses the MediaRecorder API while uploading uses HTTPS or WebSocket. Recording on the server uses WebRTC peer connection and then whatever media server you use for containerizing the media itself on the server.
Here’s how I’d compare these two alternatives to one another:
Record-and-upload | Upload-and-record | |
Technology | MediaRecorder API + HTTPS | WebRTC peer connection |
Client-side | Some complexity in implementation, and the fact that browsers differ in the formats they support | No changes to client side |
Server-side | Simple file server | Complexity in recording function |
Main advantages |
|
|
🔴 When would I record-and-upload?
I would go for client-side recording using MediaRecorder in the following scenarios:
🔴 When would I upload-and-record?
Here’s when I’d use classic WebRTC architectures of upload-and-record:
🔴 How about both?
There’s also the option of doing both at the same time - recording and uploading and in parallel to upload-and-record. Confused?
Here’s where you will see this taking place:
If you are recording more than a single media source, let's say a group of people speaking to each other, then you will have this dilemma to solve:
Will you be using WebRTC recording to get a single mixed stream out of the interaction or multiple streams - one per source or participant?
Assuming you are using an SFU as your media server AND going with the upload-and-record method, then what you have in your hands are separate media streams, each per source. Also, what you need is a kind of an MCU if you plan on recording as a single stream...
For each source you could couple their audio and video into a single media file (say .webm or .mp4), but should you instead mix all of the audio and video sources together into a single stream?
Using such a mixer means spending a lot of CPU and other resources for this process. The illustration below (from my Advanced WebRTC Architecture course) shows how that gets done for two users - you can deduce from there for more media sources:
The red blocks are the ones eating up on your CPU budget. Decoding, mixing and encoding are expensive operations, especially when an SFU is designed and implemented to avoid exactly such tasks.
Here’s how these two alternatives compare to each other:
Multiple streams | Mixed stream | |
Operation | Save into a media file | Decode, mix and re-encode |
Resources | Minimal | High on CPU and memory |
Playback | Customized, or each individual stream separately | Simple |
Main advantages |
|
|
🔴 When would I use multi stream recording?
Multi stream can be viewed as a step towards mixed stream recording or as a destination of its own. Here’s when I’d pick it:
🔴 When would I decide on mixed stream recording?
Mixed recording would be my go-to solution almost always. Usually because of these reasons:
🔴 What about mixed stream client side recording?
One thing that I’ve seen once or twice is an attempt to use a device browser to mix the streams for recording purposes. This might be doable, but quality is going to be degraded for both the actual user in the live session as well as in the recorded session.
I’d refrain from taking this route…
If you are aiming for a single stream recording, then the next dilemma you need to solve is the one between switching and compositing. Switching is the poor man’s choice, while compositing offers a richer “experience”.
What do I mean by that?
Audio is easy. You always need to mix the sources together. There isn’t much of a choice here.
For video though, the question is mostly what kind of a vantage point do you want to give that future viewer of yours. Switching means we’re going to show one person at a time - the one shouting the loudest. Compositing means we’re going to mix the video streams into a composite layout that shows some or all of the participants in the session.
Google Meet, for example, uses the switching method in its recordings, with a simple composite layout when screen sharing takes place (showing the presenter and his screen side by side, likely because it wasn’t too hard on the mixing CPU).
In a way, switching enables us to “get around” the complexity of single stream creation from multiple video sources:
Switching | Compositing | |
Audio | Mix all audio sources | Mix all audio sources |
Video | Select single video at a time, based on active speaker detection | Pick and combine multiple video streams together |
Resources | Moderate | High CPU and memory needs |
Main advantages | Cost effective | More flexible in layouts and understanding of participants and what they visually did during the meeting |
🔴 When would I pick switching?
When the focus is the audio and not the video.
Let’s face it - most meetings are boring anyway. We’re more interested in what is being said in them, and even that can be an exaggeration (one of the reasons why AI is used for creation of meeting summaries and action items in some cases).
The only crux of the matter here, is that implementing switching might take slightly longer than compositing. In order to optimize for machine time in the recording process, we need to first invest in more development time. Bear that in mind.
🔴 When would compositing be my choice?
The moment the video experience is important. Webinars. Live events. Video podcasts.
Media that plan or want to apply post production editing to.
Or simply when the implementation is there and easier to get done.
I must say that in many cases that I’ve been involved with, switching could have been selected. Compositing was picked just because it was thought of as the better/more complete solution. Which begs the question - how can Google Meet get away with switching in 2024? (the answer is simple - it isn’t needed in a lot of use cases).
Assuming you decided on compositing the multiple video streams into a single stream in your WebRTC recording, it is now time to decide on the layout to use.
You can go for a single rigid layout used for all (say tiles or presenter mode). You can go for a few layouts, with the ability to switch from one to the other based on context or some external “intervention”. You can also go for something way more flexible. I guess it all depends on the context of what you’re trying to achieve:
Single | Rigid | Flexible | |
Concept | A single layout to rule them all | Have 2, 3 or 7 specific layouts to choose from | Allow virtually any layout your users may wish to use |
Main advantages |
|
| Users can control everything, so you can offer the best user experience possible |
Main challenges | What if that single layout isn’t enough for your users? |
|
|
Here’s a good example of how this is done in StreamYard:
StreamYard gives 8 predefined different layouts a host can dynamically choose from, along with the ability to edit a layout or add new ones (the buttons at the bottom right corner of the screen).
🔴 When to aim for rigid layouts?
Here’s when I’ll go with rigid layouts:
Here, make sure to figure out which layouts are best to use and how to automatically make the decision for the users (it might be that whatever the host layout is you record, or based on the current state of the meeting - with screen sharing, without, number of participants, etc).
🔴 When would flexibility be in my menu?
Flexibility will be what I’ll aim for if:
You decided to go for a composite video stream for your WebRTC recording? Great! Now how do you achieve that exactly?
For the most part, I’ve seen vendors pick up one of two approaches here - either build their own proprietary/custom transcoding pipeline - or use a headless browser as their compositor:
Transcoding pipeline | Browser engine | |
Underlying technology | Usually ffmpeg or gstreamer | Chrome (and ffmpeg) |
Concept | Stitch the pipeline on your own from scratch | Add a headless browser in the cloud as a user to the meeting and capture the screen of that browser |
Resources | High | High, with higher memory requirements (due to Chrome) |
Main advantages |
|
|
Here I won’t be giving an opinion about which one to use as I am not sure there’s an easy guideline. To make sure I am not leaving you half satisfied here, I am sharing a session Daily did at Kranky Geek in 2022, talking about their native transcoding pipeline:
Since that’s the alternative they took, look at it critically, trying to figure out what their challenges were, to create your own comparison table and making a decision on which path to take.
Last but not least, decide if the recording process takes place online or post mortem - live or “offline”.
This is relevant when what you are trying to do is to have a composite single media stream out of the session being recorded. With WebRTC recording, you can decide to start off by just saving the media received by your SFU with a bit of metadata around it, and only later handle the actual compositing:
Live | “offline” | |
Concept | Handle recording on demand, as it is taking place. Usually, adding 0-5 seconds of delay | Use job queues to handle the recording process itself, making the recorded media file available for playback minutes or hours after the session ended |
Main advantages |
|
|
🔴 When to go live?
The simple answer here is when you need it:
🔴 When to use “offline”?
Going “offline” has its set of advantages:
🔴 How about both?
Here are some suggestions of combinations of these approaches that might work well:
This has been long. Sorry about that.
Designing your WebRTC recording architecture isn’t simple once you dive into the details. Take the time to think of these requirements and understand the implications of the architecture decisions you make.
Oh, and did I mention there’s a set of courses for WebRTC developers available? Just go check them out at https://webrtccourse.com 😃
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreGet your copy of my ebook on the top 7 video quality metrics and KPIs in WebRTC (below).
Read More