OpenAI, LLMs, WebRTC, voice bots and Programmable Video
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreWHIP & WHEP are specifications to get WebRTC into live streaming. But is this really what is needed moving forward?
WebRTC is great for real time. Anything else - not as much. Recently two new protocols came to being - WHIP and WHEP. They work as signaling to WebRTC to better support live streaming use cases.
In recent months, there has been a growing adoption in the implementation of these protocols (the adoption of actual use isn’t something I am privy to so can’t attest either way). This progress is a positive one, but I can’t ignore the feelings I have that this is only a temporary solution.
WHIP stands for WebRTC-HTTP Ingestion Protocol. WHEP stands for WebRTC-HTTP Egress Protocol. They are both relatively new IETF drafts that define a signaling protocol for WebRTC.
👉 WebRTC explicitly decided NOT to have any signaling protocol so that developers will be able to pick and choose any existing signaling protocol of their choice - be it SIP, XMPP or any other alternative. For the media streaming industry, this wasn’t a good thing - they needed a well known protocol with ready-made implementations. Which led to WHIP and WHEP.
To understand them how they fit into a solution, we can use the diagram below:
In a live streaming use case, we have one or more broadcasters who “Ingest” their media to a media server. That’s where WHIP comes in. The viewers on the other side, get their media streams on the egress side of the media servers infrastructure.
For a technical overview of WHIP & WHEP, check out this Kranky Geek session by Sergio Garcia Murillo from Dolby:
In video conferencing, WebRTC transformed the market and how it thought of meetings and interoperability by practically killing the notion of interoperability across vendors on the protocol level, shifting it to the application level and letting users install their own apps on devices or just load web pages on demand.
The streaming industry is different - it relies on 3 components, which can easily come from 3 different vendors:
When a broadcaster implements his application, he picks and chooses the media servers and media players. Sometimes he will also pick the ingestion part, but not always. And none of the vendors in each of these 3 categories can really enforce the use of his own components for the others.
This posed a real issue for WebRTC - it has no signaling protocol - this is left for the implementers, but how do you develop such a solution that works across vendors without a suitable signaling protocol?
The answer for that was WHIP and WHEP -
These are really simple protocols built around the notion of a single HTTP request - in an attempt to get the streaming industry to use them and not shy away from the complexities hidden in WebRTC.
Here’s what’s working well for WHIP and WHEP:
There’s the challenging side of things as well:
This last weakness - WebRTC - leads me to the next issue at hand.
Streaming comes in different shapes and sizes.
The scenario might have different broadcasters:viewers count - 1:1, 1:many, few:1, few:many - each has its own requirements and nuances as to what I’d prefer using on the sending side, receiving end and on the media server itself.
What really changes everything here is latency. How much latency are we willing to accept?
The lower the latency we want the more challenging the implementation is. The closer to live/real time we wish to get, the more sacrifices we will need to make in terms of quality. I’ve written about the need to choose either quality or latency.
WebRTC is razor focused on real time and live. So much so that it can’t really handle something that has latency in it. It can - but it will sacrifice too much for it at a high complexity cost - something you don’t really want or need.
What does that mean exactly?
This is when a few tough questions need to be asked - what exactly does your streaming service need?
If you need things to be conducted in sub-second latency only, then WebRTC is probably the way to go. But if you have in your use case other latencies as well, then think twice before choosing WebRTC as your go-to solution.
An important aspect that needs to be mentioned here is that in many cases, WebRTC is used in a hybrid model in media streaming.
Oftentimes, we want to ingest media using WebRTC and view the media elsewhere using other protocols - usually because we don’t care as much about latency or because we already have the viewing component solved and deployed - here WebRTC ingest is added to an existing service.
Adding the WHIP protocol here, and ingesting WebRTC media to the streaming service means we can acquire the media from a web browser without installing anything. Real time is nice, but not always needed. Browser ingest though is mostly about reducing friction and enabling web applications.
That last suggestion would have looked different just two years ago, when for real time the only game in town for browsers was WebRTC. Today though, it isn’t the case.
In 2020 I pointed to the unbundling of WebRTC. The trend in which WebRTC is being split into its core components so that developers will be able to use each one independently, and in a way, build their own solution that is similar to WebRTC but isn’t WebRTC. These components are:
Theoretically, using these 3 components one can build a real time communication solution, which is exactly what Zoom is trying to do inside web browsers.
In the past several months I’ve seen more and more companies adopting these interfaces. It started with vendors using WebAssembly for background blurring and replacement. Moved on to companies toying around with WebTransport and/or WebCodecs for streaming and recently a lot of vendors are doing noise suppression with WebAssembly.
Here’s what Intel showcased during Kranky Geek 2021:
This trend is only going to grow.
How does this relate to streaming?
Good that you asked!
These 3 enables us to implement our own live streaming solution, not based on WebRTC that can achieve sub second latency in web browsers. It is also flexible enough for us to be able to add mechanisms and tools into it that can handle higher latencies as needed, where in higher latencies we improve upon the quality of the media.
Here’s what I like about this approach:
It isn’t all shiny though:
I don’t know.
WHIP and WHEP are here. They are gaining traction and have vendors behind them pushing them.
On the other hand, they don’t solve the whole problem - only the live aspect of streaming.
The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.
Can this replace WebRTC? For media streaming - yes.
Is this the way the industry will go? This is yet to be seen, but definitely something to track.
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Read MoreGet your copy of my ebook on the top 7 video quality metrics and KPIs in WebRTC (below).
Read More