There are different ways to use WebRTC. Zoom is using WebRTC, just not in the most common way possible today.
Zoom seems to be an interesting topic when it comes to WebRTC. I’ve written about them two times recently (and a bonus one from webrtcHacks:
- When Jitsi played with Zoom vs Jitsi in bandwidth limiting
- That in turn led to webrtcHacks looking at the browser implementation of Zoom
- Just two months back we had the security vulnerability in Zoom
That in itself begged the question where WebRTC starts and where it ends, since Zoom uses getUserMedia to access the media to begin with.
What was found lately is even more interesting:
Nils (Mozilla) noticed that Zoom is using WebRTC’s data channel. Which led webrtcHacks to update that Zoom article.
Interesting times 🙂
Want to effectively connect WebRTC sessions with the success rate that Zoom is capable of? Check out my free mini course
What does “use WebRTC” mean?
If you go by the specification components in W3C, then the split looks something like this:
From the W3C specifications standpoint, WebRTC is support for Peer Connection and the Data Channel. This encompasses in it other elements/components such as getStats, SDP negotiation, ICE negotiation, etc.
But at its core, WebRTC is about sending data in real time in peer-to-peer fashion across browsers. Be it voice, video or arbitrary data.
getUserMedia and getDisplayMedia have their own specification – Media Capture and Streams. This is what Zoom has been using out of WebRTC. It allows browsers access to cameras, microphones and the screen itself. These are used for things that have nothing to do with communications – like MailChimp or Whatsapp taking a snapshot for a long time now. Others are doing the same as well.
Then there’s the MediaRecorder component, which is defined in MediaStream Recording. Its use? To record media locally. Dubb and Loom use it for example.
Is MediaStream Recording WebRTC? Is Media Capture and Streams WebRTC?
I like taking an encompassing view here and consider them part of what WebRTC is in its essence when used in a browser.
Zoom’s route to WebRTC
Back to Zoom.
Zoom started by using only getUserMedia. This allowed them access to other browser technologies such as WebAssembly. They got their real time media processing somewhere else.
The next step is what Nils just bumped into – Zoom decided that streaming the media over a WebSocket is nice but not that efficient. Since it ends up over TCP, the performance and media quality is subpar once packet losses kick in. That’s because TCP starts retransmitting the media when it is already too late for a real time task like video calling to use it, ending up with even more congestion and more packet losses.
What is a company to do at such problem? Find a non-reliable connection to send their data on. There are two alternatives today to do that in web browsers:
- WebRTC’s data channel (which uses SCTP today)
- QUIC (HTTP/3), which is still a bit too new
Zoom decided on WebRTC’s data channel in its current SCTP implementation. They haven’t gone for the Google Chrome experiment of a QUIC data channel (which should be rather “safe” considering Google Stadia is said to be using it). And they haven’t decided to use HTTP/3, which I find as a bit odd.
The end result? Zoom is using WebRTC. Somewhat. With a data channel. To handle live video streams, with their previous WebSocket architecture as fallback. And not the peer connection itself. It is really cool, but… don’t try this at home.
Is this the end of the road for WebRTC in Zoom?
I don’t think so.
They still have the installation friction and now all them pesky security experts breathing down their necks looking for vulnerabilities. It won’t hurt their valuation or their revenue, but it will eat into management’s attention.
And frankly? Zoom on a data channel will still be subpar, since doing everything in WebAssembly isn’t optimized enough. At some point, Zoom will need to throw the towel and join the WebRTC game.
Why?
Because of either VP9 or AV1. Whichever ends up being the breaking point for Zoom.
What will be the next step for Zoom’s adoption of WebRTC?
Zoom has two main things working for it today, as far as I can see:
- It just works
- Quality is great
Both are user/market perception more than they are an objective reality (if there even is such a thing).
1. It just works
It just works is about simplicity. It is the reason Zoom started with using GetUserMedia and later the data channel. Without it, guest access to Zoom would mandate installing their app. At a time when all of their competitors require no installation, that’s a problem. The problem is that this small friction that is left means that “it just works” is no longer a Zoom advantage. It becomes their hindrance.
2. Quality is great
Zoom uses H.264, at least from the analysis done by webrtcHacks (based on packet header inspection).
Since WebRTC has H.264 support, my assumption is that Zoom’s H.264 implementation is proprietary or at the very least, not compliant with the WebRTC one. They might have their own H.264 implementation which they like, value and can’t live without – or at least can’t replace in a single day.
At some point, that implementation is going to lose its luster and its advantages, and that day is rather close now.
H.264 is computationally simpler than VP9 and AV1 – a good thing. But at the same time, VP9 and AV1 offer better quality than H.264 at the same bitrate.
When Zoom’s competitors migrate to using VP9 or AV1, what is Zoom to do?
It can probably adopt VP9 or go for HEVC. It might even decide to use AV1 when the time comes.
But what if it does that without supporting WebRTC? Would running an implementation of a video codec twice or three times as complex as H.264 in WebAssembly make sense? Will it be able to compete against hardware implementations or optimized software implementations that will be found at that point in web browsers?
Without relying on WebRTC, Zoom will be impacted severely in its web browser implementation, and will need to stick to installing an app. At some point, this will no longer be acceptable.
If I were Zoom, I’d start working on a migration plan towards WebRTC. One lasting at least 2-3 years. It is going to be long, complicated, painful and necessary.
Microsoft has taken that route with Skype. Cisco did the same with WebEx.
Both Microsoft and Cisco are probably mostly there but not there yet.
Zoom should start that route.
The end of proprietary communications
In a way, this marks the end of proprietary communications. At least for the coming 5-10 years.
It is funny how things flip.
The market used to look like this:
Companies standardized on signaling, placed acceptable standardized codecs. And then pushed proprietary non-standard improvements to their codecs.
And now it looks like this:
Companies standardize on codecs, using whatever WebRTC has available (and complaining about it), placing their own proprietary signaling and infrastructure to make things work well.
In that same challenge, you’ll find additional vendors:
Agora.io, who has their own proprietary codecs, claiming superior error resiliency. They just joined the AOMedia, becoming part of the companies behind the AV1 video codec.
Dolby, who has their own proprietary voice codec, offering a 3D spatial experience. It works great, but limited when it comes to the browser environment.
As WebRTC democratized communications it also killed a lot of what proprietary optimizations in the codec level can do to assist in gaining a competitive advantage.
It isn’t that better codecs don’t exist. It is that using them has an impossibly high limitation of not being able to be used inside browsers – and that’s where everyone is these days.
Want to effectively connect WebRTC sessions with the success rate that Zoom is capable of? Check out my free mini course
There is an alternative: they wait for the webrtc-nv bits which allow plugging custom encders to be implemented by Chrome.
Given that Google has intentionally killed pan-tilt-zoom functionality for them (https://bugs.chromium.org/p/chromium/issues/detail?id=952821; the linked issue 891460 is an very interesting read) would they really want to move a large amount of their usage Chrome?
What would happen if there was another “accident” with the H264 decoder? See https://bugs.chromium.org/p/chromium/issues/detail?id=999807#c29 for an example of the kind of bug that still slips through to the stable release.
Interesting as always.
I am not sure what will happen first. Adoption of VP9 (or AV1) or improvements around WebRTC-NV – something that haven’t even started.
May I know how do you get the information that Zoom is going to use WebRTC?
Sure Albert.
The information that Zoom is using the data channel is accurate (and that’s already WebRTC).
The reasoning that they will continue with that path and support more of WebRTC is just that – reasoning. It is what makes sense to do in the longer run.
Just wanna know how accurate the information is.
I have seen multiple developers end up going with DataChannel for media recently.
One issue is the poor performance of SRTP encryption. It isn’t able to be hardware accelerated, while DTLS is. When doing SFUs at scale they are saving lots of money by doing Datachannels. Also on the client side it makes a big differences, I have been told that some clients are unusable without using AES-NI/HW accelerated video coding. (https://bugs.chromium.org/p/chromium/issues/detail?id=713701#c22)
The other one is the lack of latency control. I have seen a few users complain that they can’t control it more. They would rather have packet loss, but lower latency so they packetize and do things in WASM. No idea if that is a real world project though but seems to be working for that company!
Thanks for sharing Sean. Not sure that Zoom fits into such categories, but good to know.
May I ask what tool are the pictures in the article made of?
Sure Bill.
In this case it was pens and paper. I went all analog. Took a scan using Scanbot mobile app and then just cropped it to fit.
My problem with Zoom is its horrible audio quality. Video is not bad but their audio is horrendous. And their techs have no solution.
Douglas, I guess different people have different experiences with media quality of the various services out there. Most people I’ve talked to are happy with Zoom.
I like the analysis in “The end of proprietary communications”, as we are moving indeed from proprietary codecs to proprietary signaling.
But keeping using old protocols such as RTCP and co, could prevent real innovations from taking place : what do you think of https://snr.stanford.edu/salsify/ for instance ?
Frederic,
RTCP isn’t signaling, at least not the way I see it. My thinking is that shifting from RTP and RTCP to QUIC (HTTP/3) is the right long term approach and I’ve written about it here in the past.
As for Salsify, it has nothing to do with protocols. As far as I can understand it is just an implementation detail on how to get codecs and transport to work better together, and frankly? Everyone’s been doing that (at least in intent) since the early days of video conferencing – I remember we did similar things at RADVISION 15+ years ago. Can WebRTC be improved? Definitely. But a lot of that improvement relies on the code that ends up implemented inside web browsers.
Here is how I see it. WebRTC is a full suite of “capabilities” that you can use in multiple ways. You’re not really “forced” to use them in a traditional manner, but as long as you, things are fairly clean and simple.
The interesting things happen when you roam outside of the “normal” usage. For example, we use WebRTC in our mobile SDKs, mainly for its media processing capabilities. We don’t really use any of the “browser” stack, simply because – there is no browser involved. The result is very “unique” as we enjoy the various “internal” smarts of WebRTC and codec handling, while we add our own smarts into how to use these.
The switch from “proprietary codecs/standard signalling” to “standard codecs/proprietary signalling” was exactly where WebRTC was going. Allowing each vendor to create their own “usage” pattern of the underlying technology. So, you have agora.io, zoom, jitsi and others – here’s my challenge to you, make them talk to each other – will not happen.
“As WebRTC democratized communications it also killed a lot of what proprietary optimizations in the codec level can do to assist in gaining a competitive advantage.” – yes, that is correct. But opened at the same time a new realm of innovation and ideas, which is, how to monetise the signalling environment. I believe that while Zoom’s decision may be “questionable” or “weird” to some – I’m confident this is nothing more than a prelude to something new.
If found this article, because of my experience in how badly Zoom performs on chromebooks. Given the new “mainstreaming” of Zoom during the pandemic, we chromebook users are extremely frustrated. So I eagerly await Zoom’s full conversion to webRTC – like, say Microsoft Teams versus Skype-for-Business.
I don’t envision this coming any time soon I am afraid.
Jitsi meet is a great alternative that is already fully webrtc and can handle quite a lot of people based on your server specifications. Another one is running Nextcloud talk, not sure how that compares in performance with the number of people, but becuase of its use of webrtc will run in browsers without any need for installing software, desktop and mobile.
Thanks for this post, it was informative.
I did want to dispute the “it just works” positive bullet point for zoom.
It hardly works at all on chromebooks.
Google Meet is 100% fine, but zoom will choke a chromebook to death.
Thanks Brad. Didn’t know this tidbit about Zoom and Chromebooks.