RTC@Scale was Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary so you can pick and choose the relevant ones for you.
WebRTC Insights is a subscription service I have been running with Philipp Hancke for over a year now. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers - to keep you up to date so you can focus on what you need to do best - build awesome applications.
We got into a kind of a flow:
Once every two weeks we finalize and publish a newsletter issue
Once a month we record a video summarizing libwebrtc release notes
It is fun to do and the feedback we’re getting is positive.
That said, being us, means that we can’t really sit still… or in this case - Philipp…
We published this on Monday the week after the event took place to our WebRTC Insights clients, and now, we’re opening it up for everyone as well.
Philipp decided it would make sense to summarize the recent RTC@Scale “recruiting event” that Facebook did - the RSVP was explicitly asking for consent to be contacted. The technical depth of the talks was amazing so we’ve added an “out of order” issue for you, just for this 😎
The intent is for you to *not* spend 5 hours but rather to focus on the select sessions that are relevant for you.
The event setup was simple:
All sessions were pre-recorded and simply played back at the time of the events
If you want to watch the full 5 hours, you can use this link
KEYNOTE, PANEL AND WRAP-UP
Real-time Communication for Today and Future Experiences / Maher Saba @ Meta
Product-focused, make your product managers watch
Now this is a good recruiting pitch with all the fancy things you could work on!
One wonders if you will get interviewed on a VR whiteboard when applying…
Panel: RTC in the Metaverse / Sriram Srinivasan, Mike Arcuri, Paul Boustead, and Cullen Jennings
Product-oriented, a lot of talking. Watch with a glass of wine
40 minutes felt too long
The question everyone avoids is “what is Fortnite doing?”
SESSION 1: FUTURE RTC EXPERIENCES
These sessions focus on roadmap and far future views. We’d rather have a bit more on the here and now and the immediate future requirements than what would happen in 3, 5 or 10 years time, but hey - they are recruiting ;-)
Holographic Video Calling / Nitin Garg @ Meta
What will the technology stack for holographic video calling look like?
This is 5+ years into the future?
Encoding a single frame takes 30s currently (on i7 laptop)
It needs to be ~3ms to be really interesting
Comments on BWE, delay, rate control and FEC are relevant today
“Typical” behavior of BWE @ 2930s looks far too unstable
Holographic video calling is a nice topic, but niche at the moment. There are a lot more pressing aspects of scale that needs to be dealt with first
Spatial Communications at Scale in Virtual Environments / Paul Boustead @ Dolby
Spatial audio in virtual worlds
Experience of rotating your head is important
Render loudest 3 streams is what WebRTC does by default
P2P vs forward vs mixing
Server side mixing with HRTF (Head Related Transfer Function) vs multichannel spatial codec
The bigger the group, the more sense it would make to switch to spatial mixing of audio (assuming you’re into spatial audio)
Audio chain considerations
Watch this part for generally useful considerations
RTC3 / Justin Uberti @ Clubhouse
Great separation into phases, make product manager(s) watch
Interesting that he classifies 2010-2019 as mobile-driven and 2020+ as meeting-driven. “meetings usage eclipses call usage”
Reliability may be the expectation but who is working on that?
There is a lot to be desired on audio, where WebRTC has (is?) been neglected
But we taught people to mute when not speaking for a decade now…
Group communication and SFUs
Building a good SFU is still hard, value in e2e stack. Who owns that stack? For the client side that is still Google
Justin mentions Agora and Twilio in PaaS and large group calls. Twilio is limited at 50 users; there are others with better group calling solutions (Look at Vonage and Daily for example)
The WebRTC WATCHLISTS file is a really dumb metric to gauge vendors
Unifying RTC and HTTP/QUIC worlds
How the RTC congestion controller gets along with the QUIC one is unsolved
Unrelated to the content itself - smart cameras with auto zoom can be super annoying
Most of this session was focused on the history of WebRTC and the requirements of Clubhouse (audio-only). While we believe audio is important, video can’t be neglected either
Live QA
Watch if you found the sessions worthwhile
Justin Uberti does not wear the same clothes as in the recorded talk, breaking immersion!
SESSION 2: AUDIO ML
Audio ML is quite interesting. Large vendors are at it, and when (if?) the results will trickle into vanilla WebRTC is yet to be seen. Key takeaway: ML-based noise suppression is more important than echo cancellation these days.
Developing Machine Learning Based Speech Enhancement Models for Teams and Skype / Ross Cutler @ Microsoft
Watch if you care about audio quality but very technical (and scientific)
Specific “what could have been better” questions can turn the common (and somewhat useless) five star rating” into something that is actually actionable
Audio capture pipeline enhancements for noise suppression
Lots of almost-scientific evaluation
CPU perf evaluation followed by A/B testing in the fields
Audio capture pipeline enhancements for combined AEC/NS
No A/B testing results sadly
Packet loss concealment
Can AI Disrupt Speech Compression? / Jan Skoglund @ Google
Watch if you want to learn more about audio codecs
Use-case is 2G/3G connections and limited data plans
WaveNet sounds drunk with background noise or music
Lyra and SoundStream
Realtime performance on a smartphone CPU
Lots of listening comparisons
Combine denoiser and codec
Guess what kind of music he plays 🎸
Live QA
Watch if you found the sessions worthwhile
SESSION 3: VIDEO
AV1 is coming. It will take time to be here. To get a grip over it and see what companies are doing, we got Google and Visionular.
Google is what goes inside WebRTC. Visionular is what you can buy commercially on the market for server or proprietary implementations.
Your focus should probably be in low bitrates and slide sharing scenarios.
AV1 Encoder for RTC / Marco Paniconi @ Google
Watch many times if you are a video expert. Otherwise just read this summary
RTC requirements differ from “encode a video”. Encoding screen share? We got you!
There is a “webrtc team” they are working with?
Ah, the one that maintains apprtc… which is down. Yes there is a deployment guide but… can you click the link? No…. (we’re still frustrated like many at taking down appr.tc with no public explanation and so surprisingly)
AV1X” is gone as of M96. See PSA. Missing from the release notes of course!
Unsurprisingly Duo and Meet are the use-cases driving this
Make sure to review the BW reqs on that slide
AV1 is being tested in Meet for screen share? We will monitor!
AV1 has a special mode for screen sharing
SVC is there but the WebRTC-SVC API to enable it is not making much progress
AV1 for RTC: Current and Future / Zoe Liu @ Visionular
Easier to follow than Marco due to being a more sales-y deck
Watch if you are considering licensing what Visionular oes
A bit long for a sales deck
Lots of numbers, great if you understand those
apprtcmobile is … well, the state of that is unclear
Live QA
Watch if you found the sessions worthwhile
AV1 in Duo was low-bitrates, low resolutions. Tsahi predicted this would be the roll-out pattern
No, SVC is not there yet (as an API). Unless it is enabled by SDP munging too…?
SESSION 4: RESILIENCE AND ENCRYPTION
We found this part to be most applicable to current problems. This is where you should be spending your time and focus right now
Making Meta RTC Audio More Resilient / Andy Yang @ Meta
Highly applicable to WebRTC today. A primer on audio resilience, watch!
The presentation style is a very welcome change, giving a roadmap!
As developers explaining the impact of your work is important
Excellent of common audio problems resulting from packet loss and jitter
Great latency analysis of the stack with breakdown of the budget
A rare NetEQ and jitter buffer explanation. NetEQ remains relevant a decade after the GIPS acquisition
Note that there is no RTX for audio so the packet may be treated as “just” late (a plain resend). This is a major issue for video where rtx is used most of the time to avoid this problem. Do we need RTX for audio? Maybe…
NACK and retransmissions will increase the jitter buffer delay otherwise?
WebRTC in the browser does offer a very limited control surface for this kind of experimentation… but it is clearly necessary
Technical metrics vs actual user perception
Measuring technical metrics (see e.g. RED post on hacks) is easy
Actual perception is hard
A very open problem indeed!
Summary - rewind, watch!
We want to know your story, tell our recruiter. Great pitch!
Private Calling at WhatsApp / Xi Deng @ Meta
Again, giving a roadmap and mission statement is great!
15 billion minutes talking on whatsapp each day…
Remember the 2018 “3 billion monthly” for Chrome?
One wonders how they compare to the largest telcos in the world
Great definition of “privacy” when it comes to calling. Metadata? Such a pun!
Interesting threat scenario
“no trust to faceless corporations” (how meta can Meta be?)
Multi-device messaging and calling is a hard problem
Conflict for using data to improve service
What metrics are sensitive and which ones can you use to improve?
Private 1:1 calls
Pass-through servers seem like a relic of Whatsapp starting with XMPP as a protocol back in the day
Multi-device diverges from modern XMPP though
See also later slide on challenges of client-centric multi device
Decoupled relay server
The Whatsapp stack seems still different from the Messenger one and not using “standard” terminology even
Electing a common relay server seems wrong. ICE does not require that
Whatsapp seems to use a relay-first approach with opportunistic P2P4121
Disabling P2P for “strangers” is a very good practice
E2EE for media content
SRTP RFC 3711 does not provide E2EE. “master secret” is a very specific SRTP term. This is equivalent to SDES (boo) but is protected by E2EE (using the Signal protocol) which makes it ok-ish…
Having to generate different master secrets for different devices seems bad compared to DTLS-SRTP
It is concerning that Whatsapp continues to use SDES effectively and does not consider DTLS-SRTP (with its small setup latency) to be a solution
Identity is already a problem for chat messages. One wonders what percentage of sessions have a verified identity
Audio-video switch
A classic example of signaling glare
Unclear why a distributed consent algorithm is needed
The use-case for “oh my phone is an actual phone and can not do video” is shrinking
Multiparty
In XMPP terms the “group call storage” would be a MUC room
Selecting the best SFU makes more sense here than for relay servers
Warp protocol might be a frame header in the RTP payload before the actual codec payload
Unclear why the “master secret” which is a SRTP term (and hence on the leg between client and SFU) needs to change when participants join or leave
Recruiting pitch at the end too!
Group Call End-to-End Encryption and the Challenges of Encrypting Large Calls / Abo-Talib Mahfoodh @ Meta
Highly relevant if you are looking at E2EE for WebRTC
And another session with a mission and roadmap!
Recap of the SFU architecture and what it means for encryption
Where does frame encryption happen in the client pipeline
libwebrtc provides the FrameEncryptorInterface and FrameDecryptorInterface since 2018 but no implementations. Insertable Streams could not reuse those sync interfaces
Key negotiation
Sender key vs session key approaches
Session key is weaker than E2EE and only protects from the SFU which is still relevant in some use-cases
Note that the sender key is symmetric and all receivers must know it to decrypt, but they could encrypt with it. This is not a problem since the receivers can not send media with the SSRCs of the sender so impersonation is not possible
Joining the call requires a ratchet operation (which is cheap)
Someone leaving the call requires a rekey which is O(n^2) so expensive
Scaling group call E2EE
How large do you need to scale at? A meeting with 100 participants is not “private” so session keys might be more appropriate
Prioritizing key exchange based on whether you are planning to send becomes important
Rekey is expensive and larger calls have a higher participant churn making this a hard problem. A small time window to batch this operation helps
Failure to deliver rekey messages is odd, signaling has to be reliable or something is wrong with your overall system
No recruiting pitch?!
Live QA
Watch if you found the sessions worthwhile
out-of-band FEC does not work for audio due to the latency increase. It works for video where frames are split into multiple packets
What you are seeing here isn’t the run of the mill issue of a WebRTC insights newsletter. It wasn’t even intended. But it does show the effort and focus we put on everything WebRTC for our clients. Watching a five hour event twice and producing actionable notes is not an easy task. It changed our weekend plans but we ended up being very satisfied with the results if only for our own notes.