SFU Cascading
Introduction to SFU Cascading
SFU Cascading is a technique in video conferencing used to manage and distribute media streams effectively. SFU Cascading builds upon the concept of the single-SFU architecture as seen in the previous lesson, and connects multiple SFUs in a mesh-like architecture. Each SFU in this hierarchy deals with a specific subset of users, most often users from a single geographic region.
Cascading solves multiple problems associated with a single SFU, such as scalability, latency, server load, participant capacity, etc.
Stream’s Video API supports SFU cascading out of the box. In this lesson, we will look at how SFU cascading works and discuss some of the challenges developers can encounter when trying to implement cascading.
Disadvantages of Having a Single SFU
Implementing only a single SFU can have several issues, particularly with large-scale applications or those with geographically dispersed users.
Single point of failure: When working with only one SFU, failure of the SFU leads to a failure of the entire video call. Any technical issues affect all users associated with the SFU.
High latency: When users are geographically dispersed, some need to connect to SFUs far away, leading to larger latencies. Often, with use cases such as live streaming, latency is an important factor in the experience for the user, and higher latencies can
Low scalability: Since a single SFU handles all procedures related to the call, many participants can burden the server and reduce call quality. Latency, buffering, and dropped frames increase with the increase in participants and can tarnish the call experience for users.
How SFU Cascading Works
In SFU cascading, a mesh of interconnected SFUs is created that is capable of selectively forwarding media streams to other SFUs in the network, providing a distributed and scalable solution for resilient, low-latency video calling.
The SFUs in the mesh are usually placed in different geographical locations allowing users around the world to connect to the closest node. This reduces latency and distributes the load of the video call across multiple nodes rather than placing it on a single one. The node that a user is connected to can forward media streams to all other connected SFUs.
Advantages of SFU Cascading
Lower latency: Since SFUs can be placed around several geographical locations, media streams can be routed closer to participants, minimizing latency and increasing responsiveness.
Better reliability: By distributing the load and eliminating a single point of failure, SFU cascading enhances the overall reliability of the communication system. If one SFU experiences issues, the others can seamlessly take over, preventing disruptions to the ongoing sessions.
Better scalability: By splitting participants over many SFUs, calls can handle larger participants. This can allow better scalability when building for users worldwide and/or in larger use cases like live-streaming.
Load balancing: Since several SFUs are in the hierarchy, if one node faces overload, users can be transferred to another to balance out the load over the system. This makes sure no SFU gets overburdened and reduces overall rates of failure.
Added optimizations: Several optimizations can be carried out with SFU cascading. Media streams can be routed through an optimal path through the hierarchy to reduce travel time and ensure an optimal experience.
Disadvantages of SFU Cascading
Complexity: Setting up an SFU hierarchy is significantly harder than setting up a single SFU. Simply connecting the SFUs is only the first part of the puzzle. Getting all streams and participant connections working perfectly is harder to do with cascading SFUs than with a single SFU. Measures to add redundancy to this hierarchy can take time, and several tries are needed to get it right.
Added latency over P2P: While overall latency is reduced in a call architecture involving SFU cascading, more connections are involved in getting a video stream from a sender to a receiver. This can increase latency between two participants compared to other architectures, such as Peer-to-Peer (P2P).
How Stream uses SFU Cascading
A video-calling experience based on WebRTC is notoriously difficult to get right, and building it in-house is a massive task for any development team. There are several companies that have spent a lot of time and resources trying to build a solid calling experience and still didn’t get it quite right. There are also other variations of products based on WebRTC, such as live streaming and audio calling, that contain their own nuances.
For us, it was almost obvious that we needed to make creating video experiences easier by creating a robust video backend, SDKs on all popular platforms, and components that cover all aspects of a video calling experience. We also decided to cover all popular use-cases, such as meet-style video calling, end-to-end calling experiences integrating with OS call systems, livestreaming (both WebRTC and HLS-based), and audiorooms.
Building all the aforementioned features and ensuring scalability required a lot of thought into the infrastructure that we needed to use to build our systems. We considered several architectures but eventually settled on SFUs with cascading as our primary choice. This was important since users around the world connecting to a single SFU would be problematic and would increase latency for everyone on a single call. The SFUs in our systems work like a mesh, and each individual SFU talks to and relays information to all other SFUs about the participant connected to the node. This ensures we can do adequate load-balancing and add redundancy while reducing latency for all users.
There were several technical challenges along the way. For one, making the SFUs relay information to all other nodes is not an easy task to achieve. We used Redis streams to publish updates to the call state and made every node check for new updates every 100ms to maintain a valid call state. The DTLS (Datagram Transport Layer Security) streams used to transfer video and audio streams between SFUs are difficult to work with. Distinguishing all tracks and layers sent across SFUs was critical and needed some work to achieve. Debugging SFU issues is difficult on its own, and adding cascading also adds a layer of complexity. Then there was the issue of bandwidth: video is egress-heavy since every incoming video stream needs to be sent to all participants, increasing the outgoing bandwidth requirements. There were also additional challenges, such as adding a congestion control algorithm to the cascading implementation.
There were additional technical challenges that were too verbose to note down here, but the gist of it is that building video is a complex challenge. We pushed hard to make sure any development team can build the video experiences they desire without facing the complex issues associated with WebRTC and perfecting cascading.