Opus Discontinuous Transmission (DTX) - What is it and how does it work?

Overview of the Opus Audio Codec

The Opus Audio Codec is a highly versatile open-source audio codec used for real-time communication as well as offline storage of audio files. There are various features of the Opus codec that are appealing when building a video product.

These include:

High audio quality: Offers high-quality audio at a wide range of bitrates (6 kbps to 510 kbps)
Low latency: Allows for frame sizes ranging from 2.5 ms to 60 ms that allow optimization for various low-latency use cases
Open-source: Free to use for both commercial and non-commercial use cases
Supports both constant and variable bitrates: Can use variable bitrates for changing network conditions or constant bitrates for predictable bandwidth use
Multi-channel support: Supports up to 255 channels for multi-stream applications

What is Opus DTX?

Discontinuous Transmission, or DTX, is an extension of the Opus audio codec that allows for the reduction of bitrate used by Opus in periods of silence or low background noise. It is a valuable tool for applications where efficient use of bandwidth is critical. It can significantly reduce audio bitrate during silence without sacrificing call quality or introducing noticeable artifacts.

Working With DTX

Under the hood, the Opus audio codec uses two different codecs for different purposes. The SILK codec developed by Skype is used for speech compression, while the CELT (Constrained Energy Lapped Transform) codec developed by Xiph.Org is used for music and noise compression.

Let’s go through the stages of adding DTX to an audio stream.

Silence Detection

Both SILK and CELT have built-in mechanisms to detect periods of silence within the audio stream.

SILK analyses the incoming audio for its overall energy level. If the energy falls below a predefined threshold, it is considered a potential silent segment. It further evaluates the spectral characteristics of the audio during low-energy periods. Analyzing sound distribution across different frequencies can differentiate between true silence and faint background noise with minimal tonal components. SILK calculates the rate at which the audio signal crosses the zero amplitude level. This rate tends to be very high during true silence, providing another indicator for silence detection.

Conversely, CELT employs statistical models of audio signals to estimate the probability of silence at any given point in the stream. It then analyses the variance of the audio signal, which measures the spread of values around the mean. Lower variance typically indicates a more uniform signal, suggesting potential silence. CELT also calculates the spectral entropy of the audio, which measures the overall randomness of the frequency distribution. Low entropy generally signifies a less complex sound, potentially hinting at silence.

Adding Comfort Noise

If Opus stopped the audio stream from being transmitted, participants on the other end may not feel the presence of the participant sending the stream. When silence is detected, Opus switches from sending regular audio frames to transmitting smaller "comfort noise" frames. Opus doesn't send comfort noise continuously during silence. It transmits them periodically, typically every 400 milliseconds, even if the silence persists longer. This balances saving bandwidth and avoiding abrupt transitions back to speech when the silence ends.

The specific approach for generating comfort frames differs for SILK and CELT:

SILK: Generates comfort noise based on the previous speech samples. It models the background noise surrounding the speaker during the last spoken segment, providing a smooth transition back to speech when it resumes.
CELT: Employs statistical methods to generate a neutral, low-energy signal approximating background noise characteristics. This approach works well with diverse background environments and avoids introducing artifacts from previous speech patterns.

Handling Transitions

The hybrid mode of Opus utilizes the strengths of both SILK and CELT to handle transitions efficiently.

If the first few speech frames indicate a continuation of the previous speaker's voice, SILK might be chosen to handle speech signals with similar characteristics efficiently. When silence ends, and speech begins, SILK applies a crossfade window to both the comfort noise and the first speech frame. This gradually blends the two, smoothing the transition and avoiding abrupt jumps.

CELT might take over with its wider range of capabilities for handling diverse audio textures if the voice is different or the sound is more complex. CELT can dynamically adjust its bitrate based on the complexity of the audio. This allows for a smoother transition back to speech, as the codec can allocate more bits to capture the initial transient sounds accurately.

Advantages of DTX

Reduced bandwidth consumption: This is the main benefit of DTX, especially for applications like video conferencing, where silence is common.
Improved call quality on unreliable networks: By sending fewer packets, DTX helps manage the impact of packet loss, leading to less choppy audio.
Lower latency: With fewer packets to send and receive, the audio delay can be reduced.

Disadvantages of DTX

Increased compute power: Both the encoder and decoder need to handle DTX logic, which can require slightly more processing power than regular Opus encoding/decoding processing.
Potential artifacts during transitions: There may be audio glitches when switching between silence and speech, especially with low-quality microphones.

How Stream Video uses DTX

DTX allows us to keep the audio quality high while dramatically reducing the data participant streams consume. This leads to crystal-clear calls on slower connections, and fewer dropped frames. DTX also allows us to scale other types of calls, such as audio rooms, to many participants.

Stream Video as well as all our Video SDKs use Opus DTX by default.

Conclusion

In this lesson, you have learned about the DTX features of the Opus audio codec that help reduce bandwidth by reducing bitrate in periods of silence. While Opus is a highly efficient audio codec, there are also other audio codecs that may better suit your calling needs. Later in this module, we will also learn about adding the RED protocol to the Opus codec, which better helps transmit audio in bad network conditions.