Did you know? All Video & Audio API plans include a $100 free usage credit each month so you can build and test risk-free. View Plans ->

Session Initiation Protocol (SIP)

We all experience daily family, friends, teams, groups, and one-on-one real-time calls when working from home to the workplace. Have you ever wondered how these phone call sessions establish seamless communication, allowing you and the people you care about to share information? The technology behind the session establishment for audio and video flow until the communication ends is called Session Initiation Protocol (SIP). SIP occurs in Voice over Internet Protocol (VoIP), video conferencing, and other digital media distributions like instant messaging.

What is SIP? 

SIP is a messaging or peer-to-peer protocol for initiating, handling, and ending sessions in real-time communication between participants. It was proposed by the Internet Engineering Task Force in 2002 as an IP-based and text-based protocol. It works seamlessly alongside other application layer protocols to manage multimedia communication sessions online. In other words, it allows two or more devices to communicate. A communication or call between the two participants is called a session. The image above illustrates a basic call flow between two participants. The caller initiates a call with a Stream Video (SIP app). The system sends the caller's request to a SIP proxy server, which is responsible for transferring the user's request to the receiver. Whenever the sender's request gets to the server, it notifies the sender about the reception with a status OK response. To understand how SIP works under the hood, one needs to know its core dependencies, which help ensure efficient and reliable communication. SIP uses the following mechanisms to establish and terminate real-time communication.

  • Device Location: The actual locations of participants' devices.
  • Device Availability: Determines the availability of participants' devices. 
  • Endpoint Capabilities: Access the users devices' capabilities, such as codecs and bandwidth, to ensure the best possible communication experience.  
  • Session Handling: It manages the SIP request-response sequence and handles parameters such as Trying, Kk, Ack, and Bye. 
  • Session Configuration: It maintains and terminates the session establishment.

SIP Architecture Overview

SIP has a layered architecture consisting of different independent stages. Each layer has its components and processes. Here is a breakdown of the main layers of a typical SIP structure.

  • Surface or Topmost Layer: This layer consists of the transaction user which houses the application logic.
  • Transaction Layer: The second layer is responsible for transaction tasks such as request retransmission, timeout, and matching of responses. A transaction consists of a client's request to a server and all responses the client receives from the server. 
  • Transport layer): The third layer is responsible for transmitting the client's requests and responses. It also monitors the responses the server receives and transports to the recipient. 
  • Lowest Layer: The lowest layer of SIP contains syntax and encoding that uses augmented grammar.

SIP Messages and Responses

Several system responses occur during SIP communication's initiation, control, and end phases. The table below summarizes typical SIP responses between two participants for real-time voice and video calls.

| Response | Description | |-----------------|-----------------------------------------------------------------------------------------------| | 100 trying | The server has received the invitation request and is trying to reach the recipient (callee). | | 180 ringing | The call is ringing at the callee's end. | | 200 OK (first) | The call has been connected and answered successfully. | | Ack | An acknowledgment of the response, indicating the connection is established. | | Bye | A response generated to indicate a call is being terminated. | | 200 OK (second) | The system acknowledges that the call termination request has been received and processed. |

Refer to the image below in the next section and the detailed explanations to understand how these responses occur sequentially in the SIP process.

SIP Request-Response Sequence 

This example illustrates a typical SIP request-response sequence in a video conferencing app scenario. When a user taps the Start New Call button, the system sends an invitation to the SIP server, which then forwards the invitation request to the designated receiver. The client (request sender) gets a status response of 100 trying to indicate an ongoing connection. When the connection is successful, and the destination device starts ringing, 100 trying and 180 ringing responses are sent from the callee to the caller. When the call is accepted, a 200 Ok response is sent from the callee to the caller. The system sends an Ack response after establishing the communication. Finally, the participants can begin conversing with the Real-time Transport Protocol (RTP). If one of the participants ends the call, both receive Bye and 200 Ok feedback.

Refer to the SIP Responses section to learn about the description of each response.

SIP Network Components

A SIP uses several network elements for its operation. Its network consists of the following components.

  • User Agent (Laptop, smartphone): A user agent consists of two endpoints: a sender and a receiver. In a typical SIP communication, an endpoint can begin, change, and end a session.
  • Proxy Server: A proxy server takes a request from one of the endpoints and transports it to the other. There are two main categories of proxy servers. A stateful server is responsible for watching requests and responses for later use. A stateless proxy server has no storage ability. When it receives a message from one of the endpoints, it forwards it to the other immediately without storing it. 
  • Registrar: A registrar server authenticates the communicating devices and accepts requests from the user's device. In the example below, the caller sends a registration request to the registrar. The registrar server authenticates the request by sending a 200 Ok response back to the client.
  • Endpoint or Location Server: This server publishes information about the user agents' location and forwards it to the proxy server when requested. 
  • Redirect Server: As the name implies, the redirect server routes requests to intended user agents by retrieving their information from a database. 

Advantages of SIP

SIP provides several benefits for real-time audio, video, and messaging communications. The following are key advantages of using SIP.

  • Communication: It helps to maintain seamless communication synchronization for audio and video calls. 
  • Setup: It has a simple setup process with straightforward commands, making troubleshooting easy. 
  • Format: A SIP communication is textual like Hyper Text Transport Protocol (HTTP), which makes it easily readable and understandable.
  • Maintenance: Since the protocol is textual, it is easy to maintain.
  • Scalability: The protocol can be scaled and deployed to millions of users, which makes it an excellent choice for enterprise settings.

Disadvantages of SIP

Although SIP provides several benefits for real-time communication and multimedia applications, it has the following potential issues. The protocol relies on the bandwidth of user agents. It requires a mechanism that can automatically share bandwidth appropriately among endpoints. For example, long video calls and instant messaging pressures the participants' bandwidth. SIP is also prone to security threats and attacks. Data breaches and unauthorized access to VoIP calls can occur in SIP applications that do not have robust encryption systems like End-to-End Encryption

SIP Use Cases

SIP is generally used for internet protocols. A wide range of real-time communication applications and digital services use SIP to enable seamless communication and interaction across multiple user agents and networks. SIP can be used in the following application areas.

  • File Transfer: In web and mobile apps, SIP can provide efficient, quick, and reliable file transfer between endpoints.
  • Online Gaming: SIP ensures smooth real-time audio/voice and video communication among multiple online game players.  
  • Livestreaming: In live video and voice content streaming, SIP enables the hosts to initiate a broadcast to a wide range of viewers and handles the streaming termination after some time.
  • Video Conferencing: SIP is used in virtual video conferencing apps like Zoom to help participants initiate and terminate calls.
  • Instant Messaging: SIP powers real-time messaging apps, such as WhatsApp, Telegram, and Messenger for attachments sharing.

Frequently Asked Questions

What is a SIP Invite?

An Invite in SIP is a user agent's initial request to establish a communication session, like an audio/video call, by negotiating session parameters between endpoints.

What is a “200 Ok” response in SIP?

The "200 OK" message in SIP confirms that a client's SIP request was successfully received and accepted.

What is an “ACK” in SIP?

ACK stands for acknowledgment. It is a response a proxy server returns to display the receipt of a 200 OK message.

How does a SIP session work?

A SIP session begins with an invitation, followed by session negotiation with response statuses such as 200 OK and acknowledgment ACK, and terminates with a BYE message.

Where can SIP be used?

SIP is used in VoIP calls, video conferencing, instant messaging, file transfers, online gaming, streaming, and more.

What is the difference between a SIP Proxy Server and Registrar Server?

  • Proxy Server: Routes SIP messages and is responsible for handling a session establishment.
  • Registrar Server: Manages user endpoints or user agents' registration and watches their information.