Reliably Upgrading Cloud-Native Network Functions

Exploring Basic Methods and Their Pros, Cons, and Pitfalls

Ensuring the reliability and performance of cloud-native network functions is crucial. However, mastering the processes surrounding these functions is equally important. With the advent of the 5th generation of mobile networks introducing cloud-native applications, Kubernetes has emerged as the de facto standard for container orchestration, offering a robust platform for deploying, scaling, and managing containerized applications. Ensuring seamless and non-disruptive upgrades using methods implemented in Kubernetes is essential to maintaining service continuity and customer satisfaction.

Upgrading network functions and applications is not merely a routine maintenance task; it is a vital process that ensures the integration of new features, security patches, and performance improvements. However, given the critical nature of telecommunication services, combined with the complexity of mobile networks and the dynamic nature of Kubernetes environments, there are stringent requirements. These include preventing service downtime, minimizing service degradation, and ensuring that any issues during the upgrade process can be quickly detected and mitigated.

In this blog, we will explore three primary methods for upgrading network functions and applications running on Kubernetes:

Standard rolling updates
Canary rollouts
Blue-green deployments

Each method has its advantages and challenges. We will explore the nuances of each approach, offering insights and best practices for telecommunication networks. The example scenario used in this blog represents an instance of a cloud-native network function, the Domain Name System (DNS) server, processing 100,000 queries per second with no errors. The DNS server instance is deployed as a constellation of Kubernetes pods and scales dynamically between 10 and 25 pod replicas. Each pod replica can process approximately 5,000 queries per second.

Standard Rolling Updates

The rolling update is a commonly used strategy in Kubernetes, designed to update pod replicas incrementally. This method updates one pod or a few pods at a time, gradually replacing the old version with the new one. Each updated pod undergoes a health check before the next pod is updated.

The key advantage of rolling updates is that they aim to minimize service downtime, ensuring by configuration that the majority of pods are always serving traffic while others are being updated. The traffic throughput of 100,000 queries per second can be maintained during the entire duration of the update process.

Despite its advantage of minimizing service downtime, rolling updates pose a significant risk: the potential for an unstoppable propagation of a software bug across all pods of the application. If the new version of the application contains a critical bug, this bug can quickly spread to all pods, leading to widespread service degradation or even a complete outage of the network function. In the example scenario used in this blog, the DNS server application sends SERVFAIL responses to every received query, indicating server-internal issues with delivering the expected service.

To mitigate the risks associated with rolling updates, it is essential to conduct thorough pre-deployment testing in staging environments that closely mirror production. Additionally, monitoring the production system closely for any signs of issues and implementing automated rollback mechanisms to revert to the previous stable software version if issues are detected is also crucial. Another viable option is using canary rollouts.

Canary Rollouts

Canary deployments offer a more controlled approach to rolling out updates by initially exposing the new software version to a small portion of the network traffic. By limiting the exposure of the new software version, any issues are contained and affect only a small portion of traffic or users. Once the new version is validated, the number of canary pods can be gradually increased. Finally, a rolling update is triggered for the regular pods.

To deploy a canary version, a small percentage of pods are updated to the new software version. This can be a fixed number of pods or a percentage of the total pods. In the example scenario used in this blog, the DNS server application adds one canary pod with the new version of the software.

One pitfall exists when using canary rollouts: the deployed service shows no negative impact, but only because no traffic is routed to the canary pod. One reason for this is the use of long-living connections directing all network traffic to the regular pods. To maximize the benefits of canary deployments, it is therefore essential to implement automated monitoring and alerting to confirm that network traffic is routed to the canary pod (as seen in the diagram above showing an additional pod receiving traffic), and to quickly detect and respond to issues, for example by removing the canary pod from the service.

Blue-Green Rollouts

Blue-green deployments offer another approach to software updates by maintaining two identical environments: one (blue) serving production traffic and the other (green) being updated. In the example scenario used in this blog, the production traffic is processed by the 20 pod replicas in the blue deployment while the 10 replicas of the green deployment remain idle, without production traffic. Isolating the new software version in the green environment allows thorough testing, including performance testing, before it goes live.

Once the new software version in the green environment is validated, the production traffic is switched from blue to green, making the green environment the new production environment. Notably, during the scale-down stabilization window that may last for a few minutes, both environments will run with the same number of pods, doubling the amount of resources required to run the service.

One significant challenge with blue-green deployments is the auto-scaling lag when traffic is switched over. If the blue environment is handling a high volume of traffic, it is scaled out to accommodate this load. The green environment, being in standby, is scaled down. In our example, the blue environment comprises 20 pods that are required to process the production traffic, while the green environment comprises only 10 pods, which is the minimum number of pods configured in this example. The auto-scaling lag is 10 pods. When the production traffic is switched from the blue to the green environment, the green environment must rapidly scale up to handle the same volume of traffic. If the auto-scaling mechanisms or other Kubernetes services such as service discovery are not responsive enough, the traffic switch can result in performance bottlenecks and service disruptions.

Another risk with blue-green rollouts is the same potential for an unstoppable propagation of a software bug across all pods of the application, as discussed in conjunction with the standard rolling updates. In the example below, the traffic processed by the green deployment is switched to the blue deployment, introducing a software version that contains a critical bug. The effect and impact are the same as observed during the standard rolling update, yet amplified by the auto-scaling lag.

Summary

Rolling updates offer a seamless, zero-downtime standard approach but risk widespread propagation of bugs. Blue-green deployments provide isolation and easy rollback mechanisms but do not solve the bug propagation problem and can additionally suffer from auto-scaling lag when switching traffic. Canary deployments strike a balance, allowing early detection of issues with minimal resource usage and impact on users, and so emerge as a particularly effective strategy that is far more resource-efficient compared to blue-green deployments. Moreover, canary deployments do not suffer from auto-scaling lag.

Titan.ium is deeply committed to ensuring seamless and reliable upgrades of their cloud-native network functions deployed within critical telecommunication networks. With a clear understanding of the diverse upgrade methods available, Titan.ium’s cloud-native product portfolio supports standard rolling updates and is adding support for canary rollouts and blue-green deployments. Each method is fully supported by Helm charts integrated into the product delivery, empowering telecommunications network operators to efficiently manage updates while minimizing downtime and service disruptions. By leveraging the strengths of canary deployments, network operators can achieve reliable, efficient, and safe upgrades of cloud-native network functions deployed on Kubernetes. This recommended approach not only ensures high service availability and reliability but also demonstrates thought leadership in deploying cutting-edge technologies and methodologies in the telecommunications industry.

2-2

For more information contact our team