Patrik Rokyta, CTO, Titan.ium Platform
Cloud-native Network Functions (CNFs) are at the core of modern telecom networks. Designed as collections of microservices packaged into lightweight containers, their lifecycle is largely orchestrated by platforms such as Kubernetes. For both communication service providers and CNF vendors, mastering the interplay between CNFs and Kubernetes is not just a technical necessity; it is essential for ensuring reliable, scalable, and resilient services in today’s high-demand, cloud-driven environments. This blog explores the key dynamics of CNFs, focusing on how they shape critical aspects of microservice operation, including liveness, readiness, scaling, and graceful shutdown, while supporting high-throughput, low-latency signaling.
Modern signaling systems increasingly rely on a microservices architecture, where each component focuses on a specific function and communicates with others through standardized APIs and signaling protocols. As 5G stand-alone networks continue to expand, certain protocols, such as HTTP/2 and DNS, have emerged as the primary methods for handling signaling within mobile core networks. In this blog, we will focus on how these protocols interact with each other, as illustrated in the diagram below showing an exemplary telecommunication service. The use case could involve a DNS authoritative server, an ENUM server, or a Number Portability server, while the architecture is primarily intended to provide a clear view of the various underlying communication flows. These flows utilize DNS over UDP, DNS over TCP, DNS over HTTP/2 (DoH), JSON over HTTP/2, and gRPC over HTTP/2 protocols between clients, proxies, interworking functions (IWF), and application servers.
When designing high-performance CNFs, several important factors require attention, including:
- Managing capacity and scaling
- Distributing traffic load
- Handling protocol tunnelling and interworking
- Minimizing service latency
- Ensuring overall service reliability
Depending on the transport protocol (UDP or TCP), some factors may be more or less relevant.
Capacity and Scaling
A fundamental challenge in the design of high-performance CNFs lies in sustaining scalability under fluctuating traffic volumes while maintaining service responsiveness and reliability. Addressing this challenge requires coordinated consideration of signaling protocols, computational resources, and traffic management mechanisms.
DNS signaling provides a representative case study, as it frequently employs the lightweight UDP transport protocol to achieve low latency and minimal overhead. Despite the lack of delivery guarantees, UDP remains prevalent in scenarios such as service discovery, the mapping of telephone numbers to SIP addresses (ENUM), and the retrieval of routing information for ported numbers (Number Portability). In our example, DNS over UDP illustrates the practical requirements for achieving scalability between the DNS clients and the DNS proxy (acting as server).
When defining the size of a scalable unit that represents a DNS client or server, it is important to balance performance with resource allocation. Each unit must meet SLAs such as message throughput and service latency while operating within dedicated CPU and memory limits. If the unit is too small (for example, processing fewer than 100 messages per second), many instances are required to handle overall traffic, which increases deployment and management overhead. If the unit is too large (for example, processing more than 100,000 messages per second), resource contention may arise. This risk is amplified by factors such as task starvation in concurrent scheduling, which may result in latency spikes and service degradation.
The diagram below illustrates an example in which 100,000 messages per second are sent from 15 DNS client instances (on the left) to a single DNS server (on the right). Each message uses a unique source port, generating a large number of in-flight objects that must be stored and tracked by the server. This configuration is typical of legacy active-standby designs, in which a single active server handles all traffic while the standby instance remains idle, making the system particularly sensitive to unexpected latency increases potentially leading to memory exhaustion.
On the contrary, the same SLA can be achieved with a DNS server composed of multiple instances. The diagram below illustrates a setup in which 100,000 messages per second are sent to five DNS server instances operating behind a network load balancer (NLB). This configuration exemplifies an active-active N+M redundancy model, in which all server instances handle traffic concurrently and additional capacity is available to accommodate infrastructure node failures or planned maintenance.
As seen in the diagram, traffic from external clients (on the left) with fixed or only minimally varied source ports may still be unevenly distributed across server instances (on the right) due to the hashing algorithm used by the network load balancer. To mitigate this effect, DNS clients require sufficient source port variation, which allows the NLB to spread traffic more evenly across both existing and newly added server instances.
Traffic Distribution
After ensuring that a signaling service can scale to meet demand, it is important to consider how traffic is distributed between the CNF components. In practice, the difference between UDP and TCP traffic is less about persistence and more about how messages or connections affect load distribution and resource usage. With UDP, messages are independent at the transport layer, but those from the same client IP and port often reach the same server listener, creating an apparent persistence. Source port randomization can introduce additional dynamics, helping to distribute traffic more evenly across server listeners. With TCP, each client connection establishes a session that a dedicated server instance tracks for its duration, so all messages from that connection are handled by the same server listener.
TCP is preferred in scenarios where reliable, ordered delivery is required, or when responses exceed UDP size limits or require encryption. These characteristics make it a strong choice for scaling signaling services across CNF components. In containerized environments such as Kubernetes, traffic distribution can be optimized using a headless service. In this setup, the client maintains TCP connections to all server instances and can select which instance to target for each message. This allows per-message load balancing across the server instances, ensuring an even resource utilization and preventing individual server instances from becoming overloaded during scaling events or rolling updates. While similar outcomes can also be achieved with service meshes, operators often avoid them due to the additional computational overhead involved.
In the diagram below, DNS signaling over TCP is performed between a DNS proxy (on the left) acting as a client and an ENUM frontend (on the right) acting as a server. By exposing the IP addresses of the ENUM server instances through a Kubernetes headless service and applying client-side load balancing on the DNS proxy, traffic at 100,000 messages per second received by the ENUM frontend is balanced evenly across the available server instances. In this deployment scenario, the number of ENUM server instances auto-scales between 2 and 4, ensuring optimal resource utilization and maintaining high responsiveness to dynamic traffic.
Note that in autoscaling scenarios, existing ENUM server instances must temporarily handle higher traffic until the Kubernetes Horizontal Pod Autoscaler (HPA) detects the increased load and launches new instances. The capability to handle higher-than-nominal load is further required in service recovery scenarios, where the CNF can be immediately exposed to the full production load as soon as it becomes available.
Protocol Tunneling and Interworking
The use of TCP also provides additional benefits. In addition to its key advantage of enabling secure communication via Transport Layer Security (DNS over TLS, or DoT), TCP supports improved handling of large DNS messages, enhances resilience to packet loss, and lays the groundwork for sending DNS messages over the HTTP protocol. This is mandated by RFC documents, namely, RFC 8484 for DNS wire format tunneling (DoH) and RFC 8427 for representing DNS messages in a JSON document format, which enables protocol interworking functions that translate DNS messages between wire format and structured application-layer representations, such as JSON in this case. While 5G standards already adopt JSON over HTTP/2 and therefore benefit from connection reuse and multiplexing, implementations may consider gRPC for more efficient resource and bandwidth utilization.
The table below illustrates the impact of HTTP/2 payload variations on service latency under a traffic load of 40,000 messages per second, with traffic received at the DNS proxy and sent directly to the application server. All service instances were co-located on the same node to avoid node-to-node routing. Service latency is reported as the 95th and 99th percentiles.
The results indicate that interworking functions (JSON and gRPC) introduce additional latency compared to tunneling the DNS wire format directly, as in DoH. In this deployment scenario, the latency difference between JSON and gRPC is negligible, even though benchmarks often show gRPC to be faster. Service latency here includes the interworking function within the DNS proxy. Most of the overhead comes from translating between the protocol wire format and application-layer objects, which narrows the performance gap between JSON and gRPC and makes DoH faster in practice. JSON typically incurs higher latency for encoding/decoding and produces larger payloads than gRPC. Conversely, gRPC incurs overhead due to its message framing header and HTTP/2 trailing headers. Ultimately, minimizing interworking functions or translation steps has a greater effect on latency than the choice between JSON and gRPC. However, in CPU- or bandwidth-constrained environments, or when handling large payloads, gRPC’s compact binary encoding typically outperforms JSON.
The next chapter offers a different perspective on service latency by examining deployment-specific aspects.
Service Latency
By nature, low service latency and highly distributed microservice architectures are somewhat contradictory goals. To ensure reliability, multiple instances of each CNF component are typically deployed. These instances often run across different hosts, which introduces additional latency due to node-to-node routing. Since this aspect cannot be eliminated entirely, the primary objective is to avoid adding unnecessary latency on top of it.
Service latency in microservices-based deployments is influenced by multiple factors. Setting aside obvious contributors such as poor CNF architecture, frequent referral queries over the network, CPU throttling, and inefficient coding practices, one critical consideration is the number of CNF components (microservices) involved in the signaling chain, often referred to as the critical path. Each additional component increases the deployment footprint and introduces network- and protocol-specific overhead, which can be further amplified by node-to-node routing. A common best practice is to keep the CNF architecture lean, with three main components: a network-facing component for protocol tunneling or interworking, a stateless application component for business logic, and a storage component for data and state persistence. This setup limits the maximum node-to-node hops to two: from the network component to the application component, and from the application component to storage. At the same time, the CNF should minimize the number of sequential storage queries within a single signaling transaction.
The table below shows measurements taken at a traffic load of 40,000 messages per second between the DNS proxy and the application server, both with and without the ENUM frontend in the signaling path. All service instances were deployed on the same node to avoid node-to-node routing. Communication between the DNS proxy and the ENUM frontend used DNS over TCP. Service latency is reported as the 95th and 99th percentiles.
As the results indicate, introducing an additional CNF component in the signaling path increases service latency. While this does not discourage deploying one or two necessary CNFs, the latency effect can be further amplified by node-to-node routing.
The final table presents the complete architecture introduced at the beginning of this blog, including the Titan.ium’s datastore showcasing low latency service across distributed microservices-based architecture. The deployment runs on a six-node cluster with anti-affinity rules applied to instances of the same CNF component. To better approximate a production environment, the total load was increased to 200,000 messages per second, and the number of service instances was scaled out accordingly. Service latency is reported as the 95th and 99th percentiles.
The graph below shows the traffic distribution and service latencies at the DNS proxy (on the left), the ENUM frontend (in the middle), and the application server (on the right right). Service latency is shown at the 50th (orange), 95th (blue), and 99th (yellow) percentiles. In the middle of the 5-minute time window, traffic to the application servers was gradually rerouted through the ENUM frontend, showing the expected increase in service latency, amplified by node-to-node routing. Communication with the application server and the Titan.ium’s datastore used gRPC over HTTP/2.
Service Reliability
In cloud-native environments, ensuring service reliability is about more than just deploying proxy or server components across local and geographically distributed instances. For telecommunication services, reliability has an even sharper edge: low latency is critical to keep operations like call setup times fast and seamless. A key factor in achieving this reliability lies in the design of a cloud-native network function (CNF), specifically, whether it follows a “retry” or a “fail-fast” approach.
With the retry strategy, the CNF tries to handle service degradations internally by attempting alternative service providers before surfacing an error.
- Pros: If successful, this is often faster than leaving recovery to the remote client, since retries remain within the CNF rather than relying on external network routing.
- Cons: If retries take too long or a transaction gets lost, the remote client is left waiting for a timeout. During this waiting period, in-flight transactions may accumulate on both the client and the CNF, leading to excessive memory usage and potential instability.
In the fail-fast model, the CNF quickly responds with an error when a service degradation occurs, leaving it up to the remote client to retry or redirect requests.
- Pros: This avoids the risk of timeouts caused by prolonged retries inside the CNF, ensuring the system does not build up resource pressure.
- Cons: It shifts the responsibility to the client, which must be capable of handling retries and routing logic efficiently.
Both approaches come with trade-offs. A retry within the CNF can improve service quality but may introduce resource strain. A fail-fast design reduces CNF complexity and timeout risks but relies on robust client-side handling. Regardless of the approach, the golden rule remains the same: timeouts at the remote client must be avoided.
This brings us to the final topic of the blog: managing in-flight transactions during In-Service Software Upgrade (ISSU).
One advantage of the Kubernetes container orchestration framework is its ability to notify a service instance before termination. This gives the instance a chance to run a procedure that delays shutdown and enables graceful termination. Without such a procedure, the instance stops abruptly, leaving in-flight transactions orphaned. The result is timeouts (lost transactions) and errors (failed routing attempts) observed at the upstream CNF component, which may then cascade through the CNF back to the remote client.
The diagram below illustrates this scenario with three components: the DNS proxy (on the left), the ENUM frontend (in the middle), and the application server (on the right). The application server undergoes an In-Service Software Upgrade (ISSU) using a rolling update across all 14 running instances. In the absence of a service termination procedure, timeouts and errors occur at the ENUM frontend. Without a retry mechanism at the ENUM frontend, these errors propagate through the DNS proxy back to the remote clients, which must then take corrective action. The end result is degraded service quality and increased latency.
By correctly handling the service termination phase in the application server, the ISSU procedure remains transparent to other components (DNS proxy and ENUM frontend), with zero errors and timeouts observed across the deployed CNF. As shown in the diagram below, when the application server instance receives a pre-stop event, it can finalize all in-flight transactions before shutting down. The first phase of the service termination procedure must be long enough for the upstream component to recognize the instance is no longer available and stop sending new requests. After this, the service termination procedure continues by sending error responses for any late-arriving transactions (not the case in the diagram below), ensuring that timeouts at the remote clients are avoided.
Note that during ISSU, traffic remains evenly distributed across the application server instances because client-side load balancing is applied at the ENUM frontend CNF component.
Lesson Learned
Successfully adopting cloud-native principles requires more than just containerizing functions. It demands a holistic approach that ensures proper capacity management and scaling, efficient traffic load distribution, robust handling of protocol tunnelling and interworking, minimal service latency, and reliable mechanisms for service continuity. Telco-grade CNFs can deliver the high performance and resilience expected in modern cloud-native deployments only when all of these aspects are properly addressed.
About Titan.ium
Titan.ium Platform is a leader in signaling, routing, subscriber data management, and security software and services. Our solutions are deployed in more than 80 countries by over 180 companies, including eight of the world’s top ten communications service providers.
Titan.ium began its cloud-native journey in 2019 with the introduction of its Titan.ium cloud-native platform. By the mid of 2025, Titan.ium’s cloud-native portfolio includes several 5G network functions and selected legacy network functions that have transitioned to cloud-native to address immediate market demands. At the same time, we continue supporting the Titan virtualized platform that can also be deployed on physical servers. This gradual shift enables communication service providers to harmonize their infrastructure while ensuring continuity.