# Kubernetes v1.17 new feature preview: topology-aware service routing

##### Other languages: 中文

Hello everyone, I’m roc, from the Tencent Kubernetes Engine (TKE) team. Today I will introduce a new feature of kubernetes in v1.17 that I am involved in: topology-aware service routing.

## Glossary

• Topological domain: Indicates a certain type of “place” in the cluster, such as node, rack, zone or region etc.
• endpoint: An ip:port of a kubernetes service, usually the ip:port of a pod.
• service: kubernetes service resource, associated with a set of endpoints, traffic of access to the service will be forwarded to its’ associated endpoints.

## Background

Topology-aware service routing, simply called Service Topology. This feature was originally proposed and designed by Jun Du (@m1093782566). Why design this feature? Imagine that the kubernetes cluster nodes are distributed in different places and the endpoints corresponding to the service are distributed in different nodes. The traditional forwarding strategy will load balance all endpoints and usually forward them with equal probability. When accessing the service, the traffic may be scattered and hit these different places. Although service forwarding is load-balanced, if the endpoints are far away, the network forwarding will have a high latency, which will affect network performance, and in some cases may even pay additional traffic costs. If the service can forward the endpoints nearby, will it be possible to reduce network latency and improve network performance? Yes! And this is exactly the purpose and significance of this feature.

## Kubernetes Affinity

The service’s nearest forwarding is actually a kind of network affinity, and it tends to forward to the endpoints closer to itself. Prior to this feature, there have been some other affinity designs and implementations in terms of scheduling and storage:

• Node Affinity: Allows Pods to be scheduled to Nodes that meet certain expectations, such as limiting scheduling to a certain Availability Zone, or requiring nodes to support GPUs. This is considered scheduling affinity, and the scheduling results depend on node attributes.
• Pod affinity and anti-affinity: Allows pod to be scheduled depends on ther pods. E.g. Let a group of pods to be scheduled to nodes in the same topology domain, or dispersed to nodes in different topology domains. This is also can be considered as scheduling affinity, and the scheduling result depends on other pods.
• Volume Topology-aware Scheduling: Allows Pods to be scheduled only to nodes that match the topology domain of the storage to which they are bound. This is considered as the affinity of scheduling and storage. The scheduling result depends on the topology domain of the storage.
• Local Persistent Volume: Let Pod use local data volume, such as high-performance SSD, which is useful in some scenarios that require high IOPS and low latency. It also guarantees that the Pod is always scheduled to the same node, and the data will not lost. This is also can considered as the affinity of scheduling and storage. The scheduling result depends on the node where the storage is located.
• Topology-Aware Volume Dynamic Provisioning: The Pod is scheduled first, and then the storage is created according to the topology domain of the node that pod been scheduled onto. This can be considered as the affinity between storage and scheduling, and the creation of storage depends on the scheduling result.

However, kubernetes currently does not have an affinity capability on the network side. The new feature of topology-aware service routing can just fill this gap. This feature enables services can be forwarded nearby instead of all endpoints with equal probability forwarding.

## How to implement

We know that the service forwarding is mainly achieved by the kube-proxy process on the node, by watching apiserver to obtain the endpoints corresponding to the service, and then convert these relationship to iptables or ipvs rules; for headless service, it is mainly achieved by kube-dns or coredns, dynamically resolve service to different endpoint IPs. The key point of implementing service nearby forwarding is how to forward traffic to an endpoint in the same topology domain as the current node, that is, endpoint filtering is performed, and some endpoints that match the current node topology domain are selected for forwarding.

So how to determine whether the endpoint is in the same topology domain as the current node? As long as the topology information of the endpoint can be obtained, it can be known by comparing it with the current node topology. So how to get the topology information of the endpoint? The answer is through the label of the node where the endpoint is located. We can use the node label to describe the topology domain.

Usually when the node is initialized, the controller-manager will label the node with many labels, such as kubernetes.io/hostname, which represents the hostname of the node to distinguish the node; In addition, kubernetes cluster provided by cloud vendors, or cluster deployed with cloud-controller-manager, usually it will also label the node with failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region to distinguish the availability zone and the region where the node is located, but since v1.17, they will be renamed to topology.kubernetes.io/zone and topology.kubernetes.io/region, see PR 81431.

So how to find these labels of the node where it is located according to the endpoint? The answer is Endpoint Slice. This feature was released alpha in v1.16, and will graduate to beta in v1.17. It is equivalent to the enhanced version of the Endpoint API, which solves large-scale endpoint performance problems by sharding endpoints into slices. And it also can carry more information, including the topology information of the node where the endpoint is located. The topology-aware service routing feature will obtain these topology information through the Endpoint Slice to implement endpoint filtering (filter out endpoints in the same topology domain), and then convert to iptables or ipvs rules to implement nearby forwarding.

You may have noticed that usually the iptables/ipvs rules for forwarding services on each node are basically the same, but after the topology-aware service routing feature is enabled, the forwarding rules on each node may be different, because different nodes has different topology information, resulting in different filtered endpoints. It is because of this that service forwarding no longer has the same probability, and the flexible nearby forwarding can be achieved.

Topology-aware service routing for headless services is not currently supported, and is planned to be supported in the beta phase. Because the headless service does not generate forwarding rules through kube-proxy, but instead implements dynamic DNS resolution, kube-dns/coredns needs to be modified to support this feature.

## Prerequisites

To enable the topology awareness service routing feature of the current alpha implementation, the following prerequisites must be met:

• Kubernetes 1.17 or later
• Kube-proxy running in iptables mode or IPVS mode (Only this two modes are implemented in alpha)
• Enable Endpoint Slices (Although this feature will graduate to beta in v1.17, it will not enabled by default)

## How to Enable This Feature

Enable feature gate ServiceTopology and EndpointSlice for all kubernetes components:

--feature-gates="ServiceTopology=true,EndpointSlice=true"


## How to Use

Add the topologyKeys field to the Service spec, which indicates the list of topological domains selected by the service in priority order, corresponding to the key of the node label. When accessing this service, it will be found if there is an endpoint with the topology information corresponding to the topology key, and the value is related to the current node. The same, if it is, then select this topology key as the current topology domain, and filter out all the other endpoints in this topology domain for forwarding; if no endpoint is found in the current topology key (topology domain), it will try the second topology key, and so on; if all topology keys have not been matched after traversing the endpoints, it will refuse to forward, just like this service does not have any backend.

There is a special topology key “*”, which can match all endpoints. If topologyKeys contains *, it must be at the end of the list, usually when there is no endpoint matched for any topology key, just fallback to forward to any endpoint (avoid packet loss).

The current topology key supports the following possible values (more will be added in the future):

• kubernetes.io/hostname: Represents the hostname of the node, usually put it first in the list, which means that if the local machine has an endpoint, it will be forwarded directly to the local machine’s endpoint.
• topology.kubernetes.io/zone: Represents the availability zone where the node is located, usually placed after kubernetes.io/hostname, which means that if there is no endpoint in current node, it will be forwarded to the endpoint on other nodes in the current availability zone.
• topology.kubernetes.io/region: It indicates the region where it is located, which means that it will be forwarded to the endpoint of the current region. This should be used less, because usually all nodes will only be in the same region. If nodes crosses the region, the latency will be very high between nodes’ communication.
• *: Ignoring the topological domain and matching all endpoints, it is usually used as an guaranteed strategy to avoid packet loss and can only be placed at the end of the list.

In addition, there are the following constraints:

• topologyKeys is not compatible with externalTrafficPolicy=Local, they are mutually exclusive. If externalTrafficPolicy is Local, then topologyKeys cannot be defined and vice versa.
• topology key must be in a valid label format and up to 16 keys.

Here is a simple Service example:

apiVersion: v1
kind: Service
name: nginx
spec:
type: ClusterIP
ports:
- name: http
port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "*"


Explanation: When accessing the nginx service, To see if current node has an endpoint of this service at first. If there is, then pick one and route to it; if not, see if there is an endpoint in the current availability zone where current node is located. If there is, pick one and route to it, if still not, it pick an endpoint randomly and route to it.

The above picture is an example of one of the situation: when Pod accesses the nginx service, it finds that current node does not have any endpoint, then it search endpoints in the current availability zone, and found one and forward to it. (does not consider forwarding to endpoints in another availability zone)

## Story Bebind

There are some differences between the original KEP design and the current implementation. The implementation has changed twice, but it also promotes the development of other features. Let me tell the story.

The initial design is to let kube-proxy directly watch nodes to get endpoints’ topology information in alpha. Each node will has a global node cache. Find the corresponding node cache through the endpoint’s nodeName field, and then check the label contained by node to get the endpoint’s topological domain. In the case of a large number of cluster nodes, kube-proxy will consume a lot of resources, but the advantage is that the implementation is very simple, and it can be used as the alpha phase implementation. In beta, we can switch from watch node to watch A newly designed PodLocator API, as an intermediary for topology information storage, to avoid watching huge node objects.

In fact, at the beginning, I followed the way of the watch node, and finally implemented this feature after a lot of effort. Later, metadata-only watch is been supported in kubernetes v1.15, see PR 71548. Using this feature, we can just only watch the metadata of the node, instead of watching the entire node, which can greatly reduce the amount of data transmitted and cached. Then I changed the implementation from watch node to watch node metadata; Even so, metadata is updated frequently, mainly because resourceVersion changes frequently (kubelet reports node status), so although watch node metadata is better than watch node, it may still cause a lot of unnecessary network traffic, but as an alpha implementation is acceptable.

Unfortunately, this implementation was not merged before the v1.16 code freeze, because there are a few small details that have not been well discussed and accepted. In fact, during the implementation of the watch node, the Endpoint Slice feature was proposed. At the stage of this feature discussion, we thought that we could use it to carry topology information, so that the feature of topology-aware service routing can directly use the Endpoint Slice to get topology information, and PodLocator API can also been replaced. But since it is still in the very early stage and the code has not been merged yet, so the alpha implementation does not consider the watch Endpint Slice. And later, the Endpoint Slice feature was released alpha in v1.16.

Since topology-aware service routing was not fit into v1.16, after the v1.17 cycle started, there was more time to discuss small details, and the code of Endpoint Slice has been merged, so I changed the implementation again, from watch node metadata to watch Endpoint Slice, which did what I planned to do in the beta phase at the alpha stage. Finally, the implementation code of this feature merged into master.

## At the End

Topology-aware service routing can implement service forwarding nearby, reduce network latency, and improve the network performance for kubernetes. This feature will be release alpha in kubernetes v1.17 in early December. Let’s look forward to it!