Iptables performance is limited mainly by two reasons:
- Latency on the first packet of a connection caused by the linear search rule matching
- Latency on the programming latency caused by the need to save and restore all the lines to the kernel in each transaction
The kernel community moved to nftables as replacement of iptables, with the goal of removing the existing performance bottlenecks. Kubernetes has decided to implement a new nftables proxy because of this and another reasons explained in more detail in the corresponding KEP and during the Kubernetes Contributor Summit in Chicago 2023 on the session Iptables, end of an era
In order to get an understanding of the improvemens of nftables vs iptables, we can run a scale model testing to evaluate the difference, consisting in:
- Create a Service with 100k endpoints (no need to create pods), measure the time to program the dataplane
- Create a second service with a real http server backend, ensure this service rules are evaluated after the first service, by sending requests from a client to the exposed port by the Service
- GCE VM Large (n2d-standard-48) with vCPU: 48 and RAM: 192 GB
- Kind version v0.22.0
- Kubernetes version v1.29.2
- Create the KIND cluster (use the corresponding configuration files in this gist)
kind create cluster --config kind-iptables.yaml
OR
kind create cluster --config kind-nftables.yaml
- Modify kube-proxy to enable the metrics server on all addresses
# Get the current config
original_kube_proxy=$(kubectl get -oyaml -n=kube-system configmap/kube-proxy)
echo "Original kube proxy config:"
echo "${original_kube_proxy}"
# Patch it
fixed_kube_proxy=$(
printf '%s' "${original_kube_proxy}" | sed \
's/\(.*metricsBindAddress:\)\( .*\)/\1 "0.0.0.0:10249"/' \
)
echo "Patched kube-proxy config:"
echo "${fixed_kube_proxy}"
printf '%s' "${fixed_kube_proxy}" | kubectl apply -f -
# restart kube-proxy
kubectl -n kube-system rollout restart ds kube-proxy
- Install prometheus (it will expose the prometheus endpoint with a NodePort Service in the
monitoring
namespace)
kubectl apply -f monitoring.yaml
kubectl get service -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-service NodePort 10.96.22.134 <none> 8080:30846/TCP 4m26s
- Create a Service with 100k endpoints
go run bigservice.go --endpoints 100000
Created slice svc-test-wxyz0 with 1000 endpoints
Created slice svc-test-wxyz1 with 1000 endpoints
Created slice svc-test-wxyz2 with 1000 endpoints
Created slice svc-test-wxyz3 with 1000 endpoints
Created slice svc-test-wxyz4 with 1000 endpoints
Created slice svc-test-wxyz5 with 1000 endpoints
Created slice svc-test-wxyz6 with 1000 endpoints
Created slice svc-test-wxyz7 with 1000 endpoints
Created slice svc-test-wxyz8 with 1000 endpoints
Created slice svc-test-wxyz9 with 1000 endpoints
Created slice svc-test-wxyz10 with 1000 endpoints
Created slice svc-test-wxyz11 with 1000 endpoints
Created slice svc-test-wxyz12 with 1000 endpoints
Created slice svc-test-wxyz13 with 1000 endpoints
Created slice svc-test-wxyz14 with 1000 endpoints
Created slice svc-test-wxyz15 with 1000 endpoints
Created slice svc-test-wxyz16 with 1000 endpoints
Created slice svc-test-wxyz17 with 1000 endpoints
Created slice svc-test-wxyz18 with 1000 endpoints
Created slice svc-test-wxyz19 with 1000 endpoints
Created slice svc-test-wxyz20 with 1000 endpoints
Created slice svc-test-wxyz21 with 1000 endpoints
Created slice svc-test-wxyz22 with 1000 endpoints
Created slice svc-test-wxyz23 with 1000 endpoints
Created slice svc-test-wxyz24 with 1000 endpoints
Created slice svc-test-wxyz25 with 1000 endpoints
Created slice svc-test-wxyz26 with 1000 endpoints
Created slice svc-test-wxyz27 with 1000 endpoints
Created slice svc-test-wxyz28 with 1000 endpoints
Created slice svc-test-wxyz29 with 1000 endpoints
Created slice svc-test-wxyz30 with 1000 endpoints
Created slice svc-test-wxyz31 with 1000 endpoints
Created slice svc-test-wxyz32 with 1000 endpoints
Created slice svc-test-wxyz33 with 1000 endpoints
Created slice svc-test-wxyz34 with 1000 endpoints
Created slice svc-test-wxyz35 with 1000 endpoints
Created slice svc-test-wxyz36 with 1000 endpoints
Created slice svc-test-wxyz37 with 1000 endpoints
Created slice svc-test-wxyz38 with 1000 endpoints
Created slice svc-test-wxyz39 with 1000 endpoints
Created slice svc-test-wxyz40 with 1000 endpoints
Created slice svc-test-wxyz41 with 1000 endpoints
Created slice svc-test-wxyz42 with 1000 endpoints
Created slice svc-test-wxyz43 with 1000 endpoints
Created slice svc-test-wxyz44 with 1000 endpoints
Created slice svc-test-wxyz45 with 1000 endpoints
Created slice svc-test-wxyz46 with 1000 endpoints
Created slice svc-test-wxyz47 with 1000 endpoints
Created slice svc-test-wxyz48 with 1000 endpoints
Created slice svc-test-wxyz49 with 1000 endpoints
Created slice svc-test-wxyz50 with 1000 endpoints
Created slice svc-test-wxyz51 with 1000 endpoints
Created slice svc-test-wxyz52 with 1000 endpoints
Created slice svc-test-wxyz53 with 1000 endpoints
Created slice svc-test-wxyz54 with 1000 endpoints
Created slice svc-test-wxyz55 with 1000 endpoints
Created slice svc-test-wxyz56 with 1000 endpoints
Created slice svc-test-wxyz57 with 1000 endpoints
Created slice svc-test-wxyz58 with 1000 endpoints
Created slice svc-test-wxyz59 with 1000 endpoints
Created slice svc-test-wxyz60 with 1000 endpoints
Created slice svc-test-wxyz61 with 1000 endpoints
Created slice svc-test-wxyz62 with 1000 endpoints
Created slice svc-test-wxyz63 with 1000 endpoints
Created slice svc-test-wxyz64 with 1000 endpoints
Created slice svc-test-wxyz65 with 1000 endpoints
Created slice svc-test-wxyz66 with 1000 endpoints
Created slice svc-test-wxyz67 with 1000 endpoints
Created slice svc-test-wxyz68 with 1000 endpoints
Created slice svc-test-wxyz69 with 1000 endpoints
Created slice svc-test-wxyz70 with 1000 endpoints
Created slice svc-test-wxyz71 with 1000 endpoints
Created slice svc-test-wxyz72 with 1000 endpoints
Created slice svc-test-wxyz73 with 1000 endpoints
Created slice svc-test-wxyz74 with 1000 endpoints
Created slice svc-test-wxyz75 with 1000 endpoints
Created slice svc-test-wxyz76 with 1000 endpoints
Created slice svc-test-wxyz77 with 1000 endpoints
Created slice svc-test-wxyz78 with 1000 endpoints
Created slice svc-test-wxyz79 with 1000 endpoints
Created slice svc-test-wxyz80 with 1000 endpoints
Created slice svc-test-wxyz81 with 1000 endpoints
Created slice svc-test-wxyz82 with 1000 endpoints
Created slice svc-test-wxyz83 with 1000 endpoints
Created slice svc-test-wxyz84 with 1000 endpoints
Created slice svc-test-wxyz85 with 1000 endpoints
Created slice svc-test-wxyz86 with 1000 endpoints
Created slice svc-test-wxyz87 with 1000 endpoints
Created slice svc-test-wxyz88 with 1000 endpoints
Created slice svc-test-wxyz89 with 1000 endpoints
Created slice svc-test-wxyz90 with 1000 endpoints
Created slice svc-test-wxyz91 with 1000 endpoints
Created slice svc-test-wxyz92 with 1000 endpoints
Created slice svc-test-wxyz93 with 1000 endpoints
Created slice svc-test-wxyz94 with 1000 endpoints
Created slice svc-test-wxyz95 with 1000 endpoints
Created slice svc-test-wxyz96 with 1000 endpoints
Created slice svc-test-wxyz97 with 1000 endpoints
Created slice svc-test-wxyz98 with 1000 endpoints
Created slice svc-test-wxyz99 with 1000 endpoints
Using the kind cluster we created with the kube-proxy iptables mode
- If we get the logs of one kube-proxy Pod, we can see is blocked on the
iptables-restore
I0413 17:45:41.121569 1 trace.go:236] Trace[89257398]: "iptables restore" (13-Apr-2024 17:45:36.642) (total time: 4478ms):
Trace[89257398]: [4.478738268s] [4.478738268s] END
I0413 17:47:09.820138 1 trace.go:236] Trace[1299421304]: "iptables restore" (13-Apr-2024 17:45:41.269) (total time: 88550ms):
Trace[1299421304]: [1m28.550942623s] [1m28.550942623s] END
It also consumes the whole CPU
- If we check the prometheus metrics the p50 is very high, it is at the maximum value for the histogram, and there are also gaps in the graph, probably because kube-proxy is stuck on the iptables operations
- If we install an additional Service
$ kubectl apply -f svc-webapp.yaml
deployment.apps/server-deployment created
service/test-service created
$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 13m
test-service ClusterIP 10.96.254.227 <none> 80/TCP 4s
svc-test ClusterIP 10.96.253.88 <none> 80/TCP 8m51s
and try to query it using ab
$ kubectl run -it test --image httpd:2 bash
$ kubectl exec test -- ab -n 1000 -c 100 http://10.96.254.227/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.96.254.227 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
It times out :(
Using the kind cluster we created with the kube-proxy nftables mode
- If kube-proxy logs with verbosity 2 we can find metrics of the network programming latency on the logs, maximum time is about 11s , kube-proxy right now logs both ipv4 and ipv6 proxiers, that will be fixed with kubernetes/kubernetes#122979, also I have to get more details on the CPU consumption, but there is no additional CPU load observed when using
top
I0413 17:28:50.729678 1 proxier.go:950] "Syncing nftables rules"
I0413 17:28:52.853880 1 proxier.go:1551] "Reloading service nftables data" numServices=7 numEndpoints=100010
I0413 17:29:01.912845 1 proxier.go:944] "SyncProxyRules complete" elapsed="11.465378712s"
I0413 17:29:01.912886 1 proxier.go:950] "Syncing nftables rules"
I0413 17:29:02.953876 1 proxier.go:1551] "Reloading service nftables data" numServices=0 numEndpoints=0
I0413 17:29:03.323651 1 proxier.go:944] "SyncProxyRules complete" elapsed="1.410751875s"
I0413 17:29:20.352321 1 proxier.go:950] "Syncing nftables rules"
I0413 17:29:20.352402 1 proxier.go:950] "Syncing nftables rules"
I0413 17:29:21.421475 1 proxier.go:1551] "Reloading service nftables data" numServices=0 numEndpoints=0
I0413 17:29:21.856391 1 proxier.go:944] "SyncProxyRules complete" elapsed="1.50406568s"
I0413 17:29:21.856439 1 bounded_frequency_runner.go:296] sync-runner: ran, next possible in 1s, periodic in 30s
I0413 17:29:22.868422 1 proxier.go:1551] "Reloading service nftables data" numServices=7 numEndpoints=100010
I0413 17:29:31.322133 1 proxier.go:944] "SyncProxyRules complete" elapsed="10.96973879s"
I0413 17:29:31.322179 1 bounded_frequency_runner.go:296] sync-runner: ran, next possible in 1s, periodic in 30s
- Connecting to the prometheus exposed port to get the metrics for kube-proxy we can observe the p50 and p95 values
- Install a second service with a web application
kubectl apply -f svc-webapp.yaml
Get the Service ClusterIP and run several request against it to get the latency
kubectl exec test -- ab -n 1000 -c 100 http://10.96.246.227/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.96.246.227 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 10.96.246.227
Server Port: 80
Document Path: /
Document Length: 60 bytes
Concurrency Level: 100
Time taken for tests: 0.158 seconds
Complete requests: 1000
Failed requests: 29
(Connect: 0, Receive: 0, Length: 29, Exceptions: 0)
Total transferred: 176965 bytes
HTML transferred: 59965 bytes
Requests per second: 6333.92 [#/sec] (mean)
Time per request: 15.788 [ms] (mean)
Time per request: 0.158 [ms] (mean, across all concurrent requests)
Transfer rate: 1094.61 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 6 2.2 6 10
Processing: 0 9 6.9 7 37
Waiting: 0 7 6.9 4 37
Total: 0 15 6.6 14 41
Percentage of the requests served within a certain time (ms)
50% 14
66% 14
75% 15
80% 15
90% 25
95% 35
98% 36
99% 37
100% 41 (longest request)
The latencies does not seem to be affected by the large service
No much to say, kube-proxy nftables seems to solve the iptables scalability and performance problems , KUDOS to the netfilter people and to @danwinship for their great work.
Since kube-proxy nftables is still in alpha, all the performance and scale improvement will come in beta, so most likely current state will improve