Posted on 

kube-scheduler pod cidr bugfix

​ 之前写一个需求需要做容器网络的规划,发现kuberntes在调度的时候不会把 ip 地址作为一个调度的参考项。也就是手当node上规划出来的子网中的 ip 用光且 cpu 和 mem 以及其他调度参考项都满足的时候 pod 还是会被分配到这个节点上,并且kubelet会伴随着如下报错:

1
NetworkPlugin kubenet failed to set up pod "frontend-jh0kf_default" network: Error adding container to network: no IP addresses available in network: kubenet

​ 修复方案有很多种,核心思路是围绕着调度器参考的对象。比较优雅的方式是在kube-scheduler中将 ip 地址也作为一个调度资源,但是这个实现起来工作量相对其他方法大了一点;有个折中取巧的方式是利用kube-scheduler中的一个Allocated Pod来实现,工作量小,实现简单。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
diff --git a/pkg/scheduler/cache/node_info.go b/pkg/scheduler/cache/node_info.go
index 31be774578e..6c9f5713e94 100644
--- a/pkg/scheduler/cache/node_info.go
+++ b/pkg/scheduler/cache/node_info.go
@@ -29,6 +29,8 @@ import (
v1helper "k8s.io/kubernetes/pkg/apis/core/v1/helper"
priorityutil "k8s.io/kubernetes/pkg/scheduler/algorithm/priorities/util"
"k8s.io/kubernetes/pkg/scheduler/util"
+ "net"
+ "math"
)

var (
@@ -315,7 +317,16 @@ func (n *NodeInfo) AllowedPodNumber() int {
if n == nil || n.allocatableResource == nil {
return 0
}
- return n.allocatableResource.AllowedPodNumber
+ ip, cidr, err := net.ParseCIDR(n.node.Spec.PodCIDR)
+ if err != nil || ip.To4() == nil {
+ return n.allocatableResource.AllowedPodNumber
+ }
+ size, _ := cidr.Mask.Size()
+ if size >= 31 {
+ return 0
+ }
+ // -3 (network address, broadcaster address, gateway address)
+ return int(math.Min(math.Pow(2, float64(32-size)) - 3, float64(n.allocatableResource.AllowedPodNumber)))
}

不过还有需要考虑的是当 pod 使用的是  hostNetwork: true ,上面 patch 工作是不符合预期的。

测试

case –node-cidr-mask-size=30

期望只有一个 pod 分配到 ip 地址并运行,可以查看到 cm 的信息如下:

1
2
3
4
5
6
7
[root@VM_128_11_centos ~]# systemctl status kube-controller-manager.service -l
● kube-controller-manager.service - kube-controller-manager
Loaded: loaded (/usr/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-08-07 13:37:56 CST; 2min 23s ago
Main PID: 20759 (kube-controller)
CGroup: /system.slice/kube-controller-manager.service
└─20759 /usr/bin/kube-controller-manager --node-cidr-mask-size=30 --cluster-cidr=10.255.0.0/19 --allocate-node-cidrs=true --master=http://127.0.0.1:60001 --cloud-config=/etc/kubernetes/qcloud.conf --service-account-private-key-file=/etc/kubernetes/server.key --service-cluster-ip-range=10.255.31.0/24 --allow-untagged-cloud=true --cloud-provider=qcloud --cluster-name=cls-n1jte9ty --root-ca-file=/etc/kubernetes/cluster-ca.crt --use-service-account-credentials=true --horizontal-pod-autoscaler-use-rest-clients=true

kubelet 信息如下,看见cni插件的参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: I0807 13:38:24.454373   23809 kubenet_linux.go:308] CNI network config set to {
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "cniVersion": "0.1.0",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "name": "kubenet",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "type": "bridge",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "bridge": "cbr0",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "mtu": 1500,
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "addIf": "eth0",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "isGateway": true,
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "ipMasq": false,
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "hairpinMode": false,
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "ipam": {
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "type": "host-local",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "subnet": "10.255.0.0/30",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "gateway": "10.255.0.1",
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: "routes": [
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: { "dst": "0.0.0.0/0" }
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: ]
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: }
Aug 07 13:38:24 VM-0-43-ubuntu kubelet[23809]: }

确认一下运行中的 pod 数量和 pod 所在节点的信息:

1
2
[root@VM_128_11_centos ~]# kubectl get pod --all-namespaces | grep Running  | wc -l
1
1
2
3
4
5
6
7
[root@VM_128_11_centos ~]# kubectl describe node 172.30.0.43
...
Non-terminated Pods: (1 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default guohao-555fb5456d-kdx8n 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:

case –node-cidr-mask-size=29

期望运行 2^(32-29) - 3 = 5 个 pod 分配到 ip 并运行,可以查看到下面 kubelet 的 cni 信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: I0807 13:44:48.669847   25163 docker_service.go:307] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:10.255.0.0/29,},}
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: I0807 13:44:48.669902 25163 kubenet_linux.go:308] CNI network config set to {
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "cniVersion": "0.1.0",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "name": "kubenet",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "type": "bridge",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "bridge": "cbr0",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "mtu": 1500,
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "addIf": "eth0",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "isGateway": true,
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "ipMasq": false,
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "hairpinMode": false,
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "ipam": {
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "type": "host-local",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "subnet": "10.255.0.0/29",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "gateway": "10.255.0.1",
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: "routes": [
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: { "dst": "0.0.0.0/0" }
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: ]
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: }
Aug 07 13:44:48 VM-0-43-ubuntu kubelet[25163]: }
1
2
[root@VM_128_11_centos ~]# kubectl get pod --all-namespaces  |grep Running | wc -l
5
1
2
3
4
5
6
7
8
9
10
[root@VM_128_11_centos ~]# kubectl describe node 172.30.0.43
...
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default guohao-555fb5456d-kjzrk 0 (0%) 0 (0%) 0 (0%) 0 (0%)
default guohao-555fb5456d-lxrmn 0 (0%) 0 (0%) 0 (0%) 0 (0%)
default guohao-555fb5456d-t4fq4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
default guohao-555fb5456d-t9k2b 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system l7-lb-controller-95dcf7bd7-v9wx7 0 (0%) 0 (0%) 0 (0%) 0 (0%)

结论

当时masksize3029时候都是符合预期的,但是问题是只有使用kubenet时这个patch才能正常工作,如果使用其他的CNI实现这样实现就显得很鸡肋。因为PodCIDR是被kubenet传递给host-local插件的,其余的cni插件不一定使用这个。