I’m in the process of building up a 6 node k8s cluster, with 3 master and 3 worker nodes that uses the following elements:
- CentOS 7
- Docker CE 19.03.xx
- Kubernetes 1.18.x (its a script building this, so it always grabs the latest release)
- Kube-VIP 1.x (for high availability of the API services)
- Calico Networking layer
- Metal LB (for in-cluster load balancing of apps)
- Project Contour (for Ingress services)
Before I initialize the cluster, I’ve had to create a YAML file that kube-vip will use for its configuration and bootstrapping, each master node has a modified version of this file to ensure each node knows what the VIP address is, and what its specific peer nodes are.
When I initialize Kubernetes on the first node I am using the following initialization command:
kubeadm init --control-plane-endpoint VIPHOSTNAME:6443 --pod-network-cidr=172.18.0.0/16 --service-cidr=172.18.0.0/16 --apiserver-bind-port 6444 --upload-certs
The VIPHOSTNAME value is replaced with the actual DNS name that represents the IP address of the VIP within Kube-VIP. I am also specifying the pod cidr and service cidr to ensure I have no conflicting assignments with my external (to the cluster) network(s). The Kube-VIP pod is spun up at this point and exposes port 6443 on the VIP that get’s aliased to the local nic (ens192), and is used for all subsequent API calls when building the cluster.
I can add in the 2 remaining master nodes, and the 3 worker nodes without issue. The two master nodes will have an additional Kube-VIP spun up to create the HA’d API services entry point and the VIP will sometimes move to one of the 3 master nodes at this point due to its internal election process. All kube-system pods will also start up except for the two coredns pods (which is expected - as there is no underlying networking layer yet).
However, at random points when I intialize the remaining elements of the cluster I get this error:
Warning FailedCreatePodSandBox 10m kubelet, test-node-06 Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9cd78e4551829f270cd0644827a48b3078cd3b93c5a1c704ac32a607dc151d01" network for pod "coredns-66bff467f8-q4m6t": networkPlugin cni failed to set up pod "coredns-66bff467f8-q4m6t_kube-system" network: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/ippools: dial tcp 172.18.0.1:443: connect: connection refused, failed to clean up sandbox container "9cd78e4551829f270cd0644827a48b3078cd3b93c5a1c704ac32a607dc151d01" network for pod "coredns-66bff467f8-q4m6t": networkPlugin cni failed to teardown pod "coredns-66bff467f8-q4m6t_kube-system" network: error getting ClusterInformation: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 172.18.0.1:443: connect: connection refused]
I am starting here with Project Calico because the earliest I have seen this issue is when I have been attempting to initialize the calico networking stack. The guide I have used for Calico is the one at this URL:
https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-50-nodes-or-less
I have had times where this error has not occurred, and the entire cluster can be up and running in a matter of minutes. The error message will still occurr, but at a much later point in time when I attempting to add app pods to the cluster - sometimes preventing the pods from intializing at all, other times simply delaying how long it takes for those pods to initialize. Other times, I see the problem right at the point where I am initializing the calico networking stack, and the problem just daisy chains further on down.
I have tried letting calico detect the pod cidr on its own, as well as modifying the CALICO_IPV4POOL_CIDR value in the calico.yaml file, which is the process I am currently using. I’ve built a script that I am using to initialize the entire kubernetes stack, which strategic pauses in place to allow a given state of the cluster to settle before I allow the script to move onto the next setup phase, so it is easy for me to rebuild the cluster if I need to.
I have also curl’d the address via:
curl -vk https://172.18.0.1
When I curl the above address, it will always work from the system where the active Kube-VIP pod is located, but its 50/50 as to if one of the other nodes can get to it or not. I’ve had some cases where it worked without issue, and 5 minutes later it wont - from the same node, and I’m at a loss as to what is happening.
I am hoping someone can help me figure this out, as I’ve spent the last few days checking every google’d result to te above error and so far nothing has resolved the issue.