Calico periodically has issues initializing due to "dial tcp 172.18.0.1:443: connect: connection refused" errors

I’m in the process of building up a 6 node k8s cluster, with 3 master and 3 worker nodes that uses the following elements:

- CentOS 7
    - Docker CE 19.03.xx
- Kubernetes 1.18.x (its a script building this, so it always grabs the latest release)
- Kube-VIP 1.x (for high availability of the API services)
- Calico Networking layer
- Metal LB (for in-cluster load balancing of apps)
- Project Contour (for Ingress services)

Before I initialize the cluster, I’ve had to create a YAML file that kube-vip will use for its configuration and bootstrapping, each master node has a modified version of this file to ensure each node knows what the VIP address is, and what its specific peer nodes are.

When I initialize Kubernetes on the first node I am using the following initialization command:

kubeadm init --control-plane-endpoint VIPHOSTNAME:6443 --pod-network-cidr=172.18.0.0/16 --service-cidr=172.18.0.0/16 --apiserver-bind-port 6444 --upload-certs

The VIPHOSTNAME value is replaced with the actual DNS name that represents the IP address of the VIP within Kube-VIP. I am also specifying the pod cidr and service cidr to ensure I have no conflicting assignments with my external (to the cluster) network(s). The Kube-VIP pod is spun up at this point and exposes port 6443 on the VIP that get’s aliased to the local nic (ens192), and is used for all subsequent API calls when building the cluster.

I can add in the 2 remaining master nodes, and the 3 worker nodes without issue. The two master nodes will have an additional Kube-VIP spun up to create the HA’d API services entry point and the VIP will sometimes move to one of the 3 master nodes at this point due to its internal election process. All kube-system pods will also start up except for the two coredns pods (which is expected - as there is no underlying networking layer yet).

However, at random points when I intialize the remaining elements of the cluster I get this error:

Warning FailedCreatePodSandBox 10m kubelet, test-node-06 Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9cd78e4551829f270cd0644827a48b3078cd3b93c5a1c704ac32a607dc151d01" network for pod "coredns-66bff467f8-q4m6t": networkPlugin cni failed to set up pod "coredns-66bff467f8-q4m6t_kube-system" network: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/ippools: dial tcp 172.18.0.1:443: connect: connection refused, failed to clean up sandbox container "9cd78e4551829f270cd0644827a48b3078cd3b93c5a1c704ac32a607dc151d01" network for pod "coredns-66bff467f8-q4m6t": networkPlugin cni failed to teardown pod "coredns-66bff467f8-q4m6t_kube-system" network: error getting ClusterInformation: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 172.18.0.1:443: connect: connection refused]

I am starting here with Project Calico because the earliest I have seen this issue is when I have been attempting to initialize the calico networking stack. The guide I have used for Calico is the one at this URL:

https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-50-nodes-or-less

I have had times where this error has not occurred, and the entire cluster can be up and running in a matter of minutes. The error message will still occurr, but at a much later point in time when I attempting to add app pods to the cluster - sometimes preventing the pods from intializing at all, other times simply delaying how long it takes for those pods to initialize. Other times, I see the problem right at the point where I am initializing the calico networking stack, and the problem just daisy chains further on down.

I have tried letting calico detect the pod cidr on its own, as well as modifying the CALICO_IPV4POOL_CIDR value in the calico.yaml file, which is the process I am currently using. I’ve built a script that I am using to initialize the entire kubernetes stack, which strategic pauses in place to allow a given state of the cluster to settle before I allow the script to move onto the next setup phase, so it is easy for me to rebuild the cluster if I need to.

I have also curl’d the address via:

curl -vk https://172.18.0.1

When I curl the above address, it will always work from the system where the active Kube-VIP pod is located, but its 50/50 as to if one of the other nodes can get to it or not. I’ve had some cases where it worked without issue, and 5 minutes later it wont - from the same node, and I’m at a loss as to what is happening.

I am hoping someone can help me figure this out, as I’ve spent the last few days checking every google’d result to te above error and so far nothing has resolved the issue.

kube-VIP uses ARP to claim IPs; sounds like it’s normal for connectivity to drop when the VIP moves. Have you tried enabling GARP?

@fasaxc thanks for replying, I do have GARP enabled in the kube-vip configuration, which looks like this:

localPeer:
id: test-node-01
address: 192.168.2.237
port: 10000
remotePeers:

  • id: test-node-02
    address: 192.168.2.238
    port: 10000
  • id: test-node-03
    address: 192.168.2.239
    port: 10000
    vip: 192.168.2.236
    gratuitousARP: true
    singleNode: false
    startAsLeader: true
    interface: ens192
    loadBalancers:
  • name: Kubernetes Control Plane
    type: tcp
    port: 6443
    bindToVip: false
    backends:
    • port: 6444
      address: 192.168.2.237
    • port: 6444
      address: 192.168.2.238
    • port: 6444
      address: 192.168.2.239

    […]

Is there anywhere else it would need to be enabled at?

I also need to make a correction, this morning when I attempted to continue debugging the issue, the host that is currently running the VIP has now also stopped being able to curl the internal API enpoint:

image

Nothing has changed overnight, and no one has touched this cluster as I’ve not yet released it for use. In fact, no host in the cluster is able to curl that URL and get the expected 403 error, they are all simply connection refused responses.

Your config has VIP = 192.168.2.236, where does 178.18.0.1 fit in?

@fasaxc The 172.18.0.01 address is the internal ClusterIP created when Kubernetes is initialized:

image

image

Its what is selected when a pod is being initialized, and what all the logs report when I examine them in detail as having been the failure point, so they must auto-discover that IP during the process of standing up pods.

The VIP (192.168.2.236) will be used by developers to connect to the kubernetes cluster directly from their workstations, hence why the API is exposed outside the cluster. All other tasks executed from within the cluster are supposed to use the 172.18.0.1 ClusterIP for their API calls, as I read that internal services will only use the internal ClusterIP for that purpose, and that’s fine.

In both cases, they both terminate at the same endpoints if you compare the configuration I posted vs the more recent screenshot describing the service ClusterIP.

Is 172.18.0.0/24 your service VIP range? If so, that’d make sense.

kube-proxy is responsible for DNATting the service VIP to the “real” IP of your API server. Is kube-proxy running and healthy? Is kube-proxy correctly configured to talk directly to the kube-VIP VIP

@fasaxc, actually, I’ve setup 172.18.0.0/16 as the service range, might be a bit on the high side, but I like to plan on the chance that I might be adding more nodes to the cluster and don’t want to have to adjust the service ranges later :>

All 6 kube-proxy pods do appear to be running, and none are reporting any errors when I monitor the logs:

As for the last bit, I have no made any changes to kube-proxy to point it to the kube-vip VIP. I’ll look into that, but I might need some guidance there, as all the examples I’m finding are not clear on how to adjust that.

What does “kubectl get endpoints kubernetes” show?

image

I’ve done some additional research, and discovered that I had not allowed port 10256 through my firewall’s on the systems. I’ve added this port to the list of enabled ports, but it has not resolved the issue, I’m still hitting this error:

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "326e291afcbfa7cf4c167e41020371bb322e71ddbc632ca8db64e807b406ebe1" network for pod "calico-kube-controllers-578894d4cd-sm7nz": networkPlugin cni failed to set up pod "calico-kube-controllers-578894d4cd-sm7nz_kube-system" network: error getting ClusterInformation: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 172.18.0.1:443: connect: connection refused, failed to clean up sandbox container "326e291afcbfa7cf4c167e41020371bb322e71ddbc632ca8db64e807b406ebe1" network for pod "calico-kube-controllers-578894d4cd-sm7nz": networkPlugin cni failed to teardown pod "calico-kube-controllers-578894d4cd-sm7nz_kube-system" network: Get https://[172.18.0.1]:443/apis/crd.projectcalico.org/v1/ipamhandles/k8s-pod-network.326e291afcbfa7cf4c167e41020371bb322e71ddbc632ca8db64e807b406ebe1: dial tcp 172.18.0.1:443: connect: connection refused]

This error happened while trying to create the calico-kube-controllers container. I deleted it a few times when it hit a state where it gave up trying to attempt to kick it in the pants, but it does not appear to have worked. Here’s the current state of the calico stack.

Adding some more information, now that I’m digging through the iptables rules to see if I can figure out where things are breaking down:

The following has a masked out hostname under a white bar that’s done intentionally to mask some confidential information:

The KUBE-SEP-RAKFZFEZUCMOZY2Y and KUBE-SEP-Z4MFG73UPSRU5KD3 rules also point to their respective VIP service endpoint f 2.238:6444 and 2.239:6444

I am at a loss as to where the issue is…

I think you’ve got 6444 in your service spec but you’re running your API server on 6443