Calico-managed container communication accross hosts fails

halfdome · December 10, 2020, 4:23pm

Dear Calico community,

based on the following environment:

K8s: 1.19.3
Calico: 3.16.4

deployed via:

Kubespray: 2.11

we use the following sample K8s-YAML providing:

2x K8s-PODs
comprising a single container image “network-multitool” (offering several network tools like “nc”, “ping”, “traceroute”, etc.) each
on DIFFERENT physical nodes
with a service attached to “mt-server” to allow the “mt-client” to reach it:

apiVersion: v1
kind: Namespace
metadata:
  name: clusterdbg

---
# Pod 1 - Role: server
apiVersion: v1
kind: Pod
metadata:
  name: mt-server
  namespace: clusterdbg
  labels:
    app: mt-server
spec:
  # Pod is scheduled on nodes that are labelled with dbgnode=n1
  nodeSelector:
    dbgnode: n1
  tolerations:
  - key: "node.kubernetes.io/unschedulable"
    operator: "Equal"
    effect: "NoSchedule"
  containers:
  - command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;"]
    image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6
    imagePullPolicy: IfNotPresent
    name: mt-server
    ports:
    # TCP-Port 9090
    - containerPort: 9090
    # UDP-Port 9191
    - containerPort: 9091
---
# Pod 2 - Role: client
apiVersion: v1
kind: Pod
metadata:
  name: mt-client
  namespace: clusterdbg
spec:
  # Pod is scheduled on nodes that are labelled with dbgnode=n2
  nodeSelector:
    dbgnode: n2
  tolerations:
  - key: "node.kubernetes.io/unschedulable"
    operator: "Equal"
    effect: "NoSchedule"
  containers:
  - command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;"]
    image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6
    imagePullPolicy: IfNotPresent
    name: mt-client
---

# Cluster IP service that is exposing the mt-server
apiVersion: v1
kind: Service
metadata:
  name: mt-server-clusterip
  namespace: clusterdbg
spec:
  selector:
    app: mt-server
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090
      name: tcp-service
    - protocol: UDP
      port: 9191
      targetPort: 9191
      name: udp-service

As long as we try to ping from “mt-client” to “mt-server” we have success. As soon as we try to reach the server via TCP protocols, e.g. via “netcat” we fail (no “Hello” reaches the netcat service running on “mt-server” listening on port 9090):

#starting the NetCat server TCP listen process on “mt-server” on port 9090
nc -l -p 9090

#writing to the NetCat server from “mt-client” to port 9090
echo “Hello” | nc <target_ip> 9090

After analyzing TCP SYN flags on different network devices (container device, tunnel device, host device) we get the following picture:

Hints:

Inter-host communication worked, suddenly it fails and we do not know why
host-local communication works (e.g. when the PODs are running on the same physical node)
we followed the Calico troubleshooting guide below Troubleshoot and diagnostics without detecting any obvious errors

Does the Calico community have any ideas what we can do to get rid of this issue?

halfdome · December 14, 2020, 11:44am

Important hints:

the issue happens when updating K8s worker nodes from CentOS 7.6 to CentOS 7.8 or directly using CentOS 7.8 worker nodes completely fresh installed (not updated)
the issue appears also when booting version 7.6 of the Kernel provided in an updated CentOS 7.8 to CentOS 7.8 machine
when deploying the above K8s PODs on a CentOS 7.6 worker node again, it works
Docker Engine Community edition in version 19.03.13 with API version 1.40 is in use

Finally, we strongly suppose some incompatibilities between:

CentOS 7.8
Calico
and probably the Docker Community container runtime - see image above.

Do you know of these incompatibilities? If yes, where?

fasaxc · December 14, 2020, 4:51pm

Some thoughts:

Maybe the OS update brought a component that conflicts with pod networking, such as firewalld or network manager. I tend to use
```
watch iptables-save -c | grep DROP | grep -v '\[0:0\]'
```
to see if iptables is dropping traffic.
Maybe the domain name we’re detecting has changed and Calico doesn’t think the pod belongs on this host. Calico’s node name needs to agree with kubernetes’ node name. I think we use the Kubernetes downward API to get the node name these days so that shouldn’t happen (but if you’re building your own manifests you may get caught out).
Worth checking for errors/warnings in the calico-node log on the host with the target pod.
Do you have any network policy in play? Perhaps the source address of the packet is being incorrectly SNATted and then it doesn’t match the policy.

halfdome · December 16, 2020, 8:26am

As with CentOS 7.9 the problem has gone we didn’t take a look at the “iptables” rules in detail to check whether the root cause is located there (e.g. “drop rule”). A comparison between a working (on CentOS 7.6/7.9) and defect cluster (on CentOS 7.8) environment might show differences.

Clear is that Kubespray uses the following directive:

calico_iptables_backend = “Legacy”

for Felix (see: https://docs.projectcalico.org/reference/felix/configuration) when deploying Calico under CentOS 7.x. This means “iptables” instead of NFT (nftables) is still used for CentOS 7.x.

Strange was indeed, that although the configuration file for NetworkManager below:

/etc/NetworkManager/conf.d/calico.conf

has been created including the expected content by Kubespray (according to: https://docs.projectcalico.org/archive/v3.16/maintenance/troubleshoot/troubleshooting), the problem occurred below CentOS 7.8. So, we assume it’s not NetworkManager that causes the issue.

With that we mark this issue as done and solved.

Topic		Replies	Views
Newb-Q Can't ping pod network from master (IaaS k8s 1.20.4 in Azure on ubuntu 18.04 w/ calico) Open Source Calico Help	5	740	April 13, 2021
Pod-to-Pod communication not working in Calico Open Source Calico Help	1	633	May 25, 2022
Calico baremetal kubespray tunl0 and node-proxy not connecting Open Source Calico Help	4	2615	February 20, 2021
Getting the dial tcp 10.96.0.1:443: i/o timeout issues Open Source Calico Help	3	11529	April 14, 2020
Windows pod fail with FailedCreatePodSandBox Windows	0	863	September 10, 2020

Calico-managed container communication accross hosts fails

Related Topics