Troubleshooting Common Issues

Overview

When things go wrong in your K3s cluster, having a systematic troubleshooting approach helps you resolve issues quickly. This guide covers common problems and their solutions.

General Troubleshooting Approach

Gather Information: Collect logs, events, and status information
Identify the Scope: Determine if it's a node, pod, service, or cluster-wide issue
Check Recent Changes: Review what changed recently (updates, deployments, config changes)
Isolate the Problem: Narrow down to specific components
Apply Fixes: Start with least invasive solutions
Verify Resolution: Confirm the issue is resolved and monitor

Node Issues

Node Not Ready

Symptoms:

Node shows NotReady status in kubectl get nodes
Pods cannot be scheduled on the node

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o yaml | grep -A 10 conditions

Common Causes and Solutions:

K3s Service Not Running:

# On the affected node
sudo systemctl status k3s
sudo systemctl start k3s
sudo journalctl -u k3s -n 100

Network Connectivity Issues:

# Test connectivity from other nodes
ping <node-ip>
# Check DNS resolution
nslookup <node-name>

Disk Space Issues:

# On the affected node
df -h
# Check K3s data directory
du -sh /var/lib/rancher/k3s/*

Certificate Issues:

# Check certificate expiration
sudo k3s certificate rotate-ca

Node Resource Exhaustion

Symptoms:

Pods stuck in Pending state
Node shows MemoryPressure or DiskPressure

Diagnosis:

# Check node resources
kubectl describe node <node-name>
kubectl top node <node-name>

# Check resource usage by pod
kubectl top pods -A --sort-by=memory

Solutions:

Free Up Resources:

# Identify resource-heavy pods
kubectl top pods -A
# Delete unnecessary pods or scale down deployments

Add Resource Limits:

# In your pod/deployment spec
resources:
  requests:
    memory: '64Mi'
    cpu: '250m'
  limits:
    memory: '128Mi'
    cpu: '500m'

Add More Nodes:
- Scale your cluster by adding worker nodes

Pod Issues

Pod Stuck in Pending

Symptoms:

Pod shows Pending status
Pod never starts

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check for resource constraints
kubectl get nodes
kubectl top nodes

Common Causes:

Insufficient Resources:
- No nodes have available CPU/memory
- Solution: Free up resources or add nodes

Node Selector/Affinity Issues:

# Check pod spec for node selectors
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 nodeSelector
# Verify nodes match selector
kubectl get nodes --show-labels

PVC Not Bound:

# Check PVC status
kubectl get pvc -n <namespace>
# Check storage class
kubectl get storageclass

Pod CrashLoopBackOff

Symptoms:

Pod repeatedly crashes and restarts
Pod shows CrashLoopBackOff status

Diagnosis:

# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check container exit codes
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.exitCode}'

Common Causes:

Application Errors:
- Check application logs for errors
- Verify configuration files
- Check environment variables

Resource Limits:

# Check if OOM killed
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
# Increase memory limits if needed

Missing Dependencies:
- Verify required services are available
- Check service endpoints: kubectl get endpoints -n <namespace>
Configuration Issues:
- Verify ConfigMaps and Secrets are correct
- Check volume mounts

Pod Image Pull Errors

Symptoms:

Pod shows ImagePullBackOff or ErrImagePull
Container cannot start

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Verify image exists and is accessible
docker pull <image-name>

Solutions:

Private Registry Authentication:

# Create image pull secret
kubectl create secret docker-registry <secret-name> \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>

# Add to pod spec
imagePullSecrets:
  - name: <secret-name>

Network Issues:
- Check cluster can reach registry
- Verify DNS resolution for registry

Network Issues

Service Not Accessible

Symptoms:

Cannot access service from within or outside cluster
Service endpoints are empty

Diagnosis:

# Check service
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check pods
kubectl get pods -n <namespace> -l <selector>

Solutions:

No Endpoints:
- Verify pod labels match service selector
- Check pods are running and ready
Port Mismatch:
- Verify service port matches pod container port
- Check targetPort in service spec

Network Policies:

# Check for network policies blocking traffic
kubectl get networkpolicies -A

DNS Resolution Issues

Symptoms:

Cannot resolve service names
DNS queries fail

Diagnosis:

# Check CoreDNS pods
kubectl get pods -n kube-system | grep coredns

# Test DNS from pod
kubectl run -it --rm --restart=Never test-dns --image=busybox -- nslookup kubernetes.default

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Solutions:

CoreDNS Not Running:

# Restart CoreDNS
kubectl delete pod -n kube-system -l k8s-app=kube-dns

DNS Configuration:
- Check CoreDNS ConfigMap: kubectl get configmap coredns -n kube-system -o yaml
- Verify upstream DNS servers

Storage Issues

PVC Not Binding

Symptoms:

PVC shows Pending status
Pods cannot start due to missing volumes

Diagnosis:

# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class-name>

Solutions:

Storage Class Issues:
- Verify storage class exists and is default
- Check provisioner is running (e.g., Longhorn)
Insufficient Storage:
- Check available storage in storage system
- For Longhorn: kubectl get volumes -n longhorn-system
Access Mode Mismatch:
- Verify PVC access mode matches storage class capabilities

Volume Mount Errors

Symptoms:

Pod cannot mount volume
Permission denied errors

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Verify volume exists
kubectl get pv
kubectl get pvc -n <namespace>

Solutions:

Volume Not Found:
- Verify PVC exists and is bound
- Check volume name in pod spec
Permission Issues:
- Check security context in pod spec
- Verify volume supports required access mode

Certificate Issues

Certificate Expiration

Symptoms:

Authentication failures
TLS handshake errors

Diagnosis:

# Check certificate expiration (on node)
sudo k3s certificate rotate-ca --check

# Check API server certificate
openssl x509 -in /var/lib/rancher/k3s/server/tls/server.crt -noout -dates

Solutions:

Rotate Certificates:

# On each node
sudo k3s certificate rotate-ca

Manual Certificate Renewal:
- Follow K3s certificate renewal documentation
- May require cluster restart in some cases

etcd Issues (HA Clusters)

etcd Pod Not Running

Symptoms:

etcd pod in Error or CrashLoopBackOff
Cluster connectivity issues

Diagnosis:

# Check etcd pods
kubectl get pods -n kube-system | grep etcd

# Check etcd logs
kubectl logs -n kube-system etcd-<node-name>

Solutions:

etcd Data Corruption:
- Restore from etcd snapshot
- See Backup and Disaster Recovery
Quorum Loss:
- Ensure majority of etcd nodes are running
- In 3-node cluster, need at least 2 nodes

Log Analysis

Viewing Logs

# Pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> --tail=100 -f

# Component logs
kubectl logs -n kube-system <component-pod>

# K3s service logs (on node)
sudo journalctl -u k3s -n 100
sudo journalctl -u k3s -f

Common Log Patterns

OOM Killed: Out of memory or OOMKilled
Image Pull: Failed to pull image or ImagePullBackOff
Crash: container exited with code or CrashLoopBackOff
Network: connection refused or timeout

Getting Help

If you cannot resolve an issue:

Collect Information:

# Cluster info
kubectl cluster-info dump > cluster-info.txt

# Node info
kubectl get nodes -o yaml > nodes.yaml

# Recent events
kubectl get events -A > events.txt

Check Documentation:
- K3s Documentation
- Kubernetes Troubleshooting
Community Resources:
- K3s GitHub Issues
- Kubernetes Slack/Discord

K3s Maintenance Overview - Maintenance overview
Health Checks - Proactive health monitoring
Updating K3s - Update-related issues
Backup and Disaster Recovery - Recovery procedures

Overview​

General Troubleshooting Approach​

Node Issues​

Node Not Ready​

Node Resource Exhaustion​

Pod Issues​

Pod Stuck in Pending​

Pod CrashLoopBackOff​

Pod Image Pull Errors​

Network Issues​

Service Not Accessible​

DNS Resolution Issues​

Storage Issues​

PVC Not Binding​

Volume Mount Errors​

Certificate Issues​

Certificate Expiration​

etcd Issues (HA Clusters)​

etcd Pod Not Running​

Log Analysis​

Viewing Logs​

Common Log Patterns​

Getting Help​

Related Documentation​

Overview

General Troubleshooting Approach

Node Issues

Node Not Ready

Node Resource Exhaustion

Pod Issues

Pod Stuck in Pending

Pod CrashLoopBackOff

Pod Image Pull Errors

Network Issues

Service Not Accessible

DNS Resolution Issues

Storage Issues

PVC Not Binding

Volume Mount Errors

Certificate Issues

Certificate Expiration

etcd Issues (HA Clusters)

etcd Pod Not Running

Log Analysis

Viewing Logs

Common Log Patterns

Getting Help

Related Documentation