Troubleshooting Common Issues
Overview
When things go wrong in your K3s cluster, having a systematic troubleshooting approach helps you resolve issues quickly. This guide covers common problems and their solutions.
General Troubleshooting Approach
- Gather Information: Collect logs, events, and status information
- Identify the Scope: Determine if it's a node, pod, service, or cluster-wide issue
- Check Recent Changes: Review what changed recently (updates, deployments, config changes)
- Isolate the Problem: Narrow down to specific components
- Apply Fixes: Start with least invasive solutions
- Verify Resolution: Confirm the issue is resolved and monitor
Node Issues
Node Not Ready
Symptoms:
- Node shows
NotReadystatus inkubectl get nodes - Pods cannot be scheduled on the node
Diagnosis:
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check node conditions
kubectl get node <node-name> -o yaml | grep -A 10 conditions
Common Causes and Solutions:
-
K3s Service Not Running:
# On the affected node
sudo systemctl status k3s
sudo systemctl start k3s
sudo journalctl -u k3s -n 100 -
Network Connectivity Issues:
# Test connectivity from other nodes
ping <node-ip>
# Check DNS resolution
nslookup <node-name> -
Disk Space Issues:
# On the affected node
df -h
# Check K3s data directory
du -sh /var/lib/rancher/k3s/* -
Certificate Issues:
# Check certificate expiration
sudo k3s certificate rotate-ca
Node Resource Exhaustion
Symptoms:
- Pods stuck in
Pendingstate - Node shows
MemoryPressureorDiskPressure
Diagnosis:
# Check node resources
kubectl describe node <node-name>
kubectl top node <node-name>
# Check resource usage by pod
kubectl top pods -A --sort-by=memory
Solutions:
-
Free Up Resources:
# Identify resource-heavy pods
kubectl top pods -A
# Delete unnecessary pods or scale down deployments -
Add Resource Limits:
# In your pod/deployment spec
resources:
requests:
memory: '64Mi'
cpu: '250m'
limits:
memory: '128Mi'
cpu: '500m' -
Add More Nodes:
- Scale your cluster by adding worker nodes
Pod Issues
Pod Stuck in Pending
Symptoms:
- Pod shows
Pendingstatus - Pod never starts
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Check for resource constraints
kubectl get nodes
kubectl top nodes
Common Causes:
-
Insufficient Resources:
- No nodes have available CPU/memory
- Solution: Free up resources or add nodes
-
Node Selector/Affinity Issues:
# Check pod spec for node selectors
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 nodeSelector
# Verify nodes match selector
kubectl get nodes --show-labels -
PVC Not Bound:
# Check PVC status
kubectl get pvc -n <namespace>
# Check storage class
kubectl get storageclass
Pod CrashLoopBackOff
Symptoms:
- Pod repeatedly crashes and restarts
- Pod shows
CrashLoopBackOffstatus
Diagnosis:
# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Check container exit codes
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.exitCode}'
Common Causes:
-
Application Errors:
- Check application logs for errors
- Verify configuration files
- Check environment variables
-
Resource Limits:
# Check if OOM killed
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
# Increase memory limits if needed -
Missing Dependencies:
- Verify required services are available
- Check service endpoints:
kubectl get endpoints -n <namespace>
-
Configuration Issues:
- Verify ConfigMaps and Secrets are correct
- Check volume mounts
Pod Image Pull Errors
Symptoms:
- Pod shows
ImagePullBackOfforErrImagePull - Container cannot start
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Verify image exists and is accessible
docker pull <image-name>
Solutions:
-
Private Registry Authentication:
# Create image pull secret
kubectl create secret docker-registry <secret-name> \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
-n <namespace>
# Add to pod spec
imagePullSecrets:
- name: <secret-name> -
Network Issues:
- Check cluster can reach registry
- Verify DNS resolution for registry
Network Issues
Service Not Accessible
Symptoms:
- Cannot access service from within or outside cluster
- Service endpoints are empty
Diagnosis:
# Check service
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check endpoints
kubectl get endpoints <service-name> -n <namespace>
# Check pods
kubectl get pods -n <namespace> -l <selector>
Solutions:
-
No Endpoints:
- Verify pod labels match service selector
- Check pods are running and ready
-
Port Mismatch:
- Verify service port matches pod container port
- Check targetPort in service spec
-
Network Policies:
# Check for network policies blocking traffic
kubectl get networkpolicies -A
DNS Resolution Issues
Symptoms:
- Cannot resolve service names
- DNS queries fail
Diagnosis:
# Check CoreDNS pods
kubectl get pods -n kube-system | grep coredns
# Test DNS from pod
kubectl run -it --rm --restart=Never test-dns --image=busybox -- nslookup kubernetes.default
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
Solutions:
-
CoreDNS Not Running:
# Restart CoreDNS
kubectl delete pod -n kube-system -l k8s-app=kube-dns -
DNS Configuration:
- Check CoreDNS ConfigMap:
kubectl get configmap coredns -n kube-system -o yaml - Verify upstream DNS servers
- Check CoreDNS ConfigMap:
Storage Issues
PVC Not Binding
Symptoms:
- PVC shows
Pendingstatus - Pods cannot start due to missing volumes
Diagnosis:
# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
# Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
Solutions:
-
Storage Class Issues:
- Verify storage class exists and is default
- Check provisioner is running (e.g., Longhorn)
-
Insufficient Storage:
- Check available storage in storage system
- For Longhorn:
kubectl get volumes -n longhorn-system
-
Access Mode Mismatch:
- Verify PVC access mode matches storage class capabilities
Volume Mount Errors
Symptoms:
- Pod cannot mount volume
- Permission denied errors
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Verify volume exists
kubectl get pv
kubectl get pvc -n <namespace>
Solutions:
-
Volume Not Found:
- Verify PVC exists and is bound
- Check volume name in pod spec
-
Permission Issues:
- Check security context in pod spec
- Verify volume supports required access mode
Certificate Issues
Certificate Expiration
Symptoms:
- Authentication failures
- TLS handshake errors
Diagnosis:
# Check certificate expiration (on node)
sudo k3s certificate rotate-ca --check
# Check API server certificate
openssl x509 -in /var/lib/rancher/k3s/server/tls/server.crt -noout -dates
Solutions:
-
Rotate Certificates:
# On each node
sudo k3s certificate rotate-ca -
Manual Certificate Renewal:
- Follow K3s certificate renewal documentation
- May require cluster restart in some cases
etcd Issues (HA Clusters)
etcd Pod Not Running
Symptoms:
- etcd pod in
ErrororCrashLoopBackOff - Cluster connectivity issues
Diagnosis:
# Check etcd pods
kubectl get pods -n kube-system | grep etcd
# Check etcd logs
kubectl logs -n kube-system etcd-<node-name>
Solutions:
-
etcd Data Corruption:
- Restore from etcd snapshot
- See Backup and Disaster Recovery
-
Quorum Loss:
- Ensure majority of etcd nodes are running
- In 3-node cluster, need at least 2 nodes
Log Analysis
Viewing Logs
# Pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> --tail=100 -f
# Component logs
kubectl logs -n kube-system <component-pod>
# K3s service logs (on node)
sudo journalctl -u k3s -n 100
sudo journalctl -u k3s -f
Common Log Patterns
- OOM Killed:
Out of memoryorOOMKilled - Image Pull:
Failed to pull imageorImagePullBackOff - Crash:
container exited with codeorCrashLoopBackOff - Network:
connection refusedortimeout
Getting Help
If you cannot resolve an issue:
-
Collect Information:
# Cluster info
kubectl cluster-info dump > cluster-info.txt
# Node info
kubectl get nodes -o yaml > nodes.yaml
# Recent events
kubectl get events -A > events.txt -
Check Documentation:
-
Community Resources:
- K3s GitHub Issues
- Kubernetes Slack/Discord
Related Documentation
- K3s Maintenance Overview - Maintenance overview
- Health Checks - Proactive health monitoring
- Updating K3s - Update-related issues
- Backup and Disaster Recovery - Recovery procedures