Skip to main content

Production Backup Strategy Overview

Overview

This production K3s cluster uses a comprehensive four-layer backup strategy to ensure data protection at different levels:

  1. K3s etcd Snapshots - Control plane database backups
  2. Longhorn Volume Backups - Persistent volume backups
  3. Velero Cluster Backups - Application-aware cluster backups
  4. CloudNative PG Backups - PostgreSQL database-consistent backups with point-in-time recovery

All backups are stored in Cloudflare R2, providing off-site redundancy and disaster recovery capabilities.

Backup Schedule Summary

LayerScheduleRetentionDestination
K3s etcdDaily at 1:00 AM5 daysCloudflare R2
Longhorn VolumesDaily at 2:00 AM7 daysCloudflare R2
Velero ClusterDaily at 3:00 AM14 daysCloudflare R2
CloudNative PGEvery 6 hours (4x/day)30 daysCloudflare R2

Prerequisites: Cloudflare R2 Setup

Before configuring backups, you need a Cloudflare R2 bucket and API credentials:

  1. Create an R2 Bucket:

    • In your Cloudflare dashboard, go to R2 and click Create bucket
    • Give it a unique name (e.g., k3s-backup-repository)
    • Note your S3 Endpoint URL from the bucket's main page: https://<ACCOUNT_ID>.r2.cloudflarestorage.com
  2. Create R2 API Credentials:

    • On the main R2 page, click Manage R2 API Tokens
    • Click Create API Token
    • Give it a name (e.g., k3s-backup-token) and grant it Object Read & Write permissions
    • Securely copy the Access Key ID and Secret Access Key

You'll need these credentials for all three backup layers.

Why Four Layers?

Each backup layer serves a specific purpose:

  • etcd Snapshots: Protect the Kubernetes control plane state (API objects, cluster configuration)
  • Longhorn Backups: Protect persistent volume data independently of cluster state
  • Velero Backups: Provide application-aware backups that capture both resources and volumes together
  • CloudNative PG Backups: Provide PostgreSQL-consistent backups with point-in-time recovery capabilities

This multi-layer approach ensures you can recover from different types of failures:

  • Control plane corruption → Restore from etcd snapshot
  • Volume data loss → Restore from Longhorn backup
  • Complete cluster failure → Restore from Velero backup
  • Database corruption/PITR → Restore from CloudNative PG backup

Monitoring and Maintenance

Check Backup Status

etcd Snapshots:

sudo k3s etcd-snapshot list

Longhorn Backups:

kubectl get recurringjobs -n longhorn-system
kubectl get jobs -n longhorn-system

Velero Backups:

kubectl get schedules -n velero
velero backup get

CloudNative PG Backups:

kubectl get backups -n <postgres-namespace>
kubectl get cronjobs -n <postgres-namespace>

Backup Health Checks

Regularly verify that backups are completing successfully:

  1. Check R2 bucket for recent backup files
  2. Review Velero backup logs:
    kubectl logs -n velero deployment/velero
  3. Check Longhorn backup jobs:
    kubectl get jobs -n longhorn-system -l app=longhorn-manager

References