Disaster Recovery Procedures

Overview

This guide covers disaster recovery procedures for the four-layer backup strategy. Each recovery scenario addresses different types of failures and data loss situations.

Restore Strategy Overview

Understanding which backup to use and in what order is critical for successful recovery. This section explains the restore decision process and priority order.

Backup Layer Purposes

Each backup layer protects different aspects of your cluster:

etcd Snapshots (Layer 1): Control plane state
- Kubernetes API objects
- Cluster configuration
- Resource definitions
- Use when: Cluster control plane is corrupted or lost
Longhorn Backups (Layer 2): Persistent volume data
- Raw volume snapshots
- Independent of cluster state
- Use when: Volume data is corrupted or lost, but cluster is intact
Velero Backups (Layer 3): Application-aware backups
- Kubernetes resources + volumes together
- Application configurations
- Use when: Applications need to be restored, or complete namespace recovery
CloudNative PG Backups (Layer 4): Database-consistent backups
- PostgreSQL base backups + WAL archiving
- Point-in-time recovery (PITR)
- Use when: Database corruption, PITR needed, or database-specific recovery

Restore Order and Priority

When recovering from a complete cluster failure, follow this order:

Restore Control Plane (etcd) - Must be first if cluster is gone
Restore Infrastructure (Velero) - Longhorn, Velero, ArgoCD, etc.
Restore Applications (Velero) - Your workloads
Verify/Restore Databases (CloudNative PG) - If Velero didn't capture correctly or PITR needed

Important: Do NOT restore all layers simultaneously. Restore in order to avoid conflicts and ensure dependencies are met.

Restore Decision Matrix

Failure Type	Primary Backup	Secondary Backup	Restore Order
Complete cluster loss	etcd	Velero	etcd → Velero (infra) →
			Velero (apps) → Verify DBs
Control plane corruption	etcd	-	etcd snapshot restore
Application failure	Velero	-	Velero restore
Volume data loss	Longhorn	Velero	Longhorn restore OR Velero
Database corruption	CloudNative PG	Longhorn	CloudNative PG PITR → Base
			backup → Longhorn
Database point-in-time recovery	CloudNative PG	-	CloudNative PG PITR
Single namespace loss	Velero	-	Velero namespace restore
Infrastructure component failure	Velero	-	Velero selective restore

When NOT to Restore All Layers

Avoid restoring multiple layers simultaneously:

Don't restore etcd + Velero together: Restore etcd first, then Velero
Don't restore Longhorn + Velero volumes together: Choose one method
Don't restore CloudNative PG + Velero databases together: Use CloudNative PG for database recovery, Velero for application resources

Quick Decision Guide

Ask yourself:

Is the cluster completely gone?
- Yes → Start with etcd snapshot restore
- No → Skip to step 2
Is the infrastructure (Longhorn, Velero, ArgoCD) broken?
- Yes → Restore infrastructure via Velero
- No → Skip to step 3
Are applications broken?
- Yes → Restore applications via Velero
- No → Skip to step 4
Is the database corrupted or do you need PITR?
- Yes → Use CloudNative PG backup restore
- No → Verify database health

Recovery Scenarios

Scenario 1: Complete Cluster Failure

This is the worst-case scenario where the entire cluster is lost and needs to be rebuilt from scratch.

Prerequisites

Access to Cloudflare R2 bucket with backups
R2 credentials for all four backup layers
Fresh server(s) for cluster rebuild

Recovery Steps (In Order)

Step 1: Restore Control Plane (etcd)

Reinstall K3s:

curl -sfL https://get.k3s.io | sh -
sudo k3s kubectl get nodes

Restore etcd (if needed):

sudo k3s server \
  --cluster-init \
  --etcd-s3 \
  --etcd-s3-bucket k3s-backup-repository \
  --etcd-s3-folder k3s-etcd-snapshots \
  --etcd-s3-endpoint "<ACCOUNT_ID>.r2.cloudflarestorage.com" \
  --etcd-s3-access-key "<ACCESS_KEY>" \
  --etcd-s3-secret-key "<SECRET_KEY>" \
  --cluster-reset-restore-path <snapshot-name>

This restores the control plane state from the etcd snapshot.

Step 2: Restore Infrastructure Components

Reinstall Longhorn:

helm repo add longhorn https://charts.longhorn.io
helm repo update
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --create-namespace \
  --set persistence.defaultClass=true

Wait for Longhorn pods to be running:

kubectl get pods -n longhorn-system --watch

Reinstall Velero:

kubectl create namespace velero
kubectl apply -f velero-r2-secret.yaml
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero --namespace velero -f velero-values.yaml

Make sure you have the velero-r2-secret.yaml and velero-values.yaml files from your original setup.

Verify Velero can see backups:
```
# Wait a minute for Velero to sync
velero backup get
```
You should see your previous backups listed.

Step 3: Restore Applications

Restore from Velero backup:

velero restore create --from-backup <backup-name> --wait

Step 4: Verify and Restore Databases

Verify restored applications:

kubectl get pods --all-namespaces
kubectl get pvc --all-namespaces

Check database cluster status:

kubectl get clusters.postgresql.cnpg.io -A
kubectl get pods -A -l cnpg.io/cluster

If databases need restoration:
- If Velero restored databases correctly, verify they're healthy
- If databases are corrupted or missing, restore from CloudNative PG backups (see PostgreSQL Database Recovery below)
- If point-in-time recovery is needed, use CloudNative PG PITR

Scenario 2: Control Plane Corruption

When the etcd database is corrupted but the cluster is still running.

Recovery Steps

Stop k3s on all nodes:
```
sudo systemctl stop k3s
```

Restore from etcd snapshot:

sudo k3s server \
  --cluster-init \
  --etcd-s3 \
  --etcd-s3-bucket k3s-backup-repository \
  --etcd-s3-folder k3s-etcd-snapshots \
  --etcd-s3-endpoint "<ACCOUNT_ID>.r2.cloudflarestorage.com" \
  --etcd-s3-access-key "<ACCESS_KEY>" \
  --etcd-s3-secret-key "<SECRET_KEY>" \
  --cluster-reset-restore-path <snapshot-name>

Restart k3s on other nodes:
```
sudo systemctl start k3s
```

Scenario 3: Volume Data Loss

When persistent volume data is lost but the cluster is intact.

Recovery Steps

Identify the affected volumes:
```
kubectl get pvc --all-namespaces
```

Access Longhorn UI:

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Open http://localhost:8080

Restore from Longhorn backup:
- Navigate to Backups
- Find the backup for the affected volume
- Create a new volume from the backup
- Update the PVC to use the restored volume

Or restore via Velero:

velero restore create --from-backup <backup-name> \
  --include-namespaces <affected-namespace> \
  --wait

Scenario 4: Single Application Failure

When a single application needs to be restored.

Recovery Steps

Option 1: Restore from Velero (Recommended)

velero restore create --from-backup <backup-name> \
  --include-namespaces <application-namespace> \
  --wait

Option 2: Restore from Longhorn

Restore the application's volumes from Longhorn backups
Recreate the application manifests
Update PVCs to point to restored volumes

Scenario 5: Partial Namespace Recovery

When specific resources in a namespace need to be restored.

Recovery Steps

velero restore create --from-backup <backup-name> \
  --include-namespaces <namespace> \
  --include-resources deployments,services,configmaps \
  --wait

Scenario 6: PostgreSQL Database Recovery

When PostgreSQL databases managed by CloudNative PG need to be restored. This scenario covers database corruption, point-in-time recovery, and complete database cluster restoration.

When to Use CloudNative PG Backups vs Other Backups

Use CloudNative PG backups when:

Database corruption is detected
Point-in-time recovery (PITR) is needed
Database-specific recovery is required
Velero backup didn't capture database correctly
Cross-cluster database migration

Use Velero/Longhorn backups when:

Complete application restore (including database)
Volume-level recovery is sufficient
Database is part of larger application recovery

Recovery Options

Option 1: Point-in-Time Recovery (PITR) - Recommended for Corruption

If you need to recover to a specific time before corruption occurred:

Identify the backup and target time:

kubectl get backups -n <postgres-namespace>
# Note the backup name and determine the recovery target time

Create a new cluster with PITR:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>-restored
  namespace: <postgres-namespace>
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:15

  bootstrap:
    recovery:
      backup:
        name: <backup-name>
      recoveryTarget:
        targetTime: '2024-01-15 14:30:00' # Time before corruption

  backup:
    # Same backup configuration as original cluster
    barmanObjectStore:
      destinationPath: s3://postgres-backups/<cluster-name>
      s3Credentials:
        accessKeyId:
          name: postgres-backup-credentials
          key: AWS_ACCESS_KEY_ID
        secretAccessKey:
          name: postgres-backup-credentials
          key: AWS_SECRET_ACCESS_KEY
      region: auto
      endpoint: https://<ACCOUNT_ID>.r2.cloudflarestorage.com

Apply the restored cluster:
```
kubectl apply -f restored-cluster.yaml
```

Monitor restoration:

kubectl get cluster <cluster-name>-restored -n <postgres-namespace> -w
kubectl get pods -n <postgres-namespace> -l cnpg.io/cluster=<cluster-name>-restored

Option 2: Restore from Base Backup

If PITR is not needed, restore from a base backup:

List available backups:

kubectl get backups -n <postgres-namespace>

Create cluster from backup:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>-restored
  namespace: <postgres-namespace>
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:15

  bootstrap:
    recovery:
      backup:
        name: <backup-name>
      source: <backup-name>

  backup:
    # Same backup configuration as original
    barmanObjectStore:
      # ... same configuration

Apply and verify:

kubectl apply -f restored-cluster.yaml
kubectl get cluster <cluster-name>-restored -n <postgres-namespace>

Option 3: Restore to Different Namespace

If you need to restore to a different namespace:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>-restored
  namespace: <new-namespace> # Different namespace
spec:
  # ... same configuration
  bootstrap:
    recovery:
      backup:
        name: <backup-name>
        namespace: <original-namespace> # Original namespace

Recovery Steps for Database Corruption

Stop the corrupted cluster (if needed):

kubectl delete cluster <cluster-name> -n <namespace>

Choose recovery method:
- If you know the exact time before corruption → Use PITR (Option 1)
- If you need the latest backup → Use base backup (Option 2)
Create restored cluster using one of the options above
Update application connections:
- Update service endpoints if cluster name changed
- Update connection strings in application configs
- Verify applications can connect to restored database

Verify data integrity:

# Connect to restored database
kubectl exec -it <cluster-name>-restored-1 -n <namespace> -- \
  psql -U postgres -c "SELECT COUNT(*) FROM <table-name>;"

# Compare with expected data
# Run application-specific data validation

Recovery Decision Flow

Database Issue Detected
    │
    ├─ Need PITR? ──Yes──> Use CloudNative PG PITR (Option 1)
    │
    └─No──> Latest backup sufficient? ──Yes──> Use CloudNative PG base backup (Option 2)
            │
            └─No──> Try Longhorn volume restore
                    │
                    └─No──> Use Velero backup (last resort)

Recovery Verification

After any recovery operation, verify the following:

Cluster Health

# Check nodes
kubectl get nodes

# Check pods
kubectl get pods --all-namespaces

# Check services
kubectl get svc --all-namespaces

Application Functionality

Test application endpoints:

kubectl get ingress --all-namespaces
curl <application-url>

Verify data integrity:
- Check application logs
- Verify database connections
- Test critical functionality

Check persistent volumes:

kubectl get pvc --all-namespaces
kubectl get volumes -n longhorn-system

Backup System Health

# Check etcd snapshots
sudo k3s etcd-snapshot list

# Check Longhorn backups
kubectl get recurringjobs -n longhorn-system
kubectl get backups -n longhorn-system

# Check Velero backups
velero backup get
kubectl get schedules -n velero

# Check CloudNative PG backups
kubectl get backups -A
kubectl get cronjobs -A | grep postgres-backup

Recovery Testing

Regular recovery testing is crucial to ensure your backup strategy works.

Test Schedule

Monthly: Test restoring a single application
Quarterly: Test restoring a namespace
Annually: Test complete cluster recovery

Test Procedure

Create a test namespace:
```
kubectl create namespace backup-test
```

Deploy a test application:

kubectl apply -f test-app.yaml -n backup-test

Create a backup:

velero backup create test-backup --include-namespaces backup-test --wait

Delete the test application:
```
kubectl delete namespace backup-test
```

Restore from backup:

velero restore create --from-backup test-backup --wait

Verify restoration:
```
kubectl get all -n backup-test
```

Clean up:

kubectl delete namespace backup-test
velero backup delete test-backup

Recovery Best Practices

Document Recovery Procedures: Keep detailed documentation of recovery steps
Regular Testing: Test recovery procedures regularly
Backup Verification: Verify backups before you need them
Recovery Runbooks: Create runbooks for common recovery scenarios
Communication Plan: Have a plan for communicating during disasters
Recovery Time Objectives: Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
Backup Monitoring: Set up alerts for backup failures
Documentation: Keep recovery documentation up to date

Troubleshooting Recovery Issues

Velero Restore Fails

Check restore status:
```
velero restore describe <restore-name>
```
Review restore logs:
```
velero restore logs <restore-name>
```
Check resource conflicts:
- Some resources may already exist
- Use --restore-resource-filters to exclude conflicting resources

Longhorn Volume Restore Fails

Check volume status:

kubectl get volumes -n longhorn-system
kubectl describe volume <volume-name> -n longhorn-system

Verify backup exists:
- Check Longhorn UI → Backups
- Verify backup is accessible

Check storage space:

kubectl get nodes
# Check available disk space on nodes

etcd Restore Fails

Verify snapshot exists:
- Check R2 bucket for snapshot files
- Verify snapshot name is correct
Check k3s logs:
```
sudo journalctl -u k3s -f
```
Verify R2 credentials:
- Check R2 access key and secret
- Verify bucket permissions

References

etcd Snapshots - etcd backup and restore
Longhorn Backups - Volume backup and restore
Velero Backups - Cluster backup and restore
CloudNative PG Backups - PostgreSQL backup and restore

Overview​

Restore Strategy Overview​

Backup Layer Purposes​

Restore Order and Priority​

Restore Decision Matrix​

When NOT to Restore All Layers​

Quick Decision Guide​

Recovery Scenarios​

Scenario 1: Complete Cluster Failure​

Prerequisites​

Recovery Steps (In Order)​

Scenario 2: Control Plane Corruption​

Recovery Steps​

Scenario 3: Volume Data Loss​

Recovery Steps​

Scenario 4: Single Application Failure​

Recovery Steps​

Scenario 5: Partial Namespace Recovery​

Recovery Steps​

Scenario 6: PostgreSQL Database Recovery​

When to Use CloudNative PG Backups vs Other Backups​

Recovery Options​

Recovery Steps for Database Corruption​

Recovery Decision Flow​

Recovery Verification​

Cluster Health​

Application Functionality​

Backup System Health​

Recovery Testing​

Test Schedule​

Test Procedure​

Recovery Best Practices​

Troubleshooting Recovery Issues​

Velero Restore Fails​

Longhorn Volume Restore Fails​

etcd Restore Fails​

References​

Overview

Restore Strategy Overview

Backup Layer Purposes

Restore Order and Priority

Restore Decision Matrix

When NOT to Restore All Layers

Quick Decision Guide

Recovery Scenarios

Scenario 1: Complete Cluster Failure

Prerequisites

Recovery Steps (In Order)

Scenario 2: Control Plane Corruption

Recovery Steps

Scenario 3: Volume Data Loss

Recovery Steps

Scenario 4: Single Application Failure

Recovery Steps

Scenario 5: Partial Namespace Recovery

Recovery Steps

Scenario 6: PostgreSQL Database Recovery

When to Use CloudNative PG Backups vs Other Backups

Recovery Options

Recovery Steps for Database Corruption

Recovery Decision Flow

Recovery Verification

Cluster Health

Application Functionality

Backup System Health

Recovery Testing

Test Schedule

Test Procedure

Recovery Best Practices

Troubleshooting Recovery Issues

Velero Restore Fails

Longhorn Volume Restore Fails

etcd Restore Fails

References