Schrödinger's Database: Why Your SaaS Backups Are Probably Useless

Let's be honest for a second. If you are a CTO or a DevOps engineer, there is a tiny, dark corner in your mind that you try to ignore. It's the corner that asks: "If our primary database vanishes into the void right now, would the backup actually work?"

At THEKROLL LTD, we build and operate SaaS platforms like docs101.com and heavylogix.com.cy. We don't just write code; we manage the dirty reality of TB-scale databases for e-commerce and high-traffic platforms. We've seen servers catch fire (metaphorically and literally), and we've seen "perfect" backup strategies fail because of a missing encryption key or a corrupted WAL file.

In this article, I'm not going to bore you with the 100th generic "How to Postgres" tutorial. I'm going to show you the battle-tested, "sleep-at-night" strategy we use to secure our Keycloak and SaaS databases using CloudNativePG (CNPG) on Kubernetes.

And yes, there will be YAML.

The Strategy: "Trust No One, Verify Everything"

We use a GitOps approach (FluxCD with SOPS/Age encryption), but frankly, I don't care if you deploy your clusters with Terraform, Ansible, or by carrying USB sticks via sneaker-net. The principles remain the same.

Our goal for a standard SaaS setup (like a Keycloak Identity Provider) is simple:

RPO (Recovery Point Objective): Near zero. We cannot lose user registrations.
RTO (Recovery Time Objective): Fast. Restoring from a week of logs is not an option.
Cost: Minimal. We run a business, not a charity for cloud providers.

The Sizing Myth: "I need a massive database"

Let's talk about Keycloak. We often see people provisioning huge volumes for it. In reality, a standard SaaS Keycloak instance with tens of thousands of users fits comfortably into a 10Gi PVC. Unless you are hoarding user sessions and events like a digital hoarder, 10GB will last you for years. Don't over-provision; it just makes backups slower.

The "Safety" Illusion: Encryption & S3

We back up to S3-compatible object storage (like Hetzner or MinIO). Here is a hard truth about Server-Side Encryption (SSE):

If you enable encryption: AES256 in your backup config, the S3 provider encrypts your data on their disks. Great. But they also hold the key. If an attacker steals your S3 credentials (access_key and secret_key), the provider happily decrypts the data for them upon download.

The Reality Check:

Standard SaaS: HTTPS transit + secure Credential Management is sufficient. If hackers have your credentials, you have bigger problems than SSE.
Health/Finance Data: You need SSE-C (Customer Provided Keys) or client-side encryption. But be warned: If you lose that key, your data is digital garbage. Complex setups increase the risk of self-inflicted data loss. We prefer keeping it simple and secure.

The Configuration: Less is More

Here is the stripped-down, production-ready configuration we use. Note that we removed the encryption: AES256 line for compatibility with generic S3 providers (and because of the reasons mentioned above).

1. The Credentials

We use SOPS to encrypt this file in Git. It's safe. Do not commit secrets in plain text. Ever.

// yaml

apiVersion: v1
kind: Secret
metadata:
  name: keycloak-backup-creds
  namespace: auth
type: Opaque
stringData:
  ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
  SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"

2. The Cluster Definition (CNPG)

This defines a High-Availability Cluster (3 Instances). It configures the S3 destination.

// yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: auth-db
  namespace: auth
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.6

  # Bootstrap: Define the owner clearly.
  bootstrap:
    initdb:
      database: keycloak
      owner: keycloak

  storage:
    size: 10Gi
    storageClass: longhorn  # Or gp3, local-path, etc.

  backup:
    barmanObjectStore:
      destinationPath: s3://your-bucket-name/auth-db-backups/
      endpointURL: https://fsn1.your-objectstorage.com
      s3Credentials:
        accessKeyId:
          name: keycloak-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: keycloak-backup-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
      data:
        compression: gzip
    # 30 Days Retention. Compliant with most EU standard ops.
    retentionPolicy: "30d"

3. The Automation: Daily Routines

Defining the cluster is only half the battle. If you don't tell it when to back up, it will only stream WAL files. As discussed, relying solely on WALs for a restore is like trying to rebuild a house using only a receipt of every brick you bought — it takes forever.

We need a "Base Backup" (a full snapshot) every night. This ensures that in a worst-case scenario, the database only needs to replay logs from the last 24 hours max.

// yaml

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: auth-db-daily
  namespace: auth
spec:
  # Cron for: Every day at 03:00:00 AM.
  # We use the extended cron format (Sec Min Hour ...) in quotes.
  schedule: "0 0 3 * * *"
  cluster:
    name: auth-db  # Must match your Cluster metadata.name

The "Danger Zone": Day 2 Operations

So you applied the YAML. Great. Now you enter the Danger Zone.

The Rolling Restart: Applying a backup config often requires a restart of the Postgres pods to enable archive_mode. CNPG handles this with a rolling update. Your app might blink for a second. Deal with it.

The Gap: Once the cluster is running, it starts streaming WAL files (transaction logs) to S3. BUT, and this is huge: WAL files are useless without a Base Backup.

If your server dies 4 hours after deployment, and you haven't done a full backup yet, you have nothing. Zero. Nada.

The Fix: Always trigger a manual backup immediately after deployment:

// yaml

apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: initial-smoke-test
  namespace: auth
spec:
  cluster:
    name: auth-db

// bash

kubectl apply -f manual-backup.yaml

The Math of Recovery (RTO)

Why do we schedule daily full backups instead of weekly?

Imagine you have a weekly full backup. Your database crashes on day 6. To restore, Postgres has to:

Download the full backup from 6 days ago (Fast).
Replay 6 days worth of transaction logs (WALs) sequentially (Slow).

Replaying a week of WALs can take hours. During those hours, your SaaS is down, and your customers are writing angry tweets.

Daily Full Backups mean you replay a maximum of 24 hours of logs.

The Cost:

Storing 30 daily backups of a 10GB Keycloak database (compressed to ~2GB) plus logs on S3 costs peanuts. Even for a 1TB production DB, the storage cost is negligible compared to the cost of 4 hours of downtime.

The Secret Sauce: Automated Verification

Here is how THEKROLL LTD differs from the amateurs.

Most people check kubectl get backups and see status Completed. They smile and go to sleep.

This is a lie. A backup is only valid if you have successfully restored it.

We implement an automated pipeline (you can use a CronJob) that does the following:

Spin up a "Cold-Standby" Cluster: A fresh CNPG cluster in a test namespace.
Bootstrap from Backup: It pulls the latest production data from S3.
Run a Smoke Test: Count the rows in the user table.
Destroy the Cluster: Clean up resources.

This is the recovery config for the test cluster:

// yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: auth-db-verify  # A NEW name
  namespace: auth-test
spec:
  instances: 1
  bootstrap:
    recovery:
      source: auth-db  # The name of your PROD cluster
  externalClusters:
    - name: auth-db
      barmanObjectStore:
        # Same S3 config as production
        destinationPath: s3://your-bucket-name/auth-db-backups/
        endpointURL: https://fsn1.your-objectstorage.com
        # ... creds ...

# CRITICAL WARNING:
# Do NOT include a 'backup' section here.
# If you do, the test cluster might overwrite your production backups.
# We want read-only access to the S3 bucket for the restore.

Conclusion

Backups are boring. Disaster recovery is stressful. But setting up a self-verifying system allows us to focus on building features for docs101.com and heavylogix.com.cy instead of praying to the database gods.

If you need help designing a high-availability database strategy or migrating your legacy mess to Kubernetes without losing your mind, get in touch with us. We've seen it all, and we've fixed most of it.