Royal Army College

Lesson Overview

A digital state is its data. The registers that make Kaharagia a state, the accounts of its nationals, the records and the keys, all of it lives as information on disks somewhere. Lose that information and you have not lost a building you can rebuild; you have lost the thing the buildings were for. The previous lesson taught how to keep services patched, monitored, and documented so they run. This lesson takes up the harder question that sits behind running: what happens when something goes wrong anyway, when a disk fails, a database is corrupted, a server is encrypted by ransomware, or a person deletes the wrong thing, and how the data and the services come back.

The answer is backups, taken on a schedule, stored in more than one place, and, above all, tested. A backup that has never been restored is not a backup; it is a hope. This lesson explains where the Principality's data actually lives, the simple discipline of the 3-2-1 rule applied at the system level, the two plain numbers that describe how much you might lose and how long you might be down, the unglamorous habit of storage hygiene that keeps backups affordable and disks healthy, and how all of this rolls up into the continuity of a state that has no territory to fall back on.

This is the knowledge layer. Configuring a backup job, running a real restore drill, and rehearsing a recovery are done and signed off in person, on systems you are appointed to, where supervision allows. Reading about a restore is not the same as having done one, and the whole point of this lesson is that the doing must actually happen.

By the end you will be able to explain where a small force's data lives and why that matters for backup, apply the 3-2-1 rule at the system level and state the cardinal rule that an untested backup is not a backup, explain recovery point and recovery time in plain terms and reason about them for a given service, describe basic storage hygiene, and connect backups to the continuity of a non-territorial Principality.

Key Terms

Backup: a separate copy of data, made on a schedule, kept so that the original can be restored if it is lost, corrupted, or encrypted.
Restore: the act of putting backed-up data back, returning a service or a dataset to a working state from a backup copy.
Database: the structured store that holds a service's records, for example the identity service's accounts or a register's entries; the heart of most services and the first thing to back up.
Data volume: the storage that holds a service's files, for example uploaded documents, images, or a service's working files, kept alongside or instead of a database.
3-2-1 rule: keep 3 copies of important data, on 2 different kinds of media, with 1 of them off-site or offline.
Off-site copy: a backup held somewhere other than where the live system runs, so that a single fire, theft, or compromise cannot take both.
Offline copy: a backup not connected to the live system, so that ransomware or an attacker on the live system cannot reach and destroy it.
Recovery point (RPO): how much recent data you could lose, measured as the gap back to the last good backup; a 24-hour backup means up to a day of data is at risk.
Recovery time (RTO): how long it takes to get a service working again after a loss, from the moment of failure to the moment it is back.
Retention: how long backups are kept before being deleted, and how many old copies are held.
Snapshot: a point-in-time image of a disk or volume, quick to take but usually held on the same system, so it is a convenience and not a substitute for a real off-site backup.
Continuity: the ability of the state and its services to keep functioning, or to return quickly, through and after a disruption.

Where the data actually lives

You cannot back up what you cannot find, so the first task is to know where a service keeps its data. Most self-hosted services store their information in two kinds of place, and a good backup covers both.

The first is the database. The identity service holds its accounts and groups in a database; a register holds its entries in one; the records system, the chat server, the newsletter tool, each keeps the records that matter in structured tables. A database is usually the single most precious part of a service, because it holds the meaning, the accounts and the entries and the relationships between them, in a form that cannot be reconstructed if it is gone. Databases are also delicate: you cannot simply copy the files while the database is running and trust them, because a write may be half-finished. This is why databases are backed up with their own proper tool, which asks the database to produce a clean, consistent copy, a dump, that can be reloaded.

The second is the data volume, the ordinary files a service keeps outside its database: uploaded documents, generated images, attachments, a service's configuration. On a containerised estate these live in named volumes or mounted directories on the server's disk, deliberately kept outside the container so they survive the container being rebuilt. They are backed up by copying the files, which is simpler than a database but no less important, because a register entry that points at a missing scanned document is only half a record.

A third thing must be guarded with special care: the keys and secrets. The identity service's signing keys, the per-user certificates such as the TAK .p12, the credentials services use to reach one another. These may live in a database, in files, or in a secrets store, and they are small but disproportionately important, because some of them cannot simply be regenerated without invalidating things that depend on them. Knowing which secrets must be preserved, and protecting those backups even more tightly than the rest, is part of the job.

   WHERE A SERVICE'S DATA LIVES  (and what backs it up)

   SERVICE
     |
     +---- DATABASE  ......... accounts, entries, records, relationships
     |        backed up by: a proper dump (clean, consistent copy)
     |
     +---- DATA VOLUME  ...... uploaded files, documents, images, config
     |        backed up by: copying the files
     |
     +---- KEYS & SECRETS  ... signing keys, certificates, credentials
              backed up by: capturing them, guarded most tightly of all

   Miss any one row and the "restore" is incomplete.

The practical lesson is that a backup is only as complete as your map of where the data lives. A new service has not been properly taken on until someone has written down what it stores, where, and how each part is captured, one more reason the documentation discipline from the previous lesson matters.

The backup strategy: 3-2-1 at the system level

The same 3-2-1 rule that CIS 201 teaches a national for their personal files governs the state's systems, only here it is run deliberately, on a schedule, by whoever is appointed to it. The rule is three numbers, each closing off a different way of losing everything.

Three copies. Keep at least three copies of the important data: the live one the service uses, and two backups. One copy is no copy, because the moment it fails you have nothing. Two copies are better, but if both live in the same place they can be lost together. Three copies, sensibly spread, mean that no single event leaves you empty-handed.

Two kinds of media. Hold the copies on two different kinds of storage, so that a fault peculiar to one kind cannot take all your copies at once. In a small self-hosted estate this is less about tape versus disk and more about not having every copy on the same disk, the same server, or the same storage account. If the live database and its only backup sit on the same failing disk, you have one copy with extra steps.

One off-site or offline. At least one copy must be somewhere the live system cannot reach. Off-site means physically elsewhere, so a fire, theft, or a single provider's failure cannot take both the live system and the backup. Offline means not connected to the live system, so that ransomware or an attacker who reaches the live server cannot also reach the backup and encrypt or delete it. For a small force this off-site or offline copy is the one that saves you in the worst case, the targeted attack or the total loss of a server, and it is the one most often skimped on, because it is the least convenient. Do not skimp on it.

   3-2-1 AT THE SYSTEM LEVEL

   [ LIVE ]            the running service's own data         (copy 1)
      |  scheduled backup
      v
   [ BACKUP A ]        on the same estate, different media     (copy 2)
      |  copied away
      v
   [ BACKUP B ]        OFF-SITE or OFFLINE                     (copy 3)
                       a fire, theft, or ransomware here
                       cannot reach the live system

   3 copies   .  2 kinds of media   .  1 off-site / offline

Around these three numbers sit a few decisions. How often to back up sets your recovery point, covered in the next section: a database that changes constantly may be dumped several times a day; a rarely-changing volume, nightly. Retention, how many old backups you keep and for how long, matters because the damage you are recovering from is not always today's. Ransomware and quiet corruption can sit undetected for days, so if you keep only the most recent backup you may find it already holds the damage. Keeping a run of older copies, daily for a week and then weekly for a while, lets you reach back to a known-good point before the trouble began. And backups must be automated, because a backup that depends on a person remembering to run it will eventually not be run, and you will discover the gap at the worst moment.

The cardinal rule: test the restore

Here is the rule that sits above all the others, and the one most often broken: an untested backup is not a backup. A backup job that runs every night and reports success proves only that a file was written. It does not prove that the file is complete, that it is not corrupt, that it holds what you think it holds, or that you actually know how to put it back. Plenty of organisations have discovered, in the hour of a real disaster, that their faithful nightly backups could not be restored: the dump was truncated, a volume was never included, the restore procedure existed only in someone's head and that someone had left, or nobody had ever actually tried.

The only thing that turns a hope into a backup is a restore drill: deliberately taking a backup and restoring it, ideally onto a separate test system, and confirming that the service comes up and the data is there and correct. A restore drill is the single most valuable thing a small force can do for its continuity, and it costs only discipline. It proves the backup is good, it proves the restore procedure is written down and works, and it trains the people who would have to do it for real, so that on the bad day they are following a rehearsed drill and not improvising under pressure.

   AN UNTESTED BACKUP IS A HOPE, NOT A BACKUP

   Backup job runs ............... "success" (a file was written)
        |
        |   <-- most people stop here. This proves almost nothing.
        v
   RESTORE DRILL
        restore the backup onto a test system
        bring the service up from it
        check the data is complete and correct
        time how long it took (your real recovery time)
        |
        v
   Now it is a backup.  And you know your restore works,
   your procedure is written, and your people have done it.

Restore drills should be scheduled, not left to goodwill, and the date and result recorded, just like a patch or a change. Test on a sensible cadence, test after any significant change to a service, and treat a failed drill as the gift it is: a problem found in calm, not in crisis.

Recovery point and recovery time, in plain terms

Two plain numbers describe how bad a loss would be, and every backup decision is really a trade against them. They sound technical but they are simple, and a member who can reason about them in ordinary language is already useful.

Recovery point answers: how much recent data could we lose? It is the gap between now and the last good backup. If a register is backed up once a day at midnight and it fails at four in the afternoon, everything entered since midnight, sixteen hours of work, is at risk, because the last copy you can restore is from midnight. Back it up every hour and the most you can lose is an hour. The recovery point is set by how often you back up: more frequent backups mean a smaller window of possible loss. The right number depends on the data. For accounts and registers, where every entry is a real action by a real national, you want a small window. For something easily reconstructed, a larger window is fine.

Recovery time answers: how long until the service is working again? It runs from the moment of failure to the moment the service is back and usable. It is set not by how often you back up but by how fast and how rehearsed your restore is: how quickly you can get a server, fetch the off-site copy, reload the database, restore the volumes, bring the service up, and check it. A backup you can restore in twenty minutes and one you can restore in two days protect the same data but give very different recovery times. The way to shorten recovery time is to rehearse the restore, write the procedure down, and keep the off-site copy reachable rather than buried.

   RECOVERY POINT vs RECOVERY TIME  (a timeline)

   ... last good backup            FAILURE            service back ...
        |                             |                     |
        |<------ RECOVERY POINT ----->|<--- RECOVERY TIME -->|
        |   data created in this      |   time spent         |
        |   window may be lost        |   restoring          |
        |   (set by HOW OFTEN         |   (set by HOW FAST /  |
        |    you back up)             |    HOW REHEARSED      |
        |                             |    the restore is)    |
        v                             v                     v
   ---[backup]----------------------[X]------------------[ up ]--->  time

   Smaller recovery point  <-  back up more often.
   Smaller recovery time   <-  rehearse and document the restore.

The two numbers are not free. A smaller recovery point costs more frequent backups and more storage; a smaller recovery time costs faster infrastructure and rehearsal. So you set them per service according to what it holds. The identity service and the core registers deserve small numbers on both, because the state cannot function while they are down or after they have lost data. A low-traffic informational site can tolerate larger numbers. Deciding these targets in advance, rather than discovering them in a crisis, is what turns backups from a chore into a plan. These targets are part of the continuity planning taken up in HCR 220 and exercised in CIS 310.

Storage hygiene

Backups live on storage, and storage that is neglected will quietly betray you, so a small amount of routine care keeps the whole scheme honest and affordable.

The first habit is watching the space. A disk that fills up does not fail politely: a backup job that runs out of room may write a truncated, useless file and still report something, and a live service whose disk fills can stop accepting writes or fall over entirely. Monitoring disk space, with an alert before it gets tight, is part of the monitoring discipline from the previous lesson and is especially important on whatever holds the backups, because backups only grow.

The second is managing retention so storage stays sane. Keeping every backup forever is neither necessary nor affordable, and a pile of undated, unmanaged backup files is its own hazard, hard to search and easy to misjudge. A sensible retention scheme, keeping recent copies densely and older ones sparsely and deleting the rest automatically, keeps storage bounded while still letting you reach back past slow-burning corruption.

The third is keeping the storage healthy and the copies separate. Disks wear out and fail; that is expected, not exceptional, which is the whole reason for more than one copy on more than one kind of media. Where redundancy exists it must actually be checked, not assumed, because a redundant set quietly running on one surviving disk is a single failure away from loss. And the live data and its backups must not share a fate: if a database and its only backup sit on the same disk, you do not have a backup, you have two ways to read the same file until it dies.

The fourth is protecting the backups themselves. Backups hold the same nationals' data as the live systems, often all of it in one convenient bundle, so they are a target. They should be access-controlled like any sensitive data, and where they leave the estate they should be encrypted, so that an off-site copy that falls into the wrong hands is unreadable. The off-site or offline copy that protects you against ransomware only does its job if the attacker cannot reach it: a backup sitting on the same network with the same credentials as the live system can be encrypted right alongside it. Separation is what makes the third copy worth having.

The link to continuity

For an ordinary organisation, backups are insurance. For a non-territorial Principality, they are closer to survival, because Kaharagia has no land, no buildings, no physical archive to fall back on. The state exists as services and data, an identity service, registers, records, the keys that bind them, so the continuity of the state is, to a real degree, the continuity of those systems and the recoverability of that data. If the registers were lost without a backup, the state would not have suffered an outage; it would have lost part of what makes it a state.

This is why backups are not an IT housekeeping task but a matter of statehood, and why this lesson sits inside the speciality whose whole reason for existing is that the Principality is digital. A force that patches and monitors well but cannot prove it could restore its registers has secured the front door and left the foundations untested. Continuity planning is the work of deciding, in advance and on paper, what must survive, how quickly each service must return, who does what when something fails, and where the off-site copies are and how to reach them, then rehearsing it so the plan is real. That planning belongs to HCR 220 · Emergency Preparedness and Civil Resilience, which treats resilience for the whole Principality, and the response and recovery from an actual cyber incident, the bad day when the backups are needed in earnest, belongs to CIS 310 · Cyber Incident Response and Continuity. This lesson gives both their foundation: there is no recovery without a tested backup, and there is no continuity for a digital state without recovery.

A last point of conduct. Backups and restores touch all of a service's data at once, in one place, which makes the person who holds them one of the most trusted people in the force. That access, like every access, follows appointment and not qualification: knowing how to run a restore does not entitle anyone to the backups of a system they are not appointed to. The keys to the state's continuity are held in trust, used only for their purpose, and their use recorded. That ethic is the subject of the lesson that follows.

In Practice: A nightly backup proves its worth

Corporal Anan is the systems assistant appointed to support a small records service for an Organ of State. When she took it on, she did what this lesson teaches before anything else: she wrote down where the service keeps its data. It had a database holding the records and a data volume holding scanned documents, so her backup had to cover both. She set an automated job to dump the database every night and copy the document volume alongside it, kept the copies on storage separate from the live disk, and arranged for one copy to be sent each night to an off-site store the live server could not reach. Three copies, two kinds of media, one off-site. She set retention to keep nightly copies for a fortnight and a weekly copy for two months, so a slow corruption could be reached back past.

Then she did the part most people skip. Once a month she restored the previous night's backup onto a spare test instance, brought the service up from it, opened a few records and checked a scanned document came back whole, and recorded the date, the result, and how long it took: about forty minutes, her real recovery time. On the third drill the restore failed, the document volume had quietly dropped out of the job after a configuration change. She found it in calm, fixed the job, and drilled again until it passed. Two months later a botched update corrupted the live database in the afternoon. There was no panic and no improvising. The most recent good backup was from the night before, so the recovery point was clear, less than a day, and the records officer was told exactly what little might need re-entering. Anan followed the same restore she had rehearsed, the service was back inside the hour she had measured, the records were intact, and she recorded the incident for the team to learn from. The discipline that had felt like overcaution for months paid for itself in a single afternoon.

Check Your Understanding

State the 3-2-1 rule in your own words, and explain what the off-site or offline copy specifically protects against that a second copy on the same server does not.
A register is backed up once every night at midnight. It is lost at 3 p.m. and is restored and working again by 5 p.m. What is the recovery point and what is the recovery time, and which backup decision would you change to make each one smaller?
Why is a backup job that runs every night and reports success still not a real backup until something else is done, and what is that something?

Reflection (write a short paragraph): Kaharagia has no territory, only its systems and its data. Reflect on what it means that, for a digital state, a tested backup of the registers is closer to survival than to insurance, and how that changes the weight you would give to a restore drill that nobody is watching and that seems, on a quiet day, like a waste of an hour.

Summary

A digital state is its data; protecting and recovering that data is a matter of statehood, not housekeeping.
A service keeps its data in a database (accounts, entries, records), in data volumes (files and documents), and in keys and secrets; a complete backup must cover all three, so you must first know where the data lives.
Apply 3-2-1 at the system level: 3 copies, on 2 kinds of media, with 1 off-site or offline; automate it, and keep a run of older copies so you can reach back past slow corruption or ransomware.
The cardinal rule: an untested backup is not a backup. Schedule restore drills, restore onto a test system, confirm the data is complete, and record the result; this proves the backup, proves the procedure, and trains the people.
Recovery point is how much recent data you could lose (set by how often you back up); recovery time is how long until the service is back (set by how fast and how rehearsed the restore is). Set targets per service in advance.
Practise storage hygiene: watch disk space, manage retention, keep storage healthy and copies separate, and protect and encrypt the backups themselves.
Builds on Lesson 04 (Keeping Systems Running) and on the documentation and monitoring it teaches; assumes the personal 3-2-1 habit from CIS 201.
Feeds HCR 220 · Emergency Preparedness and Civil Resilience (continuity planning for the whole Principality) and CIS 310 · Cyber Incident Response and Continuity (recovery on the bad day); access to backups, like all access, follows appointment, the subject of Lesson 10.

Backups, Storage, and Continuity