Royal Army College

Lesson Overview

The earlier lessons of this course followed a single incident from first sign to clean restore: prepare, detect and analyse, contain and eradicate and recover, and the playbooks for the troubles a small force is most likely to meet. This lesson widens the view. An incident is one cause of disruption, but a service can also fall over because a host died, a provider went dark, the power failed, or a busy weekend simply overwhelmed it. Whatever the cause, a digital state has the same hard duty: the essential work must continue, and the systems must come back in a sensible order. That duty is the subject of this lesson.

Two disciplines answer it, and they are not the same. Business continuity keeps the essential functions of the Principality running while the systems are down or degraded, often by hand. Disaster recovery restores the systems and their data to working order. Continuity is about the function; recovery is about the technology. You plan both before the bad day, you measure them with two simple objectives, you decide in advance which services come back first, and, because Kaharagia is a digital state with no territory to fall back on, you keep a manual and off-grid floor underneath everything so the essential work never stops dead. This lesson teaches each of those in plain terms and shows how they fit together.

This is the knowledge layer. The hands-on work, running a restore from an offline backup and checking it, standing up a service from a clean rebuild, rehearsing the manual fallback in a tabletop exercise, timing a real recovery against its objective, is done and signed off in person where supervision allows. Learn here what the objectives mean, how to rank and restore services, and what the manual floor looks like, so that when you take part in a real recovery you already understand the plan you are following.

By the end you will be able to explain the difference between business continuity and disaster recovery, define recovery time objective and recovery point objective in plain terms and read them off a timeline, identify a force's critical services and put them in a sensible restoration order, explain why a digital state needs a manual and off-grid floor and describe what it contains, follow a continuity and recovery plan during a disruption, and say how this work ties into HCR 220 Emergency Preparedness and SIG 410 resilient communications.

Key Terms

Business continuity (BCP): keeping the essential functions of the organisation running through a disruption, by alternate or manual means if necessary, so the mission continues even while normal systems are unavailable.
Disaster recovery (DR): the work of restoring the affected systems, services, and data to normal working order after a disruption.
Disruption: any event that stops or degrades a service, whether a cyber incident, a hardware failure, a provider outage, a power or network loss, or simple overload.
Recovery time objective (RTO): the target for how quickly a service must be back in working order after it goes down, measured from the moment of failure.
Recovery point objective (RPO): the target for how much recent data the organisation can afford to lose, measured backwards from the moment of failure to the last good backup.
Critical service: a service the Principality cannot do its essential work without, ranked above services that can wait.
Restoration order: the planned sequence in which services are brought back, so that what others depend on is restored first.
Dependency: a service or component that another service needs in order to work; if identity is down, everything that signs in through it is down too.
Manual or offline floor: the paper forms, written procedures, and off-grid communications that let essential work continue by hand when the systems are unavailable.
Known-good backup: a clean, tested copy of data, ideally offline, from before any compromise, the only safe source to restore from.
Tabletop exercise: a rehearsal in which the team talks and walks through a disruption against the plan, to find the gaps before a real event does.

Continuity and recovery are two different jobs

The first thing to hold straight is that keeping the function running and restoring the system are separate tasks, often done at the same time by different people. Suppose the identity service that signs members into every online system goes down on a Saturday. Disaster recovery is the technical work: find the cause, rebuild or restore the service from a known-good backup, validate it, and bring it back. Business continuity is the parallel question of how the essential work carries on in the hours while that is happening. Can a duty officer be reached by a phone tree instead of the internal messenger? Can an urgent record be taken on a paper form and entered later? Can a member on a task be authorised by a known voice on a radio rather than a login? Continuity buys the time that recovery needs.

A small force gets this wrong in one of two ways. It pours everyone into the technical fix and lets the essential function simply stop, so the mission halts even though the people are busy. Or it improvises the function by hand with no plan, so the manual work is slow, inconsistent, and never reconciled back into the systems afterwards. The cure for both is to plan the two jobs together, before the bad day, and to name who runs which. This connects straight to Lesson 01: the same incident response plan that names who leads, who decides, who communicates, and who keeps the record should also name who holds continuity and who runs recovery, so that on the day nobody is inventing roles.

 DISRUPTION (incident, failure, outage, overload)
                    |
        +-----------+------------+
        |                        |
  BUSINESS CONTINUITY      DISASTER RECOVERY
  keep the FUNCTION         restore the SYSTEM
  running, by hand          to working order
  if needed                       |
        |                         |
  paper forms,              clean rebuild or
  phone tree,               restore from a
  off-grid comms,           known-good backup,
  manual procedures         validate, monitor
        |                         |
        +-----------+------------+
                    |
        NORMAL SERVICE RESUMED
        manual records reconciled in

RTO and RPO: the two objectives, in plain terms

You cannot plan a recovery without knowing two numbers for each service: how fast it must be back, and how much recent data you can afford to lose. These are the recovery time objective and the recovery point objective, and once you picture them on a timeline they are simple.

The recovery time objective (RTO) is a target measured forwards from the moment a service fails. It answers, "how long can we be without this before the harm is unacceptable?" A four hour RTO means the service should be back within four hours of going down. It drives the technical plan: a service with a short RTO needs a faster path back, perhaps a standby copy ready to take over, while a service with a long RTO can be rebuilt at a steadier pace.

The recovery point objective (RPO) is a target measured backwards from the moment of failure to the last good backup you can restore. It answers, "how much recent work can we afford to lose?" A one hour RPO means backups must be recent enough that, at worst, you lose one hour of changes; it drives how often you back up. If a register is backed up once a day but its RPO is one hour, the plan does not match the duty, and that gap must be closed.

The two are independent. A service can have a short RTO and a long RPO, or the reverse. The national identity service might need to be back fast (short RTO) and lose almost no recent account changes (short RPO). An archive of old gazette issues might tolerate being down for a day (long RTO) yet still must lose nothing (short RPO, because the data rarely changes and every record matters). Set both, per service, and be honest: shorter objectives cost more in standby systems and frequent backups, so reserve the tightest numbers for the truly critical.

        <----- RPO ----->|<----- RTO ----->
                         |
  last good      ........FAILURE........    service
  backup taken            (now)             back up
     |                     |                   |
  ---*---------------------X-------------------*--->  time
     |                     |                   |
   data you           service goes        target to
   may lose:          down here           be restored
   everything                             by here:
   changed                                 the RTO
   since the
   last backup:
   the RPO

  RPO = how much recent DATA you can afford to lose
        (look BACKWARD to the last good backup)
  RTO = how quickly the SERVICE must be BACK
        (look FORWARD to the target restore)

A useful habit is to read these off the timeline whenever a new service is brought into use. Ask the people who depend on it two questions: if this stopped, how long before it really hurts (that sets the RTO), and if we had to roll back to the last backup, how much lost work could we tolerate (that sets the RPO). Write the answers down. They become the test you measure every real recovery against, and the standard a tabletop exercise checks.

Critical services and the order to restore them

When several services are down at once, you cannot bring them all back together, and trying to scatters effort. You restore in an order, and the order is decided in advance, not argued about during the event. Two ideas set it: importance and dependency.

Importance asks which functions the Principality genuinely cannot do its essential work without. For a small digital force these usually include identity and access (because almost everything signs in through it), communications (so the team can coordinate and reach members), and the core records and registers that the essential duties run on. Below those sit services that matter but can wait hours or a day: a public website, a newsletter, an archive, a reporting dashboard.

Dependency then reorders the important list by what relies on what. If every other system signs in through the single identity service, then identity is restored first, not because it is the most visible but because nothing else fully works until it is up. Communications come early too, because the team needs to coordinate the rest of the recovery and reach the people affected. Records and registers follow, then the lower-priority services in turn. Drawing the dependencies once, in calm time, turns a panicked scramble into a checklist.

The table below shows the shape of a restoration plan for a generic small digital force. The exact services and numbers are illustrative; each force sets its own from the two questions in the section above. Note the last column: every critical service has a manual fallback that holds the function while the system is being restored, which is the subject of the next section.

 RESTORATION ORDER (illustrative; each force sets its own)

 #  Service            RTO    RPO    Why this order        Manual fallback while down
 -- ------------------ ------ ------ --------------------- ----------------------------
 1  Identity / sign-on 2 hr   15 min everything signs in   known-voice auth on a radio,
                                     through it: restore   pre-issued duty roster on paper
                                     first
 2  Team comms         2 hr   1 hr   needed to run the      phone tree, off-grid radio
                                     recovery and reach     net, agreed call schedule
                                     people
 3  Core records /     4 hr   1 hr   essential duties run   paper intake forms, entered
    registers                        on them               and reconciled later
 4  Certificates /     4 hr   1 hr   revoke/re-issue keys   manual revocation list held
    key service                      for access            by the duty NCO
 5  Public website     24 hr  24 hr  important but not      a holding notice; non-urgent
                                     time-critical
 6  Newsletter /       48 hr  24 hr  can wait               none needed
    archive

Read the table top to bottom and you have the plan: restore identity, then comms, then records, then the key service, and only then the services that can wait, holding each critical function by its manual fallback in the meantime. Rehearse it as a tabletop exercise and the gaps show themselves, a missing dependency, an RPO the backup schedule does not meet, a fallback nobody has the paper for, before a real disruption finds them for you.

The manual and off-grid floor

Here is what makes continuity different for a digital state. A force with territory and stores can fall back on physical things when the network dies. A non-territorial, digitally organised Principality cannot, so its continuity rests on a deliberate manual and off-grid floor: the paper forms, written procedures, and off-grid communications that let essential work continue by hand when the systems are unavailable. This floor is not a relic. It is engineered, kept current, and rehearsed, and it is the single most important continuity measure a digital state has.

Three parts make it up.

Paper fallback. For each essential function that normally runs through a system, keep a simple paper procedure and form that captures the same information by hand. A record that is normally entered online has a printed intake form held by the duty member. An authorisation that is normally a login has a written procedure for confirming identity by a known channel. The forms are printed and stored in advance, because a printer needs the very systems that are down. The rule that makes paper fallback work is reconciliation: every manual entry is dated and signed, and when the systems return it is entered and checked against the system, so nothing is lost and nothing is double-counted. Note the order on the timeline, do the work on paper while down, reconcile into the system once it is back.

Off-grid communications. When the internal messenger and the single sign-on are down, the team still has to coordinate and reach members. That means communications that do not depend on the systems being recovered: a phone tree that does not need the internal directory, an agreed radio net and call schedule, and a known meeting point or check-in time. This is exactly the ground SIG 410 covers, resilient and off-grid communications for a small force, and continuity planning should draw the off-grid comms plan straight from it rather than invent a separate one. The plan is useless if it lives only in the system it is meant to survive, so it is printed and carried.

Manual decision and authority. With identity down, normal access control is down, and the temptation is to either freeze entirely or wave everyone through. Neither is safe. The continuity plan names a manual authority: who can authorise essential action by a known voice or in person, what they may authorise, and how it is recorded so it can be reviewed afterwards. This keeps the "access follows appointment" rule alive even off the systems, the manual authority rests with the appointment, not with whoever happens to be holding a radio.

A digital state that has these three, kept current and rehearsed, can lose its systems for hours and still do its essential work, slowly and by hand, without losing data or control. A digital state that has not built the floor discovers on the bad day that when the screens go dark, the whole organisation goes dark with them. Building and rehearsing the floor is continuity's real product.

How this fits the wider resilience picture

Continuity and disaster recovery do not sit only inside the cyber speciality. They are one face of the Principality's broader resilience, and they connect outward to two courses in particular. HCR 220 Emergency Preparedness and Civil Resilience owns continuity for the force as a whole, how essential functions carry on through any emergency, of which a systems disruption is one kind; CIS 310 supplies the digital part of that picture and should plan inside the HCR 220 framework, not beside it. SIG 410 Communications Planning for Small Forces owns the resilient and off-grid communications that the manual floor depends on; the continuity plan borrows its off-grid comms plan rather than writing a separate one. The thread running through all three is the same: decide in advance, keep a manual floor, and rehearse, so that when a disruption comes the response is a plan being followed, not a plan being invented.

In Practice: a Saturday outage and the manual floor

On a Saturday afternoon the single sign-on service that signs every member into the Principality's online systems stops responding. Within minutes nobody can log in to the internal messenger, the records system, or the certificate service. There is no sign of attack; a host has failed. Corporal Vane is the duty NCO and opens the continuity and recovery plan, which the team rehearsed in a tabletop exercise the month before.

She runs the two jobs in parallel, as the plan names them. For recovery, a systems assistant begins rebuilding the identity service from the most recent known-good backup, working to its two hour RTO; the backup is fifteen minutes old, well inside the fifteen minute RPO, so almost no account changes will be lost. For continuity, Vane stands up the manual floor while that work proceeds. She reaches the on-call members not through the internal messenger, which is down, but through the printed phone tree and the agreed radio net from the SIG 410 off-grid plan. An urgent record that would normally be entered online is taken on the printed intake form held in the duty folder, dated and signed for reconciliation later. A member on a task who needs authorisation, which would normally be a login, is confirmed by Vane's known voice on the radio under the plan's manual authority, and she writes the authorisation in the duty log so it can be reviewed.

The identity service is restored and validated in just under two hours, inside its RTO. Vane does not declare the event over there. The plan's last step is reconciliation: the paper intake form is entered into the records system and checked, the radio authorisation is recorded against the member's task, and the manual decision log is filed for the after-action review. Because the floor was built, printed, and rehearsed before the bad day, the essential work never stopped, no data was lost, and authority stayed with the appointment throughout. Vane notes one gap for the review: the off-grid call schedule had one stale number, which she corrects on the spot.

Check Your Understanding

In one or two sentences each, explain the difference between business continuity and disaster recovery, and give an example of each from a service outage.
A records service is backed up once every twenty-four hours, but the people who depend on it say they can afford to lose at most one hour of recent entries. State which objective this is about (RTO or RPO), say whether the current plan meets it, and explain what must change.
Several services are down at once: a public website, the single sign-on identity service, and team communications. Put them in a sensible restoration order and justify the order using importance and dependency.

Reflection (write a short paragraph): Think about the essential work you or your section would still need to do if the Principality's online systems were unavailable for a full day. What would have to happen by hand, what paper forms or off-grid communications would you need ready in advance, and where would the manual floor be weakest today?

Summary

A disruption can come from an incident, a hardware failure, a provider outage, a power or network loss, or overload; the duty is the same, keep the essential work running and bring the systems back in order.
Business continuity keeps the function running, by hand if needed; disaster recovery restores the system. They are separate jobs, often run in parallel, and the plan should name who holds each.
The recovery time objective (RTO) is how quickly a service must be back, measured forward from failure; the recovery point objective (RPO) is how much recent data you can afford to lose, measured back to the last good backup. Set both, per service, and be honest about cost.
Restore in a planned order set by importance and dependency: usually identity first because everything signs in through it, then communications, then core records and the key service, then the services that can wait.
A digital state needs a manual and off-grid floor: printed paper fallback with reconciliation, off-grid communications, and a named manual authority that keeps "access follows appointment" alive when the systems are down. Build it, print it, and rehearse it.
This builds on Lesson 01 (the plan and named roles), Lesson 03 (restoring from known-good backups), and Lesson 04 (the playbooks), and ties outward to HCR 220 Emergency Preparedness and Civil Resilience for force-wide continuity and to SIG 410 Communications Planning for Small Forces for the resilient and off-grid communications the manual floor depends on. It draws too on SIG 220 and PME 210 for disciplined communications and records.

Continuity and Disaster Recovery