Royal Army College

Lesson Overview

This is the lesson where the response does its real work. In Lesson 02 the team detected and analysed an incident: it recognised the indicators, triaged and scoped what was affected, built a timeline, and preserved the evidence. Now it must act. The action divides cleanly into three phases that always run in this order: contain the incident so it spreads no further, eradicate the threat so the attacker has no way back in, and recover the service so the Principality's work goes on. This is the heart of NIST SP 800-61's "Containment, Eradication, and Recovery" phase, and it is where calm method tells the difference between a small bad day and a large one.

The skill this lesson asks for is judgement under pressure. The instinct, when a host is compromised, is to pull every plug at once; sometimes that is right, and sometimes pulling a plug destroys the very evidence that tells you how the attacker got in, so you patch the wrong hole and they walk back through the same door. The lesson teaches the three phases in turn, the order they must run in, and the standing rule of "rebuild rather than clean when in doubt." It teaches the one trade-off that recurs throughout: when to contain fast and when to preserve evidence first. And it holds firmly to the speciality's posture, which is defensive and lawful at every step: the aim is to limit harm, remove the threat, restore service, protect nationals' data, and confirm the hole is closed. It is never to pursue, attack, or "hack back" at whoever caused the incident.

This is the knowledge layer. It gives you the model, the order, and the judgement so that when an incident is called you understand what each step is for and why it comes when it does. The hands-on practice, isolating a host, disabling an account, rotating a credential or a per-user certificate, running a restore from a clean backup and validating it, is done and signed off in person, on systems you are appointed to, under supervision. Reading this lesson does not grant you access to act; access follows appointment, not a qualification.

By the end you will be able to define containment, eradication, and recovery and explain why they run in that order, describe the actions that make up each phase, weigh the standing trade-off between containing fast and preserving evidence, apply the rule of rebuilding rather than cleaning when in doubt, explain how known-good backups, credential and key rotation, and patching close an incident properly, describe how a recovered service is validated and monitored before and after it returns, and keep the whole response strictly defensive and lawful.

Key Terms

Containment: the phase that stops an incident spreading and limits the harm it can still do, while the threat is still present, without yet trying to remove it.
Eradication: the phase that removes the threat and the attacker's access from the affected systems, so the same incident cannot simply resume.
Recovery: the phase that restores affected systems and data to safe, normal operation and returns the service to use, under close watch.
Isolation: cutting an affected host off from the network (or putting it on a quarantine segment) so it cannot reach, or be reached by, other systems.
Indicator of compromise (IOC): an observable sign tied to an incident, such as a bad file hash, a malicious domain or address, or a suspect account, which can be blocked or watched for elsewhere.
Credential and key rotation: changing the passwords, tokens, session keys, and certificates that an incident may have exposed, so that what the attacker captured no longer works.
Known-good backup: a backup taken before the compromise and held safely (offline or off-site), so that restoring from it does not restore the threat along with the data. A clean backup.
Rebuild over clean: the rule that, where there is real doubt that a compromised system is fully cleaned, you rebuild it from a trusted source rather than try to remove the threat in place.
Integrity validation: checking, after a restore or rebuild, that the data and the system are correct, complete, and free of the threat before returning them to service.
Patching the hole: fixing the specific weakness the attacker used, by applying an update, closing a misconfiguration, or removing the exposure, so the same route is shut.
Enhanced monitoring: watching a recovered system more closely than normal for a period after it returns to service, to catch any sign that the threat survived or returns.

Where this phase sits

Hold the whole lifecycle in mind first, because containment, eradication, and recovery only make sense as the middle of a longer story. The recognised incident lifecycle from NIST SP 800-61 runs in four phases: Preparation, then Detection and Analysis, then Containment, Eradication, and Recovery, then Post-Incident Activity. Lesson 01 built the preparation, the plan, the roles, the clean offline backups, the rehearsals. Lesson 02 did detection and analysis. This lesson is the third phase, the action. Lesson 10 will close the loop with the learning.

   THE INCIDENT LIFECYCLE (NIST SP 800-61)

   [ PREPARATION ] -> [ DETECTION & ANALYSIS ] -> [ CONTAIN, ERADICATE, RECOVER ] -> [ POST-INCIDENT ]
      Lesson 01            Lesson 02                  THIS LESSON (03)                   Lesson 10
   plan, roles,        recognise, triage,        +-------------------------------+    blameless review,
   clean backups,      scope, timeline,          |  CONTAIN -> ERADICATE ->      |    update controls,
   rehearsals          preserve evidence         |             RECOVER           |    feed back to
                                                 +-------------------------------+    Govern/Identify/Protect

One thing to carry from that picture: the three phases in this lesson are not three separate decisions made in isolation. They flow from the analysis done before them and they feed the learning done after them. You contain based on what the analysis told you was affected. You eradicate based on what it told you the attacker did and how they got in. You recover based on the clean backups that preparation made ready. And every action you take, you record with its time, because the post-incident review and any lawful report depend on that record.

Containment: stop the spread first

The first job, once an incident is confirmed, is to stop it spreading. Containment does not try to remove the threat yet. It draws a line around the harm and holds it there, buying the team time to understand and eradicate without the problem growing under their feet. An uncontained incident is a fire still spreading; you do not begin sifting the ashes for the cause while the next room is catching.

Containment for the incidents a small force is likely to meet comes down to a few concrete moves:

Isolate the affected host from the network. Take the compromised machine off the network, or move it to a quarantine segment, so malware cannot spread to other systems and an attacker on it cannot reach further in. Isolation is usually preferable to powering the machine off, because powering off can destroy useful volatile evidence in memory; more on that trade-off below.
Disable compromised accounts. If an account is taken over, disable it, and revoke its active sessions and tokens, so the attacker is locked out even if they still hold the password. Disabling, rather than deleting, keeps the account available for analysis.
Block known-bad indicators. Take the indicators of compromise the analysis produced, the malicious domains, addresses, and file hashes, and block them across the estate, so the same threat cannot reach other hosts or call home from them.
Protect the still-healthy. Containment is also about what is not yet hit. Where one service is compromised, watch and, if needed, fence off the others it connects to, especially the identity service and anything holding nationals' records.

Containment buys time, but it is not free, and the cost is the heart of this lesson's judgement, set out in the next section. Whatever you do, do it with a light enough hand that you do not erase the trail: isolate rather than wipe, disable rather than delete, block rather than destroy. The aim is to stop the spread while avoiding destroying evidence prematurely, because that evidence is what lets you eradicate properly and, where the law requires, report honestly.

The judgement: contain fast, or preserve evidence?

Here is the trade-off that recurs in almost every incident, and the single most important piece of judgement in this lesson. Containment and evidence preservation can pull against each other. The fastest way to contain a compromised host is to cut it off and shut it down at once; but a sudden shutdown loses the volatile evidence, the running processes, the memory, the live network connections, that often explains how the attacker got in and what they did. Preserve too eagerly and the incident spreads while you photograph it; contain too brutally and you save the building but lose the cause, and so cannot be sure you have shut the door.

There is no single right answer, only a way of deciding. Weigh the two costs against each other for this incident:

   THE CONTAIN-FAST vs PRESERVE-EVIDENCE BALANCE

   CONTAIN FAST (isolate / shut down now) when:
     - harm is spreading quickly (e.g. ransomware encrypting live)
     - nationals' data or the identity service is at active risk
     - the threat to people or service outweighs the lost detail
        => ACT. Stopping harm comes first.

   PRESERVE EVIDENCE FIRST (isolate gently, image before changing) when:
     - the spread is slow or already boxed in
     - you do not yet understand how they got in
     - losing the evidence risks patching the WRONG hole
        => CAPTURE, THEN contain. Understand the door before you shut it.

   THE STANDING RULE THAT RESOLVES MOST CASES:
     ISOLATE from the network (stops the spread)
     WITHOUT POWERING OFF (keeps the volatile evidence)
     then IMAGE memory and disk before you change the host.
   This buys most of the containment AND most of the evidence at once.

Notice how the standing rule dissolves most of the conflict. Pulling the network cable, or moving the host to an isolated segment, stops it talking to anything and so achieves most of the containment, while leaving the machine running preserves the memory and live state for the disk and memory image the analysis phase calls for. You contain by isolation and preserve by not powering off, and you take the image before you start changing the host. The cases where you must still choose are the extreme ones: when harm is spreading so fast, ransomware actively encrypting, an attacker actively reaching nationals' records, that even the seconds spent imaging are too costly. Then the rule is plain and humanitarian: stopping harm to people and data comes first, and lost forensic detail is an acceptable price. Never let the wish to gather a perfect record delay the protection of nationals. But never destroy the record needlessly either, because without it you may patch the wrong hole and the incident returns. This is a judgement call, made by the appointed incident lead, recorded with its reasoning and its time.

Eradication: remove the threat and shut the door

With the incident contained, eradication removes the threat itself and, just as importantly, the attacker's way back in. Containment is a tourniquet; eradication is treating the wound so it does not reopen. It has three parts, and skipping any one of them leaves the incident only half-resolved.

First, remove the threat. Take out the malware, the attacker's tools, the malicious accounts or scheduled tasks, the back doors they left to return. This is where the analysis pays off: you can only remove what you found, which is why the timeline and scope from Lesson 02 matter. If you are not confident you have found everything, that is the signal to rebuild rather than clean, which the next section addresses.

Second, reset and rotate affected credentials and keys. Assume that anything the attacker could reach, they captured. Reset the passwords of affected accounts; revoke and reissue the tokens and session keys; and, in the Principality's setting, reissue the per-user certificates and keys that the compromise may have exposed, the kind of per-user .p12 that gives access to a service. A threat removed from a host but left able to log back in with a stolen credential, or to authenticate with a captured key, is not eradicated at all. Rotation is what makes the attacker's loot worthless.

Third, patch the hole that was used. Find the specific weakness the attacker exploited, an unpatched piece of software, a misconfiguration, a weak or reused credential, an exposed service, and fix it. Most incidents come through a known, unpatched hole, so applying the missing update, correcting the configuration, or removing the exposure is what stops the same route being used again. Eradication that removes the threat but leaves the hole open is a holiday, not a cure: the attacker, or the next one, returns the same way.

These three move together. Remove the threat, rotate the credentials and keys it may have taken, and patch the hole it came through. Do all three, or the incident is not eradicated, only paused.

Recovery: restore, validate, and watch

Recovery returns the affected systems and data to safe, normal use. It is tempting to treat it as the easy part, the relief after the hard work, but recovery has its own discipline, and rushing it is how a contained-and-eradicated incident comes back to life.

Restore from known-good (clean) backups. This is the whole reason the backups in Lesson 01 exist. Restore the system and its data from a backup taken before the compromise and held safely offline or off-site, so you do not restore the threat along with the data. A backup taken after the attacker was already inside is not clean; restoring it reinstates the compromise. Choosing the right restore point, the most recent backup you are confident predates the compromise, is why the timeline from the analysis phase matters so much.

When in doubt, rebuild rather than clean. This is the standing rule of recovery, and it deserves its own emphasis. If there is real doubt that a compromised host has been fully cleaned, do not try to scrub the threat out of the running system and hope; rebuild the host from a trusted source, a known-good image or a fresh install, and restore only the data from clean backup. Cleaning in place is faster but leaves the nagging risk that something was missed, a hidden back door, a subtle change, and a half-cleaned host is a future incident waiting. Rebuilding is more work and far more certain. For a small force protecting nationals' data, certainty is worth the extra hours.

   CLEAN-RESTORE vs REBUILD: A DECISION AID

   Q1. Do you have a backup you are CONFIDENT predates the compromise?
         NO  -> you cannot simply restore. REBUILD from a trusted
                source; rebuild the data by other trusted means.
         YES -> go to Q2.

   Q2. Are you CONFIDENT the threat is fully removed and understood
       (clear scope, clear timeline, nothing unexplained)?
         NO  -> REBUILD the system from a trusted image / fresh install,
                then restore only DATA from the clean backup.
         YES -> go to Q3.

   Q3. Is this a high-value host (identity service, records,
       certificate / key material)?
         YES -> lean to REBUILD anyway. The cost of being wrong is high.
         NO  -> RESTORE from the clean backup is acceptable.

   GOLDEN RULE: when in doubt, REBUILD. A half-cleaned host
   is a future incident. Certainty is worth the extra hours.

Validate integrity before returning to service. Whether restored or rebuilt, check the system and data before you trust them: confirm the data is correct and complete, confirm the threat is genuinely gone, confirm the patch is in place and the hole closed. Do not return a service to nationals on the assumption that the restore worked; verify that it did.

Confirm the hole is closed. Before the service goes back, confirm that the weakness eradication addressed is genuinely shut, the patch applied, the configuration corrected, the credentials rotated. Returning a service with the original hole still open invites the same incident straight back.

Monitor closely after returning to service. Recovery does not end when the service is live again. For a defined period afterwards, watch the recovered system more closely than normal, the logs, the alerts, the indicators from this very incident, so that if the threat survived or the attacker tries to return, you catch it at once. The recovered system has been a target; treat it as one until time and quiet have shown the recovery held.

In Practice: A Compromised Operator Account

A duty systems assistant, holding the appointment to act on the relevant systems, is alerted by the identity service to a successful sign-in to an operator account from an unfamiliar place at an odd hour, followed by an unusual run of access to records. Lesson 02's analysis confirms it: the account is compromised, the attacker is active now, but they have not yet reached the bulk of nationals' records. The assistant opens the incident record, notes the time, and works the three phases under the appointed incident lead.

Contain. The spread is active and nationals' data is the target, so this leans towards containing fast. The assistant disables the compromised account and, crucially, revokes its active sessions and tokens, so the attacker is cut off even though they hold the password. The known-bad address the sign-in came from is blocked. The affected operator's host is isolated from the network but left running, so the volatile evidence survives for imaging; isolation stops the spread without destroying the trail. None of this is deletion: the account is disabled, not removed, so it can be analysed.

Eradicate. With the attacker locked out, the team works out how they got in: a reused password, exposed in an unrelated breach, that was never rotated and was not protected by phishing-resistant MFA. They reset the account's credential, reissue its tokens, and, because the account could reach a service guarded by a per-user certificate, reissue that certificate and revoke the old one, on the assumption it may have been captured. Then they patch the hole: the account is moved to phishing-resistant MFA and screened against breached passwords, and a check is run for any sneaky changes the attacker made, added recovery methods, mail-forwarding rules, which are removed.

Recover. The records the attacker touched are checked against a known-good backup to confirm nothing was altered; the integrity holds, so no data restore is needed, only verification. The account is returned to its rightful operator with fresh credentials and proper MFA. For the following days the team watches that account and that host more closely than normal, alert for any sign of return. The incident lead confirms the hole, the reused unprotected password, is closed, and the timeline, with every action and its time, is handed to the post-incident review so the same gap is fixed across every account, not just this one. At no point does anyone attempt to trace, contact, or retaliate against the attacker; that is for the proper authorities, not the team. The team's job was to contain, eradicate, recover, and learn, and it did.

Check Your Understanding

Name the three phases of this part of the incident lifecycle, in order, and explain in one or two sentences each what the phase is for. Why must they run in that order, and what goes wrong if a team tries to recover before it has properly eradicated, or eradicate before it has contained?
Explain the standing trade-off between containing fast and preserving evidence. What is the standing rule (isolate without powering off, then image) that resolves most cases, and why does it work? Give one situation in which you would override it and contain fast regardless, and say why that is the right call.
State the rule "rebuild rather than clean when in doubt" and explain why a half-cleaned host is a future incident. Walk through the clean-restore versus rebuild decision aid: when can you safely restore from a backup, and when must you rebuild instead? Why does eradication insist on rotating credentials and keys and patching the hole, not just removing the malware?

Reflection (write a short paragraph): This lesson argues that the difference between a small incident and a large one is mostly made by calm method and sound judgement: containing without destroying the evidence, eradicating the access and the hole rather than just the malware, and recovering with enough certainty to trust the result. Think about the standing trade-off between speed and evidence, and about the rule to rebuild rather than clean when in doubt. If you were the appointed assistant on the compromised-account incident above, which step do you think you would be most tempted to rush or skip under pressure, and why; what would the cost of that shortcut be; and what would help you hold to the disciplined order on a real bad day?

Summary

Containment, eradication, and recovery is the third phase of the NIST SP 800-61 lifecycle, after preparation and after detection and analysis, and before the post-incident learning. The three steps run strictly in order: stop the spread, remove the threat and its access, then restore the service.
Contain stops the spread without yet removing the threat: isolate the affected host from the network, disable compromised accounts and revoke their sessions and tokens, and block known-bad indicators, all with a light enough hand that you do not destroy the evidence prematurely.
The recurring judgement is contain fast versus preserve evidence. The standing rule resolves most cases: isolate from the network without powering off, then image before you change the host, which buys most of the containment and most of the evidence at once. Override it and contain fast only when harm to people or nationals' data is spreading faster than you can afford to wait.
Eradicate does three things together: remove the threat and the attacker's tools and back doors, reset and rotate affected credentials and keys (including per-user certificates), and patch the specific hole that was used. Skip any one and the incident is only paused.
Recover restores from known-good clean backups; rebuilds rather than cleans when there is real doubt; validates the integrity of data and system; confirms the hole is closed; and monitors closely after the service returns. A half-cleaned host is a future incident, so certainty is worth the extra hours.
The whole phase stays defensive and lawful: limit harm, remove the threat, restore service, protect nationals' data, learn, and never trace, contact, or retaliate against an attacker, which is a matter for the proper authorities. Access to act follows appointment, not this qualification.
This lesson follows Lesson 02, Detection and Analysis, which produces the scope, timeline, and evidence it depends on, and feeds Lesson 10, After the Incident, with the record and lessons learned. It connects to HCR 220 for continuity, draws its backup discipline from CIS 201, its access discipline from CIS 220, and its reporting and records discipline from PME 210, in the same defensive spirit as SIG 220.

Containment, Eradication, and Recovery