Royal Army College

Lesson Overview

A system is not finished when it is switched on. The earlier lessons explained what the Principality runs on, how servers and services carry a request across the network, and how one identity service signs people in across many systems. This lesson is about what happens next, day after day, to keep those systems healthy, safe, and there when needed. It is the quiet, unglamorous work behind a reliable digital estate, and it is exactly the work a small force can do well if it is disciplined about it.

Keeping systems running is mostly the Protect and Detect work of the wider security framework, carried out as routine. You patch and update so that known holes are closed before anyone can use them. You monitor and keep logs so that you notice trouble early and can see afterwards what happened. You document what you run so the force does not depend on one person's memory. You practise change discipline so that no surprise change can break a live system unrecorded. And you administer with least privilege so that even a trusted operator works with no more power than the task in front of them needs. None of this is clever. All of it is the difference between a system that stays up and one that fails quietly until the day it is needed.

By the end you will be able to explain why systems must be patched and updated regularly, describe how monitoring and logging let you detect and understand problems, document a system so others can run it, apply a plan, record, test, and undo discipline to changes on a live system, and apply least-privilege administration through separate admin accounts and minimal access.

This is the knowledge layer. The hands-on work, applying an update on a test system under supervision, reading a real monitoring dashboard, walking a change through its workflow, is done and signed off in person, on systems you are appointed to, where supervision allows. Understanding a task is not the same as being granted access to perform it: access follows appointment, never merely a qualification.

Key Terms

Patch: a correction issued for software, very often to close a security hole; applying it is patching.
Update: a newer version of a service or its underlying software, which may bring fixes, security corrections, and changes in behaviour.
Vulnerability: a weakness in software that an attacker can use; most successful attacks exploit known, already-fixed vulnerabilities on systems that were never patched.
Monitoring: watching systems automatically so that a problem, a service down, a disk filling, an error rate climbing, is noticed quickly rather than discovered by a complaint.
Uptime check: a simple, repeated test that asks "is this service answering?" and raises an alert when it stops.
Alert: an automatic message to the responsible person when monitoring detects a problem worth acting on.
Log: a time-stamped record a system writes of what it did, who connected, and what went wrong; logs are how you reconstruct what happened.
Runbook: written, step-by-step instructions for running and recovering a particular system, so that someone other than the original author can do it.
Change discipline: the habit of planning, recording, testing, and being able to undo any change to a live system, rather than making undocumented or untested changes.
Rollback: returning a system to its previous working state after a change goes wrong; a change you cannot undo is a change you should not make.
Least privilege: granting an account only the access its task needs, and no more.
Administrative (admin) account: a separate, more powerful account used only for system administration, kept apart from the everyday account a person uses for ordinary work.

Patching and updating: closing known holes

Software is never perfect when it ships. Over its life, weaknesses are found in it, in the service itself, in the libraries it is built on, and in the operating system beneath. When a weakness is a security hole, the people who maintain the software issue a patch or a new update that closes it. The uncomfortable truth, repeated in every serious study of how organisations are breached, is that most successful attacks do not use some clever, unknown trick. They exploit a known vulnerability, one that was found, announced, and fixed, on a system that was simply never updated. The fix existed. Nobody applied it.

This is why patching is not optional housekeeping but a frontline defence, and why it appears in the CIS Critical Security Controls baseline that a small force should hold itself to: keep software patched and updated. A self-hosted estate makes this the force's own responsibility. When you rent a managed service from a provider, much of this is done for you. When you self-host for control and sovereignty, as the Principality does, you accept that you must maintain, update, and secure those systems yourselves. That is the price of running your own.

The work has a rhythm. Know what you run, because you cannot patch what you have not written down (this is why inventory and documentation come together). Watch for security updates to those services and to the servers under them. Apply them on a sensible schedule, and apply security-critical fixes quickly rather than waiting for the next convenient moment, because the window between a fix being announced and attackers using the hole is short. And, crucially, do not patch blind on a live system. An update can change behaviour or break something that depended on the old version. Where the system matters, test the update first, on a test copy or at a quiet time, and be ready to undo it. That is change discipline, which has its own section below, because patching is simply the most common change you will ever make.

One caution that the system level shares with personal cyber hygiene: updates are also a favourite disguise for attackers, who send fake "you must update now" messages to trick people into installing malware. Real updates come from the software's genuine source, the project's own release, the Linux distribution's package system, the container image from its proper registry, never from a link in an unexpected message. Patch from trusted sources only.

   THE PATCH WINDOW: why "soon" matters

   t0  A vulnerability is found in a service you run.
   t1  The maintainers publish a fix (a patch / new version).
       |                                                  |
       |  <----------- THE WINDOW OF RISK -------------> |
       |   the hole is now public knowledge AND unfixed   |
       |   on every system that has not yet updated       |
       |                                                  |
   t2  Attackers begin scanning for unpatched systems.
   t3a  You patched early ......... you are out of the window.  SAFE
   t3b  You "got to it later" ..... you were exposed the whole time.

   The fix existed before the attack. Patching is closing the window.

Monitoring and logging: detecting and understanding

A system can fail silently. A service can stop answering, a disk can fill, an error rate can climb, a certificate can quietly expire, and nothing announces it. If the first you hear of a problem is a national unable to use a service, you have already failed to detect it. Monitoring is the answer: software that watches your systems automatically and tells you when something is wrong, so that detection does not depend on someone happening to notice.

The simplest and most valuable form is the uptime check. A monitor repeatedly asks each service a plain question, "are you answering?", and when a service stops answering, or answers too slowly, or returns an error, it raises an alert to the responsible person. Beyond plain up-or-down, good monitoring watches the warning signs that come before a failure: disk space running low, memory or processor under strain, certificates approaching expiry, error rates rising. The point of all of it is the same, to turn a problem you would have discovered late into one you are told about early, while there is still time to act calmly.

Alerts only help if they reach a real person who can respond, and if they are trustworthy. An alert that nobody receives is useless; an alert that fires constantly for nothing trains people to ignore it, so that the one alert that matters is missed in the noise. A small force should monitor the things that matter, route alerts to whoever is on duty, and keep them meaningful.

Where monitoring tells you that something is wrong now, logs tell you what happened. A log is a time-stamped record a system writes as it runs: who connected and when, what was requested, what succeeded, and what failed and why. Logs are how you reconstruct events after the fact, the ordinary failure you are diagnosing, and the security event you are investigating. They are also evidence. If an account is misused or a system is attacked, the logs may be the only record of what occurred, which is one reason they must be kept, kept safe from tampering, and kept long enough to be useful. Monitoring is detection in the moment; logging is the memory that lets you understand and learn. Together they are the Detect work that makes the rest of security possible, because you cannot respond to, or recover from, what you never saw.

Documenting what you run

The most dangerous single point of failure in a small force's systems is not a server. It is one person's memory. If only one member knows how a service is configured, where its data lives, how to restart it, and what to do when it misbehaves, then the force's continuity rests on that one person being reachable, willing, and well. People go on operations, fall ill, take leave, and move on. A system that only one person can run is a system the Principality does not really own.

The defence is to write it down. Documentation turns private knowledge into something the force holds in common. At a minimum, for each system the force depends on, record what it is and what it does, where it runs and what it depends on, how to start, stop, and restart it, where its data and backups are, how to recover it when it fails, and who is responsible for it. The recovery instructions, often called a runbook, are the most valuable of all, because they are needed exactly when the usual person is unavailable and the pressure is highest. A good runbook can be followed by a competent member who has never touched that system before.

Documentation only helps if it is true. Out-of-date instructions are worse than none, because they are trusted and then mislead. This is one reason documentation and change discipline belong together: when you change a system, you update its documentation as part of the same change, not "later". Keep the documentation where the responsible people can find it, protected like any other sensitive record, and treat it as part of the system rather than an afterthought. The test is simple and worth asking honestly of every system you run: if the one person who knows this were unreachable today, could someone else keep it running from what is written down?

Change discipline: plan, record, test, undo

Most outages a small force suffers are not caused by attackers. They are caused by a change. Someone adjusted a setting, applied an update, or "just quickly fixed" something on a live system, and it broke, and now nobody is quite sure what was changed or how to put it back. The cure is not to stop changing things, systems must be patched and improved, but to change them with discipline. The rule is plain: never make undocumented or untested changes on a live system.

Disciplined change has four parts, in order. First, plan the change before you touch anything: know what you intend to change, why, what you expect to happen, and what could go wrong. Second, record it, what you are about to do and when, so there is a written trail and others know a change is happening. Third, test it where you can, on a test copy or at a quiet, low-impact time rather than blind on the live system at its busiest, and check afterwards that the system is actually healthy, not merely that the change applied. Fourth, and underpinning all of it, be able to undo it: know your rollback before you start. A change you cannot reverse is a change you should not make on a system that matters, because the moment it goes wrong is the moment you most need a way back to the last known-good state.

This is the same discipline as patching, generalised, because a patch is just the commonest change. It is also the same disciplined mindset the Signals speciality teaches for communications, and the same care that PME 210 teaches for records and orders: a change to a live system is an order issued to a machine, and it deserves the same forethought, the same written record, and the same ability to recover. Make change boring on purpose. Boring changes are the ones that do not take a service down.

   THE SAFE-CHANGE WORKFLOW

   +---------+     +---------+     +---------+     +---------+
   |  PLAN   | --> | RECORD  | --> |  TEST   | --> | APPLY   |
   | what &  |     | write   |     | on a    |     | to live |
   | why;    |     | it down |     | copy /  |     | + verify|
   | know    |     | before  |     | quiet   |     | system  |
   | rollback|     | you act |     | time    |     | healthy |
   +---------+     +---------+     +----+----+     +----+----+
        ^                               |               |
        |                               | broke?        | broke?
        |          +--------------------+---------------+
        |          v
        |     +---------+
        +-----| ROLLBACK|   undo to the last known-good state,
              | (undo)  |   then update the record, and only
              +---------+   then try again.

   No step may be skipped. A change you cannot undo
   is a change you do not make on a live system.

Least-privilege administration: holding only the keys you need

The final piece is about how an administrator holds power. The principle is least privilege: an account is granted only the access its task needs, and no more. It applies to everyone, but it bites hardest for those who run systems, because they hold the powerful accounts, and a powerful account misused, or stolen, does the most damage.

In practice this means a few concrete habits. Keep separate admin and everyday accounts. The account you read mail and browse with is not the account you administer servers with; the everyday account cannot do administrative damage, and the admin account is used deliberately, only for administration, and ideally only when needed. Grant each task only the access it requires rather than handing out broad, all-powerful access for convenience, and the same applies to the service accounts that let machines talk to each other, which are powerful, often forgotten, and must be guarded as carefully as any human's. Protect admin accounts most strongly of all, with long unique passphrases and phishing-resistant multi-factor authentication, because they are the most valuable target on the estate. And revoke access promptly when a role ends or an appointment changes, so that no account keeps power it no longer has any reason to hold.

Beneath the mechanics sits a matter of conduct, which CIS 220 develops in full. Access follows appointment, not qualification. Completing this course does not grant you access to anything; it prepares you to be trusted with access if and when you are appointed to a system. Whoever holds that access holds only what their appointment needs, uses it only for its proper purpose, documents what they do, never makes undocumented changes, and escalates when unsure. Someone trusted with the Principality's systems is trusted with the Principality's continuity and with its nationals' data. Least privilege is how a careful person makes sure that even their own mistake, or their own stolen account, can do as little harm as possible.

   SYSTEM-HEALTH AND ADMIN CHECKLIST

   PROTECT
   [ ] Services and servers patched; security fixes applied promptly
   [ ] Updates taken only from trusted, genuine sources
   [ ] Admin accounts separate from everyday accounts
   [ ] Each account / service account holds least privilege only
   [ ] Admin accounts protected with passphrase + strong MFA
   [ ] Access revoked promptly when an appointment ends

   DETECT
   [ ] Uptime check on every service that matters
   [ ] Warning signs watched: disk, load, certificate expiry, errors
   [ ] Alerts reach a real on-duty person and are kept meaningful
   [ ] Logs kept, protected from tampering, retained long enough

   DOCUMENT & DISCIPLINE
   [ ] Each system documented: what, where, depends-on, owner
   [ ] A runbook exists to start, stop, and recover it
   [ ] Every change planned, recorded, tested, and reversible
   [ ] Documentation updated as part of the change, not later

In Practice: A Tuesday on the Duty Roster

Corporal Aro holds the systems-assistant appointment this week, with access to a small set of services she is responsible for and nothing beyond them. On Tuesday morning her phone carries an alert: the monitoring has noticed that one service is answering slowly. She does not panic and she does not start changing things. She opens the logs first, the system's memory, and sees the disk on that server is nearly full, which is slowing everything down. The runbook for that system, written months ago by someone now on leave, tells her exactly where old log files pile up and how to clear them safely. Because it was written down, she does not need the original author, and she is back to a healthy system within the hour.

That afternoon brings a security update for one of her services. She does not apply it straight to the live system. She plans the change, records in the change log what she is about to do and why, applies and checks the update on the test copy first, and confirms the service is genuinely healthy afterwards, not merely that the update installed. Only then does she apply it to the live system, with the rollback to the previous version already in hand in case it misbehaves. She does all of this from her admin account, the separate one used only for this work, and signs out of it the moment she is done. When a message arrives later urging her to "update now" through a link, she ignores it: real updates come from the genuine source, never from an unexpected link. She has done nothing clever all day. She has kept the systems running, and that was the whole job.

Check Your Understanding

Most successful attacks exploit known, already-fixed vulnerabilities. What does this tell you about why patching matters, and why applying a security update promptly is part of the force's defence rather than mere housekeeping?
Describe the four parts of disciplined change in order, and explain why being able to undo a change must be settled before the change is made, not after it goes wrong.
Why does a small force keep admin accounts separate from everyday accounts, and what does the rule "access follows appointment, not qualification" mean for a member who has just completed this course?

Reflection (write a short paragraph): Think of a system, a service, or even a routine in your own life or work that only one person truly understands. What would happen if that person were unreachable tomorrow, and what is the smallest piece of documentation that would most reduce the risk?

Summary

Keeping systems running is the day-to-day Protect and Detect work behind a reliable digital estate; none of it is clever, all of it is the difference between a system that stays up and one that fails quietly.
Patch and update regularly and apply security fixes promptly, because most attacks exploit known, already-fixed holes on systems that were never updated; self-hosting makes this the force's own responsibility, and updates are taken only from genuine sources.
Monitor systems with uptime checks and meaningful alerts so problems are detected early, and keep logs so you can understand and learn from what happened; you cannot respond to what you never saw.
Document what you run, above all a recovery runbook, so the force does not depend on one person's memory, and keep that documentation true by updating it as part of every change.
Practise change discipline: plan, record, test, and be able to undo; never make undocumented or untested changes on a live system.
Administer with least privilege: separate admin and everyday accounts, only the access a task needs, strongly protected admin accounts, and access revoked promptly when an appointment ends.
Cross-references: this lesson builds on Lesson 03 (accounts, identity, and single sign-on) and leads into Lesson 05 (backups, storage, and continuity) and Lesson 10 (working safely with elevated access). It shares the disciplined mindset of SIG 220 · Communications Security and Digital Discipline, the records-and-orders care of PME 210, the continuity concern of HCR 220, and is deepened by CIS 220 · Identity, Access, and Records Security and CIS 310 · Cyber Incident Response and Continuity.

Keeping Systems Running