The Inadequacy of Backup and Recovery for OT Cyber Resilience (Post 2/4)

By
Ben Simon
June 19, 2024
Share this post

This post is the second in a four part series on OT cyber resilience, written by the founders of Fortress Labs, Ben Simon and Leor Fishman. The first blog post offered a general introduction to the topic of OT cyber resilience with a focus on recovery, remediation, and mitigation (the “right side of the bowtie”). This second post will explore Backup and Recovery technology—how it works in general, how it is used for OT cyber resilience, what its strengths and limitations are, and ultimately why it is insufficient as a resilience approach for critical industrial systems.

When it comes to cyber recovery, the conventional starting point is what’s known as Backup and Recovery, which simply refers to the process of storing separate copies of operational data that can be used to recover systems in the event of a disaster or cyber attack.  Our core thesis is that, when it comes to OT cyber resilience, Backup and Recovery is ultimately inadequate, but before discussing the issues with an OT cyber resilience strategy centered around Backup and Recovery, let’s briefly review the basics.

A (Very) Brief Introduction to Backup and Recovery

At a basic level, the logic behind Backup and Recovery for cyber resilience is as follows: since the relevant layers of the stack are software-defined, what matters is the state of the system.  If that state can be captured (backed up) and reverted (recovered) in the event of a failure or an attack, then the system can resume normal operations easily.

Generally speaking, when it comes to what can be “backed up,” there are two options: a backup can either capture everything on the machine’s disk, or certain files and folders can be individually captured.  Both of these approaches are common and often used in tandem.

When it comes to where the captured data is stored, there is quite a bit more variance.  Storage technologies run the gamut from “data storage on offsite tape backups” to “hardened custom immutable drives,” and most Backup and Recovery vendors tout their particular choice of storage and recovery medium as one of the major selling points of their process.

Regardless of the particular choice of storage medium, however, one thing always remains true: captured data must be recovered onto a new or cleaned machine—or used as a boot disk for that machine—in order to be usable.  Backed up data without a compute substrate to process that data is functionally useless.  This simple but salient fact has important implications for recovery times (more on this below).

Backup and Recovery in IT vs. OT: It’s All About the Compute Substrate

The core theme of our previous blog post was how the fundamental differences between IT and OT necessitate subtly different approaches to cyber resilience.  When it comes to Backup and Recovery, the story is the same: because IT and OT have different technology stacks, constraints, and priorities, Backup and Recovery should be thought of differently for each domain.

In IT, data backups generally serve as “vaults” or “stores of last resort.” IT teams generally capture and store customer data, emails, or proprietary information in an uncorrelated, secure, immutable location.  Apart from storing isolated copies of sensitive data, Backup and Recovery systems do not really do much to provide resilience for IT workflows and applications themselves; instead, IT applications are made resilient by running multiple copies of key virtual machines/containers in hyper-scaled environments, accompanied by distributed repositories of key code.

In OT, by contrast, the primary function of backups is to store control system configurations.  The processes themselves rely a) on machines that cannot easily be replicated between environments, and b) on configuration data that must live with the machine (for regulatory and practical reasons).

Put starkly, whereas in IT, backed up data can be pushed onto any machine with a metaphorical pulse in order to be extracted and used, in OT, a backup is only useful to maintain operations if it a) lives in the right location within a complex, legacy network architecture, and b) can be easily pushed onto a machine with the precise internal specifications.  Backups in the OT space that without attendant procedures for loading recovered data onto such a machine therefore cannot serve the key function of maintaining operations through a crisis event.  In short, it all comes down to what we call the compute substrate: in OT, systems usually have very specific compute and machine requirements compared to IT, adding a significant component to the task of recovering from an attack or disaster.

In OT, where uptime is everything, recovering from a backup is far too slow.  In fact, when it comes to recovery speed, the major constraint in the OT world is not sheer amount of data (as it is when recovering IT systems), but rather hardware acquisition and placement.  Especially in the case of a cyber attack, the only accessible hardware after an event will likely be either miles away, or already infected.  The process of locking down and then cleaning up old, or shipping out new, hardware can itself take crucial minutes (or hours, or days) that many industrial lines cannot afford to lose, especially if an environment is considered to be Critical Infrastructure.

“Fly-blind time”: A Practical and Aspirational Measure of Recovery Time Objectives in OT

Having provided a cursory explanation of how Backup and Recovery works in IT and OT, the final question—and indeed the most important one—is the following: How rapidly do Backup and Recovery systems need to function in order to be useful? Here, as well, there are different answers for IT and OT.

With respect to the question of required recovery speed, the term that is commonly thrown around is Recovery Time Objective (RTO)—i.e., what is the desired time frame in which an organization aims to regain operational functionality (this could be for an entire organization, an individual location/environment, or even just a single application).  How do RTOs differ in IT and OT?

In IT, the cost of system downtime is approximately linear: if a system is down for one hour, the customers cannot use the product for that one hour, and similarly for one minute or one day.  Accordingly, RTOs in the IT space are generally set according to the organization’s loss tolerance: if the organization can afford the loss of customer confidence and revenue from one hour downtime, then it will set an objective of less than one hour recovery times.  Furthermore, there are plenty of sectors where IT data backups are critical for data retention reasons but recovery times are not an important factor (for example, a law firm backing up its case data, a bank backing up its audit statements, etc.).

In OT, by contrast, downtime costs are highly non-linear:  since OT systems are controlling physical processes with outsized damage risks, those systems will have maximum lengths of time during which they can operate without software before a full system shutdown is required, which can take days to recover from.  The difference between one minute of downtime and one hour of downtime is much more significant than the difference between an hour of downtime and three or four hours of downtime.  Why? Because within a minute or two after a loss of visibility and control in an industrial environment, a site operator is effectively “flying blind”—and “flying blind” is not a safe way to operate in, for example, a power generation facility, oil refinery, or a specialty chemicals plant with potential toxic materials.

In effect, the true recovery time objective for OT system operators is equivalent to what we might call the “fly-blind time,” i.e.  the maximum amount of time an operator can go without control of and visibility into the lower-level physical systems.  In practice, an OT environment’s fly-blind time is usually on the order of minutes at the higher end, and seconds or even milliseconds in more extreme scenarios.  If a system failure persists for longer than this fly-blind time, the operator must initiate an unplanned shutdown of industrial processes—which is itself a costly (and sometimes dangerous) procedure.  Indeed, the cost of shutting a complex industrial environment down on a whim, and then getting it started back up again, is oftentimes comparable to the cost of downtime in terms of loss of revenue!

In practice, then, discussion of RTOs in the OT world can’t be reduced to a simple question of revenue loss that scales linearly with the amount of downtime.  This is an IT-centric mode of thinking, and a misguided approach to OT resilience.  If safe and seamless operational continuity cannot be assured in the OT environment in the event of a cyber attack or disaster, then the cost and complexity of recovery is going to be extremely high, regardless of whether it takes 6 hours, 12 hours, or three days.

Moving Beyond the Backup and Recovery Paradigm for Critical OT systems

OT systems are between a rock and a hard place when it comes to Backup and Recovery.  Without near-instantaneous recoverability of critical control and visibility, industrial operators cannot safely continue operating and must shut down.  And yet, the common network architectures of OT systems and the compute requirements necessary for OT system recovery both induce severe delays in recovery time, far surpassing what these critical systems ought to endure.

We may go as far as to say that, beyond its potential utility as a very basic last resort option, Backup and Recovery in its current form does not offer robust cyber resilience for critical industrial systems.  It may enable eventual recovery, but only after severe risks to process safety and painful unplanned shutdown and startup delays, not to mention the financial, reputational, and sometimes even legal/regulatory damage from the extended operational downtime itself.

At this point, readers might reasonably be wondering the following:  if the core issue with Backup and Recovery for OT that we have identified is the lack of availability of secondary computing infrastructure, why not simply have a second copy of relevant systems running in the background, in case of any issues with the primary copy of those systems? This approach underlies the High Availability approach to resilience, the topic of the next post in this series.

(Here’s a little teaser: if OT Backup and Recovery leaves much to be desired in terms of recovery times and avoiding costly shutdowns, High Availability has its own problems when it comes to cyber risk and vulnerabilities.)

Thanks to Michael Miller for his valuable feedback.