Does High Availability Provide Cyber Resilience for Industrial Applications? (Post 3/4)

By
Ben Simon
July 10, 2024
Share this post

This post is the third in a four-part series on OT cyber resilience written by the founders of Fortress Labs, Ben Simon and Leor Fishman. You can find the first two posts in the series here and here.

What is the ideal resilience solution for Operational Technology, and Industrial Control Systems in particular? The previous post examined Backup and Recovery and laid out the case for why it fails to provide the kind of resilience that critical industrial applications require, mostly because of necessarily-lengthy recovery times. But what, then, is the alternative?

There really is only one other strategic approach that industrial asset owners can employ when it comes to resilience: a system that is known as High Availability. Unlike Backup and Recovery, High Availability involves running redundant compute infrastructure (not just data storage) that can take over from the primary system in the event that the primary system fails or needs to be taken down.

Our core thesis as it relates to High Availability is similar to our argument vis-a-vis Backup and Recovery—but inverted. Whereas Backup and Recovery (when done correctly) provides robust isolation and immutability of backed-up data but fails to provide sufficiently-quick recovery, High Availability provides instantaneous recovery and availability but suffers from an architectural design that renders it useless in the face of cyber attacks and other types of software-related disasters.

A Primer on High Availability

First things first: what actually is High Availability? As mentioned, High Availability involves having multiple running servers and software, each of which is ready to take over in the event that any other fails or needs to be taken down. The salient architectural fact when it comes to High Availability is that these systems are constantly communicating with one another, passing back and forth information about their respective health as well any updates to their shared data.

This tight coupling is precisely what enables the secondary system(s) that are hooked into the High Availability environment to seamlessly take over operations if the primary system goes down. High Availability systems can even be run in “hyper-converged” architectures, where storage as well as system networking configurations are run on the same set of connected systems, allowing for failures deeper into the stack to be recovered from.

Where High Availability Shines…

High Availability is the de facto approach for industrial organizations that are looking to provide redundancy for their industrial control and automation systems. This redundancy enables industrial asset owners to maintain continuity and system availability in the event of certain types of failures—generally, failures that can be meaningfully isolated to the primary system.

These failures cover the majority of events that an industrial operator might encounter during day-to-day operations: hardware failure, small-scale software issues, etc. In these cases, the collapse of a single server can be seamlessly circumvented via the use of the secondary servers, and data loss can be avoided due to the constant communication between the primary and the secondary environment. For these kinds of small, localized failures, High Availability is built for the job.

…And Where It Comes Up Short

But what of larger fail states, such as cyber attacks? In these scenarios, High Availability’s strengths become its weaknesses.  Consider a threat scenario such as Volt Typhoon, where a nation-state level actor penetrates an industrial system and stays within that system for long periods of time, moving laterally from machine to machine.  In such an attack, the constant, direct communication that makes High Availability so seamless in the shallow case makes it transparent to the attacker and makes it trivial to compromise secondary computing instances. This is all the more so true in hyper-converged architectures, where even data storage is in the same visible cluster.

If you were to ask a seasoned industrial operator whether his or her High Availability systems were reliable in the case of such a network-wide threat, whether insider or outsider, the operator would in all likelihood respond in the negative. The infrastructure that provides continuity and availability for isolated failure scenarios as well as more “traditional” physical disasters is toothless against cyber threats.

In fact, even beyond the cyber scenario, certain types of larger-scale software issues can also make entire High Availability clusters unusable.  Recall that High Availability systems are by nature tightly coupled—with a constant “heartbeat” of data flowing between them. Thus, if an operator mistakenly pushes a configuration change to one machine on a High Availability cluster that causes system issues, the nature of High Availability means that the change will be pushed immediately to the remainder of the cluster, necessitating downtime and a full rollback to fix. These sorts of data corruption and operator error challenges are less “scary” than an advanced cyber attack, but they can cause a lot of damage and are unfortunately frequent in their own right.

Moving Beyond the Tradeoff

These aforementioned challenges seem to suggest a more general tradeoff in the industrial resilience space: organizations can choose to either have immediately-available compute with up-to-date synchronized systems, but lose the isolation necessary to defend from cyber attacks and software-failures that propagate across the network. Or, on the other hand, they can choose to have isolated backups, but lose the ability to seamlessly weather disasters and immediately recover.  (Or, of course, organizations can employ both solutions—but still lose out on availability and continuity after a cyber attack or software-related failure that propagates across the network.)

We believe that this choice is unacceptable—and unnecessary. How do we move beyond the tradeoff? Our final post in this four-part series will detail the revolutionary approach we are taking at Fortress Labs.

Thanks to Michael Miller for his valuable feedback.