This post is the first in a four part series on OT cyber resilience, written by the founders of Fortress Labs, Ben Simon and Leor Fishman. This first blog post, the one you are currently reading, will provide readers with a general introduction to the topic of OT cyber resilience, with a focus on one core component: recovery. The second and third posts will evaluate the two primary technologies that are currently employed to provide OT cyber recovery: data backup/recovery (post two) and High Availability (HA) systems (post three). The fourth and final post of the series will present the Fortress platform and lay out its distinct advantages vis-a-vis OT cyber resilience over existing approaches.
What is OT cyber resilience, and why is it important? Much has already been written on the subject that we will not rehash here. However, before we introduce our specific thesis on the topic—namely, the centrality of recovery (“the right side of the bowtie”) to OT cyber resilience—let us first define some terms and provide some context. (If you are already familiar with OT, please feel free to skip ahead to the section titled “The Particularity of OT Cyber Resilience and the “Right Side of the Bowtie”.)
What are OT systems? OT stands for Operational Technology, in contrast to IT (Information Technology). OT is a general term that refers to the hardware and software systems that control and monitor physical infrastructure. From power generation to food processing, petroleum refining to semiconductor manufacturing—all of these industrial processes (and more) utilize OT. In fact, OT systems are used beyond the industrial world: in airports, (baggage handling and runway lighting systems); on cruise ships (ship positioning and propulsion systems); and even in residential buildings (“Building Management Systems”). In short, any digital technology that controls physical processes can be considered OT.
Within the world of OT, there are various “layers” of the stack, from sensors and single-logic controllers all the way up to higher-level control systems that are used to monitor and control the lower-level devices (these are known as Industrial Control Systems, or ICS). This vertically integrated framework has historically been represented by a reference architecture known as the “Purdue Model” (see below). Level 3 and below is generally considered to be OT, and Level 4 and above belongs to the domain of IT.
The above Purdue Model is important as a point of reference and orientation, since we will be focusing our discussion primarily on Levels 2 and 3 of the model, which refer to the higher-level Industrial Control Systems such as SCADA (Supervisory Control and Data Acquisition), DCS (Distributed Control Systems), and other more-specialized control systems.
There is a simple reason why these systems are important, and thus why OT cyber resilience is critical: OT failures directly translate to real-world physical infrastructure failures. This is especially true in the modern world, where industrial automation reigns supreme. It is no longer the case that physical processes can be operated manually for extended periods of time: these industrial processes by and large either require software to control them, or cannot be operated safely without higher level visibility and monitoring. (Imagine a plant operator at a major oil refinery—this operator sure won’t want to keep the plant running if he/she loses visibility and software-based control).
It should come of no surprise, then, that the world of OT resilience is a vast one: fortifying these Operational Technology systems against failures of all kinds is a priority for both companies and governments alike. In general, OT failures mean loss of production capabilities and operational downtime. But for critical infrastructure organizations specifically, these failures can have more dire consequences: regulatory fallout, financial ruin, and even human harm.
The Particularity of OT Cyber Resilience and the “Right Side of the Bowtie”
When it comes to resilience in cyber systems, there are generally two different prongs/focuses. On the one hand, there is detection and prevention: technologies like XDR, IDS and SIEM, and human factors like anti-phishing training. Prevention focuses on stopping attacks before they even begin and keeping attackers out of sensitive/critical systems. On the other side, there is mitigation: technologies like backup and recovery, and human factors like Business Continuity and Disaster Recovery/BCDR plans and role-based access control/principles of least privilege – which focus on limiting the damage any single attack can do, and ensuring that recovery after an attack can be as quick as possible. In the 1990s, Shell introduced a metaphorical image of a bowtie to describe this two-prong process. On the left side of the bowtie is prevention or defense; on the right is mitigation and impact reduction.
Generally, in both IT and OT organizations, the question of how to allocate resources to the different “sides of the bowtie” (i.e. prevention vs mitigation) is a challenging one. On the one hand, simple logic would dictate that it is preferable to not have any attacks at all – which would suggest putting more resources into prevention. On the other hand, prevention of attacks requires preventing most or all classes of attack to be maximally useful; in other words, you need to be “right” 100% of the time (or very close) if you are focusing solely on prevention. However, mitigation is useful even in smaller doses. While preventing 50% or even 90% of attacks is functionally useless, making each attack 50% or even 500% quicker to recover from has obvious upsides. This dichotomy—between prevention, visibility, and hardening on the one hand, and redundancy, recoverability, and mitigation on the other—is at the very heart of cyber resilience.
There is arguably an even greater need for strong mitigation measures in OT than there is in IT. In IT, what’s valuable is often the data itself. Imagine if a hedge fund gets attacked and loses its proprietary trading algorithms, or if a hospital leaks sensitive patient information, or if a political leader’s email is hacked. In these cases, the damage is most acutely felt by the compromise of sensitive data. In the OT world, what is most important is the integrity and availability of the physical processes themselves, rather than the confidentiality of data. Process integrity and availability is the ultimate concern, not the data or its confidentiality. (This triad—confidentiality, integrity, availability—is often referred to as the “CIA triad.”)
Because process integrity and availability is of paramount importance, an attack is only truly calamitous if it results in an inability to safely continue operations. As such, if an organization has adequate redundancy mechanisms in place for its critical OT systems, and/or a rock-solid recovery strategy for OT assets, the damaging force of OT cyber attacks can be significantly blunted. In light of the emphasis on process over data in the OT sphere, OT asset owners ought to prioritize investing in redundancy, availability, and recovery measures and not get sucked into a reductive (or even hubristic) mindset of focusing solely on system hardening and attack prevention measures.
Assessing the Current State of Affairs in OT Cyber Resilience
What is the state of affairs when it comes to the proverbial right side of the bow tie (recovery, mitigation, etc.) within OT? Let’s return for a moment to the Purdue model noted above. At levels 0 and 1, we have, as we said, field devices and basic controllers. For devices at these levels, resilience mostly equates to redundancy: it’s impossible to “back up” (or virtualize) a thermistor, for example, so resilience for devices like these really just amounts to having spares that can be manually swapped in.
However, when it comes to the higher-level Industrial Control Systems (as they’re called) living at Purdue Levels 2 and 3, resilience—especially cyber resilience—is more complicated, but no less critical. Unlike in the world of IT, the software-defined applications living in the OT environment are often legacy systems and operate within a unique set of networking constraints, due to the fact that these applications control and provide visibility into intricate physical processes with hundreds, if not thousands, of individual field devices. And, as mentioned above, a loss of control of and visibility into these lower level devices and processes means that industrial operations must grind to a halt as they cannot safely continue.
Unfortunately, despite their criticality, Industrial Control Systems are woefully unprepared vis-a-vis cyber resilience compared to both traditional IT systems and the lower level OT devices. The genesis of this unpreparedness on the “right side of the bowtie” in OT, its knock-on effects, and what is needed to fix it, will be the subject of the remaining posts in this series.
Thanks to Michael Miller, CISSP for his valuable feedback.