The March 2025 substation fire that grounded Heathrow Airport for more than 16 hours disrupted over 270,000 passenger journeys and exposed systemic weaknesses across the UK’s critical infrastructure. StrategicRISK speaks to seven senior experts to explore what went wrong – and how to prepare for what’s next.
When a fire tore through the North Hyde substation on the evening of 20 March, it triggered a cascading crisis. Heathrow Airport – Europe’s busiest – was plunged into darkness and forced to close for around 16 hours the following day. By the time limited repatriation flights resumed around 6pm on 21 March, more than 1,300 flights had been cancelled and over 270,000 passenger journeys were disrupted.
The outage also affected more than 71,000 domestic and commercial customers, according to the National Energy System Operator (NESO), including Hillingdon Hospital, major road and rail routes, and three data centres. The financial fallout was significant, with widespread disruption to passenger travel and renewed scrutiny over the resilience of the UK’s infrastructure.
But beyond the headlines, Heathrow’s shutdown offers a complex case study in crisis response, supplier governance, investment decisions, and system-wide fragility. Seven experts unpack what happened – and what risk managers must take away.
A quiet protocol, quietly successful
Not everything failed. Behind the scenes, Heathrow’s mass diversion plan worked exactly as designed. Tania Roca, who leads WTW’s Global Aviation & Space Risk & Resilience advisory and helped develop the UK’s mass diversion protocol after 2018, says Heathrow’s operational response shows how effective contingency planning can be when stress-tested in real life.
“In the case of Heathrow’s total shut down, what we call shock effect, there was behind it all a ‘quiet resiliency protocol’ that was immediately activated and did what it is meant to do, which is safely divert all inbound flights to partner airports in the UK and (depending on airlines) other hub locations in Europe,” she explains.
“Having a resilience protocol in place that is quiet, maybe even taken for granted, provides certainty of what to do next when the worst happens”
“This is known in the industry as a mass diversion protocol and intends to protect passengers and aircraft. It sounds simple, but requires careful coordination of the contingency allocated slots, appropriate RFFS CAT levels, emergency response readiness, the right ground handling services and equipment and perfect coordination of all other aviation stakeholders operations.”
Operational teams at alternative airports, she says, “were outstanding in coping with additional flights and continuing business as usual.” The lesson? “Having a resilience protocol in place that is quiet, maybe even taken for granted, provides certainty of what to do next when the worst happens… airspace and on ground safety and security are the priorities and by that measure the response was a success.”
A predictable failure
Where the system broke down was in the infrastructure itself – and the long chain of oversight failures behind it. NESO’s final investigation confirmed that the fire was caused by a catastrophic failure in a transformer bushing, most likely due to moisture ingress. Critically, elevated moisture had been flagged in an oil sample back in 2018 – but no appropriate mitigation was taken.
For Jeff Le, managing principal at 100 Mile Strategies and a former senior advisor to the Governor of California, the incident should be viewed less as an anomaly and more as the inevitable outcome of deferred maintenance and lack of strategic planning.
“The Government should conduct broader assessments across its other key areas of critical infrastructure”
“National Grid should have been proactive and made a direct commitment from their own review to provide clear investments for what appears to be systemic deferred maintenance,” he says.
He also warns that resilience is not a one-off fix. “The Government should conduct broader assessments across its other key areas of critical infrastructure to ensure that other priority entities are carefully reviewed and recommendations can be made for proactive strengthening. Playing defence and being reactive is both a policy failure and a political liability.”
Design, dependencies and decision traps
The North Hyde fire revealed how a single infrastructure failure could spiral across sectors. It also exposed an internal vulnerability at Heathrow that few outside its technical team understood: the loss of one of its three grid supply points could disable some of its most critical operations. A fix was possible – but reconfiguring the internal electrical network would take up to 12 hours.
André Schneider, former CEO of Geneva Airport, says this is a stark reminder that redundancy must be engineered into both internal and external systems.
“Given the critical dependence of airport operations on electricity, it is essential to recognise the loss of electrical power – whether due to accidents such as fire, infrastructure failures, or wider blackouts – as a major operational risk and to define appropriate mitigation measures,” he says.
“As a CRO, you want to be absolutely sure you have strong transparency and insight into the condition of your critical partners/inputs – that’s pretty standard”
Schneider recommends three priorities: “redundant network access”, duplication of internal energy components, and “on-site energy production… to address more severe scenarios, including partial or total blackouts.”
William Jennings, principal at The New Bootroom, says the Heathrow outage highlights a deeper issue in supplier governance. He believes the problem wasn’t just the technical failure at the substation, but a lack of visibility and alignment between critical partners.
“As a CRO, you want to be absolutely sure you have strong transparency and insight into the condition of your critical partners/inputs – that’s pretty standard,” he says.
In this case, Jennings suggests, Heathrow may have underestimated the risk because the severity of the issue wasn’t made visible – or wasn’t pursued with sufficient scrutiny. “Here the detail on its condition and maintenance was either not disclosed or not asked for – fault may lie on both sides.”
And if the overdue maintenance had been fully understood, Jennings argues it might have changed the airport’s risk calculus: “It may have prompted them to either force maintenance under supply agreements or require the supplier to provide financial support to offset the risk… You need alignment on [the contract’s] importance and decisions affecting it need to be communicated early and clearly.”
When resilience is mistaken for reliability
At the heart of the incident lies a deeper governance challenge: balancing long-term resilience with immediate cost and operational pressures. Patrick Aubrey, former projects risk management lead at Heathrow, says the airport’s ability to manage disruption so effectively can actually make it harder to justify investments in resilience – especially under tight regulatory constraints.
“Heathrow does an amazing job at keeping order in what is a chaotic aviation environment. What would devastate any normal operation, like a plane being on fire for example, is almost BAU for Heathrow operations,” he says.
But investment choices are not made in a vacuum. “Heathrow has limited CAPEX, and spending is regulated by the partner airlines… Heathrow would probably consider additional funding for baggage facilities, which manage a 10% risk, over a substation upgrade to manage a 0.001% risk.”
And because Heathrow often takes the blame for others’ failures – whether it’s Border Control queues or power outages – it also shoulders reputational risk beyond its control. “When the issue occurred, it was a ‘Heathrow problem’ and now it’s shifted to being a National Grid problem,” says Aubrey. “Heathrow take a lot of flak and very rarely pass the blame.”
“The best method to manage such situations is to set risk appetite levels and tolerances based on objectives and weighted priorities.”
Peter Smith, former head of governance, risk and insurance at Dubai Airports, adds that managing these types of infrastructure risks requires both technical and cultural alignment.
“For me, the best method to manage such situations is to set risk appetite levels and tolerances based on objectives and weighted priorities, inclusive of stakeholder considerations.
“Secondly, understanding the risk appetite and attitude of your decision makers, ‘pffft, that’ll never happen’ when presented with a cascading effect of a minor risk right up to ‘270,000 flights impacted and international headlines impacting the reputation of National Grid, Heathrow and the UK government’ is so easily imaginable because we’ve all been in businesses with cultures or even just powerful decision making individuals like this.”
He also questions why there isn’t a more formal government mechanism to identify and track systemic infrastructure risk. “I’d hope there is a government level risk process which should be identifying these critical infrastructure risks and cascading them down to get assurances on their proactive mitigation.”
What risk leaders should do now
For risk managers in other sectors, the Heathrow shutdown offers powerful – and uncomfortable – lessons. Resilience is not about avoiding risk entirely, but about understanding cascading impacts and making sure protocols, suppliers and stakeholders are aligned long before a crisis begins.
As Roca notes, “This type of coordinated planning can be adapted to other industries, and there are bodies such as the National Preparedness Commission who provide frameworks that can be adopted across sectors.”
The NESO report makes 12 specific recommendations, ranging from asset-level risk assessments and emergency access to cross-sector coordination and legal reform for critical infrastructure oversight.
These are not theoretical improvements. They are essential steps in protecting national infrastructure from preventable collapse.
NESO’s 12 recommendations for improving energy resilience
Fintan Slye, chief executive of National Energy System Operator (NESO), said: “The power outage and closure of Heathrow Airport were hugely disruptive and our report seeks to improve the way parties plan for and respond to these incidents, building on the underlying resilience of our energy system.”
Accordingly, the report sets out 12 core recommendations following the North Hyde substation fire:
-
Energy asset management processes and systems should include robust controls to ensure that identified issues are appropriately categorised and followed up on, including regular reviews of outstanding items. Particular attention should be paid to any area reliant on manual controls.
-
Asset owners should review the suite of mitigating actions deployed in the case of overdue maintenance to ensure they are comprehensive and capable of identifying issues likely to lead to asset failure with potential impacts on security of supply. Consideration should also be given to the use of the most up-to-date technology (e.g., continuous monitoring) to monitor the condition of critical assets.
-
Fire and asset risk assessments should:
- Be in place for all energy facilities (including electricity and gas distribution, transmission, storage etc.) and explicitly cover all assets, and site level risk;
- For site level assessments, cover the situation where there are multiple assets on a site, potentially controlled by different parties
- Explicitly include consideration of new or updated standards, even if there is no requirement to be retrospectively compliant;
- Be updated (or explicitly cater for) when relevant equipment is unavailable or out of service; and
- Incorporate input from the fire service, including on firefighting protocols. -
Network asset owners and relevant emergency services should take the lessons learnt from the North Hyde incident to identify any required improvements to site and substation specific emergency management plans. This should include any learnings related to access requirements and where there is a single site with multiple assets under the control of different parties. These should be shared and incorporated into fire and asset risk assessments (as set out in recommendation 3) as appropriate.
-
Overall risk assessments at an asset and site level should be undertaken (which would incorporate the fire and asset risk assessments as set out in recommendation 3). This should incorporate and mitigate for any cumulative or compounding risks (e.g., catastrophic failure of one asset having a wider impact on the continued operation of the site). This should also consider changes or lessons learnt relating to equipment, design, or construction standards which might reasonably be expected to change the outcome of the risk assessment.
These site and asset risk assessments should explicitly factor in the implications for the wider energy network, security of supply and continuity of service for customers. Where there are potentially significant system, network or customer impacts, the risk assessment should be coordinated with NESO as appropriate.
-
Government and regulators should refresh guidance available on Electricity Safety, Quality and Continuity Regulations (ESQCR), including clarifying roles and responsibilities.
-
For every CNI site, incident management protocols should explicitly include plans around loss or impairment of energy supplies. These protocols should include a mechanism to convene all relevant parties (including network companies, system operators, regulators, government etc. as appropriate).
-
Where a CNI or essential service site has multiple supply points connected to the energy system, explicit consideration should be given by the site operator to the level of resilience and operational continuity required, and how this can be achieved.
This could include:
i) Diversifying the configuration such that the loss of one supply point does not impact the entire CNI site.
ii) The ability to switch load quickly between the supply points. For example, on the electricity system (e.g., via internal interconnectors, automatic/tele switching) and, where appropriate, the use of short duration uninterruptable power supply while switching takes place such that operations can be maintained.
-
CNI operators should be able to have transparent conversations, then work together with energy networks (transmission and/or distribution as appropriate) and system operators to review and establish a mutual understanding of the resilience and security of the energy supply arrangements to the CNI site.
-
CNI operators should develop a communication and operational protocol for addressing any planned and unplanned changes to resilience.
-
NESO and the government should work together to develop an appropriate holistic view of the CNI reliance on the energy system.
-
CNI operators, government and the relevant regulatory bodies should establish a more structured approach to energy resilience, for example via cross-sector partnerships and standards, including standards around continuity of operations under various scenarios (e.g., loss of an energy asset).
Together, these aim to reduce the likelihood and impact of similar incidents, while improving how infrastructure operators plan for and respond to disruption.
No comments yet