The Myth of the Root Cause (Singular)
Root Cause Analysis (RCA) has become an article of faith in industrial operations. It satisfies organizational needs for closure, accountability, corrective action, and regulatory compliance. When executed by skilled practitioners using methodologies like TapRoot or Apollo, RCA can effectively identify multiple contributing factors and guide systematic improvements. However, even well-executed RCA faces fundamental limitations in how it represents complex system failures; not to mention the dearth of skilled practitioners as compared to the sizable (latent) demand for RCA.
Modern industrial operations are not linear chains but lattices of interdependence. Failures rarely emerge from a single point of breakdown; rather, they arise from the probabilistic alignment of multiple, often mundane conditions across technical, human, and organizational domains—often involving perfectly functioning components operating within specification.
The challenge is not that RCA is inherently flawed, but that organizational incentives and cognitive constraints often reduce it to a search for singular explanations. Under time pressure, regulatory deadlines, and the human need for closure, even sophisticated RCA methodologies can devolve into blame assignment exercises. This perhaps contributes to what reliability professionals recognize1 as a persistent gap between RCA theory and practice. As Richard Cook observed in How Complex Systems Fail, RCA is better understood as social ritual than technical discovery; one that often ignores the hidden adaptations that kept the system functioning right up until the final moment.
More fundamentally, the “root cause” paradigm embeds an assumption of linear decomposability—that complex failures can be understood by tracing backward through causal chains. (Which thinkers have long railed against.) But complex systems exhibit emergent properties that cannot be reduced to their constituent parts. When we focus exclusively on the realized failure path, we miss the network effects, interaction patterns, and near-miss trajectories that reveal systemic vulnerabilities. When we insist on doing so, we commit what philosophers call the fallacy of misplaced concreteness: treating an analytical abstraction (the singular root) as if it were a physical reality.
Practical Insight: The most dangerous failures do not reside in obviously broken components; they hide in the interactions between subsystems, across organizational boundaries, and within the adaptive responses that maintain functionality right up until catastrophic failure occurs. Were an industrial operation a game of chess, traditional RCA would capture the king’s checkmate and the few moves immediately preceding it; lattice thinking seeks to understand the game board.
This doesn’t mean abandoning RCA—it means complementing it with approaches that preserve the full dimensionality of complex system behavior and adopting practical, scalable tools to implement them on the operational floor.
RCA as Necessary But Insufficient Compression
Traditional RCA functions as a cognitive necessity—human operators and managers require simplified narratives to drive corrective action within resource and time constraints. This compression serves legitimate purposes: enabling rapid response, satisfying regulatory requirements, and providing actionable guidance for immediate risk reduction.
The issue is not compression itself, but unmanaged information loss. Like any lossy compression algorithm, RCA should preserve the most critical information while discarding only what is truly redundant. Currently, most RCA processes discard information that is crucial for systemic understanding and prevention. RCA is similar to JPEG image compression. But unlike image compression, where lost pixels merely reduce visual quality, RCA’s information loss directly undermines our ability to more deeply prevent future failures.
What Traditional RCA Should Preserve:
- Primary failure sequences requiring immediate correction
- Regulatory compliance documentation
- Clear accountability for corrective actions
- Proximate causes that can be quickly addressed
What Current Practice Often Discards:
- Interaction effects: How degraded performance in one subsystem amplifies stress in seemingly unrelated areas
- Temporal correlation patterns: How minor anomalies cluster in ways that predict major failures
- Adaptive response networks: The informal workarounds and compensations that maintain output despite degraded conditions
- Near-miss trajectory analysis: System states that almost led to failure but were prevented by chance, skill, or redundant barriers
- Precursor event sequences: Early warning signals that occurred days or weeks before the incident
The information discarded by RCA constitutes the predictive signal needed to anticipate future failures. By compressing away interaction patterns, we lose the ability to recognize when similar patterns are forming. RCA in this sense functions as a kind of organizational placebo, convincing teams they’ve cured the disease while leaving the underlying condition untouched.
A Complementary Approach: Rather than replacing RCA, modern facilities can implement dual-track analysis to start:
- Traditional RCA Track: Focused, time-bound analysis for immediate corrective action and regulatory compliance
- Lattice Analysis Track: Comprehensive, ongoing analysis that preserves full operational context and builds systematic understanding over time
Implementation Example: A chemical plant experiencing a reactor upset would conduct immediate RCA to identify proximate causes and implement safeguards, while simultaneously feeding all operational data, operator interviews, and contextual information into a lattice analysis system that tracks interaction patterns, identifies early warning signals, and builds predictive models for future incidents.
This dual approach recognizes that different analytical needs require different methodologies. The RCA provides immediate actionable insight; the lattice analysis builds long-term predictive capability. Eventually of course, a “traditional RCA” module can be consolidated into the lattice analysis system. (If it helps, the lattice analysis system can be thought as an artificially intelligent reliability expert capable of traversing your operational data when requested.)
The Information-Theoretic View: Constrained Complexity
Let’s examine what information theory reveals about complex operational systems.
In information theory, system complexity is measured by the number of possible configurations the system can occupy, i.e. its state space. A system with binary components has possible states. As grows, the state space explodes exponentially.
Industrial systems exist in this exponential space. A chemical process unit with 1,000 monitored variables, each capable of ten discrete states (normal, high-normal, high, high-high, etc.) would have a theoretical total state space of possible configurations—larger than the number of particles in the observable universe. That said, real industrial systems are heavily constrained by physical laws, control constraints (feedback loops, safety systems), and economic constraints (e.g. regulations, cost optimization), and so the feasible state space is much smaller than the theoretical maximum.
However, this feasible space is still enormous compared to human cognitive capacity, and its boundaries are often ill-defined. Additionally, of these remaining feasible states, many represent system failure. The fundamental challenge of industrial control is not preventing specific failures, but navigating safely through a large space of feasible system states.
Key Insight: Traditional RCA examines one trajectory through this constrained space—the path that led to observed failure. But there are other possible trajectories that could lead to similar failures, and countless trajectories that nearly led to failure but were corrected by operator skill, equipment redundancy, or statistical luck. Understanding only the realized path tells us little about the structure of the space itself.2
It’s worth repeating that understanding only the realized trajectory provides minimal information about:
- The structure of the safe operating region
- The proximity of current operations to failure boundaries
- The network of pathways that lead from safe states to dangerous ones
- The effectiveness of current control strategies in maintaining safe navigation
Actionable Application: Modern process historians already collect the data needed for lattice analysis. By applying machine learning (e.g. autoencoders, world models) at scale to this historical or real-time data, facilities can:
- Map safe operating envelopes more precisely than design specifications suggest
- Identify approaching boundaries before safety systems activate
- Recognize trajectory patterns that historically precede incidents
- Quantify interaction effects between seemingly independent process variables
The Statistical Inevitability of Failure: Engineering Around Mathematics
Industrial systems do not fail because of negligence or poor design; they fail because mathematics and the entropy of complex interactions make failure inevitable.
Consider a system with critical components, each with reliability over a given time period. Naively assuming independence, the probability that all components function properly is . For large and , this probability collapses to zero as time extends. Perfect reliability is mathematically impossible.
But this simple model misses the crucial insight: real systems fail not when components fail independently, but when they fail in correlated patterns. The corollary to this is that while exhaustive enumeration of all possible failure combinations is infeasible, certain classes of failure patterns occur with disproportionately higher probability. This is reflected somewhat in standard design-phase risk assessment tools like Fault Tree Analysis (FTA) that map static failure logic. However, FTAs are brittle; they fail to capture the dynamic, state-dependent nature of risk in a live operation where dependencies shift constantly.
The mathematics of percolation theory3 provide a better model. Imagine an industrial system as a network where nodes (components, procedures, people) are connected by edges (dependencies, communications, timing relationships). The system functions when there exists a connected path from inputs to outputs through functioning nodes. System failure occurs when enough nodes fail simultaneously to break all such paths—what network theorists call a cut set. Importantly, there are typically many possible cut sets, most of which involve combinations of component states rather than single component failures.
Insight from Percolation Theory: If your system has redundancy factor (meaning independent paths must be severed to cause failure), and each path has nodes with individual failure probability , then the probability of system failure is not simply related to , but depends on the complex combinatorics of how paths can be severed simultaneously.
Real operations are even more complex still: these networks are not static but time-varying multi-agent systems. At each of time steps, the network takes a different configuration, with distinct cut sets that may emerge, evolve, or disappear as operational conditions shift. It is in these dynamics—not the isolated reliability of individual parts—that the true drivers of systemic fragility reside.
The Thermodynamics of Operational Drift
As we’ve previously alluded to, every operational procedure (e.g. SOPs, playbooks), equipment specification, and organizational structure represents a constraint that presumably limits the system’s degrees of freedom. (I italicize presumably to draw attention to the occasional gap between the ideal and what happens in practice.) Maintaining these constraints requires continuous energy investment in the form of training, maintenance, monitoring, and enforcement.
But systems naturally tend toward maximum entropy–the state of maximum operational freedom consistent with maintaining functional output. When constraint maintenance energy decreases (due to budget pressure, attention scarcity, or competing priorities), the system explores new regions of its operational space through accumulated small deviations. These new regions aren’t necessarily “bad”—they’re just different from the ones the system was designed for, and so should be assumed will fail eventually and with higher probability.
Drift is not a failure of human discipline—it is a fundamental physical process. Systems will drift unless energy is continuously invested to maintain constraints. The question is not whether drift will occur, but whether it will drift toward regions of greater or lesser resilience. To belabor the point: Drift is inevitable, but its direction is manageable. Some drift improves system performance through learning and optimization. Control your drift!
Practical Implication: The most important metric in industrial operations is not uptime or efficiency, but adherence effort: how much effort is required to keep the system adhering to its designed envelope. It includes training effort, maintenance effort, monitoring effort, and correction effort.
Example: A semiconductor fab notices increasing procedure deviations during shift handovers. Rather than simply enforcing stricter compliance, they analyze the deviation patterns and discover that standard procedures don’t account for seasonal humidity variations affecting clean room conditions. By updating procedures to include humidity-dependent process parameters, they reduce both deviations and defect rates—channeling inevitable drift in a positive direction.
The Paradox of Functional Resilience: The Hidden Adaptation Economy
Every critical industrial system operates through an invisible adaptation economy—a network of formal and informal modifications that maintain functional output despite equipment degradation, procedural inadequacies, and resource constraints. Understanding this economy of ad hoc, unrecorded, and eventually forgotten modifications is essential for distinguishing robust resilience from fragile workaround dependency.
Examples of adaptations include:
- Maintenance workarounds: Field modifications that keep equipment running
- Procedural accommodations: Informal practices that handle edge cases
- Sequence modifications: Reordering standard procedures to accommodate equipment limitations or timing constraints
- Supply chain redundancies: Multiple suppliers, safety stocks, expedited options
- Production quality containment: Extra inspections, rework stations, or bypassing non-critical sensors
- Resource substitutions: Alternative materials, tools, or methods when designed resources are unavailable
- Direct operator intervention: Hand adjustment of feed rates, process temperature, and other process parameters to compensate for idiosyncratic equipment/plant behavior.
We can formalize this intuition with a simple relation.
Where:
- Applied Effort: Planned, budgeted labor and resources reflected in official metrics. This also includes technical control systems and their corrective efforts
- Adaptations: Unrecorded workarounds, informal procedures, and compensatory behaviors
- Adherence Effort: The total effort required to operate strictly according to design specifications
When the sum of applied effort and adaptations falls below adherence effort, the system becomes vulnerable. Crucially, because many adaptations are volatile and context-dependent, traditional metrics may overestimate robustness, masking hidden fragility. Once a technician leaves, firmware changes, or a temporary workaround disappears, apparent resilience can vanish overnight. Traditional operational metrics measure only applied effort, systematically underestimating total system requirements and creating dangerous blind spots about actual operational fragility.
Managing the Adaptation Economy4: Modern operational systems can capture and manage adaptations through:
- Adaptation Detection: Real-time identification of deviations from standard procedures
- Effectiveness Assessment: Analysis of which adaptations improve vs. degrade system performance
- Knowledge Capture: Documentation of effective workarounds and their contexts, ideally not as rigid SOPs that are consigned to PDF purgatory
- Boundary Management: Clear limits on acceptable adaptation to prevent dangerous drift
- Continuous Update: Regular revision of standards to incorporate proven adaptations
Bottom Line: The goal is neither to eliminate adaptation (impossible) nor to allow uncontrolled adaptation (dangerous), but channeling it deliberately. Importantly too is surfacing the useful adaptations to operators at the organizational or team level, i.e., making this invisible economy visible.
Implementation Example: A power generation facility implemented an “adaptation capture system” within lattice that allowed operators to quickly document and share effective workarounds. Over six months, they identify several informal adaptations, validated some number as beneficial, incorporated some into official procedures, and eliminated the remainder that created safety risks. This transformed invisible expertise into managed organizational knowledge.
Why Lattice Thinking Complements Traditional Approaches
The shift from exclusive root cause thinking to integrated lattice analysis reflects modern understanding of complex adaptive systems, while acknowledging the practical value of traditional approaches.
Reductionism vs. Systems Thinking:
- RCA embodies reductionist thinking: complex phenomena can be understood by breaking them down into simple components
- Lattice thinking embodies systems thinking: complex phenomena have emergent properties that cannot be understood from components alone
Linear vs. Non-linear Causation:
- RCA assumes linear causation: effects are generally proportional to causes
- Lattice thinking recognizes non-linear causation: small causes can have large effects, and large causes can have small effects, depending on system state
Static vs. Dynamic Analysis:
- RCA provides a static snapshot of the failure moment. Often thrown into a folder and later forgotten or inaccessible
- Lattice thinking analyzes dynamic trajectories through system state space over time; importantly, it can be queried and re-constructed as needed to ensure currency
Deterministic vs. Probabilistic Reasoning:
- RCA seeks deterministic explanations: what happened and why it was inevitable
- Lattice thinking uses probabilistic reasoning: what were the likelihood patterns and how do they suggest future vulnerabilities
Implementation Roadmap: From Theory to Practice
Phase 1: Foundation
- Implement dual-track incident analysis (traditional RCA + lattice documentation)
- Begin systematic capture of adaptations and near-misses
- Establish baseline metrics for adherence effort and constraint maintenance
Phase 2: Pattern Recognition
- Deploy machine learning tools for interaction pattern analysis
- Develop predictive models based on historical data
- Create real-time monitoring for approaching operational boundaries
Phase 3: Integration
- Integrate lattice insights into operational decision-making
- Update procedures and training based on adaptation analysis, encourage realtime use of said tools at shift starts and handovers
- Establish closed-loop improvement based on both RCA and lattice learning
Phase 4: Optimization
- Continuously refine predictive models based on new data
- Develop organization-specific lattice analysis capabilities
- Share insights across industry to build collective understanding
Conclusion: The Mathematics of Practical Resilience
Root cause analysis will persist because it serves essential functions: enabling rapid response, satisfying regulatory requirements, and providing cognitive closure in uncertain situations. The goal is not to eliminate RCA but to complement it with lattice thinking that preserves the full dimensionality of complex system behavior.
This complementary approach recognizes that:
- Immediate action requires simple explanations—RCA provides this
- Long-term prevention requires complex understanding—lattice analysis provides this
- Effective reliability programs need both perspectives working in concert, working fast and scalably enough to meet the demands of a dynamic operational environment
Organizations that successfully integrate traditional and lattice approaches can:
- Respond rapidly to immediate threats while building systemic understanding that’s inculcated in operators whenever interacting with the lattice analysis system
- Predict vulnerabilities before they manifest as failures
- Design adaptive resilience as a fundamental system property
- Learn systematically from both failures and near-misses
- Channel adaptation deliberately rather than allowing uncontrolled drift
The mathematics of complex systems guarantees that failures will occur. But the same mathematics, properly applied, enables us to anticipate, prepare for, and learn from these inevitable events.
Root cause, meet Lattice.
Footnotes
-
Throughout my research, I’ve noticed a pervasive sentiment among engineers that RCAs simply find a single person or component to blame so the case can be closed and everyone can get back to work. RCA’s then resemble narrative convenience to “keep the boss happy” more than anything else. ↩
-
This is an important point to emphasize. A reasonable rebuttal is that any incident can be made intractable if we attempt to consider everything. Therefore, the best we can do is mostly manage things via dimensionality reduction. Given recent technology, we can certainly widen how much we can plausibly manage by adopting artificial intelligence and other tools that can reason through contributing factors at a scale and speed far exceeding human operators alone. ↩
-
See Wikipedia for a nice overview of percolation theory: “Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom?”. ↩
-
The Adaptation Paradox: (1) Systems that cannot adapt fail quickly and obviously. (2) Systems that adapt too well fail slowly and invisibly. (3) Optimal systems adapt just enough to handle normal variation while maintaining structural integrity ↩