Document Status — Working Paper · Series: Human Intelligence Debt, Paper 5
This document moves the series from concept to measurement. It extends Human Intelligence Debt (Paper 1), The Harvester Multiplication Problem (Paper 2), The Human Intelligence Debt Dilemma (Paper 3) and Architectural Entropy (Paper 4). Where the prior papers defined the phenomenon, its mechanism, its incentive structure and its irreversibility, this one proposes how to verify it. It is a methods proposal and preregistered experimental programme: it specifies constructs, instruments, metrics and study designs, and it states in advance the conditions under which the framework would be wrong. It does not report an empirical Human Intelligence Debt (HID) value; no such value should be published before the instrument-validation gate (Study 0) is passed. Public figures cited below are drawn from third-party sources as of early 2026, are flagged as vendor benchmarks where applicable, and are intended to establish environment and direction, not to prove the debt.
A Three-Layer Measurement Architecture and a Five-Study Programme for Verifying the Debt, the Recovery Coefficient ρ, and the Oversight Threshold
This paper builds on four field notes in the series. Where they defined the phenomenon, this one proposes how to verify it; readers new to the framework are encouraged to read them first.
- Paper 1 — Human Intelligence Debt: A Socio-Technical Metric for Measuring the Human Cost of Imperfect Data Flows — defines what the debt is (HICR, HICT, HID).
- Paper 2 — The Harvester Multiplication Problem — how the debt accumulates (capability fragmentation).
- Paper 3 — The Human Intelligence Debt Dilemma — why it persists (the incentive structure).
- Paper 4 — Architectural Entropy — why it is only partly reversible (the recoverable and spent components, formalised below as ρ).
Prefatory Note
The hardest objection to the Human Intelligence Debt framework has never been conceptual. It is that the central quantities — genuine versus mechanisable work, the feasible architecture, the share of human intelligence wasted — appear to resist consistent measurement. Two assessors looking at the same role will, the objection runs, disagree about what counts as genuine contribution; and an organisation can always contest what a perfectly architected alternative would have achieved. If that objection holds, the first four papers are an elegant description of something nobody can count.
This paper answers the objection rather than restating it. It does so on three commitments that run through every instrument and every study below, and that are worth stating plainly before the detail begins.
- Objective traces are the anchor. Wherever the debt can be read from digital exhaust — the same value re-keyed across systems, the export-and-re-import loop, the reconciliation cycle — that machine evidence is treated as the conservative floor, and self-report is checked against it rather than the reverse. Telemetry has its own coverage and interpretation limits, so it anchors and triangulates; it does not, on its own, adjudicate.
- The counterfactual is graded, and observed where possible. The debt is anchored to a real organisation that has already done the integration, or to a working prototype, rather than to a hypothetical perfect architecture. A matched benchmark gives an empirical comparator; only a randomised or robust quasi-experimental design supports a causal counterfactual, and the programme says which is which at every step.
- The framework must be able to lose. Every hypothesis is preregistered. A result of «no debt», or of «ρ ≈ 1 everywhere» — capability fully recovers the moment the architecture is fixed — is a finding, not a failure. A programme that cannot return a negative result is not measuring anything.
What follows has three measurement layers and one experimental programme. The layers move from the publicly visible to the organisation-internal: external indicators that can be read from auditable data (Layer 1), a task-level instrument stack run inside one organisation (Layer 2), and the field execution that turns the stack into causal evidence (the experimental ladder). The discipline that holds them apart is the single most important thing in the paper: the public indicators establish a present-day deficit and a direction of travel; they do not, by themselves, prove that the debt is growing, that capability has decayed, or that any of it is irreversible. Those are the claims the experiments are built to earn.
Part 1 — Why the Debt Is Architectural, and Therefore Measurable
Before proposing instruments, it is worth being exact about why Human Intelligence Debt is a measurable quantity at all, and why the language of entropy and exergy that Paper 4 introduced is more than decoration. The relationship between entropy and architecture is neither an allegory (too weak — it explains nothing and predicts nothing) nor an identity (too strong — organisations are not closed physical systems, and no competent reader will accept that the Second Law literally governs an IT estate). It is a structural correspondence, and the correspondence is what makes the debt countable.
Entropy is not fundamentally a thermodynamic quantity. It is a general measure of disorder over a state space — the number of microscopic configurations consistent with a coarse description — defined identically in statistical mechanics and in information theory. Low entropy means high order means the state is compressible into a faithful description shorter than itself. Enterprise architecture is, definitionally, the practice of producing exactly such a description: a coarse, faithful, usable model of an organisation’s systems, data and processes. That is a macrostate-over-microstates relation. Architectural order is therefore a genuine instance of the same abstract structure that entropy names in physics and information theory — not a metaphor borrowed from them.
This licenses a single measurable variable underneath the whole programme, which we will call architecturability:
Its three components are each a facet of the same compressibility idea, and each is already partially instrumented in practice. Coverage is the fraction of the real estate represented in governed models — the inverse of shadow IT, dark data and undocumented dependencies. Fidelity is how often the model is right when checked against reality — CMDB accuracy, lineage and dependency correctness. Retrieval cost is the effort required to produce a true answer about the estate — low when the model does the compression for you, high when you must re-enumerate reality by hand each time, which practitioners recognise as archaeology.
On this reading, «exergy» is the captured, usable fraction of organisational reality — architecturability itself — and «architectural entropy» is simply its absence: low A. The only proposition imported from thermodynamics is the form of a falsifiable claim, not its authority: order is the maintained exception, because restoring it costs more than degrading it, so absent sustained ordering work, A decays. That sentence is measurable and refutable; it forbids any numerical transfer from physics (no Boltzmann constant, no temperatures, no Carnot efficiencies) and, crucially, it forbids asserting decay rather than measuring it. The correspondence makes A measurable; it does not pre-decide which way A moves. That is the work of Part 6 and the experimental ladder.
Human Intelligence Debt lives in this same space. It is the human cognitive effort consumed acting as middleware between systems where, in the period under study, architecture or automation could in principle absorb that mediation. When architecturability is low, reality has escaped the model, and people become the compensating layer that holds the estate together by hand. The debt is the human cost of low A — which is precisely why it can be measured the way A can be measured: through coverage, fidelity, retrieval cost, and the cognitive hours spent compensating for their shortfall.
Two things must be said plainly here, because the whole programme depends on them and because the language invites misreading. First, we do not claim that organisational order is thermodynamic entropy, or that an estate obeys any physical law. We claim an alignment: organisational order and physical entropy are two instances of one abstract idea — order over a state space — and because the idea is the same, the same kind of measurement applies. We use «exergy» to mean architecturability and «entropy» to mean its absence as a deliberate, disciplined choice of vocabulary, with the operational definition of A always sitting behind the word; we do not import a single number from physics — no constants, no temperatures, no efficiencies — only the form of the idea. Second, and just as important, this is not a hard science and does not pretend to be. What we propose is an organisational measurement of what we call entropy: defensible, reproducible within stated bounds, and falsifiable, but not a physics and not a mathematics. The quantities below are estimates with error bars and graded evidence, not theorems. Holding that line — precise about the alignment, honest that it is an alignment and not an equation — is what keeps the framework from being dismissed as either physics envy on one side or loose metaphor on the other.
Part 2 — Why We Measure What We Measure
The instruments proposed in this paper were not chosen from a methods catalogue. They were chosen because two decades of enterprise-architecture and rationalisation practice keep surfacing the same handful of questions that estates can no longer answer — and because the pattern of which questions fail is itself the most reliable signature of the debt we have observed in the field. This section states that motivation plainly, because it explains every measurement decision that follows.
The governing observation is that Human Intelligence Debt is a latent variable, visible only under perturbation. At rest, a debt-laden process still produces output; the dashboards stay green; the monthly numbers look acceptable. The debt becomes visible only when something pushes on the system — when someone asks who owns an application, tries to retire it, asks what it actually costs in total, or watches what happens when it fails. Each useful instrument is, in effect, a standing perturbation: it asks the estate a question it has quietly lost the ability to answer. Conventional productivity and efficiency metrics miss the debt entirely because they read the output, which is fine, and never perturb the system in the way that would reveal the rot.
Three perturbations recur in practice, and they are the reason the external indicators of Part 3 take the form they do.
«What does this application actually cost, in total?» This is the question we have watched estates fail most reliably. There is no structural reason that total cost of ownership should be harder to produce now than a decade ago — there are more tools, more frameworks, more dedicated platforms for tracking cost than ever before. Yet in practice the answer scatters across budgets that no single party can consolidate; no one can say what an application costs in total, who pays for it, or who holds the authority to decommission it. Cost is the keystone question precisely because organisations explicitly staff functions to track it, so its failure cannot be waved away as the natural opacity of a complex system. Every application, however sophisticated, has exactly one total cost and exactly one party with the authority to retire it, by definition. Failure to produce those is not complexity; it is governance loss.
«How many applications are really serving this one capability — and could one good application subsume them?» In rationalisation work, capability-level counts in large organisations routinely reach an order of magnitude or more applications serving what is nominally a single business capability. The additional systems do not mostly require their own operators; they generate an entirely new category of coordination work — reconciling outputs, chasing which version is true, maintaining the governance surface of each — that produces no value and would disappear if the architecture were coherent. This is the harvester multiplication of Paper 2, met directly in the field.
«Why can we not simply remove the duplicates?» Because the cure for the debt requires precisely the input the debt has destroyed. To retire an application safely, an organisation must know what it costs, who owns it, what depends on it. When those answers have decayed, the organisation assumes the application might be essential — because it cannot find anyone able to establish that it is not — and keeps it. The rationalisation initiative that set out to remove two dozen redundant systems removes only the few already being decommissioned (the corpses), because the live ones carry dependencies no one can document. The disease is also the reason the medicine cannot be administered.
That last observation is why the rationalisation exercise itself is the most honest natural experiment available. Organisations do not want proliferation; they pay consultants specifically to reduce it, because licences and headcount are real costs they dislike. And they consistently fail — the count rises against the organisation’s own paid, declared effort to bring it down. A pattern that persists against active, funded, stated effort to eliminate it is the one kind of inefficiency that cannot be reinterpreted as a hidden optimum. That property — running against declared intent — is what makes the indicators below hard to dismiss, and it is the property they were chosen to expose.
One further field observation shapes the instruments, because it inverts a common assumption. A high volume of governance activity — committees, reconciliation forums, data-quality dashboards, stewardship meetings — is routinely read as a sign of governance maturity. In a fragmented estate it is very often the opposite: a measure of the human coordination the architecture has made necessary, and therefore a symptom of high debt rather than a defence against it. This is why the programme never treats governance effort as a credit. It treats it as one more form of compensatory work to be measured, and counts the question an estate can no longer answer as worth more than the activity it performs while failing to answer it.
Part 3 — Layer 1: External Indicators of Architectural Entropy
The first layer is the publicly visible signature: indicators measurable largely from auditable or third-party data, useful for establishing that the environment and direction are real before any single organisation is instrumented. Their suitability is unequal, and honesty about that is what keeps the layer defensible. Vendor dominance is pervasive in this data — the firms that publish it largely sell solutions to the problem they measure, on cloud-forward, sprawl-prone customer bases — so every figure below is a benchmark with selection bias, not a neutral statistic, and series that count different populations with different methods must not be merged into a single number.
A note on what this layer is, and is not. The three indicators are summarised here only far enough to show how they fit the measurement architecture. The full treatment — the figures, their sources, the selection-bias flags, and the corrected framing — is being assembled as a separate preliminary-evidence attachment to accompany this programme at the next stage. That attachment is not a further paper in the conceptual series; it is a record of the public signatures of architectural disorder that we have already found, presented as preliminary supporting evidence rather than as a result. It is kept separate on purpose, so that the present-day deficit it documents is never mistaken for the cumulative decay, or the causal and structural claims, that only the experimental programme below can earn.
Indicator 1 — Market supply per capability (context only)
As a market for a capability matures, it produces more solutions able to serve that capability end to end, so the option to consolidate onto a capable platform becomes more available over time. The clearest public series is the marketing-technology landscape, counted annually on a stable method: roughly 150 solutions in 2011 against more than fifteen thousand by the mid-2020s — a hundredfold growth in supply [vendor/industry benchmark]. This establishes that the feasible target of one coherent solution per capability is not a nostalgic ideal but an increasingly attainable one.
But a market-supply count measures products that exist, not whether one product covers a whole capability inside one firm, and not what any organisation deploys. It measures optionality and fragmentation, not end-to-end feasibility. We therefore demote this indicator to context. The principled replacement — a minimum coherent solution set (the smallest set of solutions covering a threshold of weighted capability requirements) — is a research contribution to be built, not existing data, and it must not become a precondition for publishing anything else.
Indicator 2 — Capability multiplicity and functional redundancy (the strong core)
This is the rename and correction of «applications deployed per capability». At the estate level, public panels show averages crossing one hundred applications per company, several hundred in larger portfolios, with per-capability redundancy directly observed — on the order of ten or more applications serving each of several common functions [vendor benchmarks; populations differ]. The duplication is paid for more than once: organisations license best-of-breed tools that duplicate capability already owned inside a bundle they already pay for.
What this indicator must not claim is that counts rise monotonically. They do not: at least one managed-application panel shows two consecutive years of decline (roughly 130 to 112 to 106) [vendor benchmark], a clean refutation of any monotonic-growth story. The defensible claim — and the stronger one, because the data actually support it — is persistent multiplicity, persistent functional redundancy, and persistent churn surviving despite explicit consolidation effort. The right measures are therefore not raw counts but: an Effective Application Count (weighting applications by usage or workload share, so a niche tool and a system carrying ninety per cent of transactions are not treated as equivalent — an inverse-Herfindahl of usage share); a Functional Redundancy Index; and a Justified Multiplicity Ratio that allows for the multiplicity a domain genuinely needs (resilience, regulation, regional autonomy, specialised function). The debt signal is high effective count and high redundancy and low justified multiplicity, together — not a rising number on its own.
Indicator 3 — Portfolio answerability and retrieval cost (the strongest, and the most exposed)
This is the rename of «TCO answerability», and it is the decisive indicator because it explains why the gap cannot be closed: the governance knowledge required to close it has itself decayed. It has three properties that make it unusually strong. It runs against declared intent (no one funds a control programme in order to lose control). It is self-validating (the inability to answer is revealed by the organisation’s own behaviour when asked the implementer’s standard discovery questions, not asserted by the researcher). And it isolates governance loss from genuine growth, provided the questions are chosen to be complexity-invariant.
Two corrections make it defensible. First, on the construct: total cost of ownership is not complexity-invariant in the naive sense, and the indicator should not be framed as «can they produce a TCO number?» but as «can they produce a reproducible cost under a standardised boundary and allocation method?» Ownership, likewise, is multidimensional — business, technical, budget, and, the one that matters most for rationalisation, decommission authority — so the question is not whether a single generic «owner» name exists but whether explicit decommission accountability can be named. The instrument is a fixed discovery checklist applied to a sample of applications: can the organisation produce, from governed sources, a reproducible cost, a named owner with decommission authority, a deployment date, a licence cost? Record the fraction answerable, the fraction reconstructable only by archaeology, and the fraction unobtainable, and score a Portfolio Answerability Index across completeness, provenance, freshness, consistency, and a retrieval-effort penalty — alongside a simple Time-to-Answer.
Second, and this is the central epistemic discipline of the entire programme: deficit is not decay. The data above establish an answerability deficit — A is low now, demonstrably and despite three decades of tooling. They do not establish answerability decay — that A has fallen year over year. The keystone thesis of the series is decay; the public evidence is deficit; the two must never be conflated. Asserting decay requires longitudinal, archival or flow evidence, which is exactly what Part 6 and the experimental ladder are built to supply. Reverse causality compounds the danger here: sicker estates launch more modernisation programmes, so a cross-sectional association between «number of past programmes» and «poor answerability» cannot be read causally. Only before/after, difference-in-differences, or within-estate time-series can test it.
Part 4 — Layer 2: The Internal Measurement Stack
The external indicators establish environment and direction. To measure the debt itself, inside one organisation, the programme uses five task-level instruments, each correcting the others’ biases. The unit throughout is cognitive hours, not headcount: a single role mixes genuine and mechanisable work, so people are never the unit of debt — task-time is.
4.1 The task ledger
A task episode is one continuous block of work: one person, one operational purpose, one or more systems. Episodes are logged via time-stamped work sampling — random prompts, several per day, over two to four weeks — and reconciled against system logs. The ledger’s unit is person-minutes per episode, attributed to a capability drawn from a capability taxonomy that is fixed in advance (an external process-classification framework), so that «the same capability» means the same thing across organisations and over time. Without a fixed taxonomy the cross-company and longitudinal claims are not comparable.
4.2 Two-axis classification
Earlier drafts of this programme used a single six-way sort of activities. That fails inter-rater reliability, because it fuses two independent judgments into one. The instrument decomposes them into two binary questions, which are far more reproducible:
| Structurally required (law, accountability, novelty, physical) | Compensating for a fixable gap | |
| Creates new information from reality | GIC — genuine information contribution | (rare; reclassify) |
| Moves / reconciles existing information | NEO — necessary execution & oversight | ACW — avoidable compensatory work (the debt) |
The debt is the single cell where both answers point the wrong way: transformation work that is not structurally required (ACW). Two modifiers attach orthogonally as tags, not as separate categories: TTC (transitional compensatory work, carrying a named exit date) and WFR (waiting, failure or rework). Review of AI output is split by the same test — necessary oversight of a model’s output is NEO; avoidable reconciliation of fragmented AI outputs is ACW, tagged AOV. These three terms (GIC, NEO, ACW) are introduced here for the first time and supersede the role-level vocabulary of Paper 1, which they operationalise at task granularity. Two binary questions are the single biggest lever on inter-rater reliability the programme has.
4.3 Objective tracers (the floor on ACW)
Several components of compensatory work can be measured from digital exhaust rather than judgment: re-entry detection (the same data value appearing in two or more systems with a human keystroke or clipboard event between them — the literal human-as-API); context-switch count (application or window focus changes per episode); reconciliation loops (process-mining patterns: A→B→A, rework cycles, manual touches between automated steps); and export / re-import events (download, manipulate, re-upload). These set an objective floor on ACW, and the judgment layer then classifies only the ambiguous residual. As stated in the Prefatory Note, the traces are auditable anchors and a conservative floor — they are triangulated with task coding rather than treated as ground truth in isolation, because telemetry has its own coverage gaps and interpretation limits.
4.4 Replaceability audit (the graded counterfactual)
For a sample of candidate-debt tasks the question is: could a competent engineer automate this with available period technology, at acceptable quality and risk? The evidence is graded rather than imagined. Tier 1 — already automated in a demonstrable benchmark organisation or shipping product. Tier 2 — a working prototype built during the audit. Tier 3 — expert attestation naming the technology, with no artefact. Tier 1 and 2 support classification as ACW with high confidence; Tier 3 is flagged and weighted down. This replaces «imagine the ideal architecture» with graded evidence, and it is why the counterfactual is observed wherever possible rather than asserted.
4.5 Capability probes (the exergy instruments)
To estimate recovery — to ask whether freed capacity can become genuine contribution — the programme must measure capability, not hours. Three probes do this. The degraded-mode drill runs the process with the mediating system deliberately off (planned, like a fire drill) and measures throughput, error rate and recovery time; the gap between mediated and unmediated performance is a direct reading of dependency and of spent exergy. The novel-exception rate is the fraction of genuinely novel cases — in no playbook — resolved well without escalation, which is where genuine contribution shows up. Cross-context transfer asks whether the operator can apply the skill in a shifted context, since proceduralised skill is brittle to context change while genuine capability transfers. Where degraded-mode drills carry real operational risk, they are scoped to low-stakes processes or replaced by mining existing outage and incident logs, which already contain the unmediated-performance data.
4.6 Load, redefined
Perceived cognitive load is not debt — designing a novel rule is the highest-load, purest-GIC task there is — so load is never used as a debt classifier. It is used two other ways. Load-weighted debt recognises that an hour of high-load reconciliation costs more (error risk, burnout, displaced genuine work) than an hour of low-load formatting; both raw-hour and load-weighted HID are reported. And rising load on the same once-fluent GIC task, tracked over time within experienced staff, is an atrophy fingerprint — exergy draining — which is one of the few early signals of the spent component the programme can read directly.
Part 5 — The Metric Family
The instruments above feed a small family of derived metrics. They are stated carefully to avoid the double-counting that an earlier formulation risked, in which avoidable-review and rework were treated both as categories and as separate addends. Here, observed debt is a single quantity, and the modifiers are tags on it.
HIDobserved = HACW / Htotal (ACW tagged: ACW–AOV, ACW–WFR, ACW–TTC)
HICT = HICR + Hreleasable / Htotal
F-HICT = HICR + ρ∞ · ( Hreleasable / Htotal )
HIDspent = ( 1 − ρ∞ ) · ( Hreleasable / Htotal )
NOI = Hadded − Hremoved
These yield four clean, non-overlapping quantities. Observed mediation is the compensatory work being performed now (HIDobserved, measured, no ideal architecture required). Architecturally releasable capacity is the work a coherent architecture could remove (Hreleasable). Recoverable contribution is the fraction of that released capacity which actually becomes genuine work. And spent capability is the unrecovered remainder — the irreversible part.
The parameter that separates the last two is the recovery coefficient ρ, defined per capability and per cohort (reallocating a worker to a different task is redeployment, not recovery, and must not be allowed to inflate ρ). ρ is the programme’s distinctive contribution, and it is what reconciles an apparent contradiction between the earlier papers. Paper 1 assumed freed hours convert directly into genuine contribution; Paper 4 argued that capability atrophies and is partly unrecoverable. These are not in conflict: Paper 1 is the boundary case ρ = 1, and Paper 4 is the assertion that ρ < 1 and falls the longer a capability has gone unexercised. ρ is the single parameter that tells you which world a given capability is in. Its value, and its dynamics, are the central empirical question of the whole series — and they are unproven. Everything below is built to estimate ρ honestly, including the possibility that it comes back equal to one.
Part 6 — The Velocity-of-Decay Instrument
The hardest question in the programme — «has architecturability decayed over a decade?» — seems to demand years of data the field does not have. It is partly dissolved by measuring flows, not only stocks. You do not need a long time series to know the direction of travel if, at one instant, you measure the rate at which order is being created against the rate at which it is being destroyed. This reads velocity, not just position, and it is the cheapest defensible decay signal available now. Three methods, increasing in power:
- M1 — repeated static snapshots. Re-measure A across a sample each wave; a downward shift in the cross-sectional mean hints at a population tendency. Trap: composition change — the decline may be the sample growing or shifting, not any single estate decaying.
- M2 — vintage gradient. From one snapshot, plot A against a time-under-regime variable: estate age, years since last rationalisation, modernisation waves survived, tenure under mediation. A downward gradient hints at a temporal tendency. Trap: age-period-cohort confounding and survivorship — though survivorship biases toward health (old survivors are robust), so a decline despite it is a strong hint.
- M3 — flow / derivative read (the strongest). Measure the rate of order destruction against order creation at a single instant.
order_creation_rate = (assets onboarded to the model + records verified + assets decommissioned + handovers documented) per unit time
entropy_production ∝ order_destruction_rate − order_creation_rate ( > 0 ⇒ A declining now )
fidelity_half_life ≈ ln(2) / verification_decay_rate (from the «last-verified» staleness distribution)
If the destruction rate exceeds the creation rate at time t, then A is declining at t — no history required. M3 sidesteps both the composition change that traps M1 and the age confounding that traps M2, by measuring the mechanism directly. It is the honest cash-out of the entropy-production idea used as correspondence rather than allegory: in physics one computes the entropy production rate at an instant from the fluxes, never needing to watch the system age. Honest caveat: M3 hints at the instantaneous direction, not a proven multi-year trajectory — flows can change. But it converts an almost unanswerable question into one obtainable from a single good audit, and there is a sharper point hidden in it: the inability to reconstruct the past is itself a reading of decay — an orderly estate keeps its records, so a estate that cannot tell you its own history is already telling you something.
Part 7 — The Experimental Ladder
Five studies, each buying a distinct inferential claim, from cheap and correlational to expensive and causal. The programme lives or dies on causal identification, because bad architecture is never randomly assigned.
Study 0 — Instrument validation (the gate)
Nothing downstream is interpretable until this passes. At least three independent assessors classify the same set of roughly three hundred task episodes with the two-axis protocol; the programme reports Krippendorff’s α and requires α ≥ 0.80 before any substantive measurement. Self-report labels are concordance-checked against the objective tracers — does «reconciliation» coincide with detected re-entry and context-switch events? If agreement is low, the decision rules are tightened, anchor examples added, and the test re-run. This gate is what separates a measurement from an opinion, and no HID value may be published before it is passed.
Study 1 — Baseline and benchmark pair
Measure HIDobserved across several teams and capabilities with the full stack, and pair each fragmented site with a coherent-architecture site doing the same capability (a Tier 1 counterfactual). The cognitive-hours gap estimates achievable HID with nothing imagined. Known limit: this is correlational — it cannot yet attribute the gap to architecture versus better staff. That is what Studies 2 and 3 isolate.
Study 2 — The intervention experiment (the keystone; ρ as primary endpoint)
This is the heart of the programme. Take one bounded, ACW-visible capability — master-data reconciliation, recurring reporting, invoice exceptions, or AI-output review. Measure a four-week baseline with the full stack including capability probes and load-on-GIC. Then randomise teams or sites to two arms:
- Arm A — architecture only: implement the integration fix that removes the ACW; free the hours; do nothing about capability.
- Arm B — architecture plus exergy injection: the same fix, plus a deliberate-practice and mentoring programme targeting the genuine-contribution capability.
The primary endpoint is ρ measured directly, tracked weekly for at least eight to twelve weeks (longer is better; recovery is slow): the share of released hours that converts into genuine contribution. The preregistered predictions: both arms see HACW fall and comparable hours released (the fix works); Arm A shows ρ staying low — freed time idles, is re-absorbed by new mediation, or drifts to low-value work; Arm B shows ρ rising slowly toward a ceiling ρ∞ < 1, with degraded-mode performance improving in B and not A. Randomisation kills selection; the fix removing the ACW shows architecture caused the debt; the A-versus-B contrast shows whether recovery requires injected exergy; and Arm B’s programme cost is the recovery cost. Critically, Study 2 is a randomised before/after design, so the difference-in-differences identification that the contested population-level claims require is native to it — the execution layer, run correctly, manufactures exactly the evidence the epistemics demand.
Study 3 — The hysteresis study (the irreversibility claim)
Hysteresis means the recovery path differs from the decay path. Three routes by feasibility. 3a — history-controlled cross-section (cheapest, do first): compare units with similar current architecture but different tenure under mediation; lower capability in the long-mediated units, at fixed current architecture, is the hysteresis fingerprint, since history rather than current state then determines capability. 3b — natural experiment from outages: for the same team, compare unmediated performance early versus late in the mediation era using existing incident and SLA logs, estimating the decay rate. 3c — longitudinal gold standard: track a cohort through a mediation rollout and then through Arm B recovery, showing the recovery rate is much slower than the decay rate and the ceiling sits below the original. The output is the rate asymmetry and the ceiling — «partial irreversibility» expressed as numbers rather than asserted.
Study 4 — The oversight-threshold study (the bridge to the AIOIA series)
For AI-enabled processes, measure Net Oversight Impact (NOI = Hadded − Hremoved) across varying oversight complexity — guardrail count and coupling. The endogeneity trap is that riskier processes attract more guardrails, so a plain regression of debt on guardrail count is confounded; the fixes are within-process (add or remove one guardrail on the same process and measure the change in NOI) and stepped-wedge (introduce oversight components in randomised order across processes over time, so staggered timing identifies the marginal effect). The preregistered shape is a U-curve: early guardrails release capacity (NOI < 0), and past a coupling threshold they create debt (NOI > 0). This matters because it is the same NOI metric and the same threshold logic whether the envelope exists for integration (this series) or for safety and guardrails (the AI Operational Integrity Architecture series). This single study unifies the two series empirically.
Study 5 — Sector replication
Repeat the two-arm design of Study 2 across two or three sectors — finance, health, public administration — to show that ρ and the A-versus-B effect are not artefacts of one setting.
Part 8 — Identification, in One Place
Because bad architecture is never randomly assigned, the programme states its identification strategy explicitly. Randomisation (Study 2) is the cleanest and is used wherever an intervention is genuinely controllable. Stepped-wedge and within-unit designs (Study 4) are used when full randomisation is impossible but timing can be staggered. Historical exposure — tenure under mediation (Study 3a) — is used when only observational data exist, because past architectural decisions predate current staff and are more plausibly exogenous to current capability than the current architecture is; note that this is historical exposure, and is not claimed as a formal instrumental variable unless the relevance and exclusion assumptions can be demonstrated. Benchmark pairs (Study 1) supply magnitude when no intervention is available. The one design the programme never relies on is a cross-sectional regression of debt on guardrail or system count alone; that is the design guaranteed to be confounded.
Part 9 — What Each Result Would Mean
The programme is built so that every outcome is a publishable finding — including the outcomes that would weaken the thesis. That is the property it most needs.
| Result | Interpretation |
| Arm B recovers genuine contribution, Arm A does not | Core thesis supported: freed time ≠ recovered capability; exergy must be injected |
| Both arms recover equally | ρ ≈ 1; capability was intact; Paper 4’s irreversibility does not bind here — rearchitecting alone suffices |
| Neither arm recovers | Either the capability is gone (strong irreversibility) or the practice programme was wrong — the capability probes disambiguate |
| Recovery rate much slower than decay rate (Study 3) | Hysteresis confirmed; the debt is path-dependent, not a static misallocation |
| Oversight U-shape confirmed (Study 4) | The two series share one mechanism; NOI is the unifying metric |
| HIDobserved large, benchmark gap small | The fragmentation is real but the achievable improvement is modest — important for managing expectations |
The framework makes one sharp, uncomfortable, testable prediction at the population level, and it is the one that most clearly separates this account from ordinary observations about software sprawl: across organisations, the number of completed modernisation and rationalisation programmes should be negatively associated with current answerability, and longitudinally, applications per capability should persist or rise despite repeated rationalisation effort. Both run against the stated intent of every programme involved. If the data bear them out, the result is hard to explain away, precisely because it cannot be reinterpreted as anyone’s rational design. If they are refuted — if more control programmes predict more answerability, or if rationalisation reliably reduces applications per capability — the framework is wrong in its strongest claim, and that too is a finding worth having.
Part 10 — Limitations, Ethics, and the Publication Gate
Intellectual honesty about the boundary of the evidence is what makes the result defensible, so the programme states its limits as plainly as its claims.
- Deficit is not decay. The public data establish a present-day deficit and a structural pattern. They do not prove cumulative decay, and no decay claim is made without archival, repeated-measurement or flow evidence. This is the central vulnerability of the whole programme and is treated as such.
- ρ and hysteresis are unproven. They are the distinctive contribution and the least evidenced; they are presented as hypotheses with a clean test, never as findings.
- Vendor data carry selection bias. Every third-party figure is a benchmark drawn from cloud-forward, sprawl-prone samples; figures are triangulated against more neutral sources, incompatible series are not merged, and analyst forecasts are marked as forecasts rather than measurements.
- The counterfactual is bounded. Public sources measure proxies, not the precise construct; the capability-level magnitude and the recovery dynamics are field contributions, established against a fixed protocol, not delivered by public data.
- Operational risk. Degraded-mode drills are scoped to low-stakes processes or replaced by mining real outage logs; the programme does not introduce failure into critical operations to take a reading.
- The taxonomy is fixed in advance. An external capability classification is adopted before measurement, so that comparisons across organisations and over time are meaningful.
- Study 0 is a hard gate. No published HID number before inter-rater α ≥ 0.80. Without it, the classification is opinion.
The discipline that makes the entire programme hold can be stated in one line: measure one real estate’s flows once, and its before-and-after once. A single clean velocity read plus a single clean intervention is worth more than any volume of public-data correlation, because it is the one thing that turns every contested population-level claim into something defensible. This paper specifies how to take those two readings. The numbers themselves are the work that follows.
Sources and status
The figures behind Part 3 are drawn from a public-data scan as of early 2026 and are collected, with their sources and selection-bias flags, in the separate preliminary-evidence attachment that accompanies this programme; they are to be refreshed and primary-sourced before any external publication. In outline: vendor / industry benchmarks (selection bias) — the marketing-technology landscape series (annual, 2011 onward); enterprise SaaS-portfolio panels reporting applications per company and per-capability redundancy; managed-application panels reporting the consolidation that refutes monotonic growth; ITAM and cloud-spend surveys reporting visibility and waste; and more neutral longitudinal sources — industry-consortium data and peer-reviewed academic work on the productivity paradox, automation economics, the human factors of automation, articulation and invisible work, and software-architecture decay — used to triangulate and to position the contribution. The answerability-decay claim and the recovery coefficient ρ are not in that attachment, because they are not yet evidenced: they await field measurement (Study 0, then the intervention experiment and the velocity audit) and are stated here as hypotheses, not results. The attachment documents a present-day deficit; this programme is what would turn that deficit into a measured, causal account.
This work is produced by the AI Integrity Management working group at The Integral Management Society, a Swiss non-profit association bringing together senior specialists from adaptive systems, complex systems, artificial intelligence, mission-critical operations and governance. The working group invites empirical collaboration to run the instrument-validation gate and the intervention experiment described above.
