Project Risk Lab

Interactive tools and research demos for evidence-based decisions, from project forecasting and risk to the value of nature. Explore calibration, Monte Carlo simulation, reference class forecasting, and more.

This lab needs JavaScript to run. The tools are interactive calculators and simulations for forecasting and risk, such as the calibration quiz, Monte Carlo simulator, and reference-class forecasting charts. Please enable JavaScript to use them.

New here? Start with these three

A short path through the core ideas. Each link opens the tool below.

Calibration Quiz: find out whether your confidence is well-calibrated before you trust any forecast.
Reference Class Forecasting: replace one optimistic guess with how similar projects turned out.
Monte Carlo Simulator: turn uncertain inputs into a full range of outcomes, not a single number.

Working on the environment? Start with Biodiversity in the business case and Nature Next Door.

Are You Well-Calibrated?

What you'll learn: whether your confidence is well-calibrated, and why most people are overconfident.

For each question, provide a 90% confidence interval, a range you are 90% sure contains the correct answer. If you are perfectly calibrated, 9 out of 10 answers should fall inside your ranges. Most people only capture 4–5 out of 10.

Go Deeper: why coverage is not enough (proper scoring)

Counting how many true values land inside your 90% ranges (your hit rate) tells you whether you are over- or under-confident, but it is easy to game. Make every interval enormous and you capture all ten while saying nothing useful. It also never rewards sharpness.

A proper scoring rule, an idea that traces to Brier (1950), cannot be gamed this way. This quiz adds the Winkler interval score (Gneiting & Raftery 2007) at the 90% level (α = 0.10):

\[ S = (u - l) + \frac{2}{\alpha}(l - x)\,\mathbf{1}[x < l] + \frac{2}{\alpha}(x - u)\,\mathbf{1}[x > u] \]

where [l, u] is your interval and x is the truth. You pay for the width of every interval, plus a much larger penalty whenever the truth falls outside it. Lower is better, and you cannot improve it by always hedging wide or by guessing too tight. Each question is normalised by its true value so questions on different scales (metres versus symphonies) are comparable, then averaged.

Two caveats. Under risk aversion, reported probabilities get pulled toward the middle, and for a cost distribution that means understating tail risk. And a score is only proper if you have no stake in the answer. Someone who profits from a particular outcome can gain by reporting away from their true belief. Scores discipline forecasters. They do not fix incentives.

Sources: Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1), 1–3. · Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. JASA 102(477), 359–378. · Schlag, Tremewan & van der Weele (2015), A penny for your thoughts, Experimental Economics 18(3), 457–490.

Monte Carlo Project Simulator

What you'll learn: how uncertain costs and risk events combine into a full range of outcomes, not a single number.

Define a base project cost and a set of risk events. Each risk has a probability of occurring and a cost impact range. The simulator runs multiple iterations to show the distribution of possible total project costs.

Base Project Cost ($M)

Simulations

Risk Events

Risk Name

Prob. (%)

Min Impact ($M)

Max Impact ($M)

Risk Name

Prob. (%)

Min Impact ($M)

Max Impact ($M)

Risk Name

Prob. (%)

Min Impact ($M)

Max Impact ($M)

Reference Class Forecasting

What you'll learn: how similar projects turned out, so you can replace an optimistic estimate with the outside view.

Nobel laureate Daniel Kahneman advocates using the outside view. Instead of relying on optimistic inside estimates, look at how similar projects performed. Select a project type to see historical cost overrun distributions from large-scale studies.

How does this work?

Each reference class is the empirical distribution of the outcome ratio (actual divided by estimate) for many completed projects of that type, reported at five percentiles P5, P25, P50, P75, P95 from large-scale studies. Because ratios are positive and right-skewed, the tool works in log space. It fits a monotone cubic (Fritsch-Carlson) spline through the five log-percentiles, so the curve passes exactly through the data and never crosses itself. Multiplying your point estimate by the ratio at a chosen percentile gives the reference-class forecast at that confidence level:

\[ \widehat{\text{cost}}_{P} = \text{estimate}\times q_{P},\qquad q_{P}=\exp\big(\text{spline through } \ln q_{5},\,\ln q_{25},\,\ln q_{50},\,\ln q_{75},\,\ln q_{95}\big). \]

The parametric cousin, used by the Noise Audit and the other selection tools, is the log-normal case $ \ln(\text{actual}/\text{estimate})\sim\mathcal{N}(\mu_0,\sigma_0^{2}) $, for which $ q_P = e^{\mu_0+\sigma_0\Phi^{-1}(P)} $ with $ \sigma_0=(\ln q_{90}-\ln q_{10})/(2\times 1.2816) $.

Forecast Type

Reference Class

Cost Estimate ($M)

Note: IT projects can deliver negative benefits (P95 = −4%), meaning the project not only failed to deliver its intended benefits but impeded what it set out to improve. For example, reducing productivity instead of enhancing it.

Expected Value of Information

What you'll learn: how much a measurement is worth before you pay to reduce an uncertainty.

One of the most powerful (and most overlooked) tools in decision making. Before measuring anything, ask: what is the measurement worth? This tool calculates the Expected Value of Perfect Information (EVPI), the maximum you should spend to eliminate uncertainty before making a decision. It often reveals the measurement inversion. What we measure most has the least value, and what we ignore matters most.

How does this work?

You're deciding whether to proceed with a project. The decision hinges on an uncertain variable, such as expected revenue, NPV, or cost savings. If the true value falls below your threshold, proceeding would be a mistake. EVPI quantifies the maximum you should pay to resolve that uncertainty before committing.

The tool treats your 90% confidence interval $[l_{90}, u_{90}]$ as a normal distribution with $\sigma = (u_{90} - l_{90}) / (2 \times 1.645)$, then values information as the gap between deciding with perfect information and deciding now:

\[ \mathrm{EVPI} = \mathbb{E}_{x}\!\left[\max_a u(a, x)\right] - \max_a \mathbb{E}_{x}\!\left[u(a, x)\right] \]

For a normal variable and a proceed-if-above-threshold $T$ decision, this integral has a closed form, the expected opportunity loss from acting under uncertainty:

\[ \mathrm{EVPI} = \sigma\,L\!\left(\frac{|\mu-T|}{\sigma}\right), \qquad L(z) = \phi(z) - z\,\Phi(-z), \]

where $L$ is the unit-normal loss function. It is largest when the estimate sits right at the threshold (the decision is a coin toss) and shrinks to zero as $\mu$ moves far from $T$. You should never pay more than the EVPI to measure the variable. Running it with a smaller $\sigma$ gives the value of a partial measurement (the EVI): if a study cuts the standard deviation to $\sigma'$, $ \mathrm{EVI} = \mathrm{EVPI}(\sigma) - \mathrm{EVPI}(\sigma') $, and a lower $\sigma'$ can only help (Blackwell), so its net value is $ \mathrm{EVI} - \text{cost} $, which the comparison panel below ranks.

Uncertain Variable

90% CI Lower ($M)

Your Estimate ($M)

90% CI Upper ($M)

Decision Threshold ($M) Proceed only if the true value exceeds this. Below it, the alternative is worth the threshold

Go Deeper: compare two candidate measurements (Blackwell)

EVPI tells you the most you should pay to remove all uncertainty. In practice the choice is rarely perfect information versus none. It is usually which study to buy, a cheap survey that cuts uncertainty a little, or an expensive one that cuts it a lot. The panel below values each at the same decision and threshold.

Each study is modelled as reducing your standard deviation by a given percentage (the same way the EVI above models halving uncertainty). Its EVI is how much that reduction lowers your expected cost of a wrong call. Its net value is EVI minus the study's price. The better buy is the one with the higher net value, and the cheaper study can win even when it is less informative.

This is the practitioner's face of Blackwell's theorem. A more informative signal is worth at least as much in every decision problem. For measurements of the form "estimate = truth + noise," less noise is more informative in Lehmann's precise sense. The full garbling apparatus is not needed to make the call that matters here, which signal is worth more, and whether the cheaper one is nearly as good.

Sources: de Oliveira, H. (2018). Blackwell's informativeness theorem using diagrams. Games and Economic Behavior 109, 126–131. · Lehmann, E. L. (1988). Comparing Location Experiments. Annals of Statistics 16(2), 521–533.

Compare two candidate measurements

Each study reduces your uncertainty by some amount, at some cost. Which is the better buy for this decision?

Study A: uncertainty reduction (%)

Study A: cost ($M)

Study B: uncertainty reduction (%)

Study B: cost ($M)

Noise Audit

Working paper, not yet peer-reviewed

What you'll learn: why even unbiased estimates produce systematic overruns once noise and selection are accounted for.

Even perfectly unbiased cost estimates produce systematic cost overruns. When organisations pick projects based on estimated value for money, they inadvertently favour projects whose costs were underestimated by luck. This is a portfolio version of the optimizer's curse, and has nothing to do with bias or gaming. This tool quantifies how much of the observed overrun for a given project type could be explained by estimation noise alone, without any bias, optimism, or strategic misrepresentation. That share is a noise-dominated upper bound. When candidate projects also differ in genuine quality, the realistic figure is lower, with a central reading around a third of observed overruns (roughly 25 to 50% across categories).

How does the budget constraint create overruns?

The model’s key mechanism is the budget constraint acting as a selection filter. Organisations cannot fund every project, so they rank proposals by estimated benefit-cost ratio (BCR) and fund the top fraction (the budget share α). Among funded projects, costs systematically exceed estimates because the project was selected partly because noise made its costs look lower than they were. This is the optimizer’s curse applied to project portfolios.

Two parameters govern the effect: σ (how noisy the estimates are) and α (how selective the budget is). Tighter budgets (lower α) force selection from the extreme tail of the estimated BCR distribution, amplifying the overrun. When α = 1 (everything funded), there is no selection and noise averages out.

For equal cost and benefit noise, the expected realised-to-estimated cost ratio among funded projects has a closed form:

\[ \mathbb{E}\!\left[\frac{C}{\hat{C}} \,\middle|\, \text{selected}\right] = \frac{1}{\alpha}\,e^{\sigma^2/2}\,\Phi\!\left(\frac{\sigma}{\sqrt{2}} - \Phi^{-1}(1-\alpha)\right) \]

where $\Phi$ is the standard normal CDF. At $\alpha = 1$ this collapses to the pure Jensen term $e^{\sigma^2/2}$. Tighter budgets multiply it by the selection factor on the right.

With unequal cost and benefit noise the same result reads, and this is the general form the tool computes:

\[ \mathbb{E}\!\left[\frac{C}{\hat C}\,\middle|\,\text{sel}\right] = \frac{1}{\alpha}\,e^{\sigma_c^{2}/2}\,\Phi\!\left(\frac{\sigma_c^{2}}{\sigma_\eta}-z^{*}\right), \qquad \sigma_\eta=\sqrt{\sigma_c^{2}+\sigma_b^{2}},\ \ z^{*}=\Phi^{-1}(1-\alpha). \]

The ratio exceeds one because selection truncates the joint error distribution, giving funded projects a negative average cost error:

\[ \mathbb{E}[\varepsilon^{c}\mid\text{sel}] = -\frac{\sigma_c^{2}}{\sigma_\eta}\,\lambda(z^{*}) < 0, \qquad \lambda(z)=\frac{\phi(z)}{1-\Phi(z)},\ \ \operatorname{Cov}(\varepsilon^{c},\eta)=-\sigma_c^{2}. \]

The benefit-shortfall mode is the mirror image, a ratio below one: $ \mathbb{E}[B/\hat B\mid\text{sel}] = (1/\alpha)\,e^{\sigma_b^{2}/2}\,\Phi(-\sigma_b^{2}/\sigma_\eta - z^{*}) < 1 $. Closed forms from Leed (2026), Noise, Selection, and the Illusion of Bias.

Forecast Type

Reference Class

Estimation Noise Level (σ) 0.20

No noise Typical practice Very imprecise

Budget Share (α) 0.30

Very selective Typical All funded

Based on Leed (2026), “Noise, Selection, and the Illusion of Bias: How Unbiased Estimates Can Produce Systematic Cost Overruns,” working paper. Data from Hubbard, Budzier & Leed (2025).

Selection-Adjusted RCF

Working paper, not yet peer-reviewed

What you'll learn: how to separate the selection effect from genuine bias in a reference class.

Standard Reference Class Forecasting applies the full observed overrun as an uplift (a percentage added to the estimate to account for historical patterns of cost growth). But part of that overrun is a selection effect (the optimizer’s curse). Projects were funded partly because noise made them look cheaper. A uniform full-overrun uplift is ranking-neutral and still budgets the funded portfolio correctly, but it overstates genuine bias by the selection share, which distorts value-for-money approval gates and comparisons across project types. This tool strips the selection component, producing the bias-only uplift to apply per project (the selection part belongs in a portfolio-level reserve).

What does a full-overrun uplift get wrong?

When organisations rank projects by estimated benefit-cost ratio (BCR) and fund the top fraction α, projects with accidentally underestimated costs are more likely to be selected. The observed overrun in the reference class therefore reflects both genuine estimation bias and this statistical selection effect (the optimizer’s curse). A uniform uplift calibrated to the full observed overrun is ranking-neutral and budgets the funded portfolio correctly, but it overstates the genuine bias by the selection share. It raises value-for-money and BCR-above-one approval bars too far, and it distorts comparisons across categories whose selection shares differ. The bias-only component is the part to apply to an individual estimate. The selection part belongs in a portfolio-level reserve.

This tool decomposes the observed overrun into a selection component (teal) and a residual bias component (gold). The observed reference-class overrun factors into a Jensen term, a selection term, and genuine bias:

\[ \underbrace{\mathbb{E}[R^{c}\mid\text{sel}]}_{\text{observed } O} = \underbrace{e^{\sigma_c^{2}/2}}_{\text{Jensen}}\times\underbrace{\frac{1-\Phi(z^{*}-\sigma_c^{2}/\sigma_\eta)}{1-\Phi(z^{*})}}_{\text{selection}}\times\underbrace{e^{\beta_c}}_{\text{genuine bias}},\qquad z^{*}=\Phi^{-1}(1-\alpha),\ \ \sigma_\eta=\sqrt{\sigma_c^{2}+\sigma_b^{2}}. \]

Writing $S$ for the noise-and-selection factor (everything but $e^{\beta_c}$), which is exactly the closed form the Noise Audit plots, the tool reports the selection share as the levels ratio $ (S-1)/(O-1) $ and strips it, leaving the bias-only uplift $ e^{\beta_c} $ to apply per project. The benefit-shortfall mode uses the mirror factor $ S^{b}=e^{\sigma_b^{2}/2}\,[1-\Phi(z^{*}+\sigma_b^{2}/\sigma_\eta)]/[1-\Phi(z^{*})] < 1 $.

Forecast Type

Reference Class

Cost Estimate ($M)

Cost Estimation Noise (σ_c) 0.20

No noise Typical Very imprecise

Benefit Estimation Noise (σ_b) 0.20

No noise Typical Very imprecise

Budget Share (α) 0.30

Very selective Typical All funded

Limitations:

The correction only matters when the RCF uplift is applied before the selection decision (i.e., during appraisal). If applied after funding, naive RCF is appropriate.
The decomposition into selection and bias components is exact for the mean. The percentile-level correction is a first-order approximation that applies the mean-level bias fraction to the overrun (cost) or shortfall (benefit) portion of each percentile, leaving under-budget or above-target outcomes unchanged.

Based on Leed (2026), “Noise, Selection, and the Illusion of Bias: How Unbiased Estimates Can Produce Systematic Cost Overruns,” working paper. Data from Hubbard, Budzier & Leed (2025).

Conditional Reference Class Forecasting

What you'll learn: how to update a forecast as a project progresses and real data comes in.

Standard reference class forecasting applies the outside view at the start of a project. But what happens as the project progresses and you observe actual performance? Conditional RCF updates the forecast by combining the reference class distribution with real progress data, narrowing uncertainty as evidence accumulates.

What makes it conditional?

All reference classes are conditional. When you pick "IT cost overruns" as your reference class, you've already conditioned on project type. The term Conditional Reference Class Forecasting adds a further condition. It asks how far along the project is and what actual performance looks like so far. At 0% progress, you have only the unconditional distribution. At 50% progress with an observed cost ratio of 1.15, you combine that evidence with the reference class to produce a tighter, more informed forecast.

Under the hood this is a conjugate update in log space. With the reference-class log-median $\mu_0$, its log-spread $\sigma_0$, an observed log-ratio $y = \ln(\text{actual} / \text{estimate})$, and fraction complete $p$, the posterior median and spread are:

\[ \mu_{\text{post}} = (1-p)\,\mu_0 + p\,y, \qquad \sigma_{\text{post}} = \sigma_0\sqrt{1-p} \]

At $p = 0$ the posterior is the reference class. As $p \to 1$ it collapses onto what you have observed. The forecast shifts toward the evidence with weight equal to progress, and its spread contracts as the project completes. A sequence of such Bayesian updates is itself a martingale.

Forecast Type

Reference Class

Cost Estimate ($M)

Project Progress 0%

Percentage of planned scope/schedule completed

Actual-to-Estimate Ratio So Far E.g. 1.15 = 15% over estimate so far

Beta Distribution

What you'll learn: how a probability estimate should move as evidence accumulates.

How uncertain are you about a proportion, like the probability of project cancellation, the share of requirements that change, or the fraction of stakeholders who approve? The beta distribution is the Bayesian tool for exactly this question. Start with a prior belief, add observed data, and watch the uncertainty narrow.

How do α and β work?

The parameters α (alpha) and β (beta) shape the distribution. Think of them as pseudo-counts. α = prior "hits" + 1 and β = prior "misses" + 1. Start with α=1, β=1 for a flat (uninformative) prior. Larger values indicate stronger prior belief, meaning a sharper peak and less spread. When you observe new data (hits and misses), the posterior is Beta(α + hits, β + misses).

With $s$ hits in $n$ trials, the posterior and its mean are:

\[ p \mid \text{data} \;\sim\; \mathrm{Beta}(\alpha + s,\ \beta + n - s), \qquad \mathbb{E}[p] = \frac{\alpha + s}{\alpha + \beta + n} \]

Prior Belief

α (Alpha)

β (Beta)

Observations

"Hits"

"Misses"

Prior Updated (with data)

Technology Regret Analysis

What you'll learn: when it pays to adopt a technology now instead of waiting for it to get cheaper.

Technology costs tend to decline while benefits grow over time. Adopt too early and you overpay. Too late and you forfeit years of benefits. Technology regret is the NPV lost by investing at a non-optimal time. This tool computes the net present value of investing in each possible year and identifies the optimal adoption window.

What drives the optimal timing?

Two forces compete. Cost decline rewards waiting (technology gets cheaper), while the discount rate and finite benefit window penalize delay (future benefits are worth less, and the planning horizon shrinks). The model discounts all cash flows to year 0, so you can compare adopt-now vs. adopt-later on equal footing. The optimal year balances the savings from cheaper technology against the benefits forgone by waiting.

Initial Investment ($)

Annual Benefit ($)

Discount Rate (%)

YoY Investment Change (%)

Negative = cost declines over time

YoY Benefit Growth (%)

Growth in annual benefit while waiting

Benefit Life (years)

Planning Horizon (years)

Risk Matrix Roulette

What you'll learn: why the most common risk tool can rank risks in the wrong order.

Think risk matrices give you a clear picture? Play three rounds that expose fundamental flaws in the most widely used risk assessment tool. Based on research by Cox (2008), Mandel (2021), and Hubbard, Budzier & Leed (2025).

Why challenge risk matrices?

Risk matrices are ubiquitous in project management, but research shows they suffer from at least three fatal flaws: (1) verbal probability labels mean vastly different things to different people, (2) colour-coded cells hide enormous differences in expected loss, and (3) arbitrary category boundaries can reverse the ranking of risks. This interactive challenge lets you experience all three.

Parade of Trades

What you'll learn: how variability in single tasks cascades into project-level delay.

Five trades. Thirty-five rooms. One die. Can you beat the plan? In construction, trades work sequentially through rooms, but variability in weekly output creates cascading delays. Roll the dice and watch the parade unfold.

How does this work?

Five trades (Civil, Mechanical, E&I, Fit-out, Commissioning) must each process 35 rooms in sequence. Each trade rolls a die each week to determine capacity, but can only process rooms the previous trade has already completed. The deterministic plan, using the average roll of 3.5 rooms/week, expects completion in 14 weeks. You will play three rounds with dice of identical averages but different variability to see how uncertainty amplifies through the system.

This demonstrates a general principle. When tasks must happen in sequence and each step has variable output, the total project duration is always longer than the sum of averages. Planning with averages guarantees you will be late.

Probability Word Explorer

What you'll learn: how differently people read words like “likely” and “probable.”

What does “likely” mean? What about “probable”? In a survey of 123 respondents, people assigned wildly different numeric probabilities to common probability words. Explore all 17 words and see how your own interpretations compare.

About the data

This data comes from a survey conducted by Wade Fagen-Ulmschneider (Teaching Professor, Computer Science, University of Illinois at Urbana-Champaign). Each of the 123 internet survey respondents was asked to assign a numeric probability (0–100%) to 17 common probability words. The box-and-whisker chart below shows the distribution of responses for each word. The box spans the interquartile range (Q1–Q3, the middle 50% of responses), the vertical line marks the median, and the whiskers extend to the minimum and maximum values observed. The study builds on a long tradition of research into verbal probability, notably Sherman Kent’s 1964 “Words of Estimative Probability” for the CIA.

Sort by:

Data source: Wade Fagen-Ulmschneider, Perception of Probability Words, University of Illinois at Urbana-Champaign, 123 internet survey respondents. GitHub dataset

Do Your Forecast Revisions Pass the Martingale Test?

What you'll learn: whether your forecast revisions drift in one direction, a sign the first estimate was biased.

A well-calibrated forecaster cannot predict the direction of their own next revision. So if a project's cost or schedule estimate is revised upward again and again, that one-directional drift is not bad luck. It is a statistical fingerprint that the initial estimate was too low. This tool audits a forecast's revision path and flags drift using no outcome data at all, which is exactly the situation you are in mid-project.

How does this work?

Enter the forecast made at each review or milestone (total cost, percent over baseline, or forecast completion in weeks). The tool measures each revision, tests whether the revisions point in one direction more often than coin flips would explain, and tells you whether the path looks like genuine news or the slow correction of a biased anchor. It needs no realized outcome.

Forecast metric

Review / milestone Forecast

I have the realized outcome

Actual final value

Forecast path

Revisions (change at each review)

Share of upward revisions

Go Deeper: the martingale property of forecasts

Let F_t be the forecast made at review t = 0, 1, 2, … (expected final cost, or P50 completion). If F_t is the conditional expectation of the true final value given everything known at t, that is, the forecaster updates like a Bayesian, then the sequence {F_t} is a martingale:

\[ \mathbb{E}\!\left[\,F_{t+1} \mid \mathcal{F}_t\,\right] = F_t \]

Two consequences follow. The tool checks the first directly. The second is the theory behind why drift matters.

1. Revisions are mean-zero and directionally unpredictable. The increments $\Delta_t = F_t - F_{t-1}$ have conditional mean zero. Persistent one-signed drift means the forecaster was not using all available information at the start. The anchor was biased. Upward drift in cost is optimism in the initial estimate.

2. Movement variance equals uncertainty reduction. For a rational updater (Augenblick & Rabin 2021), the total variance of belief movement equals the total reduction in uncertainty. Both excess movement (over-reaction) and insufficient movement (under-reaction, anchoring) show up in the revision path.

The drift test here is an exact two-sided sign test. With m non-zero revisions of which k are upward, it reports the probability that coin-flip revisions would be at least this lopsided (binomial, p = 0.5). The momentum readout is the share of consecutive revisions that share a sign, a light check for anchoring-and-adjustment.

Read it with care. A real forecast is a true martingale component plus genuine news, and a single project's path is short. So drift is a signal to investigate, not proof of bias. The test becomes powerful across a portfolio. Drift that repeats across many projects is the real indictment of systematic optimism.

This audits the path of updates. Use the Conditional RCF tool to generate the next update, and the Calibration Quiz to work on the anchor itself.

Sources: Augenblick, N. & Rabin, M. (2021). Belief Movement, Uncertainty Reduction, and Rational Updating. Quarterly Journal of Economics 136(2), 933–985. · Cross-reference Hubbard, Budzier & Leed, How to Measure Anything in Project Management (Wiley, 2025), and the working paper Noise, Selection, and the Illusion of Bias.

Is Your Reference Class Contaminated?

What you'll learn: how copying between estimators shrinks the real evidence in a reference class.

Reference Class Forecasting, the Noise Audit, and Selection-Adjusted RCF all treat each database observation as an independent draw of information. But if estimators copy shared benchmarks or each other instead of estimating from scratch, the reference class manufactures false consensus. It can look like hundreds of projects agree when the independent information is far smaller. Apparent agreement is not evidence, and it makes your uplift bands too narrow.

How does this work?

Set the nominal size of your reference class and how much estimators copy a shared benchmark rather than estimate independently. The tool computes the effective sample size and shows how much your reported uncertainty understates the truth.

Reference-class size (N) 200

SmallTypicalLarge

Copying intensity (ρ) 0.20

All independentSome copyingPure herd

Uncertainty in the reference-class average

Go Deeper: effective sample size and informational cascades

If each of N estimates were an independent signal of the true value, the variance of their average would fall like σ²/N. If a fraction of estimators instead copy, so the estimates share a common component with intra-class correlation ρ, that variance is inflated by the design effect:

\[ \operatorname{Var}(\bar X) = \frac{\sigma^{2}}{N}\big(1+(N-1)\rho\big), \qquad \text{DEFF}=1+(N-1)\rho. \]

Dividing N by the design effect gives the effective sample size, the number of independent observations, which is bounded no matter how large N grows:

\[ N_{\mathrm{eff}} = \frac{N}{1+(N-1)\rho}, \qquad \lim_{N\to\infty} N_{\mathrm{eff}} = \frac{1}{\rho}. \]

As ρ → 0, N_eff → N (independent); as ρ grows it collapses toward 1/ρ. Reported uncertainty, which assumes independence, is too small by a factor $ \widehat{\mathrm{SE}}_{\text{naive}}/\mathrm{SE}_{\text{true}} = \sqrt{N / N_{\mathrm{eff}}} = \sqrt{1 + (N-1)\rho} $.

This tool models the spread, how copying inflates apparent confidence. Copying can also bias the level, a related phenomenon it does not simulate. In a sequential cascade (Bikhchandani, Hirshleifer & Welch 1992; Banerjee 1992), once a few early estimates point the same way, later estimators rationally ignore their own signal and follow. With bounded private beliefs (Smith & Sørensen 2000), the herd can lock onto the wrong value with positive probability, and information stops accumulating once the cascade forms. An optimistic benchmark need not be corrected by the crowd.

RCF, the Noise Audit, and Selection-Adjusted RCF all assume independent reference observations. If the database is a herd, their uplift bands are over-confident. Counts are not evidence. Independent signals are.

Caveat. ρ is not directly observable. This is a sensitivity and awareness device. If estimators copy even moderately, here is how over-confident your bands become. Estimating ρ from project-filing data would be a worthwhile research product. This contamination (behavioural correlation among estimators) is a different mechanism from the mechanical selection on noise in Selection-Adjusted RCF. The two are complementary stories for why a reference class can mislead.

Sources: Smith, L. & Sørensen, P. N. (2000). Pathological Outcomes of Observational Learning. Econometrica 68(2), 371–398. · Bikhchandani, S., Hirshleifer, D. & Welch, I. (1992). A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades. Journal of Political Economy 100(5), 992–1026. · Banerjee, A. V. (1992). A Simple Model of Herd Behavior. Quarterly Journal of Economics 107(3), 797–817. · Cross-reference Leed, Noise, Selection, and the Illusion of Bias.

Recovering the True Effect from Selected Data

Working paper, not yet peer-reviewed

What you'll learn: why a regression run on selected data can invent a bias that is not there, and how the inverse Mills ratio, the same λ the Noise Audit uses to predict the overrun, is added as a regressor to remove it.

You only observe outcomes for the projects that got funded, and funding depends on the same estimation errors you want to study. Fit a line to the funded projects alone and its slope drifts off the truth, so the data looks biased even when no one is biased. Heckman’s correction fixes this. Estimate who gets selected, form the inverse Mills ratio, add it as an extra regressor, and the line snaps back onto the truth. Drag ρ to zero and the two lines coincide, there was nothing to correct.

How does this work?

Each point is a project. The horizontal axis is a characteristic x, the vertical axis its true outcome y. A project is funded when a selection index (which depends on x, an exclusion variable w, and noise correlated with the outcome) clears zero. You only see y for the funded (solid) points. The rest are dropped (hollow). The grey dashed line is the truth, the coral line is a naive regression on funded projects only, and the teal line is the Heckman-corrected fit. Everything is estimated live. A probit for the selection stage, then ordinary least squares with the inverse Mills ratio added.

Error correlation ρ (the source of the bias) 0.70

negativenonepositive

Funding generosity (how many get funded) 0.00

few fundedmany funded

Exclusion-restriction strength (w in selection only) 1.00

none (fragile)strong

Sample size N 1500

The truth, the naive fit on funded projects, and the corrected fit

true relationship naive fit (funded only) Heckman-corrected

Selection and outcome equations, and the correction:

\[ s_i^{*} = \gamma_0 + \gamma_1 x_i + \gamma_2 w_i + u_i, \qquad y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \] \[ \mathbb{E}[\,y_i \mid x_i,\ \text{funded}\,] = \beta_0 + \beta_1 x_i + \rho\,\sigma_\varepsilon\,\lambda(\gamma_0+\gamma_1 x_i+\gamma_2 w_i), \quad \lambda(a)=\frac{\phi(a)}{\Phi(a)} \]

Go Deeper: Heckman’s two-step correction

Heckman’s two-step estimator recovers the true slope from a selected sample. Step one fits a probit of “was it funded” on the selection variables and forms the inverse Mills ratio for every unit:

\[ \hat\lambda_i = \frac{\phi(\hat\gamma_0+\hat\gamma_1 x_i+\hat\gamma_2 w_i)}{\Phi(\hat\gamma_0+\hat\gamma_1 x_i+\hat\gamma_2 w_i)}. \]

Step two regresses the outcome on the original predictors plus λ, over the funded rows only:

\[ y_i = \beta_0 + \beta_1 x_i + \beta_\lambda\,\hat\lambda_i + \text{error}, \qquad \hat\beta_\lambda \;\xrightarrow{\ p\ }\; \rho\,\sigma_\varepsilon. \]

The coefficient on λ estimates ρσ_ε, so its sign is the sign of the selection correlation, and testing H₀: β_λ = 0 is a test for selection bias. If it is not different from zero, plain least squares on the funded sample was already fine. The naive slope that leaves λ out is biased by

\[ \mathbb{E}[\hat\beta_1^{\text{naive}}] - \beta_1 = \rho\,\sigma_\varepsilon\,\frac{\operatorname{Cov}_{\text{funded}}(x_i,\lambda_i)}{\operatorname{Var}_{\text{funded}}(x_i)}, \]

which is exactly the tilt in the coral line.

The whole thing rests on an exclusion restriction, a variable that shifts who gets selected but has no direct effect on the outcome. Here that is w. Set γ₂ = 0 and λ(γ₀+γ₁x) becomes a function of x alone, so corr(x, λ) → 1 over the funded range, the two regressors collide, and the corrected estimate swings from sample to sample. That fragility, not a formula, is why the exclusion restriction matters. Try it. Set the exclusion strength to zero and press “New sample” a few times.

Source: Heckman, J. J. (1979). Sample Selection Bias as a Specification Error. Econometrica 47(1), 153–161.

Go Deeper: the same correction recovers the true cost bias

In the noise-and-selection paper the same machinery recovers the true cost bias β_c from funded projects. A project is funded when its estimated benefit-cost ratio ranks in the top α share, which reduces to the composite noise η = ε_b − ε_c clearing a threshold. Because the cost error and η are negatively correlated (Cov = −σ_c²), funded projects had systematically underestimated costs, so their overruns look like optimism even when no bias exists. The average cost error among funded projects is

\[ \mathbb{E}[\varepsilon^{c}\mid\text{funded}] = -\frac{\sigma_c^{2}}{\sigma_\eta}\,\lambda(z^{*}) < 0, \quad z^{*}=\Phi^{-1}(1-\alpha),\ \ \lambda(z)=\frac{\phi(z)}{1-\Phi(z)},\ \ \sigma_\eta=\sqrt{\sigma_c^{2}+\sigma_b^{2}}. \]

Regress overruns on funded-project data and leave λ out, and the intercept is pulled up by exactly that amount:

\[ \operatorname{plim}\,\hat\beta_c^{\,\text{naive}} = \beta_c + \frac{\sigma_c^{2}}{\sigma_\eta}\,\lambda(z^{*}). \]

Put λ back in as a regressor and it nets out the selection-induced mean, leaving the true β_c. That second term is exactly the overrun that Comparative Statics plots, whose leading-order form is exp(σ_c²λ(z*) / σ_η). There the inverse Mills ratio predicts the overrun, here it removes it. And where Comparative Statics shows how ρ reshapes the overrun curve, here ρ is what biases the naive regression and what the correction backs out.

Source: Leed, A. B. (2026). Noise, Selection, and the Illusion of Bias, Remark on the Heckman connection and eq. for E[ε^c | selected].

Biodiversity in the business case

Working paper, not yet peer-reviewed

What you'll learn: how a project's net present value and its benefit-cost ratio move once the change in nearby nature is valued at my hedonic estimate instead of entering the appraisal at zero, and how a project that clears the bar on a conventional appraisal can fail once nearby nature is counted.

A standard appraisal weighs the present value of a project's benefits against the present value of its costs, and the land, water and wildlife it changes usually enter at zero because nobody has put a price on them. This tool adds one line to the account, the local biodiversity impact valued at the price homebuyers reveal for nearby ecological quality, and shows where that line moves the decision. The base case is a marginal project that only just clears the bar, so a modest loss of nearby nature can push it below zero and a gain can lift a failing project above it.

Present value of costs DKK 500m

Present value of benefits, nature excluded DKK 530m

Households within about 1 km of the change 5,000

Change in local ecological quality, Δ bioscore −0.30

degrades natureenhances nature

Capitalised value per household per bioscore unit DKK 66,000

3.3% of a DKK 2.0m home

How to read this. The tool shows the mechanism and lets you test sensitivity, and it is not an official appraisal number. The biodiversity value is the marginal capitalised value of the local amenity that a high bioscore indexes, at the current allocation, best read as an upper bound rather than a causal effect. It captures local, use-related value only, so it leaves out non-use and wider ecosystem-service value and can understate the full value of nature even while it overstates the effect on any single home. If your appraisal already carries an environmental or greenness benefit, counting this on top would double up. Most of the uncertainty sits in the households and the Δ bioscore you set.

Where the per-household figure comes from · The biodiversity research and map

Go Deeper: the appraisal, the netting rule, and the breakeven

A cost-benefit analysis compares a project against the world without it and adopts it when its net present value is positive, the present value of benefits minus the present value of costs.

\[ \text{NPV} = \text{PV}(B) - \text{PV}(C), \qquad \text{adopt if } \text{NPV} > 0. \]

Nearby nature is one of the project's impacts. Valuing it at the hedonic estimate and adding it to the account gives

\[ \text{NPV}_{\text{with nature}} = \text{PV}(B) + V_{\text{bio}} - \text{PV}(C), \qquad V_{\text{bio}} = H \cdot \Delta b \cdot v, \]

where $H$ is the affected household count, $\Delta b$ is the signed change in local bioscore, and $v$ is the capitalised value per household per unit. $V_{\text{bio}}$ is treated as the present value of the local amenity change applied to the affected housing stock, so it enters the account already discounted, and a degrading project with $\Delta b < 0$ makes it a negative benefit.

A negative impact can be netted out of benefits or added to costs. The net present value is the same either way, so the decision on NPV does not turn on the choice, while the benefit-cost ratio does, which is why Boardman and co-authors call the ratio manipulable and advise deciding on net present value. This tool nets the biodiversity value into benefits and reports

\[ \text{BCR} = \frac{\text{PV}(B) + V_{\text{bio}}}{\text{PV}(C)}, \]

with net present value as the criterion that decides the case. The breakeven readout is threshold sensitivity analysis, the value of an uncertain input at which net present value crosses zero. Holding the rest fixed, the biodiversity impact flips the decision once

\[ \Delta b = \frac{\text{PV}(C) - \text{PV}(B)}{H \cdot v}, \]

with a matching household count for a given $\Delta b$. The default figure of about DKK 66,000 is one bioscore unit at roughly +3.3% of a DKK 2.0m home. It is a marginal slope estimated at today's allocation, so stretching it across a large $\Delta b$ or a large household count extrapolates beyond where it was measured.

Method from Boardman, A. E., Greenberg, D. H., Vining, A. R., & Weimer, D. L. (2018). Cost-Benefit Analysis: Concepts and Practice, Cambridge University Press, on present value and net present value (Ch. 9), the benefit-cost ratio and the netting of negative impacts (Ch. 2, pp. 34–35), and threshold sensitivity analysis (Ch. 11). Measurement framing from Hubbard, D. W., Budzier, A., & Leed, A. B. (2025). How to Measure Anything in Project Management, Wiley. Hedonic figure from Leed, A. B. (in progress), The Price of Biodiversity: Hedonic Evidence from Danish Housing Markets, working paper, Aarhus University.

Is greener always richer?

Working paper, not yet peer-reviewed

What you'll learn: that a satellite sees greenness, not biodiversity, and that the two pull apart most in farmland, where a bright green field can hold very little wildlife.

Satellite greenness, the NDVI index, is often used as a stand-in for the health of nature. It measures how much live vegetation reflects, so a mown lawn, a wheat field and a species-rich meadow can all read as equally green. This tool puts satellite greenness next to the Biodiversitetskort ecological-quality score across Denmark. Switch the map between the indices, then open the divergence view to find the places that look green but score low, and the plain-looking places that score high.

How to read this. The headline correlations are computed at the finest unit available, the individual home sale, with greenness and each index averaged within 1 km, so they are not parish averages. Greenness tracks ecological quality only loosely. The divergence view ranks each parish on greenness and on the composite score and colours the gap, so intensive farmland and mown amenity grass show up as green but low, and heath, dune and parts of the coast show up as plain but high.

The full biodiversity research and map

Go Deeper: the correlation, and why NDVI is not biodiversity

Across the roughly 964,000 home sales, satellite greenness and each ecological-quality index barely track each other. The chart bins those sales by greenness and by the selected index, with the fitted line and the correlation.

NDVI, the normalised difference vegetation index, is built from how strongly a surface reflects near-infrared against red light. Live green leaves reflect near-infrared strongly, so NDVI rises with the amount and vigour of vegetation. It says nothing about which species are present or how rich the habitat is, and a fertilised monoculture and a diverse meadow can carry the same leaf area and read the same.

The correlations here are Pearson and Spearman between each sale's 1 km-mean NDVI, from Sentinel-2 imagery in summer 2021, and its 1 km-mean Biodiversitetskort score, over the roughly 964,000 sales in the study. The composite score correlates with greenness at a Pearson r of about 0.10 and a Spearman of about 0.04, and the rank correlation being even weaker means that ordering parishes by greenness barely orders them by ecological quality. The species part of the score, the threat-weighted records of red-listed wildlife, is a little more related to greenness than the landscape part, and both are weak. This is the evidence behind the line on the biodiversity page that a mown lawn and a species-rich meadow look the same from a satellite.

NDVI from Sentinel-2 imagery, summer 2021. Ecological-quality scores from the Biodiversitetskort (DCE, Aarhus University). Correlations computed at the transaction level from Leed, A. B. (in progress), The Price of Biodiversity: Hedonic Evidence from Danish Housing Markets, working paper, Aarhus University.

What Is the Nature Next Door Worth?

Working paper, not yet peer-reviewed

What you'll learn: what richer nearby nature is associated with in house prices, and why that figure is an upper bound, not a causal promise.

From my PhD research on Danish housing: homes surrounded by higher-quality nature sell for more. Set the local biodiversity score and a home price to see the associated price difference, with its uncertainty. This is an association, not a causal effect. Read the note below the charts.

Local biodiversity score (Biodiversitetskort bioscore, 1 km mean) 1.50

A typical home price (DKK)

Most of the simple correlation is sorting

The premium is local

How to read this. This is a well-identified association, not causation. Even after comparing similar homes in the same parish and the same quarter, the figure is best read as an upper bound on what nature is worth: some of it still reflects who chooses to live near nature, not the nature itself. It is not a promise that adding nature to one home would raise its price by this much; the credible causal effect is smaller. The bioscore runs 0–19, but most Danish homes sit below 2.7, so values past that point extrapolate beyond the data.

See the full research and the interactive Denmark map

Go Deeper: the model, the numbers, and what they do not say

The estimate is a hedonic regression of log sale price on the mean Biodiversitetskort bioscore within 1 km, with parish and year-quarter fixed effects and over 65 controls (size, age, type, energy rating, distance to coast, station and protected nature, noise, and satellite greenness). Comparing homes within the same parish and quarter, a one-unit higher bioscore is associated with about +3.3% in price (95% CI roughly +2.3% to +4.3%; SE 0.5 pp; N = 849,490; within-R² = 0.30). At the median home price of DKK 2.0 million that is about DKK 66,000.

The simple correlation is about three times larger (about +9.9% per unit). Adding parish fixed effects removes roughly two-thirds of it, which reflects spatial sorting: wealthier buyers cluster near nature. The association is strongest within a few hundred metres (it peaks around 500 m) and is statistically indistinguishable from zero by 5 km. The association is significant for apartments (about +3.2%, p<0.01) and detached villas (about +2.1%, p<0.001), and weaker for terraced houses (about +1.2%, significant only at the 10% level). It is stable over the 2015–2025 sample.

The premium is not merely proximity to the coast. Re-estimated on the species part of the score alone (threat-weighted records of red-listed species, which leaves out coastal proximity and the other landscape proxies), the premium is essentially the same size. Because this coast-free species signal reproduces the composite result, it points to biodiversity itself being priced, in line with survey evidence that people value species richness.

The hedonic price is the marginal value of the visible and audible landscape that a high bioscore indexes, at the current allocation. It is not the total value of biodiversity, and it does not capture non-use or ecosystem-service value.

Based on Leed, A. B. (in progress), The Price of Biodiversity: Hedonic Evidence from Danish Housing Markets, working paper, Aarhus University. Ecological-quality scores from the Biodiversitetskort (DCE, Aarhus University). Preliminary findings; associations, not causal effects.

Narrow or Broad?

What you'll learn: why the most similar reference class is not always the best one, and how to blend a small specific class with a large general one instead of choosing between them.

Narrow a reference class to only near-identical projects and you cut the bias from mixing in projects that are not like yours, but you are left with few data points, so the average is noisy. Broaden it and the average steadies, but you fold in projects that belong to a different story. Total error has a balance point in between, and blending the two classes is the disciplined way to sit near it.

How does this work?

Pick a mode. The trade-off sweeps a reference class from the tightest handful of near-twins out to the broadest comparable class and splits the error of the class average into two parts. Sampling error falls as the class grows, and heterogeneity bias rises as the class takes in less similar projects. Their sum is smallest at a width in between. The blend stops choosing a width and instead combines a small specific class with a large general one, weighting each by how much you can trust it.

How broad is your reference class? n ≈ 10

Near-twinsBroadly similarBroadest class

Spread of outcomes within a class (σ, log units) 0.40

TightTypicalWide

How different the classes are (heterogeneity) 0.25

Nearly alikeSomewhatVery different

Where the error of the class average is smallest

Go Deeper: the reference-class problem, bias against variance, and shrinkage

No reference class is handed to you as the right one. Venn (1876) and Reichenbach (1949) proposed the narrowest class for which reliable statistics can be compiled, and Alan Hájek (2007) points out that this never says exactly where to stop. Narrow all the way to a single project and it is unique, so the outside view disappears. Broaden without limit and the class stops resembling your project. Flyvbjerg (2013) gives the working rule. Statistical similarity is enough for statistical prediction, so you widen only as far as the projects still behave alike.

Split the error of a class average into two parts. Drawing n comparable projects estimates the class mean with sampling variance that falls as the class grows, while broadening the class admits less similar projects and adds a heterogeneity bias that grows with width:

\[ \operatorname{MSE}(n) = \underbrace{\frac{\sigma^{2}}{n}}_{\text{sampling variance}} + \underbrace{b^{2}\,u(n)^{2}}_{\text{heterogeneity bias}^{2}}, \qquad u(n)=\frac{\ln n - \ln n_{\min}}{\ln N - \ln n_{\min}}. \]

The two pull in opposite directions, so the sum is smallest at a width in between. The most similar class is rarely that width. It has almost no bias but too few projects to pin the average down. Here u(n) runs from 0 at the tightest class of n_min projects to 1 at the broadest comparable class of N projects (in this tool, n_min is 8 and N is about 500), and the tool marks the size that minimises the sum.

Rather than pick a width, you can blend. Combine the specific class mean with the broad-pool mean, weighting each by how much it can be trusted. This is the credibility, or partial-pooling, estimate:

\[ \hat\mu = Z\,\bar Y + (1-Z)\,\mu_0, \qquad Z = \frac{n}{n+k}, \qquad k=\frac{\sigma^{2}}{\tau^{2}}. \]

A large or consistent specific class earns most of the weight (Z near 1). A tiny or noisy one leans on the general pool (Z near 0). Stein showed that shrinking three or more estimates toward a common value beats using each raw estimate, measured by total squared error, which is why blending is a discipline and not a fudge. It is the same shape as the paper's discussion of shrinking a class estimate before ranking it.

Two limits. The linear bias model is the simplest assumption, not a law. Heterogeneity may grow faster or in steps, and the balance point depends on how different the classes are, which can only be estimated. The lowest-error size is also capped by how many comparable projects exist. If even the broadest sensible class is smaller than the optimum, you are data-limited, and the sensible move is to use every comparable project you can find. Shrinkage lowers total error across many classes, but it is not free for any one of them. It pulls the specific estimate toward the pool, so if your class is unusual the blend is biased toward the crowd and can be worse than the raw specific number. And a large class makes the class average precise, not a single project predictable. Even with thousands of past projects, one new project still varies across the full range of outcomes, so its own interval stays wide no matter how much data you have. This tool handles sampling error and heterogeneity in the class average. It does not fix selection in the realised outcomes (see Selection-Adjusted RCF) or copying between estimators (see Reference Class Independence).

Sources: Reichenbach, H. (1949). The Theory of Probability. Univ. of California Press. · Hájek, A. (2007). The reference class problem is your problem too. Synthese 156(3), 563–585. · Flyvbjerg, B. (2013). Quality control and due diligence in project management. International Journal of Project Management 31(5), 760–774. · Bühlmann, H. (1967). Experience rating and credibility. ASTIN Bulletin 4(3), 199–207. · Efron, B. & Morris, C. (1977). Stein's paradox in statistics. Scientific American 236(5), 119–127. · Gelman, A. & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge Univ. Press. · Cross-reference Leed, Noise, Selection, and the Illusion of Bias.

Few Trials, Zero Failures

What you'll learn: how to estimate a probability from few trials, and why "no failures so far" is not the same as safe.

With a handful of trials the naive rate is either overconfident or undefined. Five successes out of five reads as 100 percent, zero failures out of seven reads as 0 percent, and neither is a belief you would act on. Laplace's rule of succession gives a point estimate that is never exactly 0 or 1, and the rule of three gives an upper bound on a rate you have never seen happen. Record outcomes one at a time and watch the estimate settle.

What are you tracking?

Prior belief

Record an outcome:

Or set totals directly: Trials n Failures s

How sure you can be about the rate

After data (posterior) Before (prior)

Go Deeper: the rule of succession, the add-one family, and the rule of three

The naive rate is absurd at the extremes. Five events in five trials reads as certainty, zero in seven reads as impossibility, and neither is a rate you would plan around. Put a uniform prior on the unknown rate and, after s events in n trials, the posterior is Beta(s+1, n−s+1) and the chance the next trial is an event is its mean:

\[ P(\text{next is an event}) = \mathbb{E}[p\mid s,n] = \frac{s+1}{n+2} \qquad (\text{posterior } \operatorname{Beta}(s+1,\,n-s+1)). \]

This adds one imagined event and one imagined non-event, so it can never return exactly 0 or 1, and it slides back to the raw rate s/n as n grows. It is a special case of a family. Put a Beta(α, β) prior and the estimate becomes a weighted average of the data and the prior mean,

\[ \mathbb{E}[p\mid s,n]=\frac{s+\alpha}{n+\alpha+\beta}, \qquad \text{Jeffreys: } \frac{s+\tfrac12}{n+1}. \]

The uniform prior is add-one, or Laplace. The Jeffreys prior Beta(½, ½) is add-a-half. A stronger prior pulls harder toward its own mean. In language modelling the same device is called Laplace smoothing.

When the event has never happened in n independent trials, a different question has a clean answer. Set the chance of seeing zero events to 5 percent and solve for the rate:

\[ \Pr(0\text{ events})=(1-p)^{n}=0.05 \;\Longrightarrow\; p_{\text{upper}} = 1-0.05^{1/n} \approx \frac{-\ln 0.05}{n} \approx \frac{3}{n}. \]

The rough version is the rule of three, p up to 3/n. The exact bound 1−0.05^1/n is better when n is small. The 3 comes from a one-sided 95 percent bound. A two-sided 95 percent interval uses about 3.7 instead. This is an upper bound, not a claim that the rate is low. With seven clean trials the bound is about 0.35, so seven successes are still consistent with a failure rate near one in three.

The caveats. Laplace's own illustration was the chance the sun rises tomorrow given that it has risen every day so far, and it is the standard warning. A bare count ignores everything you know about the mechanism. The rule of succession is a Bayesian point estimate under a chosen prior, and the rule of three is a frequentist upper bound. They answer different questions and should not be merged into one number. Both assume the trials are independent draws, which for projects is an assumption, not a fact. If they share a team, a client, or a technology, treat the effective sample as smaller than n (see Reference Class Independence). This tool is the Beta Distribution tool at its flat prior. Open that tool to explore other priors.

Sources: Laplace, P. S. (1774). Mémoire sur la probabilité des causes par les évènemens. Mémoires de mathématique et de physique, Tome VI, 621–656; sunrise illustration in Essai philosophique sur les probabilités (1814). · Jeffreys, H. (1961). Theory of Probability, 3rd ed., Oxford Univ. Press. · Hanley, J. A. & Lippman-Hand, A. (1983). If nothing goes wrong, is everything all right? JAMA 249(13), 1743–1745. · Jovanovic, B. D. & Levy, P. S. (1997). A look at the rule of three. The American Statistician 51(2), 137–139. · Brown, L. D., Cai, T. T. & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science 16(2), 101–133.

Comparative Statics

Working paper, not yet peer-reviewed

What you'll learn: how an outcome responds when you nudge one input, why the sign of that response is what matters, and how to read it off either as a slope (calculus) or without calculus at all.

Comparative statics is the workhorse question of economic theory. Take an outcome that solves an optimisation, change one parameter, and ask which way the outcome moves. This tool makes the answer visible. The curve is the outcome, the moving tangent is the partial derivative, and its slope’s sign is flagged live. Switch to Selection overrun to run the same lens over a real appraisal number, where one of the signs is a genuine surprise.

How does this work?

Pick a model, then move the parameter on the horizontal axis. The teal curve is the outcome as that parameter varies, the dot is your current operating point, and the straight line through it is the tangent, whose slope is the partial derivative (estimated by a central finite difference). A chip reads the sign: green up for a positive response, red down for a negative one. The table lists every partial at once, so a sign that runs against intuition stands out.

The canonical problem: choose x to maximise b(x,θ) − c(x), with b(x,θ) = θx − ½x² and c(x) = ½kx². The optimum is x*(θ) = θ / (1 + k).

Parameter θ 1.50

Cost curvature k (sets the slope 1 / (1 + k)) 1.0

Optimal choice as the parameter moves

$ x^{*}(\theta) = \dfrac{\theta}{1+k}, \qquad \dfrac{dx^{*}}{d\theta} = \dfrac{b_{x\theta}}{c'' - b_{xx}} $

Go Deeper: two ways to sign a comparative static

Comparative statics asks a single question. You have an outcome that solves an optimisation, and you nudge one input. Which way does the outcome move, and how far? There are two ways to answer it.

The classical way is calculus. Differentiate the first-order condition in the parameter, which is the implicit function theorem:

\[ b_x(x^{*},\theta)-c'(x^{*})=0 \;\Longrightarrow\; \frac{dx^{*}}{d\theta}=\frac{b_{x\theta}}{c''-b_{xx}}. \]

The denominator is positive at a maximum (the second-order condition c'' − b_xx > 0), so the sign of the response is the sign of the cross-partial: $ \operatorname{sign}(dx^{*}/d\theta) = \operatorname{sign}(b_{x\theta}) $. This is what the textbook mode draws as a tangent line. The cost of this route is its assumptions: a real-valued choice, differentiability, and enough curvature.

The second way needs none of that. If the return to a higher x never falls as θ rises, a property called increasing differences,

\[ \theta'>\theta,\ x'>x \;\Longrightarrow\; f(x',\theta')-f(x,\theta') \;\ge\; f(x',\theta)-f(x,\theta), \]

which for a smooth objective is $ f_{x\theta}\ge 0 $, then the optimiser is nondecreasing: $ \arg\max_x f(x,\theta) $ rises in θ even when the choice is discrete, the objective has kinks, or there are several optima. That is Topkis’ theorem, sharpened by Milgrom and Shannon to the minimal single-crossing condition. Monotone comparative statics keeps the conclusion that matters, the direction of the response, and throws away the machinery you do not need to get it.

Sources: Topkis, D. (1978). Minimizing a submodular function on a lattice. Operations Research 26(2), 305–321. · Milgrom, P. & Shannon, C. (1994). Monotone comparative statics. Econometrica 62(1), 157–180.

Go Deeper: the same question for a real appraisal outcome

Paper mode applies the same lens to a number an executive reports, the average cost overrun among funded projects. The outcome is E[R^c | selected], and the parameters are the noise in cost estimates σ_c, the noise in benefit estimates σ_b, the budget share α, the error correlation ρ, and the spread of genuine project quality σ_r.

Take the log of the leading-order form and every sign falls out of a one-line derivative:

\[ \ln \mathbb{E}[R^{c}\mid\text{sel}] \approx \frac{\sigma_c^{2}}{\sigma_\eta}\,\lambda(z^{*}), \qquad \sigma_\eta=\sqrt{\sigma_c^{2}+\sigma_b^{2}},\ \ z^{*}=\Phi^{-1}(1-\alpha),\ \ \lambda(z)=\frac{\phi(z)}{1-\Phi(z)}. \]

The overrun rises in cost noise, because noisier cost estimates load more of the selection signal onto projects whose costs were underestimated by luck:

\[ \frac{\partial}{\partial\sigma_c}\ln\mathbb{E}[R^{c}\mid\text{sel}] = \frac{\sigma_c\,\lambda(z^{*})}{\sigma_\eta}\Big(2-\frac{\sigma_c^{2}}{\sigma_\eta^{2}}\Big) > 0. \]

It falls in benefit noise, the result worth pausing on. Adding benefit noise dilutes cost noise’s share of the composite signal σ_η, so the cost overrun shrinks back toward the pure Jensen term:

\[ \frac{\partial}{\partial\sigma_b}\ln\mathbb{E}[R^{c}\mid\text{sel}] = -\frac{\sigma_c^{2}\,\sigma_b}{\sigma_\eta^{3}}\,\lambda(z^{*}) < 0. \]

This is a genuine cross-effect, a negative partial where intuition expects a positive one. Sharpening your benefit estimates while holding cost estimates fixed makes the measured cost overrun look worse. The overrun also falls in the budget share α (a more generous budget reaches less far into the tail), because λ is increasing and z* decreases with α: $ \partial\ln\mathbb{E}[R^{c}\mid\text{sel}]/\partial\alpha = (\sigma_c^{2}/\sigma_\eta)\,\lambda'(z^{*})\,dz^{*}/d\alpha < 0 $, with $ \lambda'(z)=\lambda(z)\,(\lambda(z)-z) > 0 $ and $ dz^{*}/d\alpha = -1/\phi(z^{*}) $.

The caveat belongs here too. The measured overrun is not welfare. Better benefit precision still improves the ranking and cuts misallocation loss even as the overrun statistic worsens. The σ_r control is an illustrative damping toward the no-selection baseline, standing in for the paper’s result that the closed form is an upper bound once true project quality varies.

Source: Leed, A. B. (2026). Noise, Selection, and the Illusion of Bias: How Unbiased Estimates Can Produce Systematic Cost Overruns, working paper.

Go Deeper

Several of these tools are based on methods from How to Measure Anything in Project Management. The book provides the full framework for evidence-based project decisions.

Explore the Book Book Resources & Downloads