Welcome to E2P Simulator! This guide will help you understand what it does, why it is needed, and how to use it.
What is E2P Simulator?
E2P Simulator (Effect-to-Prediction Simulator) allows researchers to interactively and quantitatively explore the relationship between effect sizes (e.g., Cohen's d, Odds Ratio, Pearson's r), the corresponding predictive performance (e.g., AUC, Sensitivity, Specificity, Accuracy, etc.), and real-world predictive utility (e.g., PPV, NPV, PR-AUC, MCC, Net Benefit, etc.) by accounting for measurement reliability and outcome base rates.
In other words, E2P Simulator is a tool for performing predictive utility analysis - estimating how research findings will translate into real-world prediction or what effect sizes/predictive performance is needed to achieve a desired level of predictive and clinical utility. Much like how power analysis tools (such as G*Power) help researchers plan for statistical significance, E2P Simulator helps plan for practical significance.
E2P Simulator has several key applications:
Interpretation of findings: It helps researchers move beyond arbitrary "small/medium/large" effect size labels and flawed prediction performance metrics (such as Accuracy or AUC) by grounding their interpretation in estimated real-world predictive utility.
Research planning: Being able to easily derive what effect sizes and predictive performance are needed to achieve a desired real-world predictive performance allows researchers to plan their studies more effectively and allocate resources more efficiently.
Education: The simulator's interactive design makes it a valuable teaching tool, helping researchers develop a more intuitive understanding of how different abstract statistical metrics relate to one another and to real-world utility.
Why is E2P Simulator needed?
Many research areas such as biomedical, behavioral, education, and sports sciences, are increasingly studying individual differences in order to build predictive models to personalize treatments, learning, and training. Identifying reliable biomarkers and other predictors is central to these efforts. Yet, several entrenched research practices continue to undermine the search for predictors:
Overemphasis on statistical significance: Most research continues to optimize for statistical significance (p-values) without optimizing for practical significance (effect sizes).
Difficulties interpreting effect sizes: The interpretation of effect sizes, which are critical for gauging real-world utility, is often reduced to arbitrary cutoffs (small/medium/large) without providing any clear sense of their practical utility.
Overlooked measurement reliability: Measurement noise attenuates observed effect sizes and weakens predictive power, yet it is rarely accounted for in study design or interpretation of findings.
Neglected outcome base rates: Low-prevalence events drastically limit predictive performance in real-world settings, but are often not considered when evaluating the translational potential of prediction models.
Together, these issues undermine the quality and impact of academic research, because routinely reported metrics do not reflect real-world utility. Whether researchers focus on achieving statistical significance of individual predictors or optimizing model performance metrics like accuracy and ROC-AUC, both approaches often lead to unrealistic expectations about practical impact. In turn, this results in inefficient study planning, resource misallocation, and considerable waste of time and funding.
E2P Simulator is designed to address these fundamental challenges by placing measurement reliability and outcome base rates at the center of study planning and interpretation. It helps researchers understand how these factors jointly shape predictive utility, and guides them in making more informed research decisions.
How to use E2P Simulator
E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting effect sizes, measurement reliability, base rates, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.
The image above provides an overview of all E2P Simulator's interactive components.
Binary vs. Continuous Outcomes
E2P Simulator provides two analysis modes that cover the two most common research scenarios:
Binary Mode: Considers dichotomous outcomes such as diagnostic categories (e.g., cases vs. controls) or discrete states (e.g., success vs. failure). All metric calculations and conversions in this mode are completely analytical and follow the formulas provided on the page.
Continuous Mode: Considers continuous measurements such as symptom severity or performance scores that may need to be categorized (e.g., responders vs. non-responders or performers vs. non-performers) for practical decisions. This mode is based on actual data simulations rather than analytical solutions, hence it may be slower to react.
Measurement Reliability and True vs. Observed Effects
Measurement reliability attenuates observed effect sizes, and in turn attenuates predictive performance. The simulator allows you to specify reliability using the Intraclass Correlation Coefficient (ICC) for continuous measurements and Cohen's kappa (κ) for binary classifications. These typically correspond to test-retest reliability and inter-rater reliability, respectively. You can toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability).
Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.
Base Rates
The base rate refers to how common an outcome is within a specific population. For instance, if an effect size measures the difference between a group with a disorder and a healthy control group, the base rate would be the disorder's prevalence in the general population. However, if you're focusing on a pre-identified high-risk group, the relevant base rate becomes the disorder's prevalence within this specific high-risk cohort. In Bayesian statistics, this base rate is analogous to the prior probability of the outcome, established before considering any particular predictor.
The base rate directly affects PPV and NVP, and indirectly the metrics that depend on it: PR-AUC, MCC, and F-1 score (see Understanding Predictive Metrics).
Multivariate Effects Calculators
Both binary and continuous outcomes analysis modes include calculators that help explore how multiple predictors can be combined to achieve stronger effects:
Mahalanobis D Calculator (Binary mode)
Multivariate R² Calculator (Continuous mode)
The calculators can help approximate the expected performance of multivariate models without having to train the full models and thus help with research planning and model development. More specifically, they illustrate:
How the number of predictors affects combined effect size
The diminishing returns of adding more predictors
How collinearity (correlation among predictors) reduces their combined effectiveness
The trade-offs between using fewer strong predictors versus more moderate ones
Calculator Assumptions and Limitations
Both calculators correspond to fundamental predictive models in statistics: the Mahalanobis D Calculator approximates Linear Discriminant Analysis and logistic regression, while the Multivariate R² Calculator aligns with multiple linear regression.
Like the statistical methods they approximate, the calculators operate under several simplifying assumptions:
Average effects and correlations: The calculators use single values to represent the average effect size across predictors and average correlation among them, which can provide useful approximations even when individual predictors vary in strength
Linear effects: The formulas assume predictors contribute additively without interactions (where one predictor's effect depends on another). However, research shows that in clinical prediction, complex non-linear models generally do not outperform simple linear logistic regression (Christodoulou et al., 2019)
Normality: The underlying variables are assumed to be normally distributed. This is consistent with the assumptions of input metrics like Cohen's d and Pearson's r, although in practice these are often computed despite normality violations.
Despite these limitations, these calculators serve as valuable tools for building intuition about how multiple predictors combine to achieve stronger effects. Even though real-world predictors will vary in their individual strengths and collinearity, the overall patterns demonstrated (such as diminishing returns and the impact of shared variance) remain informative for understanding multivariate relationships.
Understanding Predictive Metrics
Classification Outcomes and Metrics
When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.
The image above illustrates how these four outcomes relate to various classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you'll find the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., Sensitivity/Recall/TPR, Precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.
Key Metrics for Different Contexts
Depending on your research context, certain metrics may be more relevant than others:
When false negatives are costly (e.g., missing a disease diagnosis): Focus on Sensitivity/Recall. In clinical settings where missing a diagnosis could be life-threatening, maximizing Sensitivity ensures fewer cases are missed, even if it means more false alarms.
When false positives are costly (e.g., unnecessary treatments): Focus on Specificity. When treatments have significant side effects or costs, high Specificity ensures fewer healthy individuals receive unnecessary interventions.
When dealing with low base rate: Focus on Precision (PPV), NPV, F1 score, MCC, and PR-AUC. These metrics are sensitive to the base rate and thus provide a more accurate assessment of how a predictor would perform in the real world.
Threshold-Independent Metrics
Some metrics evaluate performance across all possible thresholds and can serve as a better summary of the overall model performance. These include:
ROC-AUC (Area Under the Receiver Operating Characteristic Curve): Visualizes how well a model balances true positives (Sensitivity) and false positives (1-Specificity) across all possible thresholds, capturing the trade-off between the two. An AUC of 0.5 means the model is no better than flipping a coin, while 1.0 means perfect separation between groups.
PR-AUC (Area Under the Precision-Recall Curve): Shows how well a model can maintain both precision (PPV) and Recall (Sensitivity) together, which is particularly informative in the context of rare outcomes, as the base rate affects Precision. A larger PR-AUC means the model can achieve high Precision without sacrificing Recall (or vice versa), indicating a smaller trade-off between finding all positive cases and avoiding false alarms.
Both ROC-AUC and PR-AUC represent areas under their respective curves and are mathematically expressed as integrals:
\[ROC\text{-}AUC = \int_0^1 TPR(FPR) \, d(FPR)\]
\[PR\text{-}AUC = \int_0^1 P(R) \, dR\]
Where P is Precision (PPV) and R is Recall (Sensitivity). These integrals are computed using trapezoidal numerical integration.
Decision Curve Analysis (DCA)
Decision Curve Analysis (Vickers & Elkin, 2006) evaluates the clinical utility of a predictive model or a single predictor by plotting net benefit across different threshold probabilities. Unlike ROC and PR curves that focus on discrimination, DCA incorporates the relative costs of false positives and false negatives, helping determine whether using a prediction model to guide treatment decisions provides more benefit than simple strategies.
A DCA plot typically includes three key curves:
Predictor Curve: Shows the net benefit of using the at different threshold probabilities. The curve's height indicates how much benefit the predictor provides compared to treating no one.
Treat All: Represents the strategy of treating everyone regardless of their predicted risk. This strategy maximizes sensitivity (no false negatives) but results in many unnecessary treatments (false positives).
Treat None: Represents the strategy of treating no one, which always yields zero net benefit but avoids all treatment-related harms.
The net benefit formula accounts for both the benefits of true positives and the costs of false positives:
Where Pt is the threshold probability, representing the odds at which a patient would be willing to accept treatment. In practical terms, Pt reflects how many unnecessary interventions one is willing to accept to prevent one adverse outcome. The ratio Pt/(1-Pt) in the net benefit formula captures the relative weighting of false positives compared to true positives. Pt itself can be calculated as:
\[P_t = \frac{C_{FP}}{C_{FP} + C_{FN}}\]
Where CFP is the cost of a false positive (unnecessary treatment) and CFN is the cost of a false negative (missed case). Pt can also be estimated through clinician surveys, patient preferences, or clinical guidelines. The simulator allows you to explore a range of Pt values to identify when your predictor provides meaningful clinical benefit.
A-NBC (Area Under the Net Benefit Curve) summarizes the overall clinical utility of a model across a range of threshold probabilities (Talluri & Shete, 2016). It is calculated using trapezoidal integration:
Where the integration is performed over a clinically relevant range of threshold probabilities (adjustable using the movable gray bars in the simulator). A larger A-NBC indicates greater overall clinical utility within that range.
ΔA-NBC (Delta A-NBC) represents the additional clinical utility that a predictive model provides compared to the best available simple strategy (either treating all patients or treating none). Rather than comparing only against treating no one, ΔA-NBC is calculated as:
At each threshold probability, the model's net benefit is compared against whichever simple strategy performs better at that threshold. This provides a more conservative and clinically meaningful assessment of the model's added value. A positive ΔA-NBC indicates that the predictive model offers genuine improvement over the best simple strategy, while values near zero suggest that simple strategies may be equally effective.
Note: For computational simplicity, the ΔA-NBC calculation in this simulator assumes a uniform distribution of threshold probabilities across the selected range. In practice, the distribution of clinically relevant threshold probabilities may be non-uniform, which could affect the relative weighting of different threshold ranges in the overall utility assessment (Talluri & Shete, 2016).
DCA is particularly valuable because it:
Incorporates clinical context through threshold probabilities that reflect real decision-making scenarios
Accounts for the relative costs of different types of errors (false positives vs. false negatives)
Provides actionable insights about when a model should or should not be used clinically
Facilitates comparison between different models or strategies across various clinical contexts
To further clarify how the tool can be used and to demonstrate its utility we provide some specific examples.
Example 1: Diagnostic Prediction
Can we predict depression diagnosis using a cognitive biomarker?
In the Binary outcome mode:
Set base rate to 8% (the prevalence of depression in the population; Shorey et al., 2022)
Set the grouping reliability to 0.28 (depression diagnosis reliability based on DSM-5 field trials; Regier et al., 2013)
Set the predictor reliability for both groups to 0.6 (an average reliability for cognitive measures; Karvelis et al., 2023)
Set the observed effect size to d = 0.8 (a large effect size that is optimistic and rarely seen in practice)
This will yield ROC-AUC = 0.71 and PR-AUC = 0.19. Even with the optimistic effect size of 0.8, the predictive utility remains modest, particularly in balancing recall and precision (as shown by the low PR-AUC). The DCA plot also suggests limited clinical utility, with predictor-based diagnosis showing minimal improvement over diagnosing no one / everyone within the 5% to 30% threshold probability range (ΔA-NBC = 0.001).
Note that with the low reliability values, this observed effect corresponds to a much larger true effect, d = 1.58, and much better predictive utility, ROC-AUC = 0.87, PR-AUC = 0.46, and ΔA-NBC = 0.006. This demonstrates the importance of measurement reliability in predictive modeling.
Now let's say we are serious about precision psychiatry and we want to achieve a PR-AUC of 0.8. Using the tool, we can find that it would require d = 2.55. It would be rather unrealistic to expect a single biomarker to achieve this effect size. Using the Mahalanobis D calculator, you can explore how many predictors with smaller d values would be required to achieve D = 2.55.
Example 2: Treatment Response Prediction
Can we predict who will respond to antidepressant treatment using task-based brain activity measures?
Select Continuous outcome mode:
Set base rate to 15% (the rate of response to antidepressant treatment beyond placebo; Stone et al., 2022)
Set predictor reliability to 0.4 (an average reliability for task-fMRI measures; Elliott et al., 2020)
Set outcome reliability to 0.94 (Hamilton Depression Rating Scale (HAMD) reliability; Trajković et al., 2011)
Adjust effect size such that R² = 0.2 (average multivariate R² from recent research; Karvelis et al., 2022)
This will yield AUC = 0.73 and PR-AUC = 0.32, indicating rather modest predictive performance, as shown by the low PR-AUC. The clinial utility would also be rather limited wtih ΔA-NBC = 0.004 within 5%-30% threshold probability range, as indicated by the DCA plot. Improving measurement reliability could improve the performance up to AUC = 0.87 and PR-AUC = 0.57, and clinical utility up to ΔA-NBC = 0.014, which is a substantial improvement.
If we want to once again be serious about precision psychiatry and aim for PR-AUC of 0.8, we will find it requires r = 0.9 (R² = 0.81). These are rather extremely ambitious values (requiring to explain 81% of variance in symptom improvement). This helps demonstrate the inherent limitations of dichotomizing continuous outcomes for evaluating treatment response prediction - doing so leads to a loss of valuable information. On the other hand, it does reflect the binary nature of decision-making in psychiatry (to prescribe the treatment or not).
Example 3: Risk Prediction
Can we predict who will attempt suicide using a combination of risk factors?
Here we will rely mostly on Borges et al., 2011, who analyzed the data of 108,705 adults from 21 countries, and using a logistic regression model found that combining a range of risk factors (socio-demographics, parent psychopathology, childhood adversities, DSM-IV disorders, and history of suicidal behavior) led to AUC = 0.74-0.8 discrimination performance between those who did and did not attempt suicide.
In the Binary outcome mode:
Set base rate to 0.3% (the 12-month prevalence of suicide attempts; Borges et al., 2011)
Set the predictor reliability for both groups to 0.8 (a rough estimate for the average reliability across the risk factors)
Set the observed effect size to d = 1.18, which corresponds to AUC = 0.8
This will yield PR-AUC = 0.02, indicating extremely abysmal predictive performance, meaning that depending on where we place the threshold, we would either predict a lot of false positives (low precision, high recall) or a lot of false negatives (high precision, low recall). When the threshold is set to maximize the F1 score (0.06), we get precision = 0.06 and recall = 0.06. Which means that we would miss 94% of actual attempters and 94% of the predicted attempters would not attempt suicide. Not surprisingly, this gives us extremely low clinical utility (ΔA-NBC = 0.000).
Considering that we are serious about precision psychiatry and want to aim for PR-AUC of 0.8, we will find it requires observed d = 3.76, which means we have a very long way to go. Note that because the reliability is already quite high, improving it further would not make much of a difference. What we need to do is to find better predictors. Alternatively, it may be more effective to simply focus on universal suicide prevention strategies rather than trying to predict individual cases (e.g.,Large 2018).
Feedback and Contributions
E2P Simulator is an open-source project - feedback, bug reports, and suggestions for improvement are welcome. The easiest way to do so is through the GitHub Issues page.
You can view the source code, track development, and contribute directly at the project's GitHub repository.
For other inquiries, you can find my contact information here.