Welcome to the E2P Simulator! This guide will help you understand how to use this tool to explore
the relationship between effect sizes and their predictive utility.
What is the E2P Simulator?
The E2P Simulator (Effect-to-Prediction Simulator) allows researchers to interactively and quantitatively explore the relationship between effect sizes (e.g., Cohen's d, Pearson's r) and their predictive utility (e.g., AUC, PR-AUC, MCC) while accounting for real-world factors like measurement reliability and outcome base rates.
In other words, the E2P Simulator is a tool for performing predictive utility analysis - estimating how effect sizes will translate to real-world prediction or what effect sizes are needed to achieve a desired predictive performance. Much like how power analysis tools (such as G*Power) help researchers plan for statistical significance, the E2P Simulator helps plan for practical significance.
The E2P Simulator has several key applications:
Interpretation of findings: It helps researchers move beyond arbitrary "small/medium/large" effect size labels and ground the interpretation of their findings in terms of predictive value in specific contexts.
Research planning: Being able to easily derive what effect sizes are needed to achieve the desired predictive performance allows researchers to plan their studies more effectively and allocate resources more efficiently.
Education: The simulator's interactive design makes it a valuable teaching tool, helping students and researchers develop a more intuitive understanding of how different abstract statistical metrics relate to one another and to real-world utility.
Why is the E2P Simulator needed?
Many research areas such as biomedical, behavioral, educational, and sports sciences, are increasingly studying individual differences in order to build predictive models to personalize treatments, learning, and training. Yet, several entrenched research practices continue to undermine research efforts in these areas:
Overemphasis on statistical significance: Most research continues to optimize for statistical significance (p-values) without optimizing for practical significance (effect sizes).
Difficulties interpreting effect sizes: Effect sizes, which are critical for gauging real-world utility, are often reduced to arbitrary cutoffs (small/medium/large) without providing any clear sense of their practical utility.
Neglected measurement reliability: Measurement noise attenuates observed effect sizes and weakens predictive power, yet it is rarely incorporated into study design or interpretation.
Overlooked outcome base rates: Low-prevalence events drastically limit predictive performance in real-world settings, but are often disregarded when interpreting findings.
Collectively, these issues undermine the quality and impact of academic research, resulting in an abundance of statistically significant but practically negligible findings. Focusing on the statistical significance of their work, researchers develop unrealistic expectations about its potential impact, leading to inefficient study planning, resource misallocation, and considerable waste of time and funding (e.g., investing in large datasets or complex predictive models that are guaranteed to underperform due to the small effect sizes of the predictors). Ultimately, this creates a disconnect between academic inquiry and real-world practice, as years of scientific effort fail to yield meaningful practical benefits.
The E2P Simulator helps address these fundamental challenges by placing effect sizes, measurement reliability, and outcome base rates at the center of study planning and interpretation. The E2P Simulator uses interactive simulations to highlight how these factors jointly shape predictive utility - ultimately guiding researchers toward more meaningful and impactful results in both academic and real-world contexts.
How to use the E2P Simulator
The E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting parameters like effect sizes, measurement reliability, base rates, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.
Binary vs. Continuous Outcomes
The E2P Simulator provides two analysis modes that cover the two most common research scenarios:
Binary Mode: Use this mode for analyzing dichotomous outcomes like diagnostic categories (e.g., cases vs. controls) or discrete states (e.g., success vs. failure). All metric calculations and conversions in this mode are completely analytical and follow the formulas provided on the page.
Continuous Mode: Use this mode for analyzing continuous measurements like symptom severity or performance scores that may need to be categorized (e.g., responders vs. non-responders or performers vs. non-performers) for practical decisions. This mode is based on actual data simulations and the affects on all the metrics are calculated from the data itself.
Measurement Reliability and True vs. Observed Effects
Measurement reliability attenuates the observed effect sizes, and in turn attenuates the predictive performance. The simulator allows you to specify reliability using appropriate metrics - the Intraclass Correlation Coefficient (ICC) for continuous measurements and Cohen's kappa (κ) for binary classifications. These typically correspond to test-retest reliability and inter-rater reliability, respectively. You can toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability).
Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.
Base Rates
The base rate refers to the prevalence of the outcome in the studied population (rather than the sample used to calculate the effect size). It is important to make sure that the effect size in question and the base rate correspond to the same population. For example, if the effect size characterizes the differences between some disease/disorder group and a healthy control group, the relevant base rate is the prevalence of the disease/disorder in the general population. In a different scenario, if we have identified a high-risk group, and want to use the disease/disorder prevalence within this group, then we need to know the effect size comparing disease/disorder vs. the rest of the high-risk group, as it may not be the same as when comparing with healthy controls.
The base rate primarily affects the tradeoff between precision and recall, which is reflected in PR-AUC, MCC, and F-1 score (see Understanding Predictive Metrics).
The Calculators for Multivariate Effects
Both binary and continuous outcomes analysis modes include calculators that help explore how multiple predictors can be combined to achieve stronger effects:
Mahalanobis D Calculator (Binary mode)
Multivariate R² Calculator (Continuous mode)
The calculators can help approximate the expected performance of multivariate models without having to train the full models and thus help with research planning and model development. More specifically, they illustrate:
How the number of predictors affects combined effect size
The diminishing returns of adding more predictors
How collinearity (correlation among predictors) reduces their combined effectiveness
The trade-offs between using fewer strong predictors versus more moderate ones
Calculator Assumptions and Limitations
Both calculators correspond to fundamental predictive models in statistics: the Mahalanobis D Calculator approximates Linear Discriminant Analysis and logistic regression, while the Multivariate R² Calculator aligns with multiple linear regression.
Like the statistical methods they approximate, the calculators operate under several simplifying assumptions:
Average effects and correlations: The calculators use single values to represent the average effect size across predictors and average correlation among them, which can provide useful approximations even when individual predictors vary in strength
Linear effects: The formulas assume predictors contribute additively without interactions (where one predictor's effect depends on another). However, research shows that in clinical prediction, complex non-linear models generally do not outperform simple linear logistic regression (Christodoulou et al., 2019)
Normality: Variables are assumed to be normally distributed
Despite these limitations, these calculators serve as valuable tools for building intuition about how multiple predictors combine to achieve stronger effects. Even though real-world predictors will vary in their individual strengths and collinearity, the overall patterns demonstrated (such as diminishing returns and the impact of shared variance) remain informative for understanding multivariate relationships.
Quick Start Examples
To further clarify how the tool can be used and to demonstrate its utility we provide some specific examples.
Example 1: Predicting Depression Diagnosis
Can we predict depression diagnosis using a cognitive biomarker?
In the Binary outcome mode:
Set base rate to 8% (the prevalence of depression in the population; Shorey et al., 2021)
Set the grouping reliability to 0.28 (depression diagnosis reliability based on DSM-5 field trials; Regier et al., 2013)
Set the predictor reliability for both groups to 0.6 (an average reliability for cognitive measures; Karvelis et al., 2023)
Set the observed effect size to d = 0.8 (a large effect size that is optimistic and rarely seen in practice)
This will yield AUC = 0.71 and PR-AUC = 0.19. So, even with the optimistic effect size of 0.8, the predictive utility remains very modest, especially when it comes to the tradeoff between recall and precision (as shown by the low PR-AUC).
Note that with the low reliability values, this observed effect corresponds to a much larger true effect, d = 1.58, which, while not being very realistic in practice, highlights how much information is lost due to lack of measurement reliability.
Now let's say we are serious about precision psychiatry and we want to achieve a PR-AUC of 0.8. Using the tool, we can find that it would require d = 2.55. It would be totally unrealistic to expect a single biomarker to achieve this effect size. Using the Mahalanobis D calculator, you can explore how many predictors with smaller d values would be required to achieve D = 2.55.
Example 2: Predicting Antidepressant Response
Can we predict who will respond to antidepressant treatment using a cognitive biomarker?
Select Continuous outcome mode:
Set base rate to 15% (the rate of response to antidepressant treatment beyond placebo; Stone et al., 2022)
Set predictor reliability to 0.6 (an average reliability for cognitive measures; Karvelis et al., 2023)
Set outcome reliability to 0.94 (Hamilton Depression Rating Scale (HAMD) reliability; Trajković et al., 2011)
Adjust effect size such that R² = 0.2 (average multivariate R² from recent research; Karvelis et al., 2022)
This will yield AUC = 0.73 and PR-AUC = 0.32, indicating rather modest predictive performance, as shown by the low PR-AUC. Note that the limiting factor in this scenario is not so much the reliability but the effect size.
If we want to once again be serious about precision psychiatry and aim for PR-AUC of 0.8, we will find it requires r = 0.9 (R² = 0.81). These are rather extremely ambitious values (requiring to explain 81% of variance in symptom improvement). Among other things, this helps demonstrate the inherent limitations of dichotomizing continuous outcomes for assessing treatment response prediction - doing so leads to a loss of valuable information and misrepresents the actual predictive power of the model.
Understanding Predictive Metrics
Classification Outcomes and Metrics
When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.
The image above illustrates how these four outcomes relate to various classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you'll find the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., sensitivity/recall/TPR, precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.
Key Metrics for Different Contexts
Depending on your research context, certain metrics may be more relevant than others:
When false negatives are costly (e.g., missing a disease diagnosis): Focus on sensitivity/recall. In clinical settings where missing a diagnosis could be life-threatening, maximizing sensitivity ensures fewer cases are missed, even if it means more false alarms.
When false positives are costly (e.g., unnecessary treatments): Focus on specificity. When treatments have significant side effects or costs, high specificity ensures fewer healthy individuals receive unnecessary interventions.
When dealing with low base rate: Focus on precision (PPV), NPV, F1 score, MCC, and PR-AUC. These metrics are sensitive to the base rate and thus provide a more accurate assessment of the model's performance in the real world.
Note that all of these metrics are threshold-dependent, meaning that their values depend on the specific threshold used to classify cases.
Threshold-Independent Metrics
Some metrics evaluate performance across all possible thresholds and, as such, can serve as a better summary of the overall model performance. These include:
AUC (Area Under the ROC Curve): Visualizes how well a model balances true positives (sensitivity) and false positives (1-specificity) across all possible thresholds, capturing the trade-off between the two. An AUC of 0.5 means the model is no better than flipping a coin, while 1.0 means perfect separation between groups.
PR-AUC (Area Under the Precision-Recall Curve): Shows how well a model can maintain both precision (PPV) and recall (sensitivity) together, which is particularly informative in the context of rare outcomes, as the base rate affects precision. A larger PR-AUC means the model can achieve high precision without sacrificing recall (or vice versa), indicating a smaller trade-off between finding all positive cases and avoiding false alarms.