Welcome to E2P Simulator! This guide will help you understand what it does,
why it is needed, and how to use it.
E2P Simulator (Effect-to-Prediction Simulator) allows researchers to interactively and quantitatively explore the relationship between effect sizes (e.g., Cohen's d, Odds Ratio, Pearson's r), the corresponding predictive performance (e.g., ROC-AUC, Sensitivity, Specificity, Accuracy, etc.), and real-world predictive utility (e.g., PPV, NPV, PR-AUC, MCC, Net Benefit, etc.) by accounting for measurement reliability and outcome base rates.
In other words, E2P Simulator is a tool for performing predictive utility analysis - estimating how research findings will translate into real-world prediction or what effect sizes/predictive performance is needed to achieve a desired level of predictive and clinical utility. Much like how power analysis tools (such as G*Power) help researchers plan for statistical significance, E2P Simulator helps plan for practical significance.
E2P Simulator has several key applications:
Many research areas such as biomedical, behavioral, education, and sports sciences are increasingly studying individual differences to build predictive models to personalize treatments, learning, and training. Identifying reliable biomarkers and other predictors is central to these efforts. Yet, several entrenched research practices continue to undermine the search for predictors:
Together, these issues undermine the quality and impact of academic research, because routinely reported metrics do not reflect real-world utility. Whether researchers focus on achieving statistical significance of individual predictors or optimizing model performance metrics like accuracy and ROC-AUC, both approaches often lead to unrealistic expectations about practical impact. In turn, this results in inefficient study planning, resource misallocation, and considerable waste of time and funding.
E2P Simulator is designed to address these fundamental challenges by placing measurement reliability and outcome base rates at the center of study planning and interpretation. It helps researchers understand how these factors jointly shape real-world predictive utility, and guides them in making more informed research decisions.
E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting effect sizes, measurement reliability, base rates, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.
The image above provides an overview of all E2P Simulator's interactive components.
E2P Simulator provides two analysis modes that cover the two most common research scenarios:
Measurement reliability attenuates observed effect sizes, which in turn reduces predictive performance. The simulator allows you to specify reliability using the Intraclass Correlation Coefficient (ICC) for continuous measurements and Cohen's kappa (κ) for binary classifications. These typically correspond to test-retest reliability and inter-rater reliability, respectively. You can toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability); see Karvelis & Diaconescu, 2025 for more details on how reliability attenuates individual and group differences.
Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.
Base rate (or prevalence) refers to the proportion of individuals in the population who have the outcome of interest before considering any predictors or test results (in Bayesian terms, this is the prior probability of the outcome). To estimate real-world predictive utility, the base rate should be set to reflect the population where your predictor or model will actually be used — not the composition of your study sample. This distinction is crucial because research studies often use case-control designs with balanced sampling (e.g., 50% cases, 50% controls) that do not reflect real-world prevalence. This is one of the most commonly overlooked problems in evaluating prediction models (Barbec et al., 2020), as the base rate directly affects multiple metrics used for model evaluation (see Understanding Predictive Metrics).
For instance, if you are developing a model for a rare disorder that affects 2% of the general population, the base rate should be set to 2%, even if your training dataset contains equal numbers of cases and controls. However, if your model will be used in a pre-screened high-risk population where the disorder prevalence is 20%, then 20% becomes the relevant base rate (however, in this scenario, the effect size should also reflect the difference between cases and high-risk controls rather than general population controls).
Both binary and continuous outcomes analysis modes include simulators that help estimate how many predictors need to be combined to achieve a desired level of real-world predictive utility. The main metric for this is PR-AUC - it accounts for the base rate and is threshold-independent. For binary classification, the simulator also displays Mahalanobis D, a multivariate generalization of Cohen's d, and for continuous outcomes, it displays total variance explained R².
The multivariable simulators can help approximate the expected performance of multivariate models without having to train the full models and thus help with research planning and model development. They also help gain intuition about how multicollinearity undermines predictive performance and leads to diminishing returns when adding more predictors.
The multivariable simulators are based on several simplifying assumptions:
Even though real-world predictors will often not be normally distributed and will vary in their individual strengths and collinearity, the general trends (such as diminishing returns and the impact of shared variance among the predictors) remain informative for understanding multivariate relationships and estimating expected model performance.
When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.
The image above illustrates how these four outcomes relate to various classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you'll find the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., Sensitivity/Recall/TPR, Precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.
Some metrics evaluate performance across all possible thresholds and can serve as a better summary of the overall model performance. These include:
Both ROC-AUC and PR-AUC represent areas under their respective curves and are mathematically expressed as integrals:
These integrals are computed using trapezoidal numerical integration.
Decision Curve Analysis (Vickers & Elkin, 2006) evaluates the clinical utility of a predictive model or a single predictor by explicitly balancing the costs of false positives against the benefits of true positives.
A DCA plot typically includes three key curves:
The net benefit formula accounts for both the benefits of true positives and the costs of false positives:
Where Ntotal is the total sample size, and pt is the threshold probability — the minimum predicted probability at which you would decide to intervene (e.g., diagnose, treat, or refer for further assessment). For example, if pt = 0.10, you would intervene for anyone with a predicted risk ≥ 10%. In practical terms, pt represents the trade-off between benefits and harms: for every 1/(1-pt) - 1 unnecessary interventions you're willing to accept to prevent one adverse outcome. The ratio pt/(1-pt) in the net benefit formula captures this relative weighting of false positives compared to true positives. pt itself can be calculated as:
Where CFP is the cost of a false positive (unnecessary intervention) and CFN is the cost of a false negative (missed positive case). pt can also be estimated through expert surveys, stakeholder preferences, or established guidelines.
For population screening, pt is typically set low because missing true cases is costlier than unnecessary follow-ups, so more false positives are acceptable. For diagnostic confirmation (e.g., before initiating high-risk treatment), pt is set higher to avoid false positives, reflecting a preference for specificity. As a rule of thumb, screening scenarios may use pt in the 1–10% range, whereas diagnostic decisions often warrant much higher pt (for example 30–70% or more), depending on harms and preferences.
What we often want to know is not the absolute NB, but added value. ΔNB (Delta Net Benefit) measures this additional utility by comparing the model against the better of the two simple strategies (either All or None) at a specific threshold probability:
At each threshold probability, the model's net benefit is compared against whichever simple strategy performs better at that threshold. This provides a more conservative and meaningful assessment of the model's added value. A positive ΔNB indicates that the predictive model offers genuine improvement over the best simple strategy, while values near zero suggest that simple strategies may be equally effective.
DCA is particularly valuable because it:
For more information about DCA, visit https://mskcc-epi-bio.github.io/decisioncurveanalysis.
Can we predict depression diagnosis using a cognitive biomarker?
Can we predict who will respond to antidepressant treatment using task-based brain activity measures?
Can we predict who will attempt suicide using a combination of risk factors?
Most suicide prediction models (which include a variety of risk factors such as psychopathology, history of suicidal behavior, socio-demographics, etc.) achieve ROC-AUC > 0.7, with a handful exceeding ROC-AUC = 0.9 for predicting suicide attempts (Pigoni et al., 2024). Let's ignore potential overfitting concerns and consider how ROC-AUC = 0.9 would translate to real-world predictive utility.
E2P Simulator includes sample size calculators that implement evidence-based criteria (Riley et al., 2020) to ensure your multivariable prediction models have sufficient data to avoid overfitting and ensure low prediction error. These go beyond simple rules of thumb like "10 events per predictor."
Specify your number of predictors, realistically expected R² (based on prior research or pilot data), and outcome prevalence (for binary outcomes only). Use R²CS for binary outcomes and standard R² for continuous outcomes. Note, for binary outcomes with a single continuous predictor, R²CS equals eta-squared (η²), which is already displayed in the main E2P simulator dashboard. The final recommendation uses the maximum across all criteria to ensure all performance targets are met.
The sample size calculators complement the main E2P simulators in study planning: the E2P simulators explore relationships between effect sizes and predictive utility (both what you need for desired performance and what utility to expect from realistic effects), while the sample size calculators determine adequate sample size based on realistic R² estimates from prior research. For sample size planning, always use conservative, realistic R² estimates based on prior research, not idealized target values.
When developing a model with just one predictor (p = 1), the sample size calculations relate directly to traditional power analysis for detecting that predictor's effect. However, the criteria still provide value beyond simple power calculations:
For single predictors, the shrinkage and optimism criteria often require larger sample sizes than traditional power analysis, reflecting the higher standards needed for reliable prediction versus mere statistical significance.
E2P Simulator is an open-source project - feedback, bug reports, and suggestions for improvement are welcome. The easiest way to do so is through the GitHub Issues page.
You can view the source code, track development, and contribute directly at the project's GitHub repository.
For other inquiries, you can find my contact information here.