Welcome to E2P Simulator! This guide will help you understand what it does,
why it is needed, and how to use it.
E2P Simulator (Effect-to-Prediction Simulator) allows researchers to interactively and quantitatively explore the relationship between effect sizes (e.g., Cohen's d, Odds Ratio, Pearson's r), the corresponding predictive performance (e.g., ROC-AUC, Sensitivity, Specificity, Accuracy, etc.), and real-world predictive utility (e.g., PPV, NPV, PR-AUC, MCC, Net Benefit, etc.) by accounting for measurement reliability and outcome base rates.
In other words, E2P Simulator is a tool for performing predictive utility analysis - estimating how research findings will translate into real-world prediction or what effect sizes/predictive performance is needed to achieve a desired level of predictive and clinical utility. Much like how power analysis tools (such as G*Power) help researchers plan for statistical significance, E2P Simulator helps plan for practical significance.
E2P Simulator has several key applications:
Many research areas such as biomedical, behavioral, education, and sports sciences are increasingly studying individual differences to build predictive models to personalize treatments, learning, and training. Identifying reliable biomarkers and other predictors is central to these efforts. Yet, several entrenched research practices continue to undermine the search for predictors:
Together, these issues undermine the quality and impact of academic research, because routinely reported metrics do not reflect real-world utility. Whether researchers focus on achieving statistical significance of individual predictors or optimizing model performance metrics like accuracy and ROC-AUC, both approaches often lead to unrealistic expectations about practical impact. In turn, this results in inefficient study planning, resource misallocation, and considerable waste of time and funding.
E2P Simulator is designed to address these fundamental challenges by placing measurement reliability and outcome base rates at the center of study planning and interpretation. It helps researchers understand how these factors jointly shape real-world predictive utility, and guides them in making more informed research decisions.
E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting effect sizes, measurement reliability, base rates, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.
The image above provides an overview of all E2P Simulator's interactive components.
E2P Simulator provides two analysis modes that cover the two most common research scenarios:
Measurement reliability attenuates observed effect sizes, which in turn reduces predictive performance. The simulator allows you to specify reliability using the Intraclass Correlation Coefficient (ICC) for continuous measurements and Cohen's kappa (κ) for binary classifications. These typically correspond to test-retest reliability and inter-rater reliability, respectively. You can toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability); see Karvelis & Diaconescu, 2025 for more details on how reliability attenuates individual and group differences.
Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.
Base rate (or prevalence) refers to the proportion of individuals in the population who have the outcome of interest before considering any predictors or test results (in Bayesian terms, this is the prior probability of the outcome). To estimate real-world predictive utility, the base rate should be set to reflect the population where your predictor or model will actually be used — not the composition of your study sample. This distinction is crucial because research studies often use case-control designs with balanced sampling (e.g., 50% cases, 50% controls) that do not reflect real-world prevalence. This is one of the most commonly overlooked problems in evaluating prediction models (Barbec et al., 2020), as the base rate directly affects multiple metrics used for model evaluation (see Understanding Predictive Metrics).
For instance, if you are developing a model for a rare disorder that affects 2% of the general population, the base rate should be set to 2%, even if your training dataset contains equal numbers of cases and controls. However, if your model will be used in a pre-screened high-risk population where the disorder prevalence is 20%, then 20% becomes the relevant base rate (however, in this scenario, the effect size should also reflect the difference between cases and high-risk controls rather than general population controls).
Both binary and continuous outcomes analysis modes include simulators that help estimate how many predictors need to be combined to achieve a desired level of real-world predictive utility. The main metric for this is PR-AUC - it accounts for the base rate and is threshold-independent. For binary classification, the simulator also displays Mahalanobis D, a multivariate generalization of Cohen's d, and for continuous outcomes, it displays total variance explained R².
The multivariable simulators can help approximate the expected performance of multivariate models without having to train the full models and thus help with research planning and model development. They also help gain intuition about how multicollinearity undermines predictive performance and leads to diminishing returns when adding more predictors.
The multivariable simulators are based on several simplifying assumptions:
Even though real-world predictors will often not be normally distributed and will vary in their individual strengths and collinearity, the general trends (such as diminishing returns and the impact of shared variance among the predictors) remain informative for understanding multivariate relationships and estimating expected model performance.
When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.
The image above illustrates how these four outcomes relate to various classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you'll find the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., Sensitivity/Recall/TPR, Precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.
Depending on your research context, certain metrics may be more relevant than others:
Some metrics evaluate performance across all possible thresholds and can serve as a better summary of the overall model performance. These include:
Both ROC-AUC and PR-AUC represent areas under their respective curves and are mathematically expressed as integrals:
Where P is Precision (PPV) and R is Recall (Sensitivity). These integrals are computed using trapezoidal numerical integration.
Decision Curve Analysis (Vickers & Elkin, 2006) evaluates the utility of a predictive model or a single predictor by plotting net benefit across different threshold probabilities. Unlike ROC and PR curves that focus on discrimination, DCA incorporates the relative costs of false positives and false negatives, helping determine whether using a prediction model to guide decisions provides more benefit than simple strategies.
A DCA plot typically includes three key curves:
The net benefit formula accounts for both the benefits of true positives and the costs of false positives:
Where Ntotal is the total sample size, and pt is the threshold probability, representing the odds at which one would be willing to accept an intervention. In practical terms, pt reflects how many unnecessary interventions one is willing to accept to prevent one adverse outcome. The ratio pt/(1-pt) in the net benefit formula captures the relative weighting of false positives compared to true positives. pt itself can be calculated as:
Where CFP is the cost of a false positive (unnecessary intervention) and CFN is the cost of a false negative (missed positive case). pt can also be estimated through expert surveys, stakeholder preferences, or established guidelines.
What we often want to know is not the absolute NB, but added value. ΔNB (Delta Net Benefit) measures this additional utility by comparing the model against the better of the two simple strategies (either All or None) at a specific threshold probability:
At each threshold probability, the model's net benefit is compared against whichever simple strategy performs better at that threshold. This provides a more conservative and meaningful assessment of the model's added value. A positive ΔNB indicates that the predictive model offers genuine improvement over the best simple strategy, while values near zero suggest that simple strategies may be equally effective.
DCA is particularly valuable because it:
For more information about DCA, visit https://mskcc-epi-bio.github.io/decisioncurveanalysis.
Can we predict depression diagnosis using a cognitive biomarker?
Can we predict who will respond to antidepressant treatment using task-based brain activity measures?
Can we predict who will attempt suicide using a combination of risk factors?
Most suicide prediction models (which include a variety of risk factors such as psychopathology, history of suicidal behavior, socio-demographics, etc.) achieve ROC-AUC > 0.7, with a handful exceeding ROC-AUC = 0.9 for predicting suicide attempts (Pigoni et al., 2024). Let's ignore potential overfitting concerns and consider how ROC-AUC = 0.9 would translate to real-world predictive utility.
E2P Simulator is an open-source project - feedback, bug reports, and suggestions for improvement are welcome. The easiest way to do so is through the GitHub Issues page.
You can view the source code, track development, and contribute directly at the project's GitHub repository.
For other inquiries, you can find my contact information here.