Getting Started with E2P Simulator

Welcome to E2P Simulator! This guide will help you understand what it does,
why it is needed, and how to use it.

What is E2P Simulator?

E2P Simulator (Effect-to-Prediction Simulator) allows researchers to interactively and quantitatively explore the relationship between effect sizes (e.g., Cohen's d, Odds Ratio, Pearson's r), the corresponding predictive performance (e.g., ROC-AUC, Sensitivity, Specificity, Accuracy, etc.), as well as real-world predictive value and clinical utility (e.g., PPV, NPV, PR-AUC, Net Benefit, etc.) while explicitly accounting for measurement reliability and outcome base rates.

In other words, E2P Simulator is a tool for performing predictive utility analysis - estimating how research findings will translate into real-world prediction or what effect sizes/predictive performance is needed to achieve a desired level of predictive and clinical utility. Much like how power analysis tools (such as G*Power) help researchers plan for statistical significance, E2P Simulator helps plan for practical significance.

E2P Simulator has several key applications:

Why is E2P Simulator needed?

Many research areas such as biomedical, behavioral, education, and sports sciences are increasingly studying individual differences to build predictive models to personalize treatments, learning, and training. Identifying reliable biomarkers and other predictors is central to these efforts. Yet, several entrenched research practices continue to undermine the search for predictors:

Together, these issues undermine the quality and impact of academic research, because routinely reported metrics do not reflect real-world utility. Whether researchers focus on achieving statistical significance of individual predictors or optimizing model performance metrics like accuracy and ROC-AUC, both approaches often lead to unrealistic expectations about practical impact. In turn, this results in inefficient study planning, resource misallocation, and considerable waste of time and funding.

E2P Simulator is designed to address these fundamental challenges by placing measurement reliability and outcome base rate at the center of study planning and interpretation. It helps researchers understand how these factors jointly shape real-world predictive utility, and guides them in making more informed research decisions.

How to use E2P Simulator

E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting effect sizes, measurement reliability, base rate, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.

E2P Simulator overview showing all key inputs and interactive elements at a high level

The image above provides an overview of all E2P Simulator's interactive components.

Binary vs. Continuous Outcomes

E2P Simulator provides two analysis modes that cover the two most common research scenarios:

Measurement Reliability and True vs. Observed Effects

Measurement reliability attenuates observed effect sizes, which in turn reduces predictive performance. The simulator allows you to toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability). The reliability of continuous variables (predictors or outcomes) is specified using the Intraclass Correlation Coefficient (ICC), which typically corresponds to test-retest reliability. For binary variables, reliability is specified using Cohen's kappa (κ), which usually represents inter-rater reliability.

For continuous outcomes, where both the predictor and outcome are continuous, the relationship between true and observed Pearson's r is given by:

\[r_{\text{observed}} = r_{\text{true}} \times \sqrt{ICC_{\text{predictor}} \times ICC_{\text{outcome}}}\]

For binary outcomes, where a continuous predictor is used to classify a binary outcome, the relationship between true and observed Cohen's d is:

\[d_{\text{observed}} = d_{\text{true}} \times \sqrt{\frac{2 \times ICC_1 \times ICC_2}{ICC_1 + ICC_2} \times \sin(\frac{\pi}{2} \kappa)}\]

Here, \(ICC_1\) and \(ICC_2\) denote the reliability of the continuous predictor in each of the two outcome groups, and \(\kappa\) is the reliability of the binary outcome classification (e.g., interrater reliability of a diagnosis).

See Karvelis & Diaconescu (2025) for more details on how reliability attenuates individual and group differences.

Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.

Base Rate

Base rate (also referred to as prevalence) refers to the proportion of individuals in the population who have the outcome of interest before considering any predictors or test results (in Bayesian terms, this is the prior probability of the outcome). To estimate real-world predictive utility, the base rate should be set to reflect the population where your predictor or model will actually be used — not the composition of your study sample. This distinction is crucial because research studies often use case-control designs with balanced sampling (e.g., 50% cases, 50% controls) that do not reflect real-world base rate. This is one of the most commonly overlooked problems in evaluating prediction models (Brabec et al., 2020), as the base rate directly affects multiple metrics used for model evaluation (see Understanding Predictive Metrics).

For instance, if you are developing a model for a rare disorder that affects 2% of the general population, the base rate should be set to 2%, even if your training dataset contains equal numbers of cases and controls. However, if your model will be used in a pre-screened high-risk population where the disorder base rate is 20%, then 20% becomes the relevant base rate (however, in this scenario, the effect size should also reflect the difference between cases and high-risk controls rather than general population controls).

Multivariable Simulators

Both binary and continuous outcomes analysis modes include simulators that help estimate how many predictors need to be combined to achieve a desired level of real-world predictive utility. The main metric for this is PR-AUC - it accounts for the base rate and is threshold-independent. For binary classification, the simulator also displays Mahalanobis D, a multivariate generalization of Cohen's d, and for continuous outcomes, it displays total variance explained R².

The multivariable simulators can help approximate the expected performance of multivariate models without having to train the full models and thus help with research planning and model development. They also help gain intuition about how multicollinearity undermines predictive performance and leads to diminishing returns when adding more predictors.

Assumptions and Limitations

The multivariable simulators are based on several simplifying assumptions:

Even though real-world predictors will often not be normally distributed and will vary in their individual strengths and collinearity, the general trends (such as diminishing returns and the impact of shared variance among the predictors) remain informative for understanding multivariate relationships and estimating expected model performance.

Understanding Predictive Metrics

Classification Outcomes and Metrics

When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.

Prediction metrics diagram showing confusion matrix and all derived metrics

The image above illustrates how these four outcomes are used to derive classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you can see the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., Sensitivity/Recall/TPR, Precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.

A short summary of each metric:

Threshold-Independent Metrics

ROC and Precision-Recall curves showing AUC calculation

Some metrics evaluate performance across all possible thresholds and can serve as a better summary of the overall model performance. These include:

Both ROC-AUC and PR-AUC represent areas under their respective curves and are mathematically expressed as integrals:

\[ROC\text{-}AUC = \int_0^1 TPR(FPR) \, d(FPR)\]
\[PR\text{-}AUC = \int_0^1 PPV(TPR) \, d(TPR)\]

These integrals are computed using trapezoidal numerical integration.

Decision Curve Analysis (DCA)

Decision Curve Analysis (Vickers & Elkin, 2006) evaluates the clinical utility of a predictive model or a single predictor by explicitly balancing the costs of false positives against the benefits of true positives.

Decision Curve Analysis example showing net benefit curves and shaded areas representing different strategies

A DCA plot typically includes three key curves:

The net benefit formula accounts for both the benefits of true positives and the costs of false positives:

\[NB = \frac{TP}{N_{total}} - \frac{FP}{N_{total}} \times \frac{p_t}{1-p_t}\]

Where Ntotal is the total sample size, and pt is the threshold probability. It represents the minimum predicted probability of an outcome at which you would decide to intervene (e.g., diagnose or treat). For instance, if pt = 0.10, you would intervene for anyone with a predicted risk ≥ 10%. The choice of pt determines a specific balance between sensitivity (finding true cases) and specificity (avoiding false alarms). In practical terms, pt represents the trade-off between benefits and harms: for every 1/(1-pt) - 1 unnecessary interventions you are willing to accept to prevent one adverse outcome. The ratio pt/(1-pt) in the net benefit formula captures this relative weighting of false positives compared to true positives. The optimal pt can be estimated as:

\[p_t = \frac{C_{FP}}{C_{FP} + C_{FN}}\]

Where CFP is the cost of a false positive (unnecessary intervention) and CFN is the cost of a false negative (missed positive case). pt can also be estimated through expert surveys, stakeholder preferences, or established guidelines.

For population screening, pt is typically set low because missing true cases is costlier than unnecessary follow-ups, so more false positives are acceptable. For diagnostic confirmation (e.g., before initiating high-risk treatment), pt is set higher to avoid false positives, reflecting a preference for specificity. As a rule of thumb, screening scenarios may use pt in the 1–10% range, whereas diagnostic decisions often warrant much higher pt (for example 30–70% or more), depending on harms and preferences.

What we often want to know is not the absolute NB, but added value. ΔNB (Delta Net Benefit) measures this additional utility by comparing the model against the better of the two simple strategies (either All or None) at a specific threshold probability:

\[\Delta NB = NB_{\text{model}} - \max(NB_{\text{All}}, NB_{\text{None}})\]

At each threshold probability, the model's net benefit is compared against whichever simple strategy performs better at that threshold. This provides a more conservative and meaningful assessment of the model's added value. A positive ΔNB indicates that the predictive model offers genuine improvement over the best simple strategy, while values near zero suggest that simple strategies may be equally effective.

DCA is particularly valuable because it:

For more information about DCA, visit https://mskcc-epi-bio.github.io/decisioncurveanalysis.

Quick Start Examples

To further clarify how the tool can be used and to demonstrate its utility we provide some specific examples.

Example 1: Diagnostic Prediction

Can we predict depression diagnosis using a cognitive biomarker?

  1. In the Binary outcome mode:
    • Set base rate to 8% (the base rate of depression in adolescents; Shorey et al., 2022)
    • Set the grouping reliability to 0.28 (depression diagnosis reliability based on DSM-5 field trials; Regier et al., 2013)
    • Set the predictor reliability for both groups to 0.6 (an average reliability for cognitive measures; Karvelis et al., 2023)
    • Set the observed effect size to d = 0.8 (a large effect size that is optimistic and rarely seen in practice)
  2. With these parameters, observed Cohen's d = 0.8 will yield ROC-AUC = 0.71 and PR-AUC = 0.19. This means that even with a "large" effect size of 0.8, the predictive utility remains rather modest, especially when it comes to the tradeoff between PPV and Sensitivity (as shown by the low PR-AUC). Using the DCA plot to set the classification threshold to correspond to 10% risk, pt = 0.1, we obtain PPV = 0.16, Sensitivity = 0.53, and ΔNB = 0.018, which means that at this threshold 84% of diagnoses would be false positives while still missing 47% of actual cases, and we would get 1.8 additional true positives per 100 diagnoses.
  3. Note that with the low reliability values, this observed effect corresponds to a much larger true effect, d = 1.58, and in turn much better predictive performance, ROC-AUC = 0.87 and PR-AUC = 0.46, and PPV = 0.27, Sensitivity = 0.74, and ΔNB = 0.046 at pt = 0.1, highlighting how much improvement in diagnostic prediction could be achieved simply by improving measurement reliability.
  4. Now let's say we are serious about precision psychiatry and we want to achieve a PR-AUC of 0.8. Using the tool, we can find that it would require ROC-AUC = 0.96. It would be rather unrealistic to expect a single biomarker to achieve this effect size. Using the multivariable simulator, you can explore how many predictors with smaller d values would be required to achieve a desired prediction performance.

Example 2: Treatment Response Prediction

Can we predict who will respond to antidepressant treatment using task-based brain activity measures?

  1. Select Continuous outcome mode:
    • Set base rate to 15% (the rate of response to antidepressant treatment beyond placebo; Stone et al., 2022)
    • Set predictor reliability to 0.4 (an average reliability for task-fMRI measures; Elliott et al., 2020)
    • Set outcome reliability to 0.94 (Hamilton Depression Rating Scale (HAMD) reliability; Trajković et al., 2011)
    • Adjust effect size such that R² = 0.2 (average multivariate R² from recent research; Karvelis et al., 2022)
  2. This will yield ROC-AUC = 0.73 and PR-AUC = 0.33, indicating rather modest predictive performance, as shown by the low PR-AUC. At pt = 0.2, which reflects the relative harms of antidepressant side effects, this would result in Sensitivity = 0.54, PPV = 0.29, and ΔNB = 0.031, which means that 46% of those who would benefit from treatment would not receive treatment, 71% of those given treatment would not benefit from it, and we would get additional 3.1 true responders per 100 people who receive antidepressants. Improving measurement reliability alone could improve performance quite substantially, up to ROC-AUC = 0.87 and PR-AUC = 0.57, which at pt = 0.2 would give Sensitivity = 0.74, PPV = 0.42, and ΔNB = 0.073.
  3. If we want to once again be serious about precision psychiatry and aim for PR-AUC of 0.8, we will find it requires R² = 0.8. These are rather extremely ambitious values (requiring to explain 80% of variance in symptom improvement). This helps demonstrate the inherent limitations of dichotomizing continuous outcomes for evaluating treatment response prediction - doing so leads to a loss of valuable information. On the other hand, it does reflect the binary nature of decision-making in psychiatry (to prescribe the treatment or not).

Example 3: Risk Prediction

Can we predict who will attempt suicide using electronic health records?

One of the largest prospective suicide prediction studies (Edgcomb et al., 2021) followed women (N = 67,000) with serious mental illness for 12 months after a general medical hospitalization and trained models on pre-discharge electronic health records to predict readmission for suicide attempt or self-harm, achieving ROC-AUC of 0.73 (derivation sample) and 0.71 (external sample). A companion study (Thiruvalluru et al., 2023) in men (N = 1.4 million) reported a 12-month base rate of 3.9% and similar discrimination.

  1. In the Binary outcome mode:
    • Set base rate to 3.9% (Thiruvalluru et al., 2023)
    • Set the outcome reliability to 1.0 (hospital admissions for attempts or self-harm can be assumed to be near perfect)
    • Set the predictor reliability for both groups to 0.8 (structured electronic health records including healthcare utilization, prior attempts, psychiatric diagnoses, etc., can be assumed to have rather high reliability)
    • Set the observed effect size to d = 0.77, which corresponds to ROC-AUC = 0.71
  2. This yields PR-AUC = 0.10, indicating poor predictive performance in the real world. At pt = 0.03 (a reasonable threshold for intervention), we get Sensitivity = 0.77, PPV = 0.06, and ΔNB = 0.006. This means that while the model would capture about three-quarters of true cases, only 6% of those flagged would actually attempt suicide, and the added benefit would translate to 6 additional true cases per 1,000 individuals.
  3. To achieve PR-AUC = 0.8 in this population would require ROC-AUC = 0.98, which is extremely unrealistic. At pt = 0.03, this would result in Sensitivity = 0.94, PPV = 0.30, and ΔNB = 0.025.
  4. Note that because the reliability is already quite high, improving it further would not make much of a difference. What we need to do is to find better predictors. Alternatively, it may be more effective to simply focus on universal suicide prevention strategies rather than trying to predict individual cases (e.g.,Large 2018).

Sample Size Calculations for Prediction Models

E2P Simulator includes a sample size calculator to help you determine how much data is needed for your multivariable prediction models. This calculator is designed to help you avoid overfitting and keep prediction error low (following the recommendations from Riley et al., 2020), providing more robust sample size estimates than simple rules of thumb like "10 events per predictor".

How to Use

Specify your number of predictors, realistically expected R² (based on prior research or pilot data), and outcome base rate (for binary outcomes only). Use R²CS for binary outcomes and standard R² for continuous outcomes. Note, for binary outcomes with a single continuous predictor, R²CS equals eta-squared (η²), which is already displayed in the main E2P simulator dashboard. The final recommendation uses the maximum across all criteria to ensure all performance targets are met.

The sample size calculators complement the main E2P simulators in study planning: the E2P simulators explore relationships between effect sizes and predictive utility (both what you need for desired performance and what utility to expect from realistic effects), while the sample size calculators determine adequate sample size based on realistic R² estimates from prior research. For sample size planning, always use conservative, realistic R² estimates based on prior research, not idealized target values.

Prediction Models vs. Hypothesis Testing Sample Sizes

You may wonder how these prediction-focused sample size calculations compare to traditional power analysis used in hypothesis testing. The key difference is that power analysis focuses on detecting whether an effect exists, while prediction-focused calculations prioritize model reliability and performance on new data. This fundamental difference in goals typically leads to larger sample size requirements for prediction models.

Another way to think about this difference is in terms of precision requirements. Power analysis only needs sufficient precision to distinguish an effect from zero (statistical significance). In contrast, prediction models require much tighter confidence intervals around parameter estimates to ensure much more precise estimation of predictive performance / effect sizes.

Feedback and Contributions

E2P Simulator is an open-source project - feedback, bug reports, and suggestions for improvement are welcome. The easiest way to do so is through the GitHub Issues page.

You can view the source code, track development, and contribute directly at the project's GitHub repository.

For other inquiries, you can find my contact information here.

References

  1. Brabec, J., Komárek, T., Franc, V., & Machlica, L. (2020). On model evaluation under non-constant class imbalance. International Conference on Computational Science, vol. 12140 (pp. 74-87). Springer, Cham. https://doi.org/10.1007/978-3-030-50423-6_6
  2. Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22. https://doi.org/10.1016/j.jclinepi.2019.02.004
  3. Edgcomb, J. B., Thiruvalluru, R., Pathak, J., Brooks, J. O., & Zima, B. (2021). Machine learning to differentiate risk of suicide attempt and self-harm after general medical hospitalization of women with mental illness. Medical Care, 59, S58-S64. https://doi.org/10.1097/MLR.0000000000001445
  4. Elliott, M. L., Knodt, A. R., Ireland, D., Morris, M. L., Poulton, R., Ramrakha, S., Sison, M. L., Moffitt, T. E., Caspi, A., & Hariri, A. R. (2020). What is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science, 31(7), 792-806. https://doi.org/10.1177/0956797620916786
  5. Karvelis, P., & Diaconescu, A. O. (2025). Clarifying the reliability paradox: poor measurement reliability attenuates group differences. Frontiers in Psychology, 16, 1592658. https://doi.org/10.3389/fpsyg.2025.1592658
  6. Karvelis, P., Paulus, M. P., & Diaconescu, A. O. (2023). Individual differences in computational psychiatry: A review of current challenges. Neuroscience & Biobehavioral Reviews, 148, 105137. https://doi.org/10.1016/j.neubiorev.2023.105137
  7. Karvelis, P., Charlton, C. E., Allohverdi, S. G., Bedford, P., Hauke, D. J., & Diaconescu, A. O. (2022). Computational approaches to treatment response prediction in major depression using brain activity and behavioral data: A systematic review. Network Neuroscience, 6(4), 1066-1103. https://doi.org/10.1162/netn_a_00233
  8. Large, M. M. (2018). The role of prediction in suicide prevention. Dialogues in Clinical Neuroscience, 20(3), 197-205. https://doi.org/10.31887/DCNS.2018.20.3/mlarge
  9. Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry, 170(1), 59-70. https://doi.org/10.1176/appi.ajp.2012.12070999
  10. Riley, R. D., Ensor, J., Snell, K. I. E., Harrell Jr, F. E., Martin, G. P., Reitsma, J. B., Moons, K. G. M., Collins, G., & van Smeden, M. (2020). Calculating the sample size required for developing a clinical prediction model. BMJ, 368, m441. https://doi.org/10.1136/bmj.m441
  11. Shorey, S., Ng, E. D., & Wong, C. H. J. (2022). Global prevalence of depression and elevated depressive symptoms among adolescents: A systematic review and meta-analysis. British Journal of Clinical Psychology, 61(2), 287-305. https://doi.org/10.1111/bjc.12333
  12. Stone, M. B., Yaseen, Z. S., Miller, B. J., Richardville, K., Kalaria, S. N., & Kirsch, I. (2022). Response to acute monotherapy for major depressive disorder in randomized, placebo controlled trials submitted to the US Food and Drug Administration: Individual participant data analysis. BMJ, 378, e067606. https://doi.org/10.1136/bmj-2021-067606
  13. Thiruvalluru, R. K., Edgcomb, J. B., Brooks, J. O., & Pathak, J. (2023). Risk of suicide attempts and self-harm after 1.4 million general medical hospitalizations of men with mental illness. Journal of Psychiatric Research, 157, 50-56. https://doi.org/10.1016/j.jpsychires.2022.10.035
  14. Trajković, G., Starčević, V., Latas, M., Leštarević, M., Ille, T., Bukumirić, Z., & Marinković, J. (2011). Reliability of the Hamilton Rating Scale for Depression: A meta-analysis over a period of 49 years. Psychiatry Research, 189(1), 1-9. https://doi.org/10.1016/j.psychres.2010.12.007
  15. Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26(6), 565-574. https://doi.org/10.1177/0272989X06295361