E2P Simulator | Get Started

Why is E2P Simulator needed?

Many research areas such as biomedical, behavioral, education, and sports sciences are increasingly studying individual differences to build predictive models to personalize treatments, learning, and training. Identifying reliable biomarkers and other predictors is central to these efforts. Yet, several entrenched research practices continue to undermine the search for predictors:

Overemphasis on statistical significance: Most research continues to optimize for statistical significance (p-values) without optimizing for practical significance (effect sizes).
Difficulty interpreting effect sizes: The interpretation of effect sizes, which are critical for gauging real-world utility, is often reduced to arbitrary cutoffs (small/medium/large) without conveying their practical utility.
Overlooked measurement reliability: Measurement noise attenuates both effect sizes and predictive performance, yet it is rarely accounted for in study design or interpretation of findings.
Neglected outcome base rate: Low base rate of outcomes can drastically limit predictive performance in real-world settings, but is rarely accounted for when evaluating the translational potential of prediction models.

Together, these issues undermine the quality and impact of academic research, because routinely reported metrics do not reflect real-world utility. Whether researchers focus on achieving statistical significance of individual predictors or optimizing model performance metrics like accuracy and ROC-AUC, both approaches often lead to unrealistic expectations about practical impact. In turn, this results in inefficient study planning, resource misallocation, and considerable waste of time and funding.

E2P Simulator is designed to address these fundamental challenges by placing measurement reliability and outcome base rate at the center of study planning and interpretation. It helps researchers understand how these factors jointly shape real-world predictive utility, and guides them in making more informed research decisions.

How to use E2P Simulator

E2P Simulator is designed to be intuitive and interactive. You can explore different scenarios by adjusting effect sizes, measurement reliability, base rate, and decision threshold, and immediately see how these changes impact predictive performance through various visualizations and metrics. Still, in this section we will highlight and clarify some of the key features and assumptions of the simulator.

E2P Simulator overview showing all key inputs and interactive elements at a high level

The image above provides an overview of all E2P Simulator's interactive components.

Binary vs. Continuous Outcomes

E2P Simulator provides two analysis modes that cover the two most common research scenarios:

Binary Mode: Considers dichotomous outcomes such as diagnostic categories (e.g., cases vs. controls) or discrete states (e.g., success vs. failure). All metric calculations and conversions in this mode are completely analytical and follow the formulas provided on the page.
Continuous Mode: Considers continuous measurements such as symptom severity or performance scores that may need to be categorized (e.g., responders vs. non-responders or performers vs. non-performers) for practical decisions. This mode is based on actual data simulations rather than analytical solutions, hence it may be slower to respond to inputs.

Measurement Reliability and True vs. Observed Effects

Measurement reliability attenuates observed effect sizes, which in turn reduces predictive performance. The simulator allows you to toggle between "true" effect sizes (what would be observed with perfect measurement) and "observed" effect sizes (what we actually see given imperfect reliability). The reliability of continuous variables (predictors or outcomes) is specified using the Intraclass Correlation Coefficient (ICC), which typically corresponds to test-retest reliability. For binary variables, reliability is specified using Cohen's kappa (κ), which usually represents inter-rater reliability.

For continuous outcomes, where both the predictor and outcome are continuous, the relationship between true and observed Pearson's r is given by:

\[r_{\text{observed}} = r_{\text{true}} \times \sqrt{ICC_{\text{predictor}} \times ICC_{\text{outcome}}}\]

For binary outcomes, where a continuous predictor is used to classify a binary outcome, the relationship between true and observed Cohen's d is:

\[d_{\text{observed}} = d_{\text{true}} \times \sqrt{\frac{2 \times ICC_1 \times ICC_2}{ICC_1 + ICC_2} \times \sin(\frac{\pi}{2} \kappa)}\]

Here, \(ICC_1\) and \(ICC_2\) denote the reliability of the continuous predictor in each of the two outcome groups, and \(\kappa\) is the reliability of the binary outcome classification (e.g., interrater reliability of a diagnosis).

See Karvelis & Diaconescu (2025) for more details on how reliability attenuates individual and group differences.

Note that the simulator does not account for sample size limitations, which can introduce additional uncertainty around the true effect size through sampling error.

Base Rate

Base rate (also referred to as prevalence) refers to the proportion of individuals in the population who have the outcome of interest before considering any predictors or test results (in Bayesian terms, this is the prior probability of the outcome). To estimate real-world predictive utility, the base rate should be set to reflect the population where your predictor or model will actually be used — not the composition of your study sample. This distinction is crucial because research studies often use case-control designs with balanced sampling (e.g., 50% cases, 50% controls) that do not reflect real-world base rate. This is one of the most commonly overlooked problems in evaluating prediction models (Brabec et al., 2020), as the base rate directly affects multiple metrics used for model evaluation (see Understanding Predictive Metrics).

For instance, if you are developing a model for a rare disorder that affects 2% of the general population, the base rate should be set to 2%, even if your training dataset contains equal numbers of cases and controls. However, if your model will be used in a pre-screened high-risk population where the disorder base rate is 20%, then 20% becomes the relevant base rate (however, in this scenario, the effect size should also reflect the difference between cases and high-risk controls rather than general population controls).

Multivariable Simulators

Both binary and continuous outcomes analysis modes include multivariable simulators that help estimate how many predictors with a given effect size need to be combined to achieve a desired level of predictive performance. These simulators can approximate the expected performance of multivariate models without having to train the full models, making them useful for research planning and model development. The simulators illustrate how increasing the number of predictors improves performance while also showing how collinearity leads to diminishing returns,demonstrating that small effects do not add up as quickly as many researchers might intuitively expect.

For binary outcomes, the simulator displays ROC-AUC and PR-AUC. ROC-AUC provides a threshold-independent measure of discriminative ability while PR-AUC accounts for the base rate to reflect real-world performance with imbalanced outcomes. The conversion from multiple predictors to ROC-AUC is done via Mahalanobis D, a multivariate generalization of Cohen's d, which allows you to start from familiar effect size metrics.

For continuous outcomes, the simulator displays total variance explained (R²) and PR-AUC. R² provides the standard measure of predictive performance in regression models, while PR-AUC accounts for the base rate to reflect real-world classification performance. The conversion from R² to PR-AUC is done by dichotomizing the continuous outcome at the base rate threshold, which allows you to assess how well the model would perform if used to make binary decisions (e.g., identifying the top 20% of patients most likely to benefit from treatment).

Assumptions and Limitations

The multivariable simulators are based on several simplifying assumptions:

Average effects and correlations: The simulators use single values to represent the average effect size across predictors and average correlation among them, which provides useful approximations for understanding general principles of multivariate relationships and expected model performance.
Linear effects: The formulas assume predictors contribute additively without interactions (where one predictor's effect depends on another). This assumption is supported by research showing that in clinical prediction, complex non-linear models generally do not outperform simple linear logistic regression (Christodoulou et al., 2019). In general, more complex machine learning models excel at capturing non-linear relationships that we expect to see in the real world, but they also require more data and are more prone to overfitting.
Normality: The underlying variables are assumed to be normally distributed. This is consistent with the assumptions of input metrics like Cohen's d and Pearson's r, although in practice these are often computed despite normality violations.

Even though real-world predictors will often not be normally distributed and will vary in their individual strengths and collinearity, the general trends (such as diminishing returns and the impact of shared variance among the predictors) remain informative for understanding multivariate relationships and estimating expected model performance.

Understanding Predictive Metrics

Classification Outcomes and Metrics

When using a predictor to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These form the basis for all predictive metrics.

Prediction metrics diagram showing confusion matrix and all derived metrics

The image above illustrates how these four outcomes are used to derive classification metrics. On the left, you can see how negative (e.g., controls) and positive (e.g., cases) distributions overlap and how a classification threshold (red line) creates these four outcomes. On the right, you can see the confusion matrix and the formulas for key metrics derived from it. Note how some metrics have multiple names (e.g., Sensitivity/Recall/TPR, Precision/PPV) - this reflects how the same concepts are referred to differently across fields like medicine, cognitive science, and machine learning.

A short summary of each metric:

Sensitivity (Recall, True Positive Rate): Measures the proportion of actual positives correctly identified. Useful when missing a positive case is costly (e.g., disease screening), but ignores false positives.
Specificity (True Negative Rate): Measures the proportion of actual negatives correctly identified. Important when false alarms are costly (e.g., confirming a diagnosis before a risky treatment), but ignores false negatives.
Accuracy: The proportion of all predictions that are correct. Intuitive but can be misleading for imbalanced datasets, as it is heavily influenced by the majority class.
Balanced Accuracy: The average of Sensitivity and Specificity. A better measure than accuracy for imbalanced datasets, but it gives equal weight to both types of errors and does not account for the base rate, which is critical for assessing real-world utility.
Positive Predictive Value (PPV, Precision): The proportion of positive predictions that are actually correct. It informatively accounts for base rate, and corresponds to the posterior probability of a condition given a positive test result ("How likely is a positive prediction to be true?"). As such, it is a crucial metric for clinical decision-making.
Negative Predictive Value (NPV): The proportion of negative predictions that are actually correct. It also informatively accounts for the base rate and corresponds to the posterior probability of not having a condition given a negative test result ("How likely is a negative prediction to be true?"). This makes it a crucial metric for ruling out conditions.
F1 Score: The harmonic mean of Precision and Recall. A useful summary when you need to balance finding all positives and not making too many false alarms, though it can be hard to interpret directly.
Matthews Correlation Coefficient (MCC): A correlation between observed and predicted classifications. A balanced measure suitable for imbalanced datasets, but it is less intuitive to interpret and does not reveal the types of errors being made.

Threshold-Independent Metrics

ROC and Precision-Recall curves showing AUC calculation

Some metrics evaluate performance across all possible thresholds and can serve as a better summary of the overall model performance. These include:

ROC-AUC (Area Under the Receiver Operating Characteristic Curve): Summarizes how well a model balances true positives (Sensitivity) and false positives (1-Specificity) across all possible thresholds. ROC-AUC measures inherent discriminative capacity independent of base rates, which means it remains constant across different outcome base rates. This makes it a valuable metric for comparing models across different populations.
PR-AUC (Area Under the Precision-Recall Curve): Summarizes how well a model balances Precision (PPV) and Recall (Sensitivity) across all thresholds. Because Precision is dependent on the base rate, PR-AUC can often be a more informative summary metric for real-world contexts with low base rates. The caveat is that for it to be truly informative, the base rate used to obtain PR-AUC should reflect expected real-world conditions (instead of being based on artificially balanced datasets).

Both ROC-AUC and PR-AUC represent areas under their respective curves and are mathematically expressed as integrals:

\[ROC\text{-}AUC = \int_0^1 TPR(FPR) \, d(FPR)\]

\[PR\text{-}AUC = \int_0^1 PPV(TPR) \, d(TPR)\]

These integrals are computed using trapezoidal numerical integration.

Decision Curve Analysis (DCA)

Decision Curve Analysis (Vickers & Elkin, 2006) evaluates the clinical utility of a predictive model or a single predictor by explicitly balancing the costs of false positives against the benefits of true positives.

Decision Curve Analysis example showing net benefit curves and shaded areas representing different strategies

A DCA plot typically includes three key curves:

Model: Shows the net benefit of using the model (or a single predictor) at different threshold probabilities.
All: Represents the strategy of intervening for everyone regardless of their predicted risk. This strategy maximizes sensitivity (no false negatives) but results in many unnecessary interventions (false positives).
None: Represents the strategy of intervening for no one, which always yields zero net benefit but avoids all intervention-related harms.

The net benefit formula accounts for both the benefits of true positives and the costs of false positives:

\[NB = \frac{TP}{N_{total}} - \frac{FP}{N_{total}} \times \frac{p_t}{1-p_t}\]

Where N_total is the total sample size, and p_t is the threshold probability. It represents the minimum predicted probability of an outcome at which you would decide to intervene (e.g., diagnose or treat). For instance, if p_t = 0.10, you would intervene for anyone with a predicted risk ≥ 10%. The choice of p_t determines a specific balance between sensitivity (finding true cases) and specificity (avoiding false alarms). In practical terms, p_t represents the trade-off between benefits and harms: for every 1/(1-p_t) - 1 unnecessary interventions you are willing to accept to prevent one adverse outcome. The ratio p_t/(1-p_t) in the net benefit formula captures this relative weighting of false positives compared to true positives. The optimal p_t can be estimated as:

\[p_t = \frac{C_{FP}}{C_{FP} + C_{FN}}\]

Where C_FP is the cost of a false positive (unnecessary intervention) and C_FN is the cost of a false negative (missed positive case). p_t can also be estimated through expert surveys, stakeholder preferences, or established guidelines.

For population screening, p_t is typically set low because missing true cases is costlier than unnecessary follow-ups, so more false positives are acceptable. For diagnostic confirmation (e.g., before initiating high-risk treatment), p_t is set higher to avoid false positives, reflecting a preference for specificity. As a rule of thumb, screening scenarios may use p_t in the 1–10% range, whereas diagnostic decisions often warrant much higher p_t (for example 30–70% or more), depending on harms and preferences.

What we often want to know is not the absolute NB, but added value. ΔNB (Delta Net Benefit) measures this additional utility by comparing the model against the better of the two simple strategies (either All or None) at a specific threshold probability:

\[\Delta NB = NB_{\text{model}} - \max(NB_{\text{All}}, NB_{\text{None}})\]

At each threshold probability, the model's net benefit is compared against whichever simple strategy performs better at that threshold. This provides a more conservative and meaningful assessment of the model's added value. A positive ΔNB indicates that the predictive model offers genuine improvement over the best simple strategy, while values near zero suggest that simple strategies may be equally effective.

DCA is particularly valuable because it:

Incorporates the decision-making context through threshold probabilities that reflect real-world scenarios
Accounts for the relative costs of different types of errors (false positives vs. false negatives)
Provides actionable insights about when a model should or should not be used
Facilitates comparison between different models or strategies across various contexts

For more information about DCA, visit https://mskcc-epi-bio.github.io/decisioncurveanalysis.

Quick Start Examples

To further clarify how the tool can be used and to demonstrate its utility we provide some specific examples.

Example 1: Diagnostic Prediction

Can we predict depression diagnosis using a cognitive biomarker?

In the Binary outcome mode:
- Set base rate to 8% (the base rate of depression in adolescents; Shorey et al., 2022)
- Set the grouping reliability to 0.28 (depression diagnosis reliability based on DSM-5 field trials; Regier et al., 2013)
- Set the predictor reliability for both groups to 0.6 (an average reliability for cognitive measures; Karvelis et al., 2023)
- Set the observed effect size to d = 0.8 (a large effect size that is optimistic and rarely seen in practice)
With these parameters, observed Cohen's d = 0.8 will yield ROC-AUC = 0.71 and PR-AUC = 0.19. This means that even with a "large" effect size of 0.8, the predictive utility remains rather modest, especially when it comes to the tradeoff between PPV and Sensitivity (as shown by the low PR-AUC). Using the DCA plot to set the classification threshold to correspond to 10% risk, p_t = 0.1, we obtain PPV = 0.16, Sensitivity = 0.53, and ΔNB = 0.018, which means that at this threshold 84% of diagnoses would be false positives while still missing 47% of actual cases, and we would get 1.8 additional true positives per 100 diagnoses.
Note that with the low reliability values, this observed effect corresponds to a much larger true effect, d = 1.58, and in turn much better predictive performance, ROC-AUC = 0.87 and PR-AUC = 0.46, and PPV = 0.27, Sensitivity = 0.74, and ΔNB = 0.046 at p_t = 0.1, highlighting how much improvement in diagnostic prediction could be achieved simply by improving measurement reliability.
Now let's say we are serious about precision psychiatry and we want to achieve a PR-AUC of 0.8. Using the tool, we can find that it would require ROC-AUC = 0.96. It would be rather unrealistic to expect a single biomarker to achieve this effect size. Using the multivariable simulator, you can explore how many predictors with smaller d values would be required to achieve a desired prediction performance.

Example 2: Treatment Response Prediction

Can we predict who will respond to antidepressant treatment using task-based brain activity measures?

Select Continuous outcome mode:

Set base rate to 15% (the rate of response to antidepressant treatment beyond placebo; Stone et al., 2022)
Set predictor reliability to 0.4 (an average reliability for task-fMRI measures; Elliott et al., 2020)
Set outcome reliability to 0.94 (Hamilton Depression Rating Scale (HAMD) reliability; Trajković et al., 2011)
Adjust effect size such that R² = 0.2 (average multivariate R² from recent research; Karvelis et al., 2022)

This will yield ROC-AUC = 0.73 and PR-AUC = 0.33, indicating rather modest predictive performance, as shown by the low PR-AUC. At p_t = 0.2, which reflects the relative harms of antidepressant side effects, this would result in Sensitivity = 0.54, PPV = 0.29, and ΔNB = 0.031, which means that 46% of those who would benefit from treatment would not receive treatment, 71% of those given treatment would not benefit from it, and we would get additional 3.1 true responders per 100 people who receive antidepressants. Improving measurement reliability alone could improve performance quite substantially, up to ROC-AUC = 0.87 and PR-AUC = 0.57, which at p_t = 0.2 would give Sensitivity = 0.74, PPV = 0.42, and ΔNB = 0.073.
If we want to once again be serious about precision psychiatry and aim for PR-AUC of 0.8, we will find it requires R² = 0.8. These are rather extremely ambitious values (requiring to explain 80% of variance in symptom improvement). This helps demonstrate the inherent limitations of dichotomizing continuous outcomes for evaluating treatment response prediction - doing so leads to a loss of valuable information. On the other hand, it does reflect the binary nature of decision-making in psychiatry (to prescribe the treatment or not).

Example 3: Risk Prediction

Can we predict who will attempt suicide using electronic health records?

One of the largest prospective suicide prediction studies (Edgcomb et al., 2021) followed women (N = 67,000) with serious mental illness for 12 months after a general medical hospitalization and trained models on pre-discharge electronic health records to predict readmission for suicide attempt or self-harm, achieving ROC-AUC of 0.73 (derivation sample) and 0.71 (external sample). A companion study (Thiruvalluru et al., 2023) in men (N = 1.4 million) reported a 12-month base rate of 3.9% and similar discrimination.

In the Binary outcome mode:
- Set base rate to 3.9% (Thiruvalluru et al., 2023)
- Set the outcome reliability to 1.0 (hospital admissions for attempts or self-harm can be assumed to be near perfect)
- Set the predictor reliability for both groups to 0.8 (structured electronic health records including healthcare utilization, prior attempts, psychiatric diagnoses, etc., can be assumed to have rather high reliability)
- Set the observed effect size to d = 0.77, which corresponds to ROC-AUC = 0.71
This yields PR-AUC = 0.10, indicating poor predictive performance in the real world. At p_t = 0.03 (a reasonable threshold for intervention), we get Sensitivity = 0.77, PPV = 0.06, and ΔNB = 0.006. This means that while the model would capture about three-quarters of true cases, only 6% of those flagged would actually attempt suicide, and the added benefit would translate to 6 additional true cases per 1,000 individuals.
To achieve PR-AUC = 0.8 in this population would require ROC-AUC = 0.98, which is extremely unrealistic. At p_t = 0.03, this would result in Sensitivity = 0.94, PPV = 0.30, and ΔNB = 0.025.
Note that because the reliability is already quite high, improving it further would not make much of a difference. What we need to do is to find better predictors. Alternatively, it may be more effective to simply focus on universal suicide prevention strategies rather than trying to predict individual cases (e.g.,Large 2018).

Sample Size Calculations for Prediction Models

E2P Simulator includes a sample size calculator to help you determine how much data is needed for your multivariable prediction models. This calculator is designed to help you avoid overfitting and keep prediction error low (following the recommendations from Riley et al., 2020), providing more robust sample size estimates than simple rules of thumb like "10 events per predictor".

How to Use

Specify your number of predictors, realistically expected R² (based on prior research or pilot data), and outcome base rate (for binary outcomes only). Use R²_CS for binary outcomes and standard R² for continuous outcomes. Note, for binary outcomes with a single continuous predictor, R²_CS equals eta-squared (η²), which is already displayed in the main E2P simulator dashboard. The final recommendation uses the maximum across all criteria to ensure all performance targets are met.

The sample size calculators complement the main E2P simulators in study planning: the E2P simulators explore relationships between effect sizes and predictive utility (both what you need for desired performance and what utility to expect from realistic effects), while the sample size calculators determine adequate sample size based on realistic R² estimates from prior research. For sample size planning, always use conservative, realistic R² estimates based on prior research, not idealized target values.

Prediction Models vs. Hypothesis Testing Sample Sizes

You may wonder how these prediction-focused sample size calculations compare to traditional power analysis used in hypothesis testing. The key difference is that power analysis focuses on detecting whether an effect exists, while prediction-focused calculations prioritize model reliability and performance on new data. This fundamental difference in goals typically leads to larger sample size requirements for prediction models.

Another way to think about this difference is in terms of precision requirements. Power analysis only needs sufficient precision to distinguish an effect from zero (statistical significance). In contrast, prediction models require much tighter confidence intervals around parameter estimates to ensure much more precise estimation of predictive performance / effect sizes.

Feedback and Contributions

E2P Simulator is an open-source project - feedback, bug reports, and suggestions for improvement are welcome. The easiest way to do so is through the GitHub Issues page.

You can view the source code, track development, and contribute directly at the project's GitHub repository.

For other inquiries, you can find my contact information here.

References

Brabec, J., Komárek, T., Franc, V., & Machlica, L. (2020). On model evaluation under non-constant class imbalance. International Conference on Computational Science, vol. 12140 (pp. 74-87). Springer, Cham. https://doi.org/10.1007/978-3-030-50423-6_6
Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110, 12-22. https://doi.org/10.1016/j.jclinepi.2019.02.004
Edgcomb, J. B., Thiruvalluru, R., Pathak, J., Brooks, J. O., & Zima, B. (2021). Machine learning to differentiate risk of suicide attempt and self-harm after general medical hospitalization of women with mental illness. Medical Care, 59, S58-S64. https://doi.org/10.1097/MLR.0000000000001445
Elliott, M. L., Knodt, A. R., Ireland, D., Morris, M. L., Poulton, R., Ramrakha, S., Sison, M. L., Moffitt, T. E., Caspi, A., & Hariri, A. R. (2020). What is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science, 31(7), 792-806. https://doi.org/10.1177/0956797620916786
Karvelis, P., & Diaconescu, A. O. (2025). Clarifying the reliability paradox: poor measurement reliability attenuates group differences. Frontiers in Psychology, 16, 1592658. https://doi.org/10.3389/fpsyg.2025.1592658
Karvelis, P., Paulus, M. P., & Diaconescu, A. O. (2023). Individual differences in computational psychiatry: A review of current challenges. Neuroscience & Biobehavioral Reviews, 148, 105137. https://doi.org/10.1016/j.neubiorev.2023.105137
Karvelis, P., Charlton, C. E., Allohverdi, S. G., Bedford, P., Hauke, D. J., & Diaconescu, A. O. (2022). Computational approaches to treatment response prediction in major depression using brain activity and behavioral data: A systematic review. Network Neuroscience, 6(4), 1066-1103. https://doi.org/10.1162/netn_a_00233
Large, M. M. (2018). The role of prediction in suicide prevention. Dialogues in Clinical Neuroscience, 20(3), 197-205. https://doi.org/10.31887/DCNS.2018.20.3/mlarge
Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry, 170(1), 59-70. https://doi.org/10.1176/appi.ajp.2012.12070999
Riley, R. D., Ensor, J., Snell, K. I. E., Harrell Jr, F. E., Martin, G. P., Reitsma, J. B., Moons, K. G. M., Collins, G., & van Smeden, M. (2020). Calculating the sample size required for developing a clinical prediction model. BMJ, 368, m441. https://doi.org/10.1136/bmj.m441
Shorey, S., Ng, E. D., & Wong, C. H. J. (2022). Global prevalence of depression and elevated depressive symptoms among adolescents: A systematic review and meta-analysis. British Journal of Clinical Psychology, 61(2), 287-305. https://doi.org/10.1111/bjc.12333
Stone, M. B., Yaseen, Z. S., Miller, B. J., Richardville, K., Kalaria, S. N., & Kirsch, I. (2022). Response to acute monotherapy for major depressive disorder in randomized, placebo controlled trials submitted to the US Food and Drug Administration: Individual participant data analysis. BMJ, 378, e067606. https://doi.org/10.1136/bmj-2021-067606
Thiruvalluru, R. K., Edgcomb, J. B., Brooks, J. O., & Pathak, J. (2023). Risk of suicide attempts and self-harm after 1.4 million general medical hospitalizations of men with mental illness. Journal of Psychiatric Research, 157, 50-56. https://doi.org/10.1016/j.jpsychires.2022.10.035
Trajković, G., Starčević, V., Latas, M., Leštarević, M., Ille, T., Bukumirić, Z., & Marinković, J. (2011). Reliability of the Hamilton Rating Scale for Depression: A meta-analysis over a period of 49 years. Psychiatry Research, 189(1), 1-9. https://doi.org/10.1016/j.psychres.2010.12.007
Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26(6), 565-574. https://doi.org/10.1177/0272989X06295361

Getting Started with E2P Simulator

What is E2P Simulator?