bebi103.stan.sbc
- bebi103.stan.sbc(prior_predictive_model=None, posterior_model=None, prior_predictive_model_data=None, posterior_model_data=None, measured_data=None, var_names=None, measured_data_dtypes=None, posterior_predictive_var_names=None, log_likelihood_var_name=None, sampling_kwargs=None, diagnostic_check_kwargs=None, cores=1, N=400, n_prior_draws_for_sd=1000, samples_dir='sbc_samples', remove_sample_files=True, df_package='polars', progress_bar=False)
Perform simulation-based calibration on a Stan Model.
- Parameters
prior_predictive_model (cmdstanpy.model.CmdStanModel) – A Stan model for generating prior predictive data sets.
posterior_model (cmdstanpy.model.CmdStanModel) – A Stan model of the posterior that allows sampling.
prior_predictive_model_data (dict) – Dictionary with entries specified by the data block of the prior predictive Stan model.
posterior_model_data (dict) – Dictionary with entries specified by the data block of the prior predictive Stan model. Measured data in this dictionary will be replaced in each simulation by what was generated by the prior predictive model.
measured_data (list) – A list of strings containing the variable names of measured data. Each entry in measured_data must be a key in posterior_model_data.
var_names (list of strings) – A list of strings containing parameter names to be considered in the SBC analysis. Not all parameters of the model need be considered; only those in var_names have rank statistics calculated. Note that for multidimensional variables, var_names only has the root name. E.g., var_names=[‘x’, ‘y’], not something like var_names=[‘x[0]’, ‘x[1]’, ‘y[0,0]’].
posterior_predictive_var_names (list of strings, default None) – List of variables that are posterior predictive. These are ignored in the SBC analysis. Note that for multidimensional variables, var_names only has the root name. E.g., var_names=[‘x_ppc’, ‘y_ppc’], not something like var_names=[‘x_ppc[0]’, ‘x_ppc[1]’, ‘y_ppc[0,0]’].
log_likelihood_var_name (string, default None) – Name of variable in the Stan model that stores the log likelihood. This is ignored in the SBC analysis.
measured_data_dtypes (dict, default None) – The key in the dtypes dict is a string representing the data name, and the corresponding item is its dtype, almost always either int or float.
sampling_kwargs (dict, default None) – kwargs to be passed to sm.sample().
diagnostic_check_kwargs (dict, default None) – kwargs to pass to check_all_diagnostics(). If quiet and/or return_diagnostics are given, they are ignored. max_treedepth is inferred from sampling_kwargs.
cores (int, default 1) – Number of cores to use in the SBC calculation.
N (int, 400) – Number of simulations to run.
n_prior_draws_for_sd (int, default 1000) – Number of prior draws to compute the prior standard deviation for a parameter in the prior distribution. This standard deviation is used in the shrinkage calculation.
samples_dir (str, default "sbc_samples") – Path to directory to store .csv and .txt files generated by CmdStan. The directory specified here will NOT be destroyed after a calculation, though the files within it may be, depending on the remove_sample_files kwarg.
remove_sample_files (bool, default True) – If True, remove .csv and .txt files generated by CmdStan.
df_package (str, either 'polars' (default) or 'pandas') – Which package to use for output data frame
progress_bar (bool, default False) – If True, display a progress bar for the calculation using tqdm.
- Returns
output – Data frame with the output of the SBC analysis. It has the following columns.
trial : Unique trial number for the simulation.
warning_code : Warning code based on diagnostic checks outputted by check_all_diagnostics().
parameter: The name of the scalar parameter.
prior: Value of the parameter used in the simulation. This value was drawn out of the prior distribution.
mean : mean parameter value based on sampling out of the posterior in the simulation.
sd : standard deviation of the parameter value based on sampling out of the posterior in the simulation.
L : The number of bins used in computing the rank statistic. The rank statistic should be uniform on the integers [0, L].
rank_statistic : Value of the rank statistic for the parameter for the trial.
shrinkage : The shrinkage for the parameter for the given trial. This is computed as 1 - sd / sd_prior, where sd_prior is the standard deviation of the parameters as determined from drawing out of the prior.
z_score : The z-score for the parameter for the given trial. This is computed as abs(mean - prior) / sd.
- Return type
Polars or Pandas DataFrame
Notes
Each simulation is done by sampling a parameter set out of the prior distribution, using those parameters to generate data from the likelihood, and then performing posterior sampling based on the generated data. A rank statistic for each simulation is computed. This rank statistic should be uniformly distributed over its L possible values. See https://arxiv.org/abs/1804.06788, by Talts, et al., for details.