bebi103.stan.sbc

bebi103.stan.sbc(prior_predictive_model=None, posterior_model=None, prior_predictive_model_data=None, posterior_model_data=None, measured_data=None, var_names=None, measured_data_dtypes=None, posterior_predictive_var_names=None, log_likelihood_var_name=None, sampling_kwargs=None, diagnostic_check_kwargs=None, cores=1, N=400, n_prior_draws_for_sd=1000, samples_dir='sbc_samples', remove_sample_files=True, progress_bar=False)

Perform simulation-based calibration on a Stan Model.

Parameters
  • prior_predictive_model (pystan.model.StanModel) – A Stan model for generating prior predictive data sets.

  • posterior_model (pystan.model.StanModel) – A Stan model of the posterior that allows sampling.

  • prior_predictive_model_data (dict) – Dictionary with entries specified by the data block of the prior predictive Stan model.

  • posterior_model_data (dict) – Dictionary with entries specified by the data block of the prior predictive Stan model. Measured data in this dictionary will be replaced in each simulation by what was generated by the prior predictive model.

  • measured_data (list) – A list of strings containing the variable names of measured data. Each entry in measured_data must be a key in posterior_model_data.

  • var_names (list of strings) – A list of strings containing parameter names to be considered in the SBC analysis. Not all parameters of the model need be considered; only those in var_names have rank statistics calculated. Note that for multidimensional variables, var_names only has the root name. E.g., var_names=[‘x’, ‘y’], not something like var_names=[‘x[0]’, ‘x[1]’, ‘y[0,0]’].

  • posterior_predictive_var_names (list of strings, default None) – List of variables that are posterior predictive. These are ignored in the SBC analysis. Note that for multidimensional variables, var_names only has the root name. E.g., var_names=[‘x_ppc’, ‘y_ppc’], not something like var_names=[‘x_ppc[0]’, ‘x_ppc[1]’, ‘y_ppc[0,0]’].

  • log_likelihood_var_name (string, default None) – Name of variable in the Stan model that stores the log likelihood. This is ignored in the SBC analysis.

  • measured_data_dtypes (dict, default None) – The key in the dtypes dict is a string representing the data name, and the corresponding item is its dtype, almost always either int or float.

  • sampling_kwargs (dict, default None) – kwargs to be passed to sm.sample() for a CmdStanPy model sm or to sm.sampling() for a PyStan model sm. If using CmdStanPy, the ‘output_dir’ kwarg is not allowed because unambiguous naming is not possible if new sampling is done more than once per minute, given CmdStanPy’s naming convention.

  • diagnostic_check_kwargs (dict, default None) – kwargs to pass to check_all_diagnostics(). If quiet and/or return_diagnostics are given, they are ignored. max_treedepth is inferred from sampling_kwargs.

  • cores (int, default 1) – Number of cores to use in the SBC calculation.

  • N (int, 400) – Number of simulations to run.

  • n_prior_draws_for_sd (int, default 1000) – Number of prior draws to compute the prior standard deviation for a parameter in the prior distribution. This standard deviation is used in the shrinkage calculation.

  • samples_dir (str, default "sbc_samples") – Path to directory to store .csv and .txt files generated by CmdStan. If the Stan models are made using PyStan, this is ignored; it is only active if using CmdStanPy. The directory specified here will NOT be destroyed after a calculation, though the files within it may be, depending on the remove_sample_files kwarg.

  • remove_sample_files (bool, default True) – If True, remove .csv and .txt files generated by CmdStan. If the Stan models are made using PyStan, this is ignored; it is only active if using CmdStanPy.

  • progress_bar (bool, default False) – If True, display a progress bar for the calculation using tqdm.

Returns

output – A Pandas DataFrame with the output of the SBC analysis. It has the following columns.

  • trial : Unique trial number for the simulation.

  • warning_code : Warning code based on diagnostic checks outputted by check_all_diagnostics().

  • parameter: The name of the scalar parameter.

  • prior: Value of the parameter used in the simulation. This value was drawn out of the prior distribution.

  • mean : mean parameter value based on sampling out of the posterior in the simulation.

  • sd : standard deviation of the parameter value based on sampling out of the posterior in the simulation.

  • L : The number of bins used in computing the rank statistic. The rank statistic should be uniform on the integers [0, L].

  • rank_statistic : Value of the rank statistic for the parameter for the trial.

  • shrinkage : The shrinkage for the parameter for the given trial. This is computed as 1 - sd / sd_prior, where sd_prior is the standard deviation of the parameters as determined from drawing out of the prior.

  • z_score : The z-score for the parameter for the given trial. This is computed as abs(mean - prior) / sd.

Return type

Pandas DataFrame

Notes

Each simulation is done by sampling a parameter set out of the prior distribution, using those parameters to generate data from the likelihood, and then performing posterior sampling based on the generated data. A rank statistic for each simulation is computed. This rank statistic should be uniformly distributed over its L possible values. See https://arxiv.org/abs/1804.06788, by Talts, et al., for details.