Stan (MCMC) utilities


[1]:
import re

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import bebi103

import holoviews as hv
hv.extension('bokeh')

bebi103.hv.set_defaults()

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

The bebi103 package contains a set of convenience functions for use with Stan. Much of the functionality is included in ArviZ, and indeed the bebi103.stan module depends on ArviZ and extensively uses its InferenceData data type. Furthermore, the bebi103 package offers visualization functions for MCMC results which we will introduce as we go through a Bayesian workflow example below. Again, much of this functionality is present in ArviZ, but with modifications. Specifically, at least in my experience, there are some problems with the Bokeh backend with ArviZ.

Model and sample data

To demonstrate the usage of the Stan utilities of bebi103, we will consider the following hierarchical generative model.

\begin{align} &\theta_i \sim \text{Norm}(5, 5),\;\;i\in\{1, 2\},\\[1em] &\tau_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\sigma_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\rho \sim \text{Uniform}(-1, 1),\\[1em] &\mathsf{T} = \begin{pmatrix}\tau_1^2 & 0 \\ 0 & \tau_2^2\end{pmatrix}, \\[1em] &\mathsf{\Sigma} = \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}, \\[1em] &\theta_{1, i} \sim \text{Norm}(\theta, \mathsf{T}) \;\forall\,i\in\{1, 2, 3\},\\[1em] & \begin{pmatrix} x_{i, j} \\ y_{i,j} \end{pmatrix} \sim \text{Norm}(\theta_{1,i}, \mathsf{\Sigma})\;\forall\,i \in \{1, 2, 3\}, \; j \in \{1, \ldots, n_i \}. \end{align}

This is a hierarchical model in which the three vector-valued parameters \(\theta_{1,1}\), \(\theta_{1,2}\), and \(\theta_{1,3}\) are conditioned on the vector-valued hyperparameter \(\theta\). We will use a fabricated data set generated from this model with parameters described in the Using the user guide section. It is useful to understand the data set, so let’s look at the data frame.

[2]:
df = pd.read_csv('sample_data.csv')
df.head()
[2]:
x y trial
0 1.762886 11.955616 1
1 4.364957 11.136633 1
2 3.457626 12.301267 1
3 -0.839319 10.401899 1
4 4.694602 11.925334 1

Each trial (there are three, numbered 1, 2, and 3) has an x measurment and a y measurement.

To begin our analysis, we will use the following Stan model, which directly follows from the above generative model.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  int index_1[N];

  // The measurements
  real x[N];
  real y[N];
}


transformed data {
  // Data are two-dimensional, so store in a vector
  vector[2] xy[N];

  for (i in 1:N) {
    xy[i, 1] = x[i];
    xy[i, 2] = y[i];
  }
}


parameters {
  // Hyperparameters level 0
  vector[2] theta;

  // How hyperparameters vary
  vector<lower=0>[2] tau;

  // Parameters
  vector[2] theta_1[J_1];
  vector<lower=0>[2] sigma;
  real<lower=-1, upper=1> rho;
}


transformed parameters {
  // Covariance matrix for hyperparameters
  matrix[2, 2] Tau = [
    [tau[1]^2,  0       ],
    [0,         tau[2]^2]
  ];

  // Covariance matrix for likelihood
  matrix[2, 2] Sigma = [
    [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
    [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
  ];
}


model {
  // Hyperpriors
  theta ~ normal(5, 5);
  tau ~ normal(0, 10);

  // Priors
  theta_1 ~ multi_normal(theta, Tau);
  sigma ~ normal(0, 10);
  rho ~ uniform(-1, 1);

  // Likelihood
  for (i in 1:N) {
    xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
  }
}


generated quantities {
  real x_ppc[N];
  real y_ppc[N];
  real log_lik[N];

  {
    vector[2] xy_ppc;

    for (i in 1:N) {
      xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
      log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
      x_ppc[i] = xy_ppc[1];
      y_ppc[i] = xy_ppc[2];
    }
  }
}

A few important notes about the Stan file.

  1. The array index_1 keeps track of the dependence of the data on each parameter, which are themselves conditioned on hyperparameters.

  2. Posterior predictive checks and the log-likelihood function are calculated in the generated quantities block.

A first pass through a simple workflow

We will now proceed through a MCMC workflow, doing the following steps:

  1. Compile the Stan model.

  2. Convert the data frame into input data for the Stan model.

  3. Convert the samples into an ArviZ InferenceData instance.

  4. Check the sampling diagnostics.

  5. Create visualizations of the samples.

Compiling the Stan model

Starting with the first step making a compiled Stan model that we can use for sampling, we will use CmdStanPy. Although the bebi103 package supports PyStan, CmdStanPy is preferred because of the simplicity of its interface and shorter model compilation time.

If we wish to suppress messages to the screen, we can use bebi103.stan.disable_logging() with context management.

[3]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="sample_model.stan")

Preparing hierarchical data for Stan

Now that we have a compiled Stan model, we need to supply it data. Referring to the data block in the Stan code above, for this hierarchical model, we need to specify how many data points, what the values of x and y are, the number of different parameters (J_1), and a set of indices the give which of the three \(\theta_1\) values upon which each data point is conditioned. This information is already contained in the tidy data frame, df. The 'x' and 'y' columns contain the data, and the 'trial' column determines the level in the hierarchy. The value of J_1 is given by the number of unique entries in the 'trial' column, and the number of data points is given by the total number of rows in the data frame.

The bebi103.stan.df_to_datadict_hier() function converts a tidy Pandas data frame into a dictionary of data that can be passed to a Stan program for a hierarchical model. Furthermore, since the 'trial' column could have strings or other entries in them, these need to be converted to integers for use in Stan. A column is added to Stan with a suffix _stan to give the values of that column that are used in the Stan program. This is for reference when analyzing the results.

[4]:
data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols="trial", data_cols=["x", "y"]
)

# Take a look at the updated data frame
df.head()
[4]:
x y trial trial_stan
0 1.762886 11.955616 1 1
19 4.245374 12.896902 1 1
18 2.495358 6.278253 1 1
17 4.847436 8.901546 1 1
16 5.585134 11.772688 1 1

The data frame has been appropriately updated (the trials were numbers, so there is no difference). Let’s look at the data dictionary.

[5]:
data
[5]:
{'N': 63,
 'J_1': 3,
 'index_1': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]),
 'x': array([ 1.76288583,  4.24537447,  2.49535766,  4.84743608,  5.58513412,
         2.46538959,  2.53372498,  3.03556003,  4.1428636 ,  3.37291658,
         1.80757816,  3.86468666,  2.0406895 ,  6.10018145,  3.70259613,
         4.6946018 , -0.83931887,  3.45762554,  4.36495728,  9.02739122,
        -0.90483667,  0.63091768,  2.7511249 ,  0.84468249, -1.40832682,
         0.97071434,  1.65312163,  4.88902665,  4.74467292,  2.00083667,
         1.5374102 ,  5.36558851,  6.03757711,  1.71903739,  1.56783325,
         0.25157748,  6.76167578, -0.67226419,  3.36289739,  2.17013057,
         4.03836253,  2.72486749,  3.65771395,  4.59028926,  1.40245352,
         4.34252071,  4.1146886 ,  0.79938752,  0.72094663,  3.19538556,
         0.58176677, -0.01108818, -0.10906546,  2.37188306,  4.88494459,
        -0.35811241, -0.85531719,  2.62235591,  0.71487182,  1.38188117,
         1.82837862,  0.16000891,  4.81184826]),
 'y': array([11.95561574, 12.89690196,  6.27825302,  8.90154585, 11.77268772,
         8.9978501 ,  6.16475102, 10.64184364, 10.74399897,  8.71908946,
         9.63231343, 13.00236894,  9.47857277, 10.97263645, 11.46520631,
        11.92533427, 10.40189925, 12.30126673, 11.13663316, 16.78665286,
         4.11289192,  5.32907863,  7.82030839,  3.75238271,  3.89596763,
         1.37792381,  6.55489775,  9.7609403 , 10.03251758,  6.26665374,
         5.62577914,  5.84795532,  8.14624991,  9.12145582,  5.94521903,
         4.34204882, 12.30037908, -0.52873034,  4.3659075 ,  9.05879288,
         8.84628724,  9.95807796,  5.07513802, 10.37096513,  3.73211647,
         5.8037337 ,  5.02628689,  3.61882617,  6.98712067, 11.21360732,
         4.31478889,  4.40895773,  4.73065621,  8.95402277,  8.31905359,
         7.80920144,  5.92093593,  4.84087256,  3.43491574,  5.78404006,
         8.04955064,  5.29564007,  8.64743679])}

The dictionary has all of the entries necessary to pass into the Stan program while sampling.

Sampling and conversion to ArviZ InferenceData

We can now go about our sampling and get the results. We will again suppress logging. Note that this is in general not a good idea. I do it here to manage space in the documentation.

[6]:
with bebi103.stan.disable_logging():
    samples = sm.sample(
        data=data, chains=4, iter_sampling=1000, adapt_delta=0.95, seed=3252
    )

The bebi103 package works with the excellent ArviZ InferenceData data type. ArviZ can be used to convert samples from Stan, PyMC3, emcee, PyTorch, and many others into a well-defined universal data format.

[7]:
samples = az.from_cmdstanpy(
    samples, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)

Converting InferenceData to a Pandas DataFrame

Now that the samples are stored as InferenceData, we can go about using them. Before we do that, we may want to convert the posterior samples into a Pandas data frame to do our own analysis.

[8]:
df_samples = bebi103.stan.arviz_to_dataframe(samples)

# Take a look
df_samples.head()
[8]:
theta[0] theta[1] tau[0] tau[1] theta_1[0,0] theta_1[0,1] theta_1[1,0] theta_1[1,1] theta_1[2,0] theta_1[2,1] ... Tau[0,1] Tau[1,0] Tau[1,1] Sigma[0,0] Sigma[0,1] Sigma[1,0] Sigma[1,1] chain__ draw__ diverging__
0 2.37476 7.46285 0.923329 1.45192 3.54939 10.05800 2.82881 5.95545 1.88787 7.08742 ... 0.0 0.0 2.10808 4.43857 3.29286 3.29286 6.81653 0 0 False
1 2.49195 7.91184 0.950101 1.31911 3.02827 9.84365 2.50844 5.71521 1.82827 7.40473 ... 0.0 0.0 1.74006 5.15723 3.68042 3.68042 7.61829 0 1 False
2 2.63931 6.99427 1.222190 3.59020 3.73759 10.63330 2.68387 6.89852 1.75886 5.94173 ... 0.0 0.0 12.88950 3.58365 3.20140 3.20140 8.82983 0 2 False
3 3.25652 7.09316 0.787648 3.52398 3.61098 10.10210 2.60675 6.98456 1.97803 6.58115 ... 0.0 0.0 12.41840 3.25437 3.20632 3.20632 8.44636 0 3 False
4 2.80134 7.91179 0.739557 1.10262 2.39642 9.66203 2.63468 6.75522 2.13054 6.64004 ... 0.0 0.0 1.21576 5.09825 3.26933 3.26933 5.90262 0 4 False

5 rows × 24 columns

Variables/parameters naming convention

Note how multdimensional variables are named in the above data frame. They are zero-indexed, and use brackets with different axes of the index separated by a comma with no spaces. Note also that the last three columns are chain__, draw__, and diverging__, which specify which chain the sample comes from, which of the successive draws it was, and whether or not it had a divergence, respectively. This nomenclature is used throughout the bebi103 MCMC parsing functions.

Some functions in the bebi103.stan and bebi103.viz modules use parameters as an argument. In those cases, the entries of multidimensional properties are expected to be entered one-by-one. Other functions use var_names as an argument. For multidimensional parameters, only the base name (with no indexing) is expected. This var_names format is congruent with that of ArviZ. As an example, consider a scalar parameter a, a vector parameter b with three entries, and a 2×2 matrix parameter c. The following shows how the respective keyword arguments should be used.

  • parameters=['a', 'b[0]', 'b[1]', 'b[2]', 'c[0,0]', 'c[0,1]', 'c[1,0]', 'c[1,1]']

  • var_names=['a', 'b', 'c']

Using parameters allows omission of some elements of multidimensional parameters.

Checking diagnostics

We may check the following diagnostic features:

  1. The effective sample size (ESS).

  2. The rank-normalized Gelman-Rubin statistic, Rhat (\(\hat{R}\)).

  3. Divergences.

  4. Tree depth.

  5. Energy-Bayes fraction of missing information (E-BFMI)

The functions to check these diagnostics are based on some of the functionality in ArviZ and from the work of Michael Betancourt.

We can start by checking the effective sample size. The bebi103.stan.check_ess() function checks for both the effective sample size and the “tail” effective sample size, as defined in Vehtari, et al., 2019.

[9]:
bebi103.stan.check_ess(samples)
Effective sample size looks reasonable for all parameters.
[9]:
True

The function returned True, indicating that all variables have an effective samples size and tail effective sample size greater than the rule of thumb of 100 per chain. We can adjust the rule of thumb, if we like. If use the total_ess_rule_of_thumb=400 kwarg, then any parameter with fewer than 1600 effective samples (since we had four chains) is flagged a failure of the ESS check.

[10]:
bebi103.stan.check_ess(samples, total_ess_rule_of_thumb=400)
tail-ESS for parameter theta[0] is 1465.951659584542.
tail-ESS for parameter theta[1] is 1426.717666604798.
ESS for parameter tau[0] is 1235.1932445709374.
ESS for parameter Tau[0,0] is 1235.1909291436552.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.
[10]:
False

The function now returns False, and also prints diagnostic information about the parameters for which the test failed.

Now let’s turn to Rhat. The Rhat value is a rank-normalized, folded Rhat, as described in Vehtari, et al., 2019.

[11]:
bebi103.stan.check_rhat(samples)
Rhat for parameter Tau[0,1] is NaN.
Rhat for parameter Tau[1,0] is NaN.
[11]:
False

The function returned False, meaning the that Rhat check failed. It informed us that two of the Rhat values were NaN, which causes a failure of the check. It is often the case that variables in the generated parameters block have NaN Rhat because they are not in any way manipulated by the sampler. In this case, the off-diagonal terms in the matrix Tau are both always zero. We can omit these using the omit kwarg.

[12]:
bebi103.stan.check_rhat(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Rhat looks reasonable for all parameters.
[12]:
True

Next, we can check for divergences.

[13]:
bebi103.stan.check_divergences(samples)
53 of 4000 (1.325%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
[13]:
False

We failed the divergence test because we had 53 total divergences. We will deal with these momentarily.

We can also check if the sampler was hitting the maximum tree depth. By default, the tree depth used in a sampling call by CmdStanPy is 10, which is the default max_treedepth kwarg value for bebi103.stan.check_treedepth(). You need to set this kwarg to match whatever tree depth used in the sampling. In our case, we can use the default.

[14]:
bebi103.stan.check_treedepth(samples)
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
[14]:
True

We have passed this test.

Finally, we can check the E-BFMI.

[15]:
bebi103.stan.check_energy(samples)
E-BFMI indicated no pathological behavior.
[15]:
True

We have also passed this test.

For convenience, we can perform all tests at once using the bebi103.stan.check_all_diagnostics() function.

[16]:
bebi103.stan.check_all_diagnostics(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

53 of 4000 (1.325%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.
[16]:
4

The return value of this function is a number that, when converted to binary, each digit in the code stands for whether or not a test passed. A digit of zero indicates the test passed. The ordering of the tests goes:

  • ess

  • r_hat

  • divergences

  • tree depth

  • E-BFMI

For example, a warning code of 12 has a binary representation of 01100, which means that R-hat and divergences tests failed. Our warning code of 4 has a binary representation of 00100, which means that the divergence test failed.

Visualizing results

The bebi103 package allows for three visualizations of MCMC results.

  • Trace plots: bebi103.viz.trace()

  • Parallel coordinate plots: bebi103.viz.parcoord()

  • Corner plot: bebi103.viz.corner()

Each of these takes as their first six arguments:

  • samples: ArviZ InferenceData instance of containing the samples. (Note that corner() also accepts other data type for samples, such as in display of bootstrap replicates.)

  • parameters: List of the names of variables to include in the plot.

  • palette: List of colors to use.

  • omit: List of variable names or regular expression patterns for variables to omit in the plot.

  • include_ppc: If True, include variables ending in _ppc, which denotes posterior predictive checks, in the plot. Default is False.

  • include_log_lik: If True, include variables starting with log_lik or loglik. These denote log-likelihood contributions. Default is False.

samples is the only required argument.

Trace plots

We will begin with a trace plot of the hyperparameters θ and τ.

[17]:
bokeh.io.show(
    bebi103.viz.trace(samples, parameters=["theta[0]", "theta[1]", "tau[0]", "tau[1]"])
)

The traces are color-coded by chain.

Parallel coordinate plots

For a parallel coordinate plot, we want to omit all four entries of the matrices Tau and Sigma because the entries are either zero or determined uniquely by other parameters. We can specify regular expressions in the omit list to conveniently omit all of the entries in these matrix parameters.

[18]:
bokeh.io.show(
    bebi103.viz.parcoord(samples, omit=[re.compile("Tau.*"), re.compile("Sigma.*")])
)

In the plot, samples that did not result in a divergence are semi-transparent gray by default. Samples that had a divergence are colored in orange and shown with thicker lines. Because the scale of respective parameters may be different, it is often easer to rescale each parameter for the plot. We can do this with the transformation kwarg, and setting it to 'minmax' sets the minimum of each parameter value to zero and maximum to one.

[19]:
bokeh.io.show(
    bebi103.viz.parcoord(
        samples,
        omit=[re.compile("Tau.*"), re.compile("Sigma.*")],
        transformation="minmax",
    )
)

This plot is more revealing. We see (especially if we zoom in) that most of the divergences are going through small tau[0]. This is indicative of the “funnel” behavior often encountered in hierarchical models.

Corner plots

Corner plots display scatter plots of samples from all pairs of variables, as well as histograms or ECDFs for individual variables. We can make a corner plot for the hyperparameters and also for ρ.

[20]:
bokeh.io.show(
    bebi103.viz.corner(
        samples, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
    )
)

These are very useful plots for interpreting a posterior, and also for diagnosing problems with sampling. The divergent samples appear in orange. The clustering of divergences for small tau[0] is clear.

The diagonal can also be plotted with ECDFs and contours may also be added to the off-diagonal plots.

[21]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"],
        plot_ecdf=True,
        show_contours=True,
    )
)

Simulation-based calibration (SBC)

In a principled Bayesian workflow, simulation-based calibration (SBC) is a useful tool to identify pathologies in a generative model and in the sampler’s ability to draw samples from it. The procedure is laid out in Talts, et al., 2018, and is described in the Stan manual.

Briefly, the procedure is:

  1. Draw a parameter set θ out of the prior.

  2. Use θ to draw a data set y out of the likelihood.

  3. Perform MCMC sampling of the posterior using y as if it were the actual measured data set. Draw L MCMC samples of the parameters.

  4. Do steps 1-3 N times (hundreds to a thousand is usually a good number).

For each of these calculations, you should run diagnostics to make sure the effective sample size, Rhat, etc., are within acceptable ranges. You can also compute z-scores (a measure of how well the MCMC samples reveal the ground truth), shrinkage (how informative the data can be), and rank statistics (used to check sampler performance). Please see the Talts, et al. paper for details on these quantities and their interpretation. Our focus here is how to use the bebi103.stan.sbc() function to perform the analysis and compute these quantities.

Example model

We have already seen in the example so far that we have divergences and the sampler does not seem to be able to sample small values of tau[0]. We will therefore write a new Stan program where we have noncentered parameters. The Stan model is:

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  int index_1[N];

  // The measurements
  real x[N];
  real y[N];
}


transformed data {
  // Data are two-dimensional, so store in a vector
  vector[2] xy[N];
  for (i in 1:N) {
    xy[i, 1] = x[i];
    xy[i, 2] = y[i];
  }
}


parameters {
  // Hyperparameters level 0
  vector[2] theta;

  // How hyperparameters vary
  vector<lower=0>[2] tau;

  // Parameters
  vector<lower=0>[2] sigma;
  real<lower=-1, upper=1> rho;

  // Noncentered parameters
  vector[2] theta_1_noncentered[J_1];
}


transformed parameters {
  // Covariance matrix for likelihood
  matrix[2, 2] Sigma = [
    [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
    [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
  ];

  // Center parameters
  vector[2] theta_1[J_1];
  for (i in 1:J_1) {
    theta_1[i] = theta + tau .* theta_1_noncentered[i];
  }
}


model {
  // Hyperpriors
  theta ~ normal(5, 5);
  tau ~ normal(0, 10);

  // Priors
  theta_1_noncentered ~ multi_normal([0, 0], [[1, 0], [0, 1]]);
  sigma ~ normal(0, 10);
  rho ~ uniform(-1, 1);

  // Likelihood
  for (i in 1:N) {
    xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
  }
}


generated quantities {
  real x_ppc[N];
  real y_ppc[N];
  real log_lik[N];

  {
    vector[2] xy_ppc;

    for (i in 1:N) {
      xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
      log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
      x_ppc[i] = xy_ppc[1];
      y_ppc[i] = xy_ppc[2];
    }
  }
}

Let’s go ahead and compile the model and draw some samples just to see if it alleviated the problems with divergences. We will use extra warmup to make sure the chains are achieving stationarity.

[22]:
with bebi103.stan.disable_logging():
    sm_nc = cmdstanpy.CmdStanModel(stan_file="sample_model_noncentered.stan")

    samples_nc = sm_nc.sample(
        data=data,
        chains=4,
        iter_warmup=2000,
        iter_sampling=1000,
        adapt_delta=0.95,
        seed=3252,
    )

samples_nc = az.from_cmdstanpy(
    samples_nc, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)

And we can check a corner plot to check for divergences.

[23]:
bokeh.io.show(
    bebi103.viz.corner(
        samples_nc, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
    )
)

We still got some divergences, but only nine, which could be false positives for divergences. We will go forward with SBC using this model.

Building an SBC

To build a simulation-based calibration, we need to write a Stan program to generate data sets based on parameters drawn out of the prior, as prior predictive model.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  int index_1[N];
}


generated quantities {
  vector[2] theta;
  vector[2] tau;
  vector[2] sigma;

  for (i in 1:2) {
    theta[i] = normal_rng(5, 5);
    tau[i] = fabs(normal_rng(0, 10));
    sigma[i] = fabs(normal_rng(0, 10));
  }

  real rho = uniform_rng(-1, 1);

  vector[2] theta_1[J_1];
  for (i in 1:J_1) {
    for (j in 1:2) {
      theta_1[i, j] = normal_rng(theta[j], tau[j]);
    }
  }

  real x[N];
  real y[N];

  {
    matrix[2, 2] Sigma = [
      [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
      [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
    ];

    vector[2] xy;

    for (i in 1:N) {
      xy = multi_normal_rng(theta_1[index_1[i]], Sigma);
      x[i] = xy[1];
      y[i] = xy[2];
    }
  }
}

To use it in SBC, we need to compile it.

[24]:
with bebi103.stan.disable_logging():
    sm_prior_pred = cmdstanpy.CmdStanModel(
        stan_file="sample_model_prior_predictive.stan"
    )

Running an SBC

Finally, we are ready to perform SBC using the bebi103.stan.sbc() function. We need to tell the function

  1. What Stan model to use for the prior predictive model.

  2. Which Stan model to use for posterior sampling.

  3. Which data dictionary to feed into the prior predictive model.

  4. Which data dictionary to feed into the posterior sampling mode.

  5. What the names of the measured data are. For multidimensional data, do not include indexing. In the present case, the measured data are ['x', 'y'].

  6. The names of the posterior predictive variables. There are not used in shrinkage, z-score, and rank statistic calculations.

  7. The name of the log likelihood variable. There are not used in shrinkage, z-score, and rank statistic calculations.

  8. The data types of the measured data.

  9. Any kwargs to use in posterior sampling.

  10. Any kwargs to pass to bebi103.stan.check_all_diagnostics().

  11. How many cores you want to use for the calculation.

  12. How many SBC calculations you want to do. You should do at least 400; 1000 is a good number.

SBC calculations are lengthy, so to keep the runtime of this documentation manageable, we will load a pre-computed SBC result if it is available. For reference, this SBC calculation took about six hours on 16 cores on an AWS c5 instance.

[25]:
try:
    df_sbc = pd.read_csv('sbc_results.csv')
except:
    df_sbc = bebi103.stan.sbc(
        prior_predictive_model=sm_prior_pred,
        posterior_model=sm_nc,
        prior_predictive_model_data=data,
        posterior_model_data=data,
        measured_data=['x', 'y'],
        posterior_predictive_var_names=['x_ppc', 'y_ppc'],
        log_likelihood_var_name='log_lik',
        measured_data_dtypes={'x': float, 'y': float},
        sampling_kwargs={'iter_warmup': 2000, 'adapt_delta': 0.95},
        diagnostic_check_kwargs={'omit': ['Tau.*', 'Sigma.*']},
        cores=16,
        N=1000,
        progress_bar=True,
    )

    df_sbc.to_csv('sbc_results.csv', index=False)

The output of the SBC calculation is a tidy data frame with the results for each of the 1000 trials.

[26]:
df_sbc
[26]:
ground_truth rank_statistic mean sd shrinkage z_score Rhat ESS ESS_per_iter tail_ESS tail_ESS_per_iter n_divergences n_bad_ebfmi n_max_treedepth warning_code L trial parameter
0 4.281950 3712 0.403373 2.407448 0.770475 -1.611074 1.002368 1341.460058 0.335365 1303.921933 0.325980 15 0 60 12 4000 0 theta[0]
1 4.974810 313 11.482235 4.101083 0.333940 1.586758 1.001557 1387.925463 0.346981 2224.981728 0.556245 2 0 234 12 4000 1 theta[0]
2 -0.356069 235 6.575907 4.445932 0.217216 1.559173 1.002113 1879.793973 0.469948 2239.299188 0.559825 0 0 0 0 4000 2 theta[0]
3 5.071360 1929 5.121302 1.737149 0.880494 0.028750 1.002705 1730.319116 0.432580 1575.431994 0.393858 1 0 0 4 4000 3 theta[0]
4 4.264670 1218 5.950835 3.416150 0.537842 0.493586 1.000094 1684.749750 0.421187 2169.509041 0.542377 0 0 0 0 4000 4 theta[0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12995 3.222330 1162 4.178241 1.764406 0.974475 0.541775 1.000539 3132.297533 0.783074 3482.163558 0.870541 0 0 0 0 4000 995 theta_1[2,1]
12996 -7.107200 631 -5.619239 1.471871 0.982237 1.010932 1.000755 4177.643205 1.044411 3496.129529 0.874032 0 0 0 0 4000 996 theta_1[2,1]
12997 9.484280 763 10.892740 1.644157 0.977836 0.856645 1.000262 3765.095120 0.941274 2540.140071 0.635035 0 0 0 0 4000 997 theta_1[2,1]
12998 -6.195420 1290 -4.456004 4.444606 0.838030 0.391354 1.001032 3340.526088 0.835132 3040.492446 0.760123 0 0 0 0 4000 998 theta_1[2,1]
12999 -1.098650 3921 -6.015190 2.318488 0.955926 -2.120581 1.000614 4789.118749 1.197280 3234.499689 0.808625 0 0 0 0 4000 999 theta_1[2,1]

13000 rows × 18 columns

For each parameter for each trial, the ground truth is given (as drawn from the prior), as well as a rank statistic, mean, standard deviation, shrinkage, and z-score. MCMC diagnostics are also reported.

Interpreting results

We will not go into much detail about interpreting results here, directing you instead to the Talts, et al. paper. We will instead do a quick run through some of the results and suggested visualization procedures, highlighting some of the useful functions in bebi103 along the way.

We can first check what diagnostic warnings we encountered. There are thirteen parameters, so we can count the warning codes and divide by 13 to see how many MCMC calculations had problems.

[27]:
df_sbc.groupby("warning_code").size() // 13
[27]:
warning_code
0     530
2       3
4     277
5      12
6       4
7      13
8      62
10      1
11     18
12     66
13      1
14      3
15     10
dtype: int64

Most of the samples either had no diagnostic warnings (warning code 0) or only divergences (warning code 4). To help interpret the rest of the warning codes, we can use the bebi103.stan.parse_warning_code() function. For example, the next most common warning code was 12.

[28]:
bebi103.stan.parse_warning_code(12)
divergence warning
treedepth warning

This means that we had tree depth and divergence warnings. All together, the diagnostics tell us there are some problems with our sampling as we have it set up. We should take more iterations, increase adapt_delta, and also increase max_treedepth in doing the calculations.

We should also check the z-score and shrinkage. We will plot the z-score vs. shrinkage for each parameter, coloring the glyphs by the warning code. This coloring enables us to see if any of the MCMC diagnostic issues might be affecting shrinkage and z-scores.

[29]:
points = hv.Points(
    data=df_sbc,
    kdims=["shrinkage", ("z_score", "z-score")],
    vdims=[("parameter", "param"), "warning_code"],
).groupby(
    "parameter"
).opts(
    color="warning_code",
    frame_height=125,
    frame_width=125,
    size=2,
    tools=["hover"],
).layout(
)

bokeh.io.show(hv.render(points))

The shrinkage and z-scores are good for all parameters (close to one and between –5 and 5, respectively), but the shrinkage is poor for the hyperparameters. This does not appear to be dependent on MCMC diagnostic issues. Rather, this tells us that with the size of data sets we are considering, we cannot learn much about the hyperparameters.

Finally, as suggested by Talts, et al., we can make an SBC rank ECDF plot using the bebi103.viz.sbc_rank_ecdf().

[30]:
p = bebi103.viz.sbc_rank_ecdf(
    df_sbc, show_legend=True, marker_kwargs={"size": 2},
)
p.legend.spacing = 0
p.legend.label_text_font_size = '8pt'

bokeh.io.show(p)

The ECDF difference of the rank statistic falling within the theoretical 99-percentile envelope is indicative of good calibration of the sampler and lack of bias. The parameter sigma[1] has a low-biased rank statistic, suggesting that the samples of sigma[1] will tend to be biased toward larger values.

Cleaning up

CmdStanPy leaves .hpp, ,o, .d, executable files, and sampling output. To conveniently remove them, use bebi103.stan.clean_cmdstanpy().

[31]:
bebi103.stan.clean_cmdstan()