Stan (MCMC) utilities


[1]:
import re

import numpy as np
import polars as pl

import cmdstanpy
import arviz as az

import bebi103

import holoviews as hv
hv.extension('bokeh')

bebi103.hv.set_defaults()

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

The bebi103 package contains a set of convenience functions for use with Stan. Much of the functionality is included in ArviZ, and indeed the bebi103.stan module depends on ArviZ and extensively uses its InferenceData data type. Furthermore, the bebi103 package offers visualization functions for MCMC results which we will introduce as we go through a Bayesian workflow example below. Again, much of this functionality is present in ArviZ, but with modifications. Specifically, at least in my experience, there are some problems with the Bokeh backend with ArviZ.

Additionally, the bebi103 package has functions that may be included in functions block for Stan programs.

Model and sample data

To demonstrate the usage of the Stan utilities of bebi103, we will consider the following hierarchical generative model.

\begin{align} &\theta_i \sim \text{Norm}(5, 5),\;\;i\in\{1, 2\},\\[1em] &\tau_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\sigma_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\rho \sim \text{Uniform}(-1, 1),\\[1em] &\mathsf{T} = \begin{pmatrix}\tau_1^2 & 0 \\ 0 & \tau_2^2\end{pmatrix}, \\[1em] &\mathsf{\Sigma} = \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}, \\[1em] &\theta_{1, i} \sim \text{Norm}(\theta, \mathsf{T}) \;\forall\,i\in\{1, 2, 3\},\\[1em] & \begin{pmatrix} x_{i, j} \\ y_{i,j} \end{pmatrix} \sim \text{Norm}(\theta_{1,i}, \mathsf{\Sigma})\;\forall\,i \in \{1, 2, 3\}, \; j \in \{1, \ldots, n_i \}. \end{align}

This is a hierarchical model in which the three vector-valued parameters \(\theta_{1,1}\), \(\theta_{1,2}\), and \(\theta_{1,3}\) are conditioned on the vector-valued hyperparameter \(\theta\). We will use a fabricated data set generated from this model with parameters described in the Using the user guide section. It is useful to understand the data set, so let’s look at the data frame.

[2]:
df = pl.read_csv('sample_data.csv')
df.head()
[2]:
shape: (5, 3)
xytrial
f64f64i64
1.76288611.9556161
4.36495711.1366331
3.45762612.3012671
-0.83931910.4018991
4.69460211.9253341

Each trial (there are three, numbered 1, 2, and 3) has an x measurment and a y measurement.

To begin our analysis, we will use the following Stan model, which directly follows from the above generative model.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  array[N] int index_1;

  // The measurements
  array[N] real x;
  array[N] real y;
}


transformed data {
  // Data are two-dimensional, so store in a vector
  array[N] vector[2] xy;

  for (i in 1:N) {
    xy[i, 1] = x[i];
    xy[i, 2] = y[i];
  }
}


parameters {
  // Hyperparameters level 0
  vector[2] theta;

  // How hyperparameters vary
  vector<lower=0>[2] tau;

  // Parameters
  array[J_1] vector[2] theta_1;
  vector<lower=0>[2] sigma;
  real<lower=-1, upper=1> rho;
}


transformed parameters {
  // Covariance matrix for hyperparameters
  matrix[2, 2] Tau = [
    [tau[1]^2,  0       ],
    [0,         tau[2]^2]
  ];

  // Covariance matrix for likelihood
  matrix[2, 2] Sigma = [
    [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
    [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
  ];
}


model {
  // Hyperpriors
  theta ~ normal(5, 5);
  tau ~ normal(0, 10);

  // Priors
  theta_1 ~ multi_normal(theta, Tau);
  sigma ~ normal(0, 10);
  rho ~ uniform(-1, 1);

  // Likelihood
  for (i in 1:N) {
    xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
  }
}


generated quantities {
  array[N] real x_ppc;
  array[N] real y_ppc;
  array[N] real log_lik;

  {
    vector[2] xy_ppc;

    for (i in 1:N) {
      xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
      log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
      x_ppc[i] = xy_ppc[1];
      y_ppc[i] = xy_ppc[2];
    }
  }
}

A few important notes about the Stan file.

  1. The array index_1 keeps track of the dependence of the data on each parameter, which are themselves conditioned on hyperparameters.

  2. Posterior predictive checks and the log-likelihood function are calculated in the generated quantities block.

A first pass through a simple workflow

We will now proceed through a MCMC workflow, doing the following steps:

  1. Compile the Stan model.

  2. Convert the data frame into input data for the Stan model.

  3. Convert the samples into an ArviZ InferenceData instance.

  4. Check the sampling diagnostics.

  5. Create visualizations of the samples.

Compiling the Stan model

Starting with the first step making a compiled Stan model that we can use for sampling, we will use CmdStanPy. Although the bebi103 package supports PyStan, CmdStanPy is preferred because of the simplicity of its interface and shorter model compilation time.

If we wish to suppress messages to the screen, we can use bebi103.stan.disable_logging() with context management.

[3]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="sample_model.stan")

Preparing hierarchical data for Stan

Now that we have a compiled Stan model, we need to supply it data. Referring to the data block in the Stan code above, for this hierarchical model, we need to specify how many data points, what the values of x and y are, the number of different parameters (J_1), and a set of indices the give which of the three \(\theta_1\) values upon which each data point is conditioned. This information is already contained in the tidy data frame, df. The 'x' and 'y' columns contain the data, and the 'trial' column determines the level in the hierarchy. The value of J_1 is given by the number of unique entries in the 'trial' column, and the number of data points is given by the total number of rows in the data frame.

The bebi103.stan.df_to_datadict_hier() function converts a tidy data frame into a dictionary of data that can be passed to a Stan program for a hierarchical model. Furthermore, since the 'trial' column could have strings or other entries in them, these need to be converted to integers for use in Stan. A column is added to Stan with a suffix _stan to give the values of that column that are used in the Stan program. This is for reference when analyzing the results.

[4]:
# df may be either Polars or Pandas data frame
data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols="trial", data_cols=["x", "y"]
)

# Take a look at the updated data frame
df.head()
[4]:
shape: (5, 4)
xytrialtrial_stan
f64f64i64i64
1.76288611.95561611
4.24537412.89690211
2.4953586.27825311
4.8474368.90154611
5.58513411.77268811

The data frame has been appropriately updated (the trials were numbers, so there is no difference). Let’s look at the data dictionary.

[5]:
data
[5]:
{'N': 63,
 'J_1': 3,
 'index_1': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]),
 'x': array([ 1.76288583,  4.24537447,  2.49535766,  4.84743608,  5.58513412,
         2.46538959,  2.53372498,  3.03556003,  4.1428636 ,  3.37291658,
         1.80757816,  3.86468666,  2.0406895 ,  6.10018145,  3.70259613,
         4.6946018 , -0.83931887,  3.45762554,  4.36495728,  9.02739122,
        -0.90483667,  0.63091768,  2.7511249 ,  0.84468249, -1.40832682,
         0.97071434,  1.65312163,  4.88902665,  4.74467292,  2.00083667,
         1.5374102 ,  5.36558851,  6.03757711,  1.71903739,  1.56783325,
         0.25157748,  6.76167578, -0.67226419,  3.36289739,  2.17013057,
         4.03836253,  2.72486749,  3.65771395,  4.59028926,  1.40245352,
         4.34252071,  4.1146886 ,  0.79938752,  0.72094663,  3.19538556,
         0.58176677, -0.01108818, -0.10906546,  2.37188306,  4.88494459,
        -0.35811241, -0.85531719,  2.62235591,  0.71487182,  1.38188117,
         1.82837862,  0.16000891,  4.81184826]),
 'y': array([11.95561574, 12.89690196,  6.27825302,  8.90154585, 11.77268772,
         8.9978501 ,  6.16475102, 10.64184364, 10.74399897,  8.71908946,
         9.63231343, 13.00236894,  9.47857277, 10.97263645, 11.46520631,
        11.92533427, 10.40189925, 12.30126673, 11.13663316, 16.78665286,
         4.11289192,  5.32907863,  7.82030839,  3.75238271,  3.89596763,
         1.37792381,  6.55489775,  9.7609403 , 10.03251758,  6.26665374,
         5.62577914,  5.84795532,  8.14624991,  9.12145582,  5.94521903,
         4.34204882, 12.30037908, -0.52873034,  4.3659075 ,  9.05879288,
         8.84628724,  9.95807796,  5.07513802, 10.37096513,  3.73211647,
         5.8037337 ,  5.02628689,  3.61882617,  6.98712067, 11.21360732,
         4.31478889,  4.40895773,  4.73065621,  8.95402277,  8.31905359,
         7.80920144,  5.92093593,  4.84087256,  3.43491574,  5.78404006,
         8.04955064,  5.29564007,  8.64743679])}

The dictionary has all of the entries necessary to pass into the Stan program while sampling.

Sampling and conversion to ArviZ InferenceData

We can now go about our sampling and get the results. We will again suppress logging. Note that this is in general not a good idea. I do it here to manage space in the documentation.

[6]:
with bebi103.stan.disable_logging():
    samples = sm.sample(
        data=data, chains=4, iter_sampling=1000, adapt_delta=0.95, seed=3252
    )

The bebi103 package works with the excellent ArviZ InferenceData data type. ArviZ can be used to convert samples from Stan, PyMC3, emcee, PyTorch, and many others into a well-defined universal data format.

[7]:
samples = az.from_cmdstanpy(
    samples, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)

Converting InferenceData to a DataFrame

Now that the samples are stored as InferenceData, we can go about using them. Before we do that, we may want to convert the posterior samples into a data frame to do our own analysis.

[8]:
# Default is polars; for pandas use df_package='pandas' kwarg
df_samples = bebi103.stan.arviz_to_dataframe(samples)

# Take a look
df_samples.head()
[8]:
shape: (5, 24)
theta[0]theta[1]tau[0]tau[1]theta_1[0,0]theta_1[0,1]theta_1[1,0]theta_1[1,1]theta_1[2,0]theta_1[2,1]sigma[0]sigma[1]rhoTau[0,0]Tau[0,1]Tau[1,0]Tau[1,1]Sigma[0,0]Sigma[0,1]Sigma[1,0]Sigma[1,1]chain__draw__diverging__
f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64i64i64bool
3.340099.380060.7995494.856023.6098310.7023.021436.948971.765245.894972.051772.577420.6570980.6392790.00.023.58094.209773.474923.474926.6430800false
2.817916.502041.374778.000512.759729.183621.884415.983571.654145.751712.200512.658380.6696811.889990.00.064.00814.842243.917493.917497.0669701false
2.810328.155540.5346810.9407673.4422610.90693.028847.191832.501897.687381.894212.438110.4989890.2858830.00.00.8850433.588042.304482.304485.9443702false
2.584567.960210.75088710.79753.4237110.40272.416486.039161.393616.216462.2452.750910.7317620.5638310.00.0116.5875.040014.519214.519217.5675103false
2.532932.106781.985514.333543.0175410.02752.452556.858911.930535.936231.861822.578620.4055753.942260.00.018.77963.466371.947131.947136.6492704false

Variables/parameters naming convention

Note how multdimensional variables are named in the above data frame. They are zero-indexed, and use brackets with different axes of the index separated by a comma with no spaces. Note also that the last three columns are chain__, draw__, and diverging__, which specify which chain the sample comes from, which of the successive draws it was, and whether or not it had a divergence, respectively. This nomenclature is used throughout the bebi103 MCMC parsing functions.

Some functions in the bebi103.stan and bebi103.viz modules use parameters as an argument. In those cases, the entries of multidimensional properties are expected to be entered one-by-one. Other functions use var_names as an argument. For multidimensional parameters, only the base name (with no indexing) is expected. This var_names format is congruent with that of ArviZ. As an example, consider a scalar parameter a, a vector parameter b with three entries, and a 2×2 matrix parameter c. The following shows how the respective keyword arguments should be used.

  • parameters=['a', 'b[0]', 'b[1]', 'b[2]', 'c[0,0]', 'c[0,1]', 'c[1,0]', 'c[1,1]']

  • var_names=['a', 'b', 'c']

Using parameters allows omission of some elements of multidimensional parameters.

Checking diagnostics

We may check the following diagnostic features:

  1. The effective sample size (ESS).

  2. The rank-normalized Gelman-Rubin statistic, Rhat (\(\hat{R}\)).

  3. Divergences.

  4. Tree depth.

  5. Energy-Bayes fraction of missing information (E-BFMI)

The functions to check these diagnostics are based on some of the functionality in ArviZ and from the work of Michael Betancourt.

We can start by checking the effective sample size. The bebi103.stan.check_ess() function checks for both the effective sample size and the “tail” effective sample size, as defined in Vehtari, et al., 2019.

[9]:
bebi103.stan.check_ess(samples)
Effective sample size looks reasonable for all parameters.
[9]:
True

The function returned True, indicating that all variables have an effective samples size and tail effective sample size greater than the rule of thumb of 100 per chain. We can adjust the rule of thumb, if we like. If use the total_ess_rule_of_thumb=400 kwarg, then any parameter with fewer than 1600 effective samples (since we had four chains) is flagged a failure of the ESS check.

[10]:
bebi103.stan.check_ess(samples, total_ess_rule_of_thumb=400)
tail-ESS for parameter theta[0] is 1247.7325645487529.
ESS for parameter tau[0] is 770.5256373544728.
tail-ESS for parameter tau[0] is 570.1635775815961.
ESS for parameter theta_1[0,0] is 1330.7936577374564.
tail-ESS for parameter theta_1[0,0] is 1275.3828600137535.
ESS for parameter Tau[0,0] is 770.5275684975434.
tail-ESS for parameter Tau[0,0] is 570.1635775815961.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.
[10]:
False

The function now returns False, and also prints diagnostic information about the parameters for which the test failed.

Now let’s turn to Rhat. The Rhat value is a rank-normalized, folded Rhat, as described in Vehtari, et al., 2019.

[11]:
bebi103.stan.check_rhat(samples)
Rhat for parameter Tau[0,1] is NaN.
Rhat for parameter Tau[1,0] is NaN.
[11]:
False

The function returned False, meaning the that Rhat check failed. It informed us that two of the Rhat values were NaN, which causes a failure of the check. It is often the case that variables in the generated parameters block have NaN Rhat because they are not in any way manipulated by the sampler. In this case, the off-diagonal terms in the matrix Tau are both always zero. We can omit these using the omit kwarg.

[12]:
bebi103.stan.check_rhat(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Rhat looks reasonable for all parameters.
[12]:
True

Next, we can check for divergences.

[13]:
bebi103.stan.check_divergences(samples)
96 of 4000 (2.4%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
[13]:
False

We failed the divergence test because we had many divergences. We will deal with these momentarily.

We can also check if the sampler was hitting the maximum tree depth. By default, the tree depth used in a sampling call by CmdStanPy is 10, which is the default max_treedepth kwarg value for bebi103.stan.check_treedepth(). You need to set this kwarg to match whatever tree depth used in the sampling. In our case, we can use the default.

[14]:
bebi103.stan.check_treedepth(samples)
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
[14]:
True

We have passed this test.

Finally, we can check the E-BFMI.

[15]:
bebi103.stan.check_energy(samples)
E-BFMI indicated no pathological behavior.
[15]:
True

We have also passed this test.

For convenience, we can perform all tests at once using the bebi103.stan.check_all_diagnostics() function.

[16]:
bebi103.stan.check_all_diagnostics(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

96 of 4000 (2.4%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.
[16]:
4

The return value of this function is a number that, when converted to binary, each digit in the code stands for whether or not a test passed. A digit of zero indicates the test passed. The ordering of the tests goes:

  • ess

  • r_hat

  • divergences

  • tree depth

  • E-BFMI

For example, a warning code of 12 has a binary representation of 01100, which means that R-hat and divergences tests failed. Our warning code of 4 has a binary representation of 00100, which means that the divergence test failed.

Visualizing results

The bebi103 package allows for three visualizations of MCMC results.

  • Trace plots: bebi103.viz.trace()

  • Parallel coordinate plots: bebi103.viz.parcoord()

  • Corner plot: bebi103.viz.corner()

Each of these takes as their first six arguments:

  • samples: ArviZ InferenceData instance of containing the samples. (Note that corner() also accepts other data type for samples, such as in display of bootstrap replicates.)

  • parameters: List of the names of variables to include in the plot.

  • palette: List of colors to use.

  • omit: List of variable names or regular expression patterns for variables to omit in the plot.

  • include_ppc: If True, include variables ending in _ppc, which denotes posterior predictive checks, in the plot. Default is False.

  • include_log_lik: If True, include variables starting with log_lik or loglik. These denote log-likelihood contributions. Default is False.

samples is the only required argument.

Trace plots

We will begin with a trace plot of the hyperparameters θ and τ.

[17]:
bokeh.io.show(
    bebi103.viz.trace(samples, parameters=["theta[0]", "theta[1]", "tau[0]", "tau[1]"])
)

The traces are color-coded by chain.

Parallel coordinate plots

For a parallel coordinate plot, we want to omit all four entries of the matrices Tau and Sigma because the entries are either zero or determined uniquely by other parameters. We can specify regular expressions in the omit list to conveniently omit all of the entries in these matrix parameters.

[18]:
bokeh.io.show(
    bebi103.viz.parcoord(samples, omit=[re.compile("Tau.*"), re.compile("Sigma.*")])
)

In the plot, samples that did not result in a divergence are semi-transparent gray by default. Samples that had a divergence are colored in orange and shown with thicker lines. Because the scale of respective parameters may be different, it is often easer to rescale each parameter for the plot. We can do this with the transformation kwarg, and setting it to 'minmax' sets the minimum of each parameter value to zero and maximum to one.

[19]:
bokeh.io.show(
    bebi103.viz.parcoord(
        samples,
        omit=[re.compile("Tau.*"), re.compile("Sigma.*")],
        transformation="minmax",
    )
)

This plot is more revealing. We see (especially if we zoom in) that most of the divergences are going through small tau[0]. This is indicative of the “funnel” behavior often encountered in hierarchical models.

Corner plots

Corner plots display scatter plots of samples from all pairs of variables, as well as histograms or ECDFs for individual variables. We can make a corner plot for the hyperparameters and also for ρ.

[20]:
bokeh.io.show(
    bebi103.viz.corner(
        samples, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
    )
)

These are very useful plots for interpreting a posterior, and also for diagnosing problems with sampling. The divergent samples appear in orange. The clustering of divergences for small tau[0] is clear.

The diagonal can also be plotted with ECDFs and contours may also be added to the off-diagonal plots.

[21]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"],
        plot_ecdf=True,
        show_contours=True,
    )
)

Simulation-based calibration (SBC)

In a principled Bayesian workflow, simulation-based calibration (SBC) is a useful tool to identify pathologies in a generative model and in the sampler’s ability to draw samples from it. The procedure is laid out in Talts, et al., 2018, and is described in the Stan manual.

Briefly, the procedure is:

  1. Draw a parameter set θ out of the prior.

  2. Use θ to draw a data set y out of the likelihood.

  3. Perform MCMC sampling of the posterior using y as if it were the actual measured data set. Draw L MCMC samples of the parameters.

  4. Do steps 1-3 N times (hundreds to a thousand is usually a good number).

For each of these calculations, you should run diagnostics to make sure the effective sample size, Rhat, etc., are within acceptable ranges. You can also compute z-scores (a measure of how well the MCMC samples reveal the ground truth), shrinkage (how informative the data can be), and rank statistics (used to check sampler performance). Please see the Talts, et al. paper for details on these quantities and their interpretation. Our focus here is how to use the bebi103.stan.sbc() function to perform the analysis and compute these quantities.

Example model

We have already seen in the example so far that we have divergences and the sampler does not seem to be able to sample small values of tau[0]. We will therefore write a new Stan program where we have noncentered parameters. The Stan model is:

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  array[N] int index_1;

  // The measurements
  array[N] real x;
  array[N] real y;
}


transformed data {
  // Data are two-dimensional, so store in a vector
  array[N] vector[2] xy;
  for (i in 1:N) {
    xy[i, 1] = x[i];
    xy[i, 2] = y[i];
  }
}


parameters {
  // Hyperparameters level 0
  vector[2] theta;

  // How hyperparameters vary
  vector<lower=0>[2] tau;

  // Parameters
  vector<lower=0>[2] sigma;
  real<lower=-1, upper=1> rho;

  // Noncentered parameters
  array[J_1] vector[2] theta_1_noncentered;
}


transformed parameters {
  // Covariance matrix for likelihood
  matrix[2, 2] Sigma = [
    [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
    [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
  ];

  // Center parameters
  array[J_1] vector[2] theta_1;
  for (i in 1:J_1) {
    theta_1[i] = theta + tau .* theta_1_noncentered[i];
  }
}


model {
  // Hyperpriors
  theta ~ normal(5, 5);
  tau ~ normal(0, 10);

  // Priors
  theta_1_noncentered ~ multi_normal([0, 0], [[1, 0], [0, 1]]);
  sigma ~ normal(0, 10);
  rho ~ uniform(-1, 1);

  // Likelihood
  for (i in 1:N) {
    xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
  }
}


generated quantities {
  array[N] real x_ppc;
  array[N] real y_ppc;
  array[N] real log_lik;

  {
    vector[2] xy_ppc;

    for (i in 1:N) {
      xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
      log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
      x_ppc[i] = xy_ppc[1];
      y_ppc[i] = xy_ppc[2];
    }
  }
}

Let’s go ahead and compile the model and draw some samples just to see if it alleviated the problems with divergences. We will use extra warmup to make sure the chains are achieving stationarity.

[22]:
with bebi103.stan.disable_logging():
    sm_nc = cmdstanpy.CmdStanModel(stan_file="sample_model_noncentered.stan")

    samples_nc = sm_nc.sample(
        data=data,
        chains=4,
        iter_warmup=2000,
        iter_sampling=1000,
        adapt_delta=0.95,
        seed=3252,
    )

samples_nc = az.from_cmdstanpy(
    samples_nc, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)

And we can check a corner plot to check for divergences.

[23]:
bokeh.io.show(
    bebi103.viz.corner(
        samples_nc, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
    )
)

We still got some divergences, but only a few, which could be false positives for divergences. We will go forward with SBC using this model.

Building an SBC

To build a simulation-based calibration, we need to write a Stan program to generate data sets based on parameters drawn out of the prior, as prior predictive model.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  array[N] int index_1;
}


generated quantities {
  vector[2] theta;
  vector[2] tau;
  vector[2] sigma;

  for (i in 1:2) {
    theta[i] = normal_rng(5, 5);
    tau[i] = abs(normal_rng(0, 10));
    sigma[i] = abs(normal_rng(0, 10));
  }

  real rho = uniform_rng(-1, 1);

  array[J_1] vector[2] theta_1;
  for (i in 1:J_1) {
    for (j in 1:2) {
      theta_1[i, j] = normal_rng(theta[j], tau[j]);
    }
  }

  array[N] real x;
  array[N] real y;

  {
    matrix[2, 2] Sigma = [
      [sigma[1]^2,                 rho * sigma[1] * sigma[2]],
      [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
    ];

    vector[2] xy;

    for (i in 1:N) {
      xy = multi_normal_rng(theta_1[index_1[i]], Sigma);
      x[i] = xy[1];
      y[i] = xy[2];
    }
  }
}

To use it in SBC, we need to compile it.

[24]:
with bebi103.stan.disable_logging():
    sm_prior_pred = cmdstanpy.CmdStanModel(
        stan_file="sample_model_prior_predictive.stan"
    )

Running an SBC

Finally, we are ready to perform SBC using the bebi103.stan.sbc() function. We need to tell the function

  1. What Stan model to use for the prior predictive model.

  2. Which Stan model to use for posterior sampling.

  3. Which data dictionary to feed into the prior predictive model.

  4. Which data dictionary to feed into the posterior sampling mode.

  5. What the names of the measured data are. For multidimensional data, do not include indexing. In the present case, the measured data are ['x', 'y'].

  6. The names of the posterior predictive variables. There are not used in shrinkage, z-score, and rank statistic calculations.

  7. The name of the log likelihood variable. There are not used in shrinkage, z-score, and rank statistic calculations.

  8. The data types of the measured data.

  9. Any kwargs to use in posterior sampling.

  10. Any kwargs to pass to bebi103.stan.check_all_diagnostics().

  11. How many cores you want to use for the calculation.

  12. How many SBC calculations you want to do. You should do at least 400; 1000 is a good number.

SBC calculations are lengthy, so to keep the runtime of this documentation manageable, we will load a pre-computed SBC result if it is available. For reference, this SBC calculation took about six hours on 16 cores on an AWS c5 instance.

[25]:
try:
    df_sbc = pl.read_csv('sbc_results.csv')
except:
    df_sbc = bebi103.stan.sbc(
        prior_predictive_model=sm_prior_pred,
        posterior_model=sm_nc,
        prior_predictive_model_data=data,
        posterior_model_data=data,
        measured_data=['x', 'y'],
        posterior_predictive_var_names=['x_ppc', 'y_ppc'],
        log_likelihood_var_name='log_lik',
        measured_data_dtypes={'x': float, 'y': float},
        sampling_kwargs={'iter_warmup': 2000, 'adapt_delta': 0.95},
        diagnostic_check_kwargs={'omit': ['Tau.*', 'Sigma.*']},
        cores=16,
        N=1000,
        progress_bar=True,
    )

    df_sbc.to_csv('sbc_results.csv', index=False)

The output of the SBC calculation is a tidy data frame with the results for each of the 1000 trials.

[26]:
df_sbc
[26]:
shape: (13_000, 18)
ground_truthrank_statisticmeansdshrinkagez_scoreRhatESSESS_per_itertail_ESStail_ESS_per_itern_divergencesn_bad_ebfmin_max_treedepthwarning_codeLtrialparameter
f64i64f64f64f64f64f64f64f64f64f64i64i64i64i64i64i64str
4.2819537120.4033732.4074480.770475-1.6110741.0023681341.4600580.3353651303.9219330.32598150601240000"theta[0]"
4.9748131311.4822354.1010830.333941.5867581.0015571387.9254630.3469812224.9817280.556245202341240001"theta[0]"
-0.3560692356.5759074.4459320.2172161.5591731.0021131879.7939730.4699482239.2991880.559825000040002"theta[0]"
5.0713619295.1213021.7371490.8804940.028751.0027051730.3191160.432581575.4319940.393858100440003"theta[0]"
4.2646712185.9508353.416150.5378420.4935861.0000941684.749750.4211872169.5090410.542377000040004"theta[0]"
3.2223311624.1782411.7644060.9744750.5417751.0005393132.2975330.7830743482.1635580.87054100004000995"theta_1[2,1]"
-7.1072631-5.6192391.4718710.9822371.0109321.0007554177.6432051.0444113496.1295290.87403200004000996"theta_1[2,1]"
9.4842876310.892741.6441570.9778360.8566451.0002623765.095120.9412742540.1400710.63503500004000997"theta_1[2,1]"
-6.195421290-4.4560044.4446060.838030.3913541.0010323340.5260880.8351323040.4924460.76012300004000998"theta_1[2,1]"
-1.098653921-6.015192.3184880.955926-2.1205811.0006144789.1187491.197283234.4996890.80862500004000999"theta_1[2,1]"

For each parameter for each trial, the ground truth is given (as drawn from the prior), as well as a rank statistic, mean, standard deviation, shrinkage, and z-score. MCMC diagnostics are also reported.

Interpreting results

We will not go into much detail about interpreting results here, directing you instead to the Talts, et al. paper. We will instead do a quick run through some of the results and suggested visualization procedures, highlighting some of the useful functions in bebi103 along the way.

We can first check what diagnostic warnings we encountered. There are thirteen parameters, so we can count the warning codes and divide by 13 to see how many MCMC calculations had problems.

[27]:
df_sbc.group_by("warning_code").agg(pl.len() // 13).sort(by='warning_code')
[27]:
shape: (13, 2)
warning_codelen
i64u32
0530
23
4277
512
64
1118
1266
131
143
1510

Most of the samples either had no diagnostic warnings (warning code 0) or only divergences (warning code 4). To help interpret the rest of the warning codes, we can use the bebi103.stan.parse_warning_code() function. For example, the next most common warning code was 12.

[28]:
bebi103.stan.parse_warning_code(12)
divergence warning
treedepth warning

This means that we had tree depth and divergence warnings. All together, the diagnostics tell us there are some problems with our sampling as we have it set up. We should take more iterations, increase adapt_delta, and also increase max_treedepth in doing the calculations.

We should also check the z-score and shrinkage. We will plot the z-score vs. shrinkage for each parameter, coloring the glyphs by the warning code. This coloring enables us to see if any of the MCMC diagnostic issues might be affecting shrinkage and z-scores.

[29]:
points = hv.Points(
    data=df_sbc.to_dict(),
    kdims=["shrinkage", ("z_score", "z-score")],
    vdims=[("parameter", "param"), "warning_code"],
).groupby(
    "parameter"
).opts(
    color="warning_code",
    frame_height=125,
    frame_width=125,
    size=2,
    tools=["hover"],
).layout(
)

# Show plot disabling logging to avoid Bokeh warnings
with bebi103.stan.disable_logging():
    bokeh.io.show(hv.render(points))

The shrinkage and z-scores are good for all parameters (close to one and between –5 and 5, respectively), but the shrinkage is poor for the hyperparameters. This does not appear to be dependent on MCMC diagnostic issues. Rather, this tells us that with the size of data sets we are considering, we cannot learn much about the hyperparameters.

Finally, as suggested by Talts, et al., we can make an SBC rank ECDF plot using the bebi103.viz.sbc_rank_ecdf().

[30]:
p = bebi103.viz.sbc_rank_ecdf(
    df_sbc, show_legend=True, marker_kwargs={"size": 2},
)
p.legend.spacing = 0
p.legend.label_text_font_size = '8pt'

bokeh.io.show(p)

The ECDF difference of the rank statistic falling within the theoretical 99-percentile envelope is indicative of good calibration of the sampler and lack of bias. The parameter sigma[1] has a low-biased rank statistic, suggesting that the samples of sigma[1] will tend to be biased toward larger values.

Using stanfunctions

The bebi103 package includes functions that can be included in the functions block of a Stan program. As an example, let us analyze data that are inverse Gaussian distributed, a distribution that is not natively included in Stan. The bebi103 package has functionality for sampling out of this distribution and also for using it in models.

The generative model we will use is as follows

\begin{align} \log_{10}\mu \sim \text{Norm}(-2, 2),\\[1em] \log_{10}\lambda \sim \text{Norm}(-2, 2),\\[1em] y_i \sim \text{InvGaussian}(\mu, \lambda)\;\forall i. \end{align}

First, we’ll generate the dat for some contrived parameters.

[31]:
rng = np.random.default_rng(seed=3252)

# Parameters
mu = 6.5
lam = 11.25

# Generated data
N = 300
y = rng.wald(mu, lam, size=N)

For our Stan model, we include the inv_gaussian.stanfunctions, which will be available when we build the model using cmdstanpy with the appropriate keyword argument for the include path for the Stan compiler, as we will do in a moment. The Stan code is below.

functions {
    #include inv_gaussian.stanfunctions
}


data {
    int<lower=0> N;
    array[N] real<lower=0> y;
}


parameters {
    real log10_mu;
    real log10_lambda;
}


transformed parameters {
    real mu = 10^log10_mu;
    real lambda = 10^log10_lambda;
}

model {
    log10_mu ~ normal(-2, 2);
    log10_lambda ~ normal(-2, 2);
    y ~ inv_gaussian(mu, lambda);
}


generated quantities {
    array[N] real y_pred;
    for (i in 1:N) {
        y_pred[i] = inv_gaussian_rng(mu, lambda);
    }
}

Now, to compile the model, we need to tell the Stan compiler where to find the inverse Gaussian Stan functions. We do that with the stanc_options keyword argument below.

[32]:
sm = cmdstanpy.CmdStanModel(
    stan_file="inv_gauss.stan",
    stanc_options={"include-paths": bebi103.stan.include_path()}
)
17:04:55 - cmdstanpy - INFO - compiling stan file /Users/bois/Dropbox/git/bebi103/doc/user_guide/inv_gauss.stan to exe file /Users/bois/Dropbox/git/bebi103/doc/user_guide/inv_gauss
17:04:58 - cmdstanpy - INFO - compiled model executable: /Users/bois/Dropbox/git/bebi103/doc/user_guide/inv_gauss

Let’s take some samples and see how we captured the parameters we used to generate the data.

[33]:
# Perform sampling
with bebi103.stan.disable_logging():
    samples = sm.sample(data=dict(y=y, N=N))

# Convert to ArviZ object
samples = az.from_cmdstanpy(posterior=samples, posterior_predictive="y_pred")

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples, parameters=["mu", "lambda"]))

Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

The parameter \(\mu\) to generate the data was 6.5 and \(\lambda\) was 11.25, which are well within the credible region. We can also check to see how the model predicts the measurements using the predictive_ecdf() function of the bebi103.viz module.

[34]:
# Put data in right structure for predictive_ecdf()
y_pred = samples.posterior_predictive.y_pred.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "y_pred_dim_0")

bokeh.io.show(bebi103.viz.predictive_ecdf(y_pred, data=y, y_axis_label="y"))

In addition to functions enabling use of an inverse Gaussian distribution, the bebi103 package has Stan functions useful for working with Gaussian processes when the x-variables are one-dimensional. The contents of the gp_one_dimensional.stanfunctions file are below.

/*
  These functions are for GPs with univariate x. The functions
  gp_posterior_mstar and gp_posterior_sigmastar_cholesky are general.
  The remaining functions are useful for computing derivatives of
  GPs that use exponentiated quadratic (a.k.a. squared exponential or
  radial basis function), Matern 3/2, and Matern 5/2 kernels.
 */

/**
  * Return posterior m* for a model with a Normal likelihood and GP
  * prior for a given Cholesky decomposition, Ly, of the matrix Ky,
  * and K*.
  *
  * @param y y-values of data
  * @param Ly Cholesky decomposition of the matrix Ly
  * @param Kstar Matrix K*
  * @return Posterior m*
  */
vector gp_posterior_mstar(vector y, matrix Ly, matrix Kstar) {
  // Get sizes
  int N = size(y);
  int Nstar = cols(Kstar);

  // Compute xi = inv(Ky) . y, which is solution xi to Ky . xi = y.
  vector[N] z = mdivide_left_tri_low(Ly, y);
  vector[N] xi = mdivide_right_tri_low(z', Ly)';

  // Compute mean vector mstar
  vector[Nstar] mstar = Kstar' * xi;

  return mstar;
}


/**
  * Return Cholesky decomposition of posterior Σ* for a model with a
  * Normal likelihood and GP prior.
  *
  * @param y y-values of data
  * @param Ly Cholesky decomposition of the matrix Ly
  * @param Kstar Matrix K*
  * @param Kstarstar Matrix K**
  * @param delta Small value to add to the diagonal of Σ* to ensure
  *              numerical positive definiteness
  * @return Cholesky decomposition of posterior Σ*
  */
matrix gp_posterior_sigmastar_cholesky(
    vector y,
    matrix Ly,
    matrix Kstar,
    matrix Kstarstar,
    real delta) {

  // Get sizes
  int N = size(y);
  int Nstar = cols(Kstar);

  // Compute Xi = inv(Ky) . Kstar, which is the solution Xi to Ky . Xi = Kstar.
  matrix[N, Nstar] Z = mdivide_left_tri_low(Ly, Kstar);
  matrix[N, Nstar] Xi = mdivide_right_tri_low(Z', Ly)';

  // Compute Sigma_star (plus a small number of the diagonal to ensure pos. def.)
  matrix[Nstar, Nstar] Sigmastar = Kstarstar - Kstar' * Xi
                                  + diag_matrix(rep_vector(delta, Nstar));

  // Compute and return Cholesky decomposition
  matrix[Nstar, Nstar] Lstar = cholesky_decompose(Sigmastar);

  return Lstar;
}


  /**
   * Return exponentiated quadratic (a.k.a. squared exponential, or SE) kernel
   * differentiated by the first variable.
   *
   * @param x1 Value of first variable
   * @param x2 Value of second variable
   * @param alpha α hyperparameter of SE kernel
   * @param rho ρ hyperparameter of SE kernel
   * @return Partial first derivative of SE kernel with respect to the
   *         first variable
   */
  real d1_exp_quad_kernel(real x1, real x2, real alpha, real rho) {
     real x_diff = x1 - x2;

     return -(alpha / rho)^2 * x_diff * exp(-x_diff^2 / 2.0 / rho^2);
  }


/**
  * Return mixed second derivative of exponentiated quadratic
  * (a.k.a. squared exponential, or SE) kernel
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of SE kernel
  * @param rho ρ hyperparameter of SE kernel
  * @return Mixed second derivative of squared exponential (SE) kernel
  */
real d1_d2_exp_quad_kernel(real x1, real x2, real alpha, real rho) {
    real rho2 = rho^2;
    real term1 = x1 - x2 + rho;
    real term2 = x2 - x1 + rho;

    return (alpha / rho2)^2 * term1 * term2 * exp(-(x1 - x2)^2 / 2.0 / rho2);
}

  /**
   * Return Matern 3/2 kernel differentiated by the first
   * variable.
   *
   * @param x1 Value of first variable
   * @param x2 Value of second variable
   * @param alpha α hyperparameter of Matern kernel
   * @param rho ρ hyperparameter of Matern kernel
   * @return Partial first derivative of Matern 3/2 kernel with respect
   * to the first variable
   */
   real d1_matern32_kernel(real x1, real x2, real alpha, real rho) {
    real x_diff = x1 - x2;

    return -3.0 * (alpha / rho)^2 * x_diff * exp(-sqrt(3.0 * x_diff^2 / rho^2));
 }


 /**
  * Return mixed second derivative of Matern 3/2 kernel
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern kernel
  * @param rho ρ hyperparameter of Matern kernel
  * @return Mixed second derivative of Matern 3/2 kernel
  */
real d1_d2_matern32_kernel(real x1, real x2, real alpha, real rho) {
  real exp_arg = sqrt(3.0 * ((x1 - x2) / rho)^2);

  return 3.0 * (alpha / rho)^2 * (1.0 - exp_arg) * exp(-exp_arg);
}


/**
   * Return Matern 5/2 kernel differentiated by the first
   * variable.
   *
   * @param x1 Value of first variable
   * @param x2 Value of second variable
   * @param alpha α hyperparameter of Matern kernel
   * @param rho ρ hyperparameter of Matern kernel
   * @return Partial first derivative of Matern 5/2 kernel with respect
   * to the first variable
   */
   real d1_matern52_kernel(real x1, real x2, real alpha, real rho) {
    real x_diff = x1 - x2;
    real exp_arg = sqrt(5 * (x_diff / rho)^2);

    return -5.0 * (alpha / rho)^2 / 3.0 * x_diff * (1.0 + exp_arg) * exp(-exp_arg);
 }


 /**
  * Return mixed second derivative of Matern 5/2 kernel
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern kernel
  * @param rho ρ hyperparameter of Matern kernel
  * @return Mixed second derivative of Matern 5/2 kernel
  */
real d1_d2_matern52_kernel(real x1, real x2, real alpha, real rho) {
  real x_diff = x1 - x2;
  real x_diff2 = x_diff^2;
  real rho2 = rho^2;
  real exp_arg = sqrt(5 * x_diff^2 / rho2);

  return 5.0 * (alpha / rho2)^2 / 3.0 * ((1.0 + exp_arg) * rho2 - 5.0 * x_diff2) * exp(-exp_arg);
}


/**
  * Return first derivative of the covariance matrix of the exponentiated quadratic
  * (a.k.a. squared exponential, or SE) kernel with respect to the first variable.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of SE kernel
  * @param rho ρ hyperparameter of SE kernel
  * @return First derivative of the covariance matrix of the squared
  * exponential kernel with respect to the first variable.
  */
matrix d1_cov_exp_quad(array[] real x1, array[] real x2, real alpha, real rho) {
  int m = size(x1);
  int n = size(x2);
  matrix[m, n] d1_K;

  for (i in 1:m) {
    for (j in 1:n) {
      d1_K[i, j] = d1_exp_quad_kernel(x1[i], x2[j], alpha, rho);
    }
  }

  return d1_K;
}


/**
  * Return mixed second derivative of the covariance matrix of the
  * exponentiated quadratic (a.k.a. squared exponential, or SE) kernel.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of SE kernel
  * @param rho ρ hyperparameter of SE kernel
  * @return Mixed second derivative of the covariance matrix of the
  * squared exponential kernel.
  */
matrix d1_d2_cov_exp_quad(array[] real x1, array[] real x2, real alpha, real rho) {
  int m = size(x1);
  int n = size(x2);
  matrix[m, n] d1_d2_K;

  for (i in 1:m) {
    for (j in 1:n) {
      d1_d2_K[i, j] = d1_d2_exp_quad_kernel(x1[i], x2[j], alpha, rho);
    }
  }

  return d1_d2_K;
}


/**
  * Return first derivative of the covariance matrix of the Matern
  * kernel with nu = 3/2 with respect to the first variable.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern 3/2 kernel
  * @param rho ρ hyperparameter of Matern 3/2 kernel
  * @return First derivative of the covariance matrix of the Matern 3/2
  * kernel with respect to the first variable.
  */
  matrix d1_cov_matern32(array[] real x1, array[] real x2, real alpha, real rho) {
    int m = size(x1);
    int n = size(x2);
    matrix[m, n] d1_K;

    for (i in 1:m) {
      for (j in 1:n) {
        d1_K[i, j] = d1_matern32_kernel(x1[i], x2[j], alpha, rho);
      }
    }

    return d1_K;
  }


/**
  * Return mixed second derivative of the covariance matrix of the Matern
  * kernel with nu = 3/2 with respect to the first variable.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern 3/2 kernel
  * @param rho ρ hyperparameter of Matern 3/2 kernel
  * @return Mixed second derivative of the covariance matrix of the
  * Matern 3/2 kernel.
  */
  matrix d1_d2_cov_matern32(array[] real x1, array[] real x2, real alpha, real rho) {
    int m = size(x1);
    int n = size(x2);
    matrix[m, n] d1_d2_K;

    for (i in 1:m) {
      for (j in 1:n) {
        d1_d2_K[i, j] = d1_d2_matern32_kernel(x1[i], x2[j], alpha, rho);
      }
    }

    return d1_d2_K;
  }


/**
  * Return first derivative of the covariance matrix of the Matern
  * kernel with nu = 5/2 with respect to the first variable.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern 5/2 kernel
  * @param rho ρ hyperparameter of Matern 5/2 kernel
  * @return First derivative of the covariance matrix of the Matern 3/2
  * kernel with respect to the first variable.
  */
  matrix d1_cov_matern52(array[] real x1, array[] real x2, real alpha, real rho) {
    int m = size(x1);
    int n = size(x2);
    matrix[m, n] d1_K;

    for (i in 1:m) {
      for (j in 1:n) {
        d1_K[i, j] = d1_matern52_kernel(x1[i], x2[j], alpha, rho);
      }
    }

    return d1_K;
  }


/**
  * Return mixed second derivative of the covariance matrix of the Matern
  * kernel with nu = 5/2 with respect to the first variable.
  *
  * @param x1 Value of first variable
  * @param x2 Value of second variable
  * @param alpha α hyperparameter of Matern 5/2 kernel
  * @param rho ρ hyperparameter of Matern 5/2 kernel
  * @return Mixed second derivative of the covariance matrix of the
  * Matern 5/2 kernel.
  */
  matrix d1_d2_cov_matern52(array[] real x1, array[] real x2, real alpha, real rho) {
    int m = size(x1);
    int n = size(x2);
    matrix[m, n] d1_d2_K;

    for (i in 1:m) {
      for (j in 1:n) {
        d1_d2_K[i, j] = d1_d2_matern52_kernel(x1[i], x2[j], alpha, rho);
      }
    }

    return d1_d2_K;
  }

Cleaning up

CmdStanPy leaves .hpp, ,o, .d, executable files, and sampling output. To conveniently remove them, use bebi103.stan.clean_cmdstanpy().

[35]:
bebi103.stan.clean_cmdstan()