Stan (MCMC) utilities
[1]:
import re
import numpy as np
import pandas as pd
import cmdstanpy
import arviz as az
import bebi103
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
import bokeh.io
bokeh.io.output_notebook()
The bebi103 package contains a set of convenience functions for use with Stan. Much of the functionality is included in ArviZ, and indeed the bebi103.stan
module depends on ArviZ and extensively uses its InferenceData
data type. Furthermore, the bebi103 package offers visualization functions for MCMC results which we will introduce as we go through a Bayesian workflow example below. Again, much of this functionality is
present in ArviZ, but with modifications. Specifically, at least in my experience, there are some problems with the Bokeh backend with ArviZ.
Model and sample data
To demonstrate the usage of the Stan utilities of bebi103, we will consider the following hierarchical generative model.
\begin{align} &\theta_i \sim \text{Norm}(5, 5),\;\;i\in\{1, 2\},\\[1em] &\tau_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\sigma_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em] &\rho \sim \text{Uniform}(-1, 1),\\[1em] &\mathsf{T} = \begin{pmatrix}\tau_1^2 & 0 \\ 0 & \tau_2^2\end{pmatrix}, \\[1em] &\mathsf{\Sigma} = \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}, \\[1em] &\theta_{1, i} \sim \text{Norm}(\theta, \mathsf{T}) \;\forall\,i\in\{1, 2, 3\},\\[1em] & \begin{pmatrix} x_{i, j} \\ y_{i,j} \end{pmatrix} \sim \text{Norm}(\theta_{1,i}, \mathsf{\Sigma})\;\forall\,i \in \{1, 2, 3\}, \; j \in \{1, \ldots, n_i \}. \end{align}
This is a hierarchical model in which the three vector-valued parameters \(\theta_{1,1}\), \(\theta_{1,2}\), and \(\theta_{1,3}\) are conditioned on the vector-valued hyperparameter \(\theta\). We will use a fabricated data set generated from this model with parameters described in the Using the user guide section. It is useful to understand the data set, so let’s look at the data frame.
[2]:
df = pd.read_csv('sample_data.csv')
df.head()
[2]:
x | y | trial | |
---|---|---|---|
0 | 1.762886 | 11.955616 | 1 |
1 | 4.364957 | 11.136633 | 1 |
2 | 3.457626 | 12.301267 | 1 |
3 | -0.839319 | 10.401899 | 1 |
4 | 4.694602 | 11.925334 | 1 |
Each trial (there are three, numbered 1, 2, and 3) has an x measurment and a y measurement.
To begin our analysis, we will use the following Stan model, which directly follows from the above generative model.
data {
// Total number of data points
int N;
// Number of entries in each level of the hierarchy
int J_1;
//Index array to keep track of hierarchical structure
int index_1[N];
// The measurements
real x[N];
real y[N];
}
transformed data {
// Data are two-dimensional, so store in a vector
vector[2] xy[N];
for (i in 1:N) {
xy[i, 1] = x[i];
xy[i, 2] = y[i];
}
}
parameters {
// Hyperparameters level 0
vector[2] theta;
// How hyperparameters vary
vector<lower=0>[2] tau;
// Parameters
vector[2] theta_1[J_1];
vector<lower=0>[2] sigma;
real<lower=-1, upper=1> rho;
}
transformed parameters {
// Covariance matrix for hyperparameters
matrix[2, 2] Tau = [
[tau[1]^2, 0 ],
[0, tau[2]^2]
];
// Covariance matrix for likelihood
matrix[2, 2] Sigma = [
[sigma[1]^2, rho * sigma[1] * sigma[2]],
[rho * sigma[1] * sigma[2], sigma[2]^2 ]
];
}
model {
// Hyperpriors
theta ~ normal(5, 5);
tau ~ normal(0, 10);
// Priors
theta_1 ~ multi_normal(theta, Tau);
sigma ~ normal(0, 10);
rho ~ uniform(-1, 1);
// Likelihood
for (i in 1:N) {
xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
}
}
generated quantities {
real x_ppc[N];
real y_ppc[N];
real log_lik[N];
{
vector[2] xy_ppc;
for (i in 1:N) {
xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
x_ppc[i] = xy_ppc[1];
y_ppc[i] = xy_ppc[2];
}
}
}
A few important notes about the Stan file.
The array
index_1
keeps track of the dependence of the data on each parameter, which are themselves conditioned on hyperparameters.Posterior predictive checks and the log-likelihood function are calculated in the
generated quantities
block.
A first pass through a simple workflow
We will now proceed through a MCMC workflow, doing the following steps:
Compile the Stan model.
Convert the data frame into input data for the Stan model.
Convert the samples into an ArviZ
InferenceData
instance.Check the sampling diagnostics.
Create visualizations of the samples.
Compiling the Stan model
Starting with the first step making a compiled Stan model that we can use for sampling, we will use CmdStanPy. Although the bebi103 package supports PyStan, CmdStanPy is preferred because of the simplicity of its interface and shorter model compilation time.
If we wish to suppress messages to the screen, we can use bebi103.stan.disable_logging()
with context management.
[3]:
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file="sample_model.stan")
Preparing hierarchical data for Stan
Now that we have a compiled Stan model, we need to supply it data. Referring to the data block in the Stan code above, for this hierarchical model, we need to specify how many data points, what the values of x and y are, the number of different parameters (J_1
), and a set of indices the give which of the three \(\theta_1\) values upon which each data point is conditioned. This information is already contained in the tidy data frame, df
. The 'x'
and 'y'
columns contain the
data, and the 'trial'
column determines the level in the hierarchy. The value of J_1
is given by the number of unique entries in the 'trial'
column, and the number of data points is given by the total number of rows in the data frame.
The bebi103.stan.df_to_datadict_hier()
function converts a tidy Pandas data frame into a dictionary of data that can be passed to a Stan program for a hierarchical model. Furthermore, since the 'trial'
column could have strings or other entries in them, these need to be converted to integers for use in Stan. A column is added to Stan with a suffix _stan
to give the values of that column that are used in the Stan program. This is for reference when analyzing the results.
[4]:
data, df = bebi103.stan.df_to_datadict_hier(
df, level_cols="trial", data_cols=["x", "y"]
)
# Take a look at the updated data frame
df.head()
[4]:
x | y | trial | trial_stan | |
---|---|---|---|---|
0 | 1.762886 | 11.955616 | 1 | 1 |
19 | 4.245374 | 12.896902 | 1 | 1 |
18 | 2.495358 | 6.278253 | 1 | 1 |
17 | 4.847436 | 8.901546 | 1 | 1 |
16 | 5.585134 | 11.772688 | 1 | 1 |
The data frame has been appropriately updated (the trials were numbers, so there is no difference). Let’s look at the data
dictionary.
[5]:
data
[5]:
{'N': 63,
'J_1': 3,
'index_1': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]),
'x': array([ 1.76288583, 4.24537447, 2.49535766, 4.84743608, 5.58513412,
2.46538959, 2.53372498, 3.03556003, 4.1428636 , 3.37291658,
1.80757816, 3.86468666, 2.0406895 , 6.10018145, 3.70259613,
4.6946018 , -0.83931887, 3.45762554, 4.36495728, 9.02739122,
-0.90483667, 0.63091768, 2.7511249 , 0.84468249, -1.40832682,
0.97071434, 1.65312163, 4.88902665, 4.74467292, 2.00083667,
1.5374102 , 5.36558851, 6.03757711, 1.71903739, 1.56783325,
0.25157748, 6.76167578, -0.67226419, 3.36289739, 2.17013057,
4.03836253, 2.72486749, 3.65771395, 4.59028926, 1.40245352,
4.34252071, 4.1146886 , 0.79938752, 0.72094663, 3.19538556,
0.58176677, -0.01108818, -0.10906546, 2.37188306, 4.88494459,
-0.35811241, -0.85531719, 2.62235591, 0.71487182, 1.38188117,
1.82837862, 0.16000891, 4.81184826]),
'y': array([11.95561574, 12.89690196, 6.27825302, 8.90154585, 11.77268772,
8.9978501 , 6.16475102, 10.64184364, 10.74399897, 8.71908946,
9.63231343, 13.00236894, 9.47857277, 10.97263645, 11.46520631,
11.92533427, 10.40189925, 12.30126673, 11.13663316, 16.78665286,
4.11289192, 5.32907863, 7.82030839, 3.75238271, 3.89596763,
1.37792381, 6.55489775, 9.7609403 , 10.03251758, 6.26665374,
5.62577914, 5.84795532, 8.14624991, 9.12145582, 5.94521903,
4.34204882, 12.30037908, -0.52873034, 4.3659075 , 9.05879288,
8.84628724, 9.95807796, 5.07513802, 10.37096513, 3.73211647,
5.8037337 , 5.02628689, 3.61882617, 6.98712067, 11.21360732,
4.31478889, 4.40895773, 4.73065621, 8.95402277, 8.31905359,
7.80920144, 5.92093593, 4.84087256, 3.43491574, 5.78404006,
8.04955064, 5.29564007, 8.64743679])}
The dictionary has all of the entries necessary to pass into the Stan program while sampling.
Sampling and conversion to ArviZ InferenceData
We can now go about our sampling and get the results. We will again suppress logging. Note that this is in general not a good idea. I do it here to manage space in the documentation.
[6]:
with bebi103.stan.disable_logging():
samples = sm.sample(
data=data, chains=4, iter_sampling=1000, adapt_delta=0.95, seed=3252
)
The bebi103 package works with the excellent ArviZ InferenceData
data type. ArviZ can be used to convert samples from Stan, PyMC3, emcee, PyTorch, and many others into a well-defined universal data format.
[7]:
samples = az.from_cmdstanpy(
samples, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)
Converting InferenceData to a Pandas DataFrame
Now that the samples are stored as InferenceData
, we can go about using them. Before we do that, we may want to convert the posterior samples into a Pandas data frame to do our own analysis.
[8]:
df_samples = bebi103.stan.arviz_to_dataframe(samples)
# Take a look
df_samples.head()
[8]:
theta[0] | theta[1] | tau[0] | tau[1] | theta_1[0,0] | theta_1[0,1] | theta_1[1,0] | theta_1[1,1] | theta_1[2,0] | theta_1[2,1] | ... | Tau[0,1] | Tau[1,0] | Tau[1,1] | Sigma[0,0] | Sigma[0,1] | Sigma[1,0] | Sigma[1,1] | chain__ | draw__ | diverging__ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.37476 | 7.46285 | 0.923329 | 1.45192 | 3.54939 | 10.05800 | 2.82881 | 5.95545 | 1.88787 | 7.08742 | ... | 0.0 | 0.0 | 2.10808 | 4.43857 | 3.29286 | 3.29286 | 6.81653 | 0 | 0 | False |
1 | 2.49195 | 7.91184 | 0.950101 | 1.31911 | 3.02827 | 9.84365 | 2.50844 | 5.71521 | 1.82827 | 7.40473 | ... | 0.0 | 0.0 | 1.74006 | 5.15723 | 3.68042 | 3.68042 | 7.61829 | 0 | 1 | False |
2 | 2.63931 | 6.99427 | 1.222190 | 3.59020 | 3.73759 | 10.63330 | 2.68387 | 6.89852 | 1.75886 | 5.94173 | ... | 0.0 | 0.0 | 12.88950 | 3.58365 | 3.20140 | 3.20140 | 8.82983 | 0 | 2 | False |
3 | 3.25652 | 7.09316 | 0.787648 | 3.52398 | 3.61098 | 10.10210 | 2.60675 | 6.98456 | 1.97803 | 6.58115 | ... | 0.0 | 0.0 | 12.41840 | 3.25437 | 3.20632 | 3.20632 | 8.44636 | 0 | 3 | False |
4 | 2.80134 | 7.91179 | 0.739557 | 1.10262 | 2.39642 | 9.66203 | 2.63468 | 6.75522 | 2.13054 | 6.64004 | ... | 0.0 | 0.0 | 1.21576 | 5.09825 | 3.26933 | 3.26933 | 5.90262 | 0 | 4 | False |
5 rows × 24 columns
Variables/parameters naming convention
Note how multdimensional variables are named in the above data frame. They are zero-indexed, and use brackets with different axes of the index separated by a comma with no spaces. Note also that the last three columns are chain__
, draw__
, and diverging__
, which specify which chain the sample comes from, which of the successive draws it was, and whether or not it had a divergence, respectively. This nomenclature is used throughout the bebi103 MCMC parsing functions.
Some functions in the bebi103.stan
and bebi103.viz
modules use parameters
as an argument. In those cases, the entries of multidimensional properties are expected to be entered one-by-one. Other functions use var_names
as an argument. For multidimensional parameters, only the base name (with no indexing) is expected. This var_names
format is congruent with that of ArviZ. As an example, consider a scalar parameter a
, a vector parameter b
with three entries, and a 2×2
matrix parameter c
. The following shows how the respective keyword arguments should be used.
parameters=['a', 'b[0]', 'b[1]', 'b[2]', 'c[0,0]', 'c[0,1]', 'c[1,0]', 'c[1,1]']
var_names=['a', 'b', 'c']
Using parameters
allows omission of some elements of multidimensional parameters.
Checking diagnostics
We may check the following diagnostic features:
The effective sample size (ESS).
The rank-normalized Gelman-Rubin statistic, Rhat (\(\hat{R}\)).
Divergences.
Tree depth.
Energy-Bayes fraction of missing information (E-BFMI)
The functions to check these diagnostics are based on some of the functionality in ArviZ and from the work of Michael Betancourt.
We can start by checking the effective sample size. The bebi103.stan.check_ess()
function checks for both the effective sample size and the “tail” effective sample size, as defined in Vehtari, et al., 2019.
[9]:
bebi103.stan.check_ess(samples)
Effective sample size looks reasonable for all parameters.
[9]:
True
The function returned True
, indicating that all variables have an effective samples size and tail effective sample size greater than the rule of thumb of 100 per chain. We can adjust the rule of thumb, if we like. If use the total_ess_rule_of_thumb=400
kwarg, then any parameter with fewer than 1600 effective samples (since we had four chains) is flagged a failure of the ESS check.
[10]:
bebi103.stan.check_ess(samples, total_ess_rule_of_thumb=400)
tail-ESS for parameter theta[0] is 1465.951659584542.
tail-ESS for parameter theta[1] is 1426.717666604798.
ESS for parameter tau[0] is 1235.1932445709374.
ESS for parameter Tau[0,0] is 1235.1909291436552.
ESS or tail-ESS below 100 per chain indicates that expectation values
computed from samples are unlikely to be good approximations of the
true expectation values.
[10]:
False
The function now returns False
, and also prints diagnostic information about the parameters for which the test failed.
Now let’s turn to Rhat. The Rhat value is a rank-normalized, folded Rhat, as described in Vehtari, et al., 2019.
[11]:
bebi103.stan.check_rhat(samples)
Rhat for parameter Tau[0,1] is NaN.
Rhat for parameter Tau[1,0] is NaN.
[11]:
False
The function returned False
, meaning the that Rhat check failed. It informed us that two of the Rhat values were NaN, which causes a failure of the check. It is often the case that variables in the generated parameters
block have NaN Rhat because they are not in any way manipulated by the sampler. In this case, the off-diagonal terms in the matrix Tau
are both always zero. We can omit these using the omit
kwarg.
[12]:
bebi103.stan.check_rhat(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Rhat looks reasonable for all parameters.
[12]:
True
Next, we can check for divergences.
[13]:
bebi103.stan.check_divergences(samples)
53 of 4000 (1.325%) iterations ended with a divergence.
Try running with larger adapt_delta to remove divergences.
[13]:
False
We failed the divergence test because we had 53 total divergences. We will deal with these momentarily.
We can also check if the sampler was hitting the maximum tree depth. By default, the tree depth used in a sampling call by CmdStanPy is 10, which is the default max_treedepth
kwarg value for bebi103.stan.check_treedepth()
. You need to set this kwarg to match whatever tree depth used in the sampling. In our case, we can use the default.
[14]:
bebi103.stan.check_treedepth(samples)
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
[14]:
True
We have passed this test.
Finally, we can check the E-BFMI.
[15]:
bebi103.stan.check_energy(samples)
E-BFMI indicated no pathological behavior.
[15]:
True
We have also passed this test.
For convenience, we can perform all tests at once using the bebi103.stan.check_all_diagnostics()
function.
[16]:
bebi103.stan.check_all_diagnostics(samples, omit=["Tau[0,1]", "Tau[1,0]"])
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
53 of 4000 (1.325%) iterations ended with a divergence.
Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[16]:
4
The return value of this function is a number that, when converted to binary, each digit in the code stands for whether or not a test passed. A digit of zero indicates the test passed. The ordering of the tests goes:
ess
r_hat
divergences
tree depth
E-BFMI
For example, a warning code of 12 has a binary representation of 01100, which means that R-hat and divergences tests failed. Our warning code of 4 has a binary representation of 00100, which means that the divergence test failed.
Visualizing results
The bebi103 package allows for three visualizations of MCMC results.
Trace plots:
bebi103.viz.trace()
Parallel coordinate plots:
bebi103.viz.parcoord()
Corner plot:
bebi103.viz.corner()
Each of these takes as their first six arguments:
samples
: ArviZInferenceData
instance of containing the samples. (Note thatcorner()
also accepts other data type forsamples
, such as in display of bootstrap replicates.)parameters
: List of the names of variables to include in the plot.palette
: List of colors to use.omit
: List of variable names or regular expression patterns for variables to omit in the plot.include_ppc
: IfTrue
, include variables ending in_ppc
, which denotes posterior predictive checks, in the plot. Default isFalse
.include_log_lik
: IfTrue
, include variables starting withlog_lik
orloglik
. These denote log-likelihood contributions. Default isFalse
.
samples
is the only required argument.
Trace plots
We will begin with a trace plot of the hyperparameters θ and τ.
[17]:
bokeh.io.show(
bebi103.viz.trace(samples, parameters=["theta[0]", "theta[1]", "tau[0]", "tau[1]"])
)
The traces are color-coded by chain.
Parallel coordinate plots
For a parallel coordinate plot, we want to omit all four entries of the matrices Tau
and Sigma
because the entries are either zero or determined uniquely by other parameters. We can specify regular expressions in the omit
list to conveniently omit all of the entries in these matrix parameters.
[18]:
bokeh.io.show(
bebi103.viz.parcoord(samples, omit=[re.compile("Tau.*"), re.compile("Sigma.*")])
)
In the plot, samples that did not result in a divergence are semi-transparent gray by default. Samples that had a divergence are colored in orange and shown with thicker lines. Because the scale of respective parameters may be different, it is often easer to rescale each parameter for the plot. We can do this with the transformation
kwarg, and setting it to 'minmax'
sets the minimum of each parameter value to zero and maximum to one.
[19]:
bokeh.io.show(
bebi103.viz.parcoord(
samples,
omit=[re.compile("Tau.*"), re.compile("Sigma.*")],
transformation="minmax",
)
)
This plot is more revealing. We see (especially if we zoom in) that most of the divergences are going through small tau[0]
. This is indicative of the “funnel” behavior often encountered in hierarchical models.
Corner plots
Corner plots display scatter plots of samples from all pairs of variables, as well as histograms or ECDFs for individual variables. We can make a corner plot for the hyperparameters and also for ρ.
[20]:
bokeh.io.show(
bebi103.viz.corner(
samples, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
)
)
These are very useful plots for interpreting a posterior, and also for diagnosing problems with sampling. The divergent samples appear in orange. The clustering of divergences for small tau[0]
is clear.
The diagonal can also be plotted with ECDFs and contours may also be added to the off-diagonal plots.
[21]:
bokeh.io.show(
bebi103.viz.corner(
samples,
parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"],
plot_ecdf=True,
show_contours=True,
)
)
Simulation-based calibration (SBC)
In a principled Bayesian workflow, simulation-based calibration (SBC) is a useful tool to identify pathologies in a generative model and in the sampler’s ability to draw samples from it. The procedure is laid out in Talts, et al., 2018, and is described in the Stan manual.
Briefly, the procedure is:
Draw a parameter set θ out of the prior.
Use θ to draw a data set y out of the likelihood.
Perform MCMC sampling of the posterior using y as if it were the actual measured data set. Draw L MCMC samples of the parameters.
Do steps 1-3 N times (hundreds to a thousand is usually a good number).
For each of these calculations, you should run diagnostics to make sure the effective sample size, Rhat, etc., are within acceptable ranges. You can also compute z-scores (a measure of how well the MCMC samples reveal the ground truth), shrinkage (how informative the data can be), and rank statistics (used to check sampler performance). Please see the Talts, et al. paper for details on these quantities and their interpretation. Our focus here is how to use
the bebi103.stan.sbc()
function to perform the analysis and compute these quantities.
Example model
We have already seen in the example so far that we have divergences and the sampler does not seem to be able to sample small values of tau[0]
. We will therefore write a new Stan program where we have noncentered parameters. The Stan model is:
data {
// Total number of data points
int N;
// Number of entries in each level of the hierarchy
int J_1;
//Index array to keep track of hierarchical structure
int index_1[N];
// The measurements
real x[N];
real y[N];
}
transformed data {
// Data are two-dimensional, so store in a vector
vector[2] xy[N];
for (i in 1:N) {
xy[i, 1] = x[i];
xy[i, 2] = y[i];
}
}
parameters {
// Hyperparameters level 0
vector[2] theta;
// How hyperparameters vary
vector<lower=0>[2] tau;
// Parameters
vector<lower=0>[2] sigma;
real<lower=-1, upper=1> rho;
// Noncentered parameters
vector[2] theta_1_noncentered[J_1];
}
transformed parameters {
// Covariance matrix for likelihood
matrix[2, 2] Sigma = [
[sigma[1]^2, rho * sigma[1] * sigma[2]],
[rho * sigma[1] * sigma[2], sigma[2]^2 ]
];
// Center parameters
vector[2] theta_1[J_1];
for (i in 1:J_1) {
theta_1[i] = theta + tau .* theta_1_noncentered[i];
}
}
model {
// Hyperpriors
theta ~ normal(5, 5);
tau ~ normal(0, 10);
// Priors
theta_1_noncentered ~ multi_normal([0, 0], [[1, 0], [0, 1]]);
sigma ~ normal(0, 10);
rho ~ uniform(-1, 1);
// Likelihood
for (i in 1:N) {
xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
}
}
generated quantities {
real x_ppc[N];
real y_ppc[N];
real log_lik[N];
{
vector[2] xy_ppc;
for (i in 1:N) {
xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
x_ppc[i] = xy_ppc[1];
y_ppc[i] = xy_ppc[2];
}
}
}
Let’s go ahead and compile the model and draw some samples just to see if it alleviated the problems with divergences. We will use extra warmup to make sure the chains are achieving stationarity.
[22]:
with bebi103.stan.disable_logging():
sm_nc = cmdstanpy.CmdStanModel(stan_file="sample_model_noncentered.stan")
samples_nc = sm_nc.sample(
data=data,
chains=4,
iter_warmup=2000,
iter_sampling=1000,
adapt_delta=0.95,
seed=3252,
)
samples_nc = az.from_cmdstanpy(
samples_nc, posterior_predictive=["x_ppc", "y_ppc"], log_likelihood="log_lik"
)
And we can check a corner plot to check for divergences.
[23]:
bokeh.io.show(
bebi103.viz.corner(
samples_nc, parameters=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"]
)
)
We still got some divergences, but only nine, which could be false positives for divergences. We will go forward with SBC using this model.
Building an SBC
To build a simulation-based calibration, we need to write a Stan program to generate data sets based on parameters drawn out of the prior, as prior predictive model.
data {
// Total number of data points
int N;
// Number of entries in each level of the hierarchy
int J_1;
//Index array to keep track of hierarchical structure
int index_1[N];
}
generated quantities {
vector[2] theta;
vector[2] tau;
vector[2] sigma;
for (i in 1:2) {
theta[i] = normal_rng(5, 5);
tau[i] = fabs(normal_rng(0, 10));
sigma[i] = fabs(normal_rng(0, 10));
}
real rho = uniform_rng(-1, 1);
vector[2] theta_1[J_1];
for (i in 1:J_1) {
for (j in 1:2) {
theta_1[i, j] = normal_rng(theta[j], tau[j]);
}
}
real x[N];
real y[N];
{
matrix[2, 2] Sigma = [
[sigma[1]^2, rho * sigma[1] * sigma[2]],
[rho * sigma[1] * sigma[2], sigma[2]^2 ]
];
vector[2] xy;
for (i in 1:N) {
xy = multi_normal_rng(theta_1[index_1[i]], Sigma);
x[i] = xy[1];
y[i] = xy[2];
}
}
}
To use it in SBC, we need to compile it.
[24]:
with bebi103.stan.disable_logging():
sm_prior_pred = cmdstanpy.CmdStanModel(
stan_file="sample_model_prior_predictive.stan"
)
Running an SBC
Finally, we are ready to perform SBC using the bebi103.stan.sbc()
function. We need to tell the function
What Stan model to use for the prior predictive model.
Which Stan model to use for posterior sampling.
Which
data
dictionary to feed into the prior predictive model.Which
data
dictionary to feed into the posterior sampling mode.What the names of the measured data are. For multidimensional data, do not include indexing. In the present case, the measured data are
['x', 'y']
.The names of the posterior predictive variables. There are not used in shrinkage, z-score, and rank statistic calculations.
The name of the log likelihood variable. There are not used in shrinkage, z-score, and rank statistic calculations.
The data types of the measured data.
Any kwargs to use in posterior sampling.
Any kwargs to pass to
bebi103.stan.check_all_diagnostics()
.How many cores you want to use for the calculation.
How many SBC calculations you want to do. You should do at least 400; 1000 is a good number.
SBC calculations are lengthy, so to keep the runtime of this documentation manageable, we will load a pre-computed SBC result if it is available. For reference, this SBC calculation took about six hours on 16 cores on an AWS c5 instance.
[25]:
try:
df_sbc = pd.read_csv('sbc_results.csv')
except:
df_sbc = bebi103.stan.sbc(
prior_predictive_model=sm_prior_pred,
posterior_model=sm_nc,
prior_predictive_model_data=data,
posterior_model_data=data,
measured_data=['x', 'y'],
posterior_predictive_var_names=['x_ppc', 'y_ppc'],
log_likelihood_var_name='log_lik',
measured_data_dtypes={'x': float, 'y': float},
sampling_kwargs={'iter_warmup': 2000, 'adapt_delta': 0.95},
diagnostic_check_kwargs={'omit': ['Tau.*', 'Sigma.*']},
cores=16,
N=1000,
progress_bar=True,
)
df_sbc.to_csv('sbc_results.csv', index=False)
The output of the SBC calculation is a tidy data frame with the results for each of the 1000 trials.
[26]:
df_sbc
[26]:
ground_truth | rank_statistic | mean | sd | shrinkage | z_score | Rhat | ESS | ESS_per_iter | tail_ESS | tail_ESS_per_iter | n_divergences | n_bad_ebfmi | n_max_treedepth | warning_code | L | trial | parameter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.281950 | 3712 | 0.403373 | 2.407448 | 0.770475 | -1.611074 | 1.002368 | 1341.460058 | 0.335365 | 1303.921933 | 0.325980 | 15 | 0 | 60 | 12 | 4000 | 0 | theta[0] |
1 | 4.974810 | 313 | 11.482235 | 4.101083 | 0.333940 | 1.586758 | 1.001557 | 1387.925463 | 0.346981 | 2224.981728 | 0.556245 | 2 | 0 | 234 | 12 | 4000 | 1 | theta[0] |
2 | -0.356069 | 235 | 6.575907 | 4.445932 | 0.217216 | 1.559173 | 1.002113 | 1879.793973 | 0.469948 | 2239.299188 | 0.559825 | 0 | 0 | 0 | 0 | 4000 | 2 | theta[0] |
3 | 5.071360 | 1929 | 5.121302 | 1.737149 | 0.880494 | 0.028750 | 1.002705 | 1730.319116 | 0.432580 | 1575.431994 | 0.393858 | 1 | 0 | 0 | 4 | 4000 | 3 | theta[0] |
4 | 4.264670 | 1218 | 5.950835 | 3.416150 | 0.537842 | 0.493586 | 1.000094 | 1684.749750 | 0.421187 | 2169.509041 | 0.542377 | 0 | 0 | 0 | 0 | 4000 | 4 | theta[0] |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12995 | 3.222330 | 1162 | 4.178241 | 1.764406 | 0.974475 | 0.541775 | 1.000539 | 3132.297533 | 0.783074 | 3482.163558 | 0.870541 | 0 | 0 | 0 | 0 | 4000 | 995 | theta_1[2,1] |
12996 | -7.107200 | 631 | -5.619239 | 1.471871 | 0.982237 | 1.010932 | 1.000755 | 4177.643205 | 1.044411 | 3496.129529 | 0.874032 | 0 | 0 | 0 | 0 | 4000 | 996 | theta_1[2,1] |
12997 | 9.484280 | 763 | 10.892740 | 1.644157 | 0.977836 | 0.856645 | 1.000262 | 3765.095120 | 0.941274 | 2540.140071 | 0.635035 | 0 | 0 | 0 | 0 | 4000 | 997 | theta_1[2,1] |
12998 | -6.195420 | 1290 | -4.456004 | 4.444606 | 0.838030 | 0.391354 | 1.001032 | 3340.526088 | 0.835132 | 3040.492446 | 0.760123 | 0 | 0 | 0 | 0 | 4000 | 998 | theta_1[2,1] |
12999 | -1.098650 | 3921 | -6.015190 | 2.318488 | 0.955926 | -2.120581 | 1.000614 | 4789.118749 | 1.197280 | 3234.499689 | 0.808625 | 0 | 0 | 0 | 0 | 4000 | 999 | theta_1[2,1] |
13000 rows × 18 columns
For each parameter for each trial, the ground truth is given (as drawn from the prior), as well as a rank statistic, mean, standard deviation, shrinkage, and z-score. MCMC diagnostics are also reported.
Interpreting results
We will not go into much detail about interpreting results here, directing you instead to the Talts, et al. paper. We will instead do a quick run through some of the results and suggested visualization procedures, highlighting some of the useful functions in bebi103 along the way.
We can first check what diagnostic warnings we encountered. There are thirteen parameters, so we can count the warning codes and divide by 13 to see how many MCMC calculations had problems.
[27]:
df_sbc.groupby("warning_code").size() // 13
[27]:
warning_code
0 530
2 3
4 277
5 12
6 4
7 13
8 62
10 1
11 18
12 66
13 1
14 3
15 10
dtype: int64
Most of the samples either had no diagnostic warnings (warning code 0
) or only divergences (warning code 4
). To help interpret the rest of the warning codes, we can use the bebi103.stan.parse_warning_code()
function. For example, the next most common warning code was 12
.
[28]:
bebi103.stan.parse_warning_code(12)
divergence warning
treedepth warning
This means that we had tree depth and divergence warnings. All together, the diagnostics tell us there are some problems with our sampling as we have it set up. We should take more iterations, increase adapt_delta
, and also increase max_treedepth
in doing the calculations.
We should also check the z-score and shrinkage. We will plot the z-score vs. shrinkage for each parameter, coloring the glyphs by the warning code. This coloring enables us to see if any of the MCMC diagnostic issues might be affecting shrinkage and z-scores.
[29]:
points = hv.Points(
data=df_sbc,
kdims=["shrinkage", ("z_score", "z-score")],
vdims=[("parameter", "param"), "warning_code"],
).groupby(
"parameter"
).opts(
color="warning_code",
frame_height=125,
frame_width=125,
size=2,
tools=["hover"],
).layout(
)
bokeh.io.show(hv.render(points))
The shrinkage and z-scores are good for all parameters (close to one and between –5 and 5, respectively), but the shrinkage is poor for the hyperparameters. This does not appear to be dependent on MCMC diagnostic issues. Rather, this tells us that with the size of data sets we are considering, we cannot learn much about the hyperparameters.
Finally, as suggested by Talts, et al., we can make an SBC rank ECDF plot using the bebi103.viz.sbc_rank_ecdf()
.
[30]:
p = bebi103.viz.sbc_rank_ecdf(
df_sbc, show_legend=True, marker_kwargs={"size": 2},
)
p.legend.spacing = 0
p.legend.label_text_font_size = '8pt'
bokeh.io.show(p)
The ECDF difference of the rank statistic falling within the theoretical 99-percentile envelope is indicative of good calibration of the sampler and lack of bias. The parameter sigma[1]
has a low-biased rank statistic, suggesting that the samples of sigma[1]
will tend to be biased toward larger values.
Cleaning up
CmdStanPy leaves .hpp
, ,o
, .d
, executable files, and sampling output. To conveniently remove them, use bebi103.stan.clean_cmdstanpy()
.
[31]:
bebi103.stan.clean_cmdstan()