Bootstrap utilities


[30]:
import numpy as np
import pandas as pd
df = pd.read_csv('sample_data.csv')

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

The bebi103 package contains utilities for nonparametric bootstrap calculations. These fall into three categories.

  1. Bootstrap replicates of a single scalar statistic from a one-dimensional data set.

  2. Pairs bootstrap replicates of a single scalar statistic.

  3. Permutation replicates drawn from two one-dimensional data sets.

Additionally, it contains draw_bs_reps_mle(), which is used for drawing bootstrap replicates of a maximum likelihood estimator.

Random number generation

Bootstrap method relies on random number generation. Numpy’s documentation suggests instantiating a Generator instance and using that to draw random numbers, for example as

import numpy as np

# Instantiate Generator
rg = np.random.default_rng()

# Draw Normally distributed random numbers
rg.normal(loc=2.5, scale=1.0, size=100)

By default, np.random.default_rng() gives a PCG-64 random number generator.

In legacy uses of Numpy, random numbers were drawn using the Mersenne Twister algorithm. This functionality is still available, as

import numpy as np

# Draw Normally distributed random numbers
np.random.normal(loc=2.5, scale=1.0, size=100)

Numba currently (v. 0.50) has support for the legacy Numpy implementation using the Mersenne Twister algorithm. Because much of the functions using random number generation in the bebi103 package use Numba to get speed boosts by just-in-time compilation, all functions in the bebi103.bootstrap module except bebi103.bootstrap.draw_bs_reps_mle() all use the legacy Numpy random number generation. The MLE bootstrap function does not because it allows for parallel computation, and seeding the RNG would result in all threads giving the same values.

Seeding the RNG

For testing, debugging, and pedagogical reproducibility purposes, it is sometimes valuable to seed the random number generator. To do so, you need to seed the random number generator both in vanilla Numpy and with a Numba’d function. This is accomplished using the bebi103.bootstrap.seed_rng() function.

[31]:
bebi103.bootstrap.seed_rng(3252)

The results of bebi103.bootstrap.draw_bs_reps_mle() are unaffected by this seeding.

Nonparametric resampling methods

We begin with the available nonparametric bootstrap and permutation methods.

Boostrap replicates from one-dimensional data

A bootstrap replicate is obtained by resampling a data set with replacement and then computing the prescribed summary statistic. The summary statistic need not be scalar, scalar summary statistics are a common use case. Drawing the bootstrap replicates from a set of scalar measurements is achieved using the bebi103.bootstrap.draw_bs_reps() function. It takes two positional arguments.

  1. The data set to resample.

  2. The function used to compute the summary statistic.

Additionally, it takes two keyword arguments.

  1. size, the number of bootstrap replicates to generate.

  2. args, a tuple containing any other parameters that need to be passed into the function used to compute the summary statistic.

Below, I draw 10,000 replicates of the median.

[32]:
bs_reps = bebi103.bootstrap.draw_bs_reps(df['x'], np.median, size=10000)

To compute confidence intervals from the replicates, we take percentiles. For example, to compute a 95% confidence interval, we compute the 2.5th and 97.5th percentiles.

[33]:
conf_int = np.percentile(bs_reps, [2.5, 97.5])

We just computed the confidence interval for the x values of the data set ignoring the trials. We can automate this process to compute confidence intervals for each of the three trials, storing them in a list of dictionaries along with the trial numbers and point estimates of the median.

[34]:
summaries = []

for label, group in df.groupby("trial"):
    label = str(label)
    estimate = np.median(group["x"])

    bs_reps = bebi103.bootstrap.draw_bs_reps(group["x"], np.median, size=10000)
    conf_int = np.percentile(bs_reps, [2.5, 97.5])

    summaries.append(dict(label=label, estimate=estimate, conf_int=conf_int))

To make sure we understand the entries in summaries, a list of dictionaries, let’s take a look.

[35]:
summaries
[35]:
[{'label': '1',
  'estimate': 3.5801108355411304,
  'conf_int': array([2.51454132, 4.30516588])},
 {'label': '2',
  'estimate': 2.0008366741479433,
  'conf_int': array([1.40245352, 3.65771395])},
 {'label': '3',
  'estimate': 1.0906343409350114,
  'conf_int': array([0.43744037, 2.90887073])}]

When the summary statistics are stored in this format, we can use the bebi103.viz.confints() function to make a plot.

[36]:
bokeh.io.show(bebi103.viz.confints(summaries, x_axis_label='x', y_axis_label='trial'))

Pairs bootstrap

We can also perform pairs bootstrap calculations in which pairs of data from two one-dimensional arrays of the same length are drawn. For example, we may with to draw pairs bootstrap replicates of the Pearson correlation coefficient for the \(x\) and \(y\) data in the data set. Conveniently, the bebi103.bootstrap.pearson_r() function computes the correlation coefficient. We use the bebi103.bootstrap.draw_bs_reps_pairs() to do the calculation. The API is similar to the univariate case, except now we supply three positional arguments, the x-data, the y-data, and the function to compute the summary statistic from the two.

[37]:
summaries = []

for label, group in df.groupby("trial"):
    label = str(label)
    estimate = bebi103.bootstrap.pearson_r(group["x"], group["y"])

    bs_reps = bebi103.bootstrap.draw_bs_reps_pairs(
        group["x"], group["y"], bebi103.bootstrap.pearson_r, size=10000
    )
    conf_int = np.percentile(bs_reps, [2.5, 97.5])

    summaries.append(dict(label=label, estimate=estimate, conf_int=conf_int))


bokeh.io.show(bebi103.viz.confints(summaries, x_axis_label="ρ", y_axis_label="trial"))

Permutation replicates

When testing the hypothesis that two one-dimensional data sets come from identical distribution, we may use a permutation test. If data set x has m measurements and data set y has n measurements, a permutation sample is obtained by concatenating the two data sets, randomly scrambling the order of the concatenated data set, and then assigning the first m entries in the concatenated/scrambled data set to be “x” and the remaining n entries to be “y.” A summary statistic is then computed from the permutation sample to get a permutation replicate. This is done over and over again in order to get many permutation replicates to ultimately compute a p-value.

The bebi103.bootstrap.draw_perm_reps() function generates permutation replicates. The API is similar to generating pairs bootstrap with three positional arguments, the x-data, the y-data, and the function to compute the summary statistic from the two. Here, we demonstrate a permutation test on the values of measurement x for trials 1 and 2. We will use the Studentized difference of means as the test statistic, which is conveniently available in the bebi103.bootstrap.studentized_diff_of_means() function.

[38]:
# Extract data sets
x1 = df.loc[df["trial"] == 1, "x"]
x2 = df.loc[df["trial"] == 2, "x"]

# Compute test statistic from original data
diff_means_studentized = bebi103.bootstrap.studentized_diff_of_means(x1, x2)

# Draw permutation reps
perm_reps = bebi103.bootstrap.draw_perm_reps(
    x1, x2, bebi103.bootstrap.studentized_diff_of_means, size=10000
)

# Compute p-value
np.sum(perm_reps >= diff_means_studentized) / 10000
[38]:
0.0287

In this case, we got p = 0.03.

Parametric methods

In parametric modeling, we have an explicit generative model in mind that generated the data. The generative model has a set of parameters, and a central task is to estimate those parameters. The maximum likelihood estimate (MLE) of the parameters consists of the parameter values that maximize the likelihood function.

The bebi103 package allows for drawing bootstrap replicates of maximum likelihood estimates (MLEs). This is accomplished using the bebi103.bootstrap.draw_bs_reps_mle() function. The function does the following.

  1. Compute the MLE of the parameters of the generative model from the measured data.

  2. Use the generative model parametrized by the MLE to generate a new data set. This new data set is referred to as a bootstrap sample.

  3. Compute the MLE from the bootstrap sample and store it. This value of the MLE from a bootstrap sample is referred to as a bootstrap replicate.

  4. Repeat steps 2 and 3 until the desired number of bootstrap replicates is acquired.

This function is quite general, requiring the user to specify a function to compute the MLE, a function to draw a bootstrap sample, and the data themselves.

MLE for a univariate data set

To begin, we will consider a model in which all of the \(x\)-data from from a Normal distribution. That is, our generative model is

\begin{align} x \sim \text{Norm}(\mu, \sigma), \end{align}

with the two parameters we seek to estimate from the data being the location parameter \(\mu\) and the scale parameter \(\sigma\). Conveniently, the MLEs for the location and scale parameters for a Normal model are the plug-in estimates for the mean and standard deviation.

First, we define the function to compute the MLE. The function must the data set as its first argument, and may have other arguments if necessary.

[39]:
def mle_fun(x):
    """Given a data set, compute MLE (plug-in) estimates for the
    mean and variance."""
    return np.array([np.mean(x), np.std(x)])

Next, we need a function that generates a new data set from the MLE. This function must have the parameters as a Numpy array (in this case an array of the mean and standard deviation) as its first argument. Additional arguments, if any, follow. The last two arguments must be size and rg, which are respectively the number of new data to generate and the random number generator to use.

[40]:
def gen_fun(params, size, rg):
    """Generate a new data set"""
    return rg.normal(*params, size=size)

We can now draw bootstrap replicates of the MLE.

[41]:
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(mle_fun, gen_fun, df['x'], size=10000)

The output is a Numpy array where each row is a parametric bootstrap replicate of the MLE. Each column corresponds to on of the variables, the first being \(\mu\) and the second being \(\sigma\). We can compute 95% confidence intervals for the two.

[42]:
np.percentile(bs_reps, [2.5, 97.5], axis=0)
[42]:
array([[2.08490171, 1.74424982],
       [3.12677947, 2.48335369]])

The first column is the confidence interval for \(\mu\) and the second is the confidence interval for \(\sigma\).

We should, of course, also compute the MLE itself (not just its confidence interval!).

[43]:
params_mle = mle_fun(df['x'])

print("μ MLE: {0:.2f}\nσ MLE: {1:.2f}".format(*params_mle))
μ MLE: 2.61
σ MLE: 2.13

Graphical display of a confidence region

While the above calculation of confidence intervals are useful for each parameter, we may wish to understand how the MLE of \(\mu\) and that of \(\sigma\) might depend on each other. E.g., we might wonder if we have a higher estimate for \(\mu\), do we also have a higher estimate for \(\sigma\)?

To address this question, we can plot all of the bootstrap replicates. From those, we can draw a contour around a 95% confidence region, which contains 95% of the bootstrap samples. This is achieved with a corner plot, which also includes histograms of the bootstrap replicates for each parameter individually. This functionality is given in the bebi103.viz.corner() function.

[44]:
bokeh.io.show(
    bebi103.viz.corner(
        bs_reps,
        labels=["µ", "σ"],
        show_contours=True,
        levels=[0.95],
        max_plotted=1000,
    )
)

Note that we have used the max_plotted=1000 kwarg to only plot 1,000 of the 10,000 samples. This is to keep the file size down in the rendering of this notebook in the hosted documentation.

Graphical model comparison of univariate data sets

The generative model emits data. If we parametrize the generative model with the MLE, we can use the generative model to draw new data sets. We can compare these predicted data set the experimental data set. For this purpose, graphical comparisons are useful.

In order to construct the graphical model comparisons, we need to generate predicted data sets as samples out of the generative model parametrized by the MLE. Each data set has the same number of “observations” as the observed data set. We will generate 10,000 such data sets.

[45]:
rg = np.random.default_rng(seed=3252)
samples = np.array([gen_fun(params_mle, len(df), rg) for _ in range(10000)])

Q-Q plots

Q-Q plots are an effective means of model comparison. They are constructed as follows. Say we have n data points. We draw n samples out of the generative model and sort them. We then plot the sorted generated data against the sorted measured data points. Plotted points falling on a straight line of slope one and passing through the origin is indicative of agreement between the model and the true process that generated the data.

We can repeat this over and over again, can can compute percentiles of the Q-Q plots to give a confidence interval of the plots of sorted generated data versus sorted measured data. If the confidence interval captures the line of unit slope and zero intercept, the data are commensurate with the model.

We have the generated data sets, stored as samples. Along with the measured data (in this case, fabricated “measured” data for illustrative purposes), we can generated a Q-Q plot using bebi103.viz.qqplot().

[46]:
bokeh.io.show(bebi103.viz.qqplot(data=df["x"], samples=samples))

By default, the middle 95% of samples are displayed as the shaded region.

Predictive ECDFs

To compare data sets that would be generated from the model and the observed data set, we plot percentiles of the ECDFs of the generated data sets with the ECDF of the data overlaid using the bebi103.viz.predictive_ecdf() function.

[47]:
p = bebi103.viz.predictive_ecdf(data=df['x'], samples=samples)
bokeh.io.show(p)

The orange curve is the ECDF of the data. The dark blue curve in the center is the median ECDF of the samples. By default, the coloring shows the middle 68th and middle 95th percentiles of the ECDFs from samples out of the generative model.

Predictive ECDFs are often easier to look at as differences from the median. We can consider two different differences. First, we can take each value of the ECDF and look at how different the x-values are from the median. Second, we can consider each value of x and look at how different the value of the ECDF is. In the first case, we are looking at an inverse ECDF (IECDF). We can look at either different plot using the diff kwarg.

[48]:
p_iecdf = bebi103.viz.predictive_ecdf(
    data=df["x"], samples=samples, diff="iecdf", frame_height=175
)

p_ecdf = bebi103.viz.predictive_ecdf(
    data=df["x"], samples=samples, diff="ecdf", frame_height=175
)


bokeh.io.show(
    bokeh.layouts.column([p_iecdf, bokeh.layouts.Spacer(height=30), p_ecdf])
)

MLE for a multivariate data set

The MLE analysis generalizes to multivariate data. As a simple example, we will consider a bivariate example where \(y\) is a linear function of \(x\).

\begin{align} y_i \sim \text{Norm}(a_0 + a_1 x_i, \sigma) \;\;\forall i. \end{align}

The MLE for \(a_0\), and \(a_1\) are given by the slope and intercept from a least squares calculation, and the MLE for \(\sigma\) is given by

\begin{align} \sigma_\mathrm{MLE} = \sqrt{\text{RSS}/n}, \end{align}

where RSS is the residual sum of squares. We can write mle_fun() to return the MLEs for the three parameters.

[49]:
def mle_fun(y, x):
    """Given x, y data and a linear model with homoscedastic error,
    compute MLE for slope, intercept, and standard deviation."""
    # Obtain MLEs
    a1, a0 = np.polyfit(x, y, deg=1)
    sigma = np.sqrt(np.sum((a0 + a1 * x - y)**2) / len(x))

    return a0, a1, sigma

Note that we have identified the dependent variable y as the “data,” and the independent variable x as a “parameter.”

We can use this function to get the MLEs.

[50]:
x = df["x"].values
y = df["y"].values

mle_params = mle_fun(y, x)

To obtain bootstrap replicates of the MLE using bebi103.bootstrap.draw_bs_reps_mle(), we need to specify a function to generate data from the model. As with the MLE function, we need to pass an additional parameter here, x, the values of the dependent variable.

[51]:
def gen_fun(params, x, size, rg):
    """Generate a new data set"""
    a0, a1, sigma = params

    return rg.normal(a0 + a1 * x, sigma)

We can now draw our bootstrap replicates and compute confidence intervals for the MLEs. We use the mle_args and gen_args keyword arguments to specify the x must be passed as an argument to mle_fun() and gen_fun(). In this case, we will use the n_jobs kwargs to enable parallel acquisition of the bootstrap samples.

[52]:
# Get reps
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    mle_fun, gen_fun, y, mle_args=(x,), gen_args=(x,), size=10000, n_jobs=2
)

# Confidence intervals
np.percentile(bs_reps, [2.5, 97.5], axis=0)
[52]:
array([[4.15616345, 0.71864661, 1.9791932 ],
       [6.09647939, 1.29010591, 2.8266433 ]])

The first column is the confidence interval for \(a_0\), the second for \(a_1\), and the third for \(\sigma\).

Graphical display of confidence region

We can again use the bebi103.viz.corner() function to display the confidence regions. This time, we will plot the samples for the individual parameters with an ECDF.

[53]:
bokeh.io.show(
    bebi103.viz.corner(
        bs_reps,
        labels=["a₀", "a₁", "σ"],
        show_contours=True,
        levels=[0.95],
        plot_ecdf=True,
        max_plotted=1000,
    )
)

The corner plot reveals in this case that the MLEs for \(a_0\) and \(a_1\) are correlated, as we might expect.

Predictive regression plots

It is traditional to display a plot of the curve parametrized by the MLEs. Such a plot might look like this:

[54]:
a0, a1, sigma = mle_params

p = bokeh.plotting.figure(frame_height=200, frame_width=200)
p.circle(x, y)
p.line(x, a0 + a1 * x, color='tomato')

bokeh.io.show(p)

However, this kind of plot does not display what might be predicted from the generative model, which has variability in measured beyond the (in this case linear) theoretical curve. It is more instructive to make a predictive regression plot. This is generated by drawing samples out of the generative model, and then plotting percentiles of the samples. This is accomplished with the bebi103.viz.predictive_regression() function.

It expects the data to come as a two-column Numpy array, so we need to create one.

[55]:
xy = df[['x', 'y']].values

We also need to specify for which x-values we want the samples. We can choose arbitrary x-values, so we will choose 200 point spanning the range of the data set in order to get smooth curves.

[56]:
samples_x = np.linspace(-3, 10, 200)

Finally, we can draw sample (we’ll do 10,000 of them), and make the plot.

[57]:
samples = np.array(
    [gen_fun(mle_params, samples_x, size=1, rg=rg) for _ in range(5000)]
)

bokeh.io.show(
    bebi103.viz.predictive_regression(
        samples, samples_x, data=xy, x_range=[-3, 10],
    )
)

By default, the middle 68% and 95% of samples are shown in the shaded blue regions, and the dark blue line is the median of the samples.

If we would like to compare the difference between compared to the median, we can use the diff=True kwarg. However, for this kind of plot, the x-values for which the samples were generated have to match those of the data. This is a necessary because the x-values can in general be floats and it is not entirely ambiguous which samples correspond to which x-values otherwise.

[58]:
samples = np.array(
    [gen_fun(mle_params, x, size=1, rg=rg) for _ in range(5000)]
)

bokeh.io.show(
    bebi103.viz.predictive_regression(
        samples, x, data=xy, x_range=[x.min(), x.max()], diff=True
    )
)