A Short Primer on Bayesian Statistics

Overview

This document provides a brief introduction into Bayesian analysis and thinking. Some prerequisite knowledge in standard Frequentist statistics and modeling is helpful to understand the material. Nested within this tutorial are R code blocks that implement the new concepts.

Outline

Bayes’ Rule and Bayesian thinking
Implementing Bayesian regression models in practice
Workshop: A demonstration of these ideas via a re-analysis of the ART trial
References and additional resources for learning Bayesian statistics

Preliminary Setup

For the workshop component, we will use R locally. To be ready for this, you should have the brms package installed (and also the insight package). Feel free to try to install during the break. brms is a somewhat troublesome package at install - if you have issues you can troubleshoot here or just feel free to just watch my shared screen at that time.

Bayesian Thinking vs. Frequentist Thinking

Before describing Bayesian thinking and philosophy, it is helpful to review the core tenets of Frequentist statistics. Frequentist probability is defined as the limit of an event’s relative frequency in a large number of repetitions. For example, if the probability of a power outage during this presentation is 0.01%, then if we were to repeat this presentation an infinite number of times, we would expect a power outage about 1 in every 10,000 presentations on average.

Statistical inference begins with the description of some unknown parameter of interest, which we’ll denote here as \(\theta\). In Frequentist inference, often one will perform an analysis involving

Point estimation of \(\theta\)
Confidence interval calculation
Hypothesis testing

Point estimation entails calculating a “best guess” for \(\theta\) according to some chosen definition of best guess. Then a confidence interval gives a range of estimates for \(\theta\) where the confidence level (say 95%) can be interpreted as the long-run probability of a calculated interval containing the true parameter value.

Finally, one might specify a set of null hypothesis test and then calculate whether there us enough evidence to reject such a hypothesis using a p-value. For example, one might specify a null hypothesis that the mean number of cars owned by Americans is 0.5. Then one specifies a test statistic which is a function of a relevant data sample, taking into account sample mean, size, and variability. Then the calculated p-value represents the probability that, if the null is true, one would sample a test statistic as extreme as that calculated for the data. Even with this somewhat hand-wavy wording, it takes a lot of build-up to describe a p-value!

Now let’s contrast these ideas with the Bayesian paradigm. The Bayesian definition of a probability is that it is a quantification of belief that something will occur. Bayesian inference is undergone by combining sample data with any known prior information to learn about \(\theta\). Point estimation is also common in Bayesian inference. Instead of confidence intervals, one would typically calculate a \(C\)% credible interval, for which the probability that \(\theta\) lies within the interval is \(C\)%. One does not typically perform hypothesis testing in the Bayesian framework, and instead uses the posterior output (calculating using the observed data and prior information) to answer questions of interest about \(\theta\) directly. So in the example above, one might just directly calculate the probability that the mean number of cars owned by Americans is greater than 0.5 given observed data and any prior known information.

In general, Frequentist methods will try to learn about the probability of the observed data (denoted throughout at \(y\)) given some parameter value, while Bayesian methods will try to learn about the probability of a parameter value given the observed data.

Many of these ideas are somewhat vague until we do concrete examples. Let’s define some basic tools for Bayesian inference and then provide some examples.

Bayes’ Rule

Bayes’ Rule provides a means to calculate a posterior distribution for the parameter of interest - a distribution that updates our prior beliefs using the observed data.

Before discussing Bayes’ rule, it is helpful to review some conditional probability rules. Let \(A\) and \(B\) be events. Let \(P(A)\) be the probability that event \(A\) occurs and let \(P(A|B)\) be the probability that event \(A\) occurs, given that event \(B\) has occurred. Then \[\begin{equation*} P(A | B) = \frac{P(A, B)}{P(B)} \hspace{2cm} P(B | A) = \frac{P(A, B)}{P(A)} \end{equation*}\]

If the joint probability \(P(A, B)\) is not known, you can use one of these quantities to calculate the other. For example, \(P(B|A)P(A) = P(A, B)\), and thus, \[\begin{equation*} P(A | B) = \frac{P(B | A)P(A)}{P(B)} \end{equation*}\]

The above equation is also known as Bayes’ Rule or Bayes Theorem. Going back to the setting of learning about population parameters using data, we can state the rule as \[\begin{equation*} p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)} \end{equation*}\] where \(p()\) is shorthand for either probability or density functions for simplicity.

The term \(p(\theta)\) reflects prior knowledge about the parameter of interest. The term \(p(y|\theta)\) can be modeled using standard approaches, and the term \(p(y)\) is often not needed, for reasons that will be described later. Let’s ground this idea with some examples.

Example 1

John just took a Covid rapid test and it came back negative. The test instructions say that the probability of a negative test given that you truly don’t have Covid is 99.8%. They also say that the probability of a positive test given that you do have Covid is 45%. But John wants to know the probability that he doesn’t have Covid given this new result, which they didn’t provide. Let’s calculate this quantity using a simple application of Bayes’ Theorem.

\[\begin{align*} P(\text{Not Covid | Negative}) &= \frac{P(\text{Negative | Not Covid})P(\text{Not Covid})}{P(\text{Negative})} \\ &= \frac{P(\text{Negative | Not Covid})P(\text{Not Covid})}{P(\text{Negative, Not Covid}) + P(\text{Negative, Covid})} \\ &= \frac{P(\text{Negative | Not Covid})P(\text{Not Covid})}{P(\text{Negative | Not Covid})P(\text{Not Covid}) + P(\text{Negative | Covid})P(\text{Covid})} \\ &= \frac{0.998P(\text{Not Covid})}{0.998P(\text{Not Covid}) + 0.55P(\text{Covid})} \end{align*}\]

Thus the solution is a function of a prior probability that you had Covid. In disease contexts, this prior is often replaced by the prelavence in a particular area, for example the current prevalence of Covid in Philadelphia. Let’s plot this function in R:

prev_list <- seq(0, 1, 0.01)
val_list <- (0.998*(1-prev_list)) / ( (0.998*(1-prev_list)) + 0.55*prev_list )
plot(prev_list, val_list, type = "l", xlab = "Prevalence", ylab = "Probability of no Covid given negative rapid test")
abline(v = 0.005)

John heard that the current prevalence in Philly is about 0.5%, so he’s pretty confident that he does not have Covid.

Example 2

Suppose a new Covid wave is starting and Mary is interested in the prevalence of infection in the city of Philadelphia, denoted \(\theta\), and she plans to work remotely if the prevalence is greater than 4%. She collects a sample of PCR tests from 50 random individuals from the city and 4 of them are positive.

There are a couple of ways to do a Frequentist analysis for the parameter of interest. An unbiased estimate of the prevalence is given by the sample mean \(\hat{\theta} = 0.08\) and a standard approach for calculating a confidence interval (without small sample correction) yields an interval of (0.005, 0.155). If Mary wants to use one-sided hypothesis testing with \(\alpha = 0.05\) to make a decision of whether the prevalence is greater than 4%, then she gets \(p=0.14\) and fails to reject the null hypothesis (here using an exact test because of the small sample).

binom.test(4, 50, p = 0.04, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  4 and 50
## number of successes = 4, number of trials = 50, p-value = 0.1391
## alternative hypothesis: true probability of success is greater than 0.04
## 95 percent confidence interval:
##  0.02778767 1.00000000
## sample estimates:
## probability of success 
##                   0.08

Bayesian thinking provides two critical benefits in this type of analysis:

While hypothesis testing is useful in many scenarios, it can often be overly stringent (or arbitrary with respect to the choice of \(\alpha\)) and you may want to make decisions with a more holistic process.
Bayesian analysis gives us the opportunity to leverage prior information.

Let’s start with just the second point. Suppose that before Mary runs her analysis, a colleague tells her that the estimated prevalence in New York City, Baltimore, and Pittsburgh are 0.12, 0.06, and 0.03 respectively. The first step to Bayesian analysis is to encode this information into a prior distribution, \(p(\theta)\).

We want to use a distribution family that takes values from 0 to 1, the support of our parameter of interest. A logical choice is the Beta distribution. Using the information on nearby cities, it seems probable but not guaranteed that the prevalence in Philly is between 0.03 and 0.12. So we should pick a Beta prior with parameters that put most, but not all, of the distribution within that range.

Let’s find such a distribution using R. One way to do this is by changing the shape1 and shape2 parameters until you find a distribution that has the desired properties.

samp_beta <- rbeta(10000, shape1 = 1, shape2 = 1)
hist(samp_beta)

samp_beta <- rbeta(10000, shape1 = 2, shape2 = 5)
hist(samp_beta)

samp_beta <- rbeta(10000, shape1 = 5, shape2 = 2)
hist(samp_beta)

samp_beta <- rbeta(10000, shape1 = 1, shape2 = 10)
hist(samp_beta)

Rather than play with parameters in an ad hoc way, it’s usually better to use known properties of the distribution. So for example, the average of prior known prevalences is 0.07, so it would be nice if the mean of our prior distribution was approximately 0.07. The mean of a Beta variable is the first parameter divided by the sum of the parameters.

One possible prior choice is a \(\theta \sim \text{Beta}(\alpha = 2.5, \beta = 30)\) distribution. Let’s look at some properties of this choice:

samp_beta <- rbeta(10000, shape1 = 2.5, shape2 = 30)

print("Prior mean")

## [1] "Prior mean"

mean(samp_beta)

## [1] 0.0778114

print("Prior median")

## [1] "Prior median"

median(samp_beta)

## [1] 0.06932789

print("Probability between 0.03 and 0.12")

## [1] "Probability between 0.03 and 0.12"

mean(0.03 < samp_beta & samp_beta < 0.12)

## [1] 0.6966

print("Probability less than 0.03")

## [1] "Probability less than 0.03"

mean(samp_beta < 0.03)

## [1] 0.1324

print("Probability greater than 0.12")

## [1] "Probability greater than 0.12"

mean(samp_beta > 0.12)

## [1] 0.171

hist(samp_beta)

So this seems to be a reasonable choice. Step 1 is complete!

For step 2, we need to assume a distribution for our observed data, \(p(y | \theta)\). Given that our data are binary test results, a reasonable choice is a \(\text{Binomial}(n=50, \theta)\) distribution.

Now for Step 3, we need to find the posterior distribution that combines our observed data with the prior information by using Bayes’ rule. First let’s write out the mass/density functions for our data and prior distributions. \[\begin{equation*} p(y | \theta) = \binom{n}{y}\theta^{y}(1 - \theta)^{n-y} \hspace{2cm} p(\theta) = c(\alpha, \beta) \theta^{\alpha - 1}(1 - \theta)^{\beta - 1} \end{equation*}\]

Throughout this document we will use a generic function \(c()\) to denote terms which are constant with respect to any variables or parameters inside the parentheses. Then we plug into Bayes’ Theorem:

\[\begin{align*} p(\theta | y) &= \frac{p(y | \theta)p(\theta)}{p(y)} \\ &= c(y, n, \alpha, \beta) \theta^{y + \alpha - 1} (1 - \theta)^{n - y + \beta - 1} \\ &\sim \text{Beta}(y + \alpha, n - y + \beta) \end{align*}\]

Let’s break this result down. First, we combined the constant terms together to focus on just the \(\theta\) term. Then the remaining \(\theta\) terms had the underlying form of a \(\text{Beta}\) distribution. The posterior must be a valid pdf that integrates to 1, so we don’t need to worry about the constant term, which must be equal to the corresponding constant for the derived \(\text{Beta}\) distribution. As a side note, when the posterior is from the same distribution class as the prior, we say that the Beta prior is conjugate for the Binomial sampling model.

We finally have our posterior distribution! Now we can perform inference by sampling from this distribution in a similar way to how we looked at the prior.

shape1post <- 2.5 + 4
shape2post <- 30 + (50 - 4)
samp_beta <- rbeta(10000, shape1 = shape1post, shape2 = shape2post)

print("Posterior mean")

## [1] "Posterior mean"

mean(samp_beta)

## [1] 0.07836576

print("Posterior median")

## [1] "Posterior median"

median(samp_beta)

## [1] 0.0747128

print("P(theta > 0.04)")

## [1] "P(theta > 0.04)"

mean(samp_beta > 0.04)

## [1] 0.9286

print("95% Credible Interval")

## [1] "95% Credible Interval"

quantile(samp_beta, c(0.025, 0.975))

##       2.5%      97.5% 
## 0.03112176 0.14473397

hist(samp_beta)
abline(v = 0.04)

When the posterior has a nice form like the Beta distribution here, you can also just use the properties of that distribution. For example, since the mean of a Beta distribution is \(\alpha / (\alpha + \beta)\), the posterior mean should be \(6.5 / 82.5 = 0.079\).

As sample size increases…

As sample size increases, the observed data will start to dominate the prior. Suppose that the next day, we get access to more test data in Philly - we now have 500 samples with 40 positive. Then we can look at the new posterior

shape1post <- 2.5 + 40
shape2post <- 30 + (500 - 40)
samp_beta <- rbeta(10000, shape1 = shape1post, shape2 = shape2post)

print("Posterior mean")

## [1] "Posterior mean"

mean(samp_beta)

## [1] 0.07985561

print("Posterior median")

## [1] "Posterior median"

median(samp_beta)

## [1] 0.0793015

print("P(theta > 0.04)")

## [1] "P(theta > 0.04)"

mean(samp_beta > 0.04)

## [1] 1

print("95% Credible Interval")

## [1] "95% Credible Interval"

quantile(samp_beta, c(0.025, 0.975))

##       2.5%      97.5% 
## 0.05853447 0.10418477

hist(samp_beta)
abline(v = 0.04)

Example 3

John and Mary are on a Zoom call and John tells her about his negative test and calculations. Mary then tells John about her analysis, and John is worried that he assumed prevalence in Philly was too low.

Thankfully, Mary’s analysis can be used directly to update John’s analysis! In particular, for a posterior sample for \(\theta\) notated \(\theta^1 , ..., \theta^S\) and some function \(g(\theta)\), the distribution of \(g(\theta^1) ,..., g(\theta^S)\) will converge to the distribution of \(g(\theta)^1 ,..., g(\theta)^S\) as \(S \rightarrow \infty\). So here for example, John’s final output from Bayes’ Rule was a function of the same \(\theta\) Mary was interested in in Example 2: \[\begin{equation*} \frac{0.998(1 - \theta)}{0.998(1 - \theta) + 0.55 \cdot \theta} \end{equation*}\]

Thus, we can draw John’s parameter of interest (his probability of not having Covid given his negative test result) using the posterior samples we drew earlier:

(oldnocovid <- 0.998*0.995 / (0.998*0.995 + 0.55*.005))

## [1] 0.9972383

nocovid <- 0.998*(1-samp_beta) / ( 0.998*(1-samp_beta) + 0.55*samp_beta )
quantile(nocovid, c(0.025, 0.975))

##      2.5%     97.5% 
## 0.9397665 0.9668710

hist(nocovid)

Under his previous assumption of a 0.5% prevalence, John had a 99.7% chance of not having Covid. But the new results are less certain. John decides to get a PCR test just to be sure (side note: rapid tests are very useful - this is just an example!). The same principle of considering functions of parameters can be used to change the target of analysis, for example from a probability to a log odds measure.

Interlude to the more applied tutorial

In most applied Bayesian analyses you won’t be directly deriving a posterior distribution, but it is important to understand the theory underlying this approach. In the next section, we will transition to fitting Bayesian regression models in R, which is more practical for most problems.

Let’s take a short break before the next section. But first, any questions?

Bayesian Modeling in R

Most applied Bayesian analyses fall under the regression framework. In this section we will describe such models and present tools to fit them in R. The course readings for this week fall under two basic frameworks, (i) using data-driven priors and (ii) using many priors of varying beliefs. We will use each of those here.

This tutorial will use the brms package in R (Bayesian regression modeling using Stan). Stan is a programming language used on the backend to fit the models.

# Load brms library
library(brms)

## Loading required package: Rcpp

## Loading 'brms' package (version 2.18.0). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').

## 
## Attaching package: 'brms'

## The following object is masked from 'package:stats':
## 
##     ar

Let’s simulate some data to use to help illustrate these concepts.

set.seed(41822)
n <- 300
A <- rbinom(n, 1, 0.5)
Yprob <- exp(-0.6 + 0.4*A) / (1 + exp(-0.6 + 0.4*A))
Y <- rbinom(n, 1, Yprob)
faketrial <- data.frame(A, Y)

A Frequentist analysis could just be a logistic regression model for the outcome, or direct calculation of an estimate of interest like a risk difference or odds ratio. For example a logistic regression can be written as \[\begin{equation*} \text{logit}\{ P(Y = 1) \} = \beta_{0} + \beta_{1}A \end{equation*}\]

freqmod <- glm(Y ~ A, family = binomial(link = "logit"))
summary(freqmod)

## 
## Call:
## glm(formula = Y ~ A, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0989  -1.0989  -0.9657   1.2580   1.4050  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.5208     0.1630  -3.195   0.0014 **
## A             0.3332     0.2358   1.413   0.1577   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 406.12  on 299  degrees of freedom
## Residual deviance: 404.11  on 298  degrees of freedom
## AIC: 408.11
## 
## Number of Fisher Scoring iterations: 4

Bayesian GLMs work similarly, but specify a prior for parameters of interest. For example, suppose there were a few previous clinical trials reporting treatment effect estimates. It may be useful to include this prior information in analysis. Unlike the previous page of this tutorial, we will not be deriving the posterior distribution directly, and will instead let brms do the work for us.

Let’s fit two Bayesian models, one assuming a flat prior (one that essentially gives no information) and one assuming a prior for the treatment effect that is Normally distributed with mean 0.5 and standard deviation of 0.3 (such that about 95% of the density lies between -0.1 and 1.1). First the flat prior:

betaprior <- prior(normal(0, 10), class = "b")

# Fit a model with a flat prior
bmod <- brm(Y ~ A, family = bernoulli, data = faketrial,
            prior = betaprior, silent = TRUE, refresh = 0)

Now let’s output a summary of bmod and compare the confidence interval for the treatment effect to that from the Frequentist model.

summary(bmod)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Y ~ A 
##    Data: faketrial (Number of observations: 300) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.52      0.16    -0.84    -0.21 1.00     2977     2584
## A             0.32      0.24    -0.15     0.80 1.00     3253     2267
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

confint(freqmod)

## Waiting for profiling to be done...

##                  2.5 %     97.5 %
## (Intercept) -0.8452884 -0.2049619
## A           -0.1283716  0.7970535

Now let’s do the more “optimistic” prior:

beta1prior <- prior(normal(0.5, 0.3), class = "b", coef = A)

# Fit a model with an optimistic prior
bmod2 <- brm(Y ~ A, family = bernoulli, data = faketrial,
            prior = beta1prior, silent = TRUE, refresh = 0)
summary(bmod2)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Y ~ A 
##    Data: faketrial (Number of observations: 300) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.55      0.15    -0.85    -0.26 1.00     3359     2430
## A             0.40      0.19     0.03     0.76 1.00     3043     2689
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Note that the confidence interval for the treatment parameter has been shifted toward the specified prior.

As on the last page, the advantage of doing a Bayesian analysis is not only incorporating prior information, but also performing inference using the posterior draws rather than just p-values and confidence intervals. Let’s learn how to get those draws from the insight R package.

library(insight)
posteriors <- insight::get_parameters(bmod2, iterations = 10^4)
head(posteriors)

##   b_Intercept       b_A
## 1  -0.4367512 0.3141399
## 2  -0.5731723 0.4796035
## 3  -0.5344722 0.3763882
## 4  -0.3310785 0.1092287
## 5  -0.4182325 0.3805925
## 6  -0.5489927 0.3641012

Then we can consider posterior summaries like in the previous examples:

print("Posterior mean")

## [1] "Posterior mean"

mean(posteriors$b_A)

## [1] 0.3964879

print("Probability of harm")

## [1] "Probability of harm"

sum(posteriors$b_A < 0) / nrow(posteriors)

## [1] 0.016

print("Posterior distribution")

## [1] "Posterior distribution"

hist(posteriors$b_A)

Presenting a holistic analysis with multiple priors

In many analyses, the prior information available is limited or even contradictory. Often different experts will have different opinions as to how likely a treatment is to be effective. Presenting Bayesian models with a series of priors of varying beliefs can not only accommodate prior beliefs of many different experts, but also help provide a sense of how strong the observed trial results were. For example, a trial result that shows high probability of benefit, even when using a very pessimistic prior, can be very convincing.

Define a neutral prior as one whose distribution is centered at a null value. Define an optimistic prior as one whose distribution is centered at a beneficial value, and likewise a pessimistic prior at a harmful value.

For example, consider the following set of prior densities on a log odds ratio scale used for the Zampieri paper application.

par(mfrow = c(3, 3))
plot(density(rnorm(10000, 0, 0.205)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Strong neutral")
plot(density(rnorm(10000, 0, 0.355)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Moderate neutral")
plot(density(rnorm(10000, 0, 5)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Weak neutral")
plot(density(rnorm(10000, log(0.66), 0.25)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Strong optimistic")
plot(density(rnorm(10000, log(0.66), 0.40)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Moderate optimistic")
plot(density(rnorm(10000, log(0.66), 0.77)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Weak optimistic")
plot(density(rnorm(10000, -log(0.66), 0.25)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Strong pessimistic")
plot(density(rnorm(10000, -log(0.66), 0.40)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Moderate pessimistic")
plot(density(rnorm(10000, -log(0.66), 0.77)), xlim = c(-2, 2), ylim = c(0, 2),
     main = "Weak pessimistic")

Now let’s generate a new fake trial to illustrate this approach.

n <- 800
A <- rbinom(n, 1, 0.5)
Yprob <- exp(-0.6 - 0.4*A) / (1 + exp(-0.6 - 0.4*A))
Y <- rbinom(n, 1, Yprob)
faketrial2 <- data.frame(A, Y)

And let’s define a series of 3 priors, one neutral, one optimistic, and one pessimistic, and compare the resulting posterior distributions. We’ll use the prior means shown above and the standard deviations corresponding to moderate strength. First neutral:

neutprior <- prior(normal(0, 0.355), class = "b", coef = A)

# Fit a model with an optimistic prior
bmod_neut <- brm(Y ~ A, family = bernoulli, data = faketrial2,
                 prior = neutprior, silent = TRUE, refresh = 0)

## Compiling Stan program...

## Start sampling

summary(bmod_neut)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Y ~ A 
##    Data: faketrial2 (Number of observations: 800) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.79      0.10    -0.99    -0.60 1.00     4111     2589
## A            -0.24      0.14    -0.52     0.03 1.00     4183     3303
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

posteriors_neut <- insight::get_parameters(bmod_neut, iterations = 10^4)

print("Probability of harm")

## [1] "Probability of harm"

sum(posteriors_neut$b_A > 0) / nrow(posteriors_neut)

## [1] 0.0425

hist(posteriors_neut$b_A)

Then optimistic:

optprior <- prior(normal(log(0.66), 0.4), class = "b", coef = A)

# Fit a model with an optimistic prior
bmod_opt <- brm(Y ~ A, family = bernoulli, data = faketrial2,
                prior = optprior, silent = TRUE, refresh = 0)

## Compiling Stan program...

## Start sampling

summary(bmod_opt)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Y ~ A 
##    Data: faketrial2 (Number of observations: 800) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.76      0.10    -0.96    -0.56 1.00     3226     2621
## A            -0.31      0.15    -0.61    -0.01 1.00     2910     2481
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

posteriors_opt <- insight::get_parameters(bmod_opt, iterations = 10^4)

print("Probability of harm")

## [1] "Probability of harm"

sum(posteriors_opt$b_A > 0) / nrow(posteriors_opt)

## [1] 0.02025

hist(posteriors_opt$b_A)

Then pessimistic:

pessprior <- prior(normal(-log(0.66), 0.4), class = "b", coef = A)

# Fit a model with an optimistic prior
bmod_pess <- brm(Y ~ A, family = bernoulli, data = faketrial2,
                 prior = pessprior, silent = TRUE, refresh = 0)

## Compiling Stan program...

## Start sampling

summary(bmod_pess)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Y ~ A 
##    Data: faketrial2 (Number of observations: 800) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.81      0.11    -1.02    -0.61 1.00     3058     2515
## A            -0.20      0.15    -0.48     0.10 1.00     2985     2563
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

posteriors_pess <- insight::get_parameters(bmod_pess, iterations = 10^4)

print("Probability of harm")

## [1] "Probability of harm"

sum(posteriors_pess$b_A > 0) / nrow(posteriors_pess)

## [1] 0.092

hist(posteriors_pess$b_A)

We can see that there is evidence of benefit in this fake trial regardless of the prior choice, even in the face of pessimistic prior information.

Package options

There are many important things happening under the hood of the brm function that we don’t have time to go over today. For a brief discussion of this, consider the output of the following model run:

bmod <- brm(Y ~ A, family = bernoulli, data = faketrial,
            prior = betaprior, chains = 4, iter = 2000, warmup = 1000)

## Compiling Stan program...

## recompiling to avoid crashing R session

## Start sampling

## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.024687 seconds (Warm-up)
## Chain 1:                0.023403 seconds (Sampling)
## Chain 1:                0.04809 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.025043 seconds (Warm-up)
## Chain 2:                0.024564 seconds (Sampling)
## Chain 2:                0.049607 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 8e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.024557 seconds (Warm-up)
## Chain 3:                0.022898 seconds (Sampling)
## Chain 3:                0.047455 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 8e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.024943 seconds (Warm-up)
## Chain 4:                0.023079 seconds (Sampling)
## Chain 4:                0.048022 seconds (Total)
## Chain 4:

The model used here doesn’t result in a nice clean posterior distribution to be sampled from like the early examples we considered. Instead, the posterior samples are obtained with an algorithm known as a Markov Chain Monte Carlo method. This is a more advanced topic, but it’s related to the package options used above.

The algorithm might take a while to converge to the posterior distribution, so the first chunk of samples from the algorithm are thrown out. This is known as a “burn-in” period (see warmup option). Likewise, the algorithm might be sensitive to it’s starting values and first few iterations. So often the algorithm is run multiple times, sharing samples from each of the chains (as indicated by the chains option). Then each chain should be long enough to acquire a decent amount of samples after convergence (see iter option)

To illustrate this, let’s plot the posterior samples.

plot(bmod)

And then compare to that from a model with much less burn-in:

bmod <- brm(Y ~ A, family = bernoulli, data = faketrial,
            prior = betaprior, chains = 4, iter = 1000, warmup = 5)

## Compiling Stan program...

## recompiling to avoid crashing R session

## Start sampling

## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: WARNING: No variance estimation is
## Chain 1:          performed for num_warmup < 20
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration:   6 / 1000 [  0%]  (Sampling)
## Chain 1: Iteration: 105 / 1000 [ 10%]  (Sampling)
## Chain 1: Iteration: 205 / 1000 [ 20%]  (Sampling)
## Chain 1: Iteration: 305 / 1000 [ 30%]  (Sampling)
## Chain 1: Iteration: 405 / 1000 [ 40%]  (Sampling)
## Chain 1: Iteration: 505 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 605 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 705 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 805 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 905 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.000116 seconds (Warm-up)
## Chain 1:                0.016618 seconds (Sampling)
## Chain 1:                0.016734 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 9e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: WARNING: No variance estimation is
## Chain 2:          performed for num_warmup < 20
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration:   6 / 1000 [  0%]  (Sampling)
## Chain 2: Iteration: 105 / 1000 [ 10%]  (Sampling)
## Chain 2: Iteration: 205 / 1000 [ 20%]  (Sampling)
## Chain 2: Iteration: 305 / 1000 [ 30%]  (Sampling)
## Chain 2: Iteration: 405 / 1000 [ 40%]  (Sampling)
## Chain 2: Iteration: 505 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 605 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 705 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 805 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 905 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 8.7e-05 seconds (Warm-up)
## Chain 2:                0.017145 seconds (Sampling)
## Chain 2:                0.017232 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 8e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: WARNING: No variance estimation is
## Chain 3:          performed for num_warmup < 20
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration:   6 / 1000 [  0%]  (Sampling)
## Chain 3: Iteration: 105 / 1000 [ 10%]  (Sampling)
## Chain 3: Iteration: 205 / 1000 [ 20%]  (Sampling)
## Chain 3: Iteration: 305 / 1000 [ 30%]  (Sampling)
## Chain 3: Iteration: 405 / 1000 [ 40%]  (Sampling)
## Chain 3: Iteration: 505 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 605 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 705 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 805 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 905 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.000114 seconds (Warm-up)
## Chain 3:                0.020134 seconds (Sampling)
## Chain 3:                0.020248 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'fbf71df2382223324cb6fe6a2964884c' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 7e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: WARNING: No variance estimation is
## Chain 4:          performed for num_warmup < 20
## Chain 4: 
## Chain 4: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 4: Iteration:   6 / 1000 [  0%]  (Sampling)
## Chain 4: Iteration: 105 / 1000 [ 10%]  (Sampling)
## Chain 4: Iteration: 205 / 1000 [ 20%]  (Sampling)
## Chain 4: Iteration: 305 / 1000 [ 30%]  (Sampling)
## Chain 4: Iteration: 405 / 1000 [ 40%]  (Sampling)
## Chain 4: Iteration: 505 / 1000 [ 50%]  (Sampling)
## Chain 4: Iteration: 605 / 1000 [ 60%]  (Sampling)
## Chain 4: Iteration: 705 / 1000 [ 70%]  (Sampling)
## Chain 4: Iteration: 805 / 1000 [ 80%]  (Sampling)
## Chain 4: Iteration: 905 / 1000 [ 90%]  (Sampling)
## Chain 4: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 9.6e-05 seconds (Warm-up)
## Chain 4:                0.01525 seconds (Sampling)
## Chain 4:                0.015346 seconds (Total)
## Chain 4:

## Warning: There were 562 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 1.51, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

plot(bmod)

Demonstration

Part 1

In the first demonstration, we will use the techniques we learned and practiced on fake data in the previous section to perform a Bayesian re-analysis of the ART trial (Alveolar Recruitment for Acute Respiratory Distress Syndrome Trial) [Zampieri et al. 2021]. This trial randomized about 1000 participants in mechanical ventilation and diagnosed with ARDS to two treatment regimens: a conventional strategy (ARDSNet) vs. the experimental ART strategy (maximum alveolar recruitment plus PEEP titration). One of the readings for this week presented a Bayesian re-analysis of this trial, showing that although the initial trial findings were not statistically significant, there is more evidence for harm than benefit of ART.

Open R or Rstudio and load in the ART trial data. Then we’ll walk through the following steps:

Explore the data
Replicate the flat prior analysis from the Zampieri paper
Conduct an analysis under a new prior distribution

Feel free to look at the example code in this document as we work through this exercise.

Part 2

Then we’ll go over some code for part of the analysis of another recent paper [Harhay et al. 2022], which reanalyzed the THAPCA trial. In another tab of Rstudio, load in the trial data and we’ll walk through the code file ebp-neuro-posterior-density-figure.R.

Conclusion and Additional Resources

Bayesian modeling is a useful technique to add to your analysis toolbox. It allows for incorporation of prior known information, as well as allowing for direct calculation of the probability of a hypothesis. However, Bayesian methodology is not without limitation. Many areas of Bayesian methods have not developed the theoretical guarantees (such as good statistical properties as sample size increases) that Frequentist methods have developed. Furthermore, Bayesian analyses can be computationally burdensome in practice. This greatly hampered the use of Bayesian models before the advent of computers, but has become less of an obstacle over time. In general, Frequentist and Bayesian analyses for the same or similar questions will typically reach the same conclusion.

The examples in the first section of this document were partly inspired by the book A First Course in Bayesian Statistics by Hoff. This is a great resource if you want to learn more!

References:

Hoff, P.D., 2009. A first course in Bayesian statistical methods (Vol. 580). New York: Springer.
Zampieri, F.G., Casey, J.D., Shankar-Hari, M., Harrell Jr, F.E. and Harhay, M.O., 2021. Using Bayesian methods to augment the interpretation of critical care trials. An overview of theory and example reanalysis of the alveolar recruitment for acute respiratory distress syndrome trial. American Journal of Respiratory and Critical Care Medicine, 203(5), pp.543-552.
Harhay, M.O., Blette, B.S., Granholm, A., Moler, F.W., Zampieri, F.G., Goligher, E.C., Gardner, M.M., Topjian, A.A. and Yehya, N., 2022. A Bayesian interpretation of a pediatric cardiac arrest trial (THAPCA-OH). NEJM Evidence, 2(1), p.EVIDoa2200196.

Additional resources:

Goligher, E.C., Tomlinson, G., Hajage, D., Wijeysundera, D.N., Fan, E., Jüni, P., Brodie, D., Slutsky, A.S. and Combes, A., 2018. Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome and posterior probability of mortality benefit in a post hoc Bayesian analysis of a randomized clinical trial. Jama, 320(21), pp.2251-2259.
Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B., 1995. Bayesian data analysis. Chapman and Hall/CRC.
McElreath, R., 2020. Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC.