MCMC for the probit model

Suppose we our dataset is made of binary observations \(Y_1, \ldots, Y_n\). For instance \(Y_i\) is \(1\) if student “i” has passed the exam and \(0\) otherwise. Assume we know \(p\) covariates about the students, such as the time spent studying, the number of classes he attended, the ability to cheat without getting caught, etc. We call the covariates “explanatory variables” and store them in a matrix X of size \(n\times p\). The probit model says that for each \(i=1, \ldots, n\) \[Y_i = \begin{cases} &1 \text{ with probability } \Phi(X_i^T \beta) \\ & 0 \text{ with probability } 1 - \Phi(X_i^T \beta) \end{cases}\] where \(X_i\) is the \(i\)-th row of \(X\), \(\Phi\) is the distribution function of a standard Normal distribution, and \(\beta \in \mathbb{R}^p\) is the parameter to infer. Inferring \(\beta\) allows to learn and quantify the effect of each covariate on the observation. We first generate a synthetic dataset from the model, given a fixed parameter value.

beta <- matrix(c(2, -1), ncol = 1)  # we store the arbitrary parameter in a 2x1 matrix
n <- 50  # sample size
X <- cbind(rnorm(n, mean = 1), rnorm(n, mean = 1.5))  # matrix of explanatory variables
Y <- as.numeric(rnorm(n, mean = X %*% beta, sd = 1) >= 0)  # observations
table(Y)
## Y
##  0  1 
## 19 31

Let us put a normal prior on \(\beta\), with mean \(0\) and variance \(B\), a \(p\times p\) matrix: \[\pi(\beta) = \mathcal{N}\left( 0, B \right)\]

B <- matrix(c(3, 0, 0, 3), ncol = 2)  # prior variance
library(mvtnorm)  # mvtnorm to deal with multivariate normal densities
log_prior <- function(beta) {
    return(dmvnorm(x = t(beta), sigma = B, log = TRUE))  # note the transposition using t()
}

We can compute the likelihood pointwise, for any vector \(\beta\). Indeed \[\mathbb{P}(Y_1 = y_1, \ldots, Y_n = y_n \mid \beta) = \prod_{i=1}^n \Phi(X_i^T \beta)^{y_i} (1-\Phi(X_i^T \beta))^{1-y_i}\] thus we can compute the posterior density pointwise.

full_log_posterior <- function(beta) {
    return(sum(pnorm(q = X[Y == 1, ] %*% beta, mean = 0, sd = 1, log.p = TRUE)) + 
        sum(pnorm(q = -X[Y == 0, ] %*% beta, mean = 0, sd = 1, log.p = TRUE)) + 
        log_prior(beta))
}

Here the posterior distribution is only \(2\)-dimensional so it is possible to represent the posterior density, for instance as follows. The area of light blue represent higher density values, dark blue represents lower density values.

grid.df <- expand.grid(seq(from = 0.5, to = 3.5, length.out = 100), seq(from = -2, 
    to = 0, length.out = 100))
names(grid.df) <- c("beta1", "beta2")
grid.df$density <- mapply(FUN = function(x, y) exp(full_log_posterior(matrix(c(x, 
    y), ncol = 1))), grid.df$beta1, grid.df$beta2)
g2d <- ggplot(grid.df) + geom_raster(aes(x = beta1, y = beta2, fill = density))
g2d <- g2d + xlab(expression(beta[1])) + ylab(expression(beta[2]))
g2d <- g2d + theme(legend.position = "none")
print(g2d)

Note that here the value of \(\beta\) used to generate the data was \((2,-1)\). We can see on the above plot that it is situated in a region of high posterior density value. To perform inference we first do Metropolis-Hastings.

niterations <- 10000  # number of Metropolis Hastings steps
init_beta <- rmvnorm(n = 1, mean = c(0, 0), sigma = B)  # start randomly using the prior
chain <- matrix(0, ncol = 2, nrow = niterations)  # store all the chain
chain[1, ] <- init_beta

# Pre-generate the uniform variables used during the acceptance step.
uniforms <- runif(niterations)

# Keep track of the number of accepted moves for monitoring.
n_accepts <- 0

# Keep track of latest posterior pdf value to improve efficiency.
current_log_posterior <- full_log_posterior(chain[1, ])

for (t in 2:niterations) {
    proposal <- chain[t - 1, ] + rnorm(2, mean = 0, sd = 1)  # random walk proposal distribution
    proposal_log_posterior <- full_log_posterior(proposal)  # log target density of proposal
    ratios <- proposal_log_posterior - current_log_posterior  # log acceptance ratio
    accept <- (log(uniforms[t]) < ratios)  # accept or not?
    n_accepts <- n_accepts + accept
    if (accept) {
        chain[t, ] <- proposal
        current_log_posterior <- proposal_log_posterior  # and update the current log target pdf value
    } else {
        chain[t, ] <- chain[t - 1, ]
    }
}
cat("acceptance rate of", round(100 * n_accepts/niterations, 2), "%\n")
## acceptance rate of 10.95 %
MHchain.df <- data.frame(chain)
MHchain.df$step <- 1:niterations
gMH <- ggplot(subset(MHchain.df, step < 1000), aes(x = X1, y = X2)) + geom_path() + 
    geom_point()
gMH <- gMH + xlab(expression(beta[1])) + ylab(expression(beta[2]))
print(gMH)

We can see on the above plot that t he chain converges quickly towards the most interesting region of the space. The chain can also be used to construct a kernel density estimate of the posterior distribution which gives the following plot.

gMH2d <- ggplot(MHchain.df, aes(x = X1, y = X2)) + geom_density2d()
gMH2d <- gMH2d + xlab(expression(beta[1])) + ylab(expression(beta[2]))
print(gMH2d)