(Note: As of this writing, one of my Zipfian co-fellows, Dan Saber, has published an excellent review of the topic here implemented in python. I will try not to re-hash most of what he said, just what is necessary to describe the process mathematically from my understanding.)
Let's say we're given some data points, \(y_i\) and \(x_{ij}\). Here, we are using Einstein tensor notation, which basically means a single index denotes a vector and a double index denotes a matrix. Specifically here, we will use \(i\) to denote the data point index, where \(i \in \lbrace 1,2,3,...,N\rbrace\), and \(j\) to denote the dimension index, where \(j \in \lbrace 0,1,2,3,...,D\rbrace\). Note that \(j\) is zero-indexed to allow for an intercept term. Our linear regression model assumes the form:
$$y_i = x_{ij} \beta_j + \epsilon_i$$
Here, \(\beta_j\) is the vector of weighting coefficients, and \(\epsilon_i\) denotes the error between the model and the data. (Note: in Einstein notation, a repeated index implies a summation over all possible values of the index. In this example, after the multiplication between \(x_{ij}\) and \(\beta_j\), sum over all \(j\)'s; this is matrix multiplication. Here, the \(j\)'s are "summed away" and every term is a vector with index \(i\).)
We assume that \(\epsilon_i\) is normally distributed with mean \(0\) and variance \(\sigma_i^2\), and that each data point's error is independent of the rest. For simplicity, we'll refer to the "precision" \(a_i = \frac{1}{\sigma_i^2}\) rather than the variance. If the error is normally distributed about \(0\), then the observed data is normally distributed around the model as:
$$p(y_i|x_{ij},\beta_j,a_i) = \sqrt{\frac{a_i}{2 \pi}}\exp \left[ \frac{a_i}{2} (y_i - x_{ij}\beta_j)^2 \right]$$
We want to find the parameters \(\beta_j\) and \(a_i\). To do this, we will explore the parameter space and find the most probable parameters given the data. In other words, \(p(\beta_j, a_i | y_i, x_i)\). According to Bayes' Theorem, this is:
$$p(\beta_j, a_i | y_i, x_{ij}) \propto p(y_i|x_{ij},\beta_j, a_i)p(\beta_j)p(a_i)$$
The left hand side is the posterior probability. On the right, \(p(\beta_j)\) and \(a_i\) are the priors, and \(p(y_i|x_{ij},\beta_j,a_i)\) is the likelihood. The prior can be any informed or uninformed distribution of \(\beta_j\) and \(a_i\)
With that framework set up, here comes the Markov Chain Monte Carlo (MCMC) part. In MCMC, we propose a stochastic jump, evaluate the proposal, and determine to accept or reject the proposal. (Note: let's introduce another notation here; parenthetical superscripts denote iteration number.) The way this works in molecular simulation is:
- Given a box of atoms with known pairwise interactions, evaluate the energy of the system by summing up those interactions: \(E^{(t)}\)
- Pick an atom and propose a random jump to a different location in the box.
- Evaluate the new energy of the system: \(E^{(t+1)}\)
- If \(E^{(t+1)} \lt E^{(t)}\), accept the proposal. We are now in state \(t+1\) with energy \(E^{(t+1)}\).
- Otherwise, accept the proposal with some fractional probability, depending on the ratio of energies. If we reject, we stay in state \(t\) with energy \(E^{(t)}\).
- Repeat until convergence to find the minimum energy of the system
We have to set up a multivariate distribution to sample parameters from. According to the Metropolis-Hastings algorithm of MCMC, every \(\beta_j^{(t+1)}\) is chosen from a proposal distribution given the previous parameter values: \(Q(\beta_j^{(t+1)} | \beta_j^{(t)})\); similarly with \(a_i\). To keep things relatively simple, we sample from a normal distribution -- this is the Metropolis Monte Carlo algorithm. In this way, stochastic jumps are limited to a region close to the previous parameter values.
To decide whether we accept or reject, we look at the ratio of probabilities:
$$\alpha = \frac{p(y_i^{(t+1)}|x_{ij}^{(t+1)},\beta_j^{(t+1)},a_i^{(t+1)})p(\beta_j^{(t+1)})p(a_i^{(t+1)})}{p(y_i^{(t)}|x_{ij}^{(t)},\beta_j^{(t)},a_i^{(t)})p(\beta_j^{(t)})p(a_i^{(t)})} \frac{Q(\beta_j^{(t)}|\beta_j^{(t+1)})}{Q(\beta_j^{(t+1)}|\beta_j^{(t)})} \frac{Q(a_i^{(t)}|a_i^{(t+1)})}{Q(a_i^{(t+1)}|a_i^{(t)})}$$
The ratio of proposal densities drops out nicely due to our choice of a symmetric distribution. As a result, for each proposed jump, we only need to evaluate: \(p(y_i|x_{ij},\beta_j)p(\beta_j)p(a_i)\) and compare it to the value from the previous iteration.
This process of proposals and accept/reject decisions continues as the probability distributions for all parameters are explored on and around the probability maximum.
Please let me know if anything needs clarification and/or correcting, and I will try to update this post accordingly. Thanks!