Jonny5: May 2014

Wednesday, May 28, 2014

Bayes, Markov, and (Monte) Carlo try to walk in a straight line... (what happens next is unbelievable)

On Monday, we played around with linear regression. Boring, right? Well, then we tackled the "Bayesian" method of linear regression. There are apparently a couple ways of doing this, but since we last talked about Markov Chains, and my background is in molecular simulation, let's talk about the Markov Chain Monte Carlo method of linear regression.

(Note: As of this writing, one of my Zipfian co-fellows, Dan Saber, has published an excellent review of the topic here implemented in python. I will try not to re-hash most of what he said, just what is necessary to describe the process mathematically from my understanding.)

Let's say we're given some data points, $y_i$ and $x_{ij}$. Here, we are using Einstein tensor notation, which basically means a single index denotes a vector and a double index denotes a matrix. Specifically here, we will use $i$ to denote the data point index, where $i \in \lbrace 1,2,3,...,N\rbrace$, and $j$ to denote the dimension index, where $j \in \lbrace 0,1,2,3,...,D\rbrace$. Note that $j$ is zero-indexed to allow for an intercept term. Our linear regression model assumes the form:

$$y_i = x_{ij} \beta_j + \epsilon_i$$

Here, $\beta_j$ is the vector of weighting coefficients, and $\epsilon_i$ denotes the error between the model and the data. (Note: in Einstein notation, a repeated index implies a summation over all possible values of the index. In this example, after the multiplication between $x_{ij}$ and $\beta_j$, sum over all $j$'s; this is matrix multiplication. Here, the $j$'s are "summed away" and every term is a vector with index $i$.)

We assume that $\epsilon_i$ is normally distributed with mean $0$ and variance $\sigma_i^2$, and that each data point's error is independent of the rest. For simplicity, we'll refer to the "precision" $a_i = \frac{1}{\sigma_i^2}$ rather than the variance. If the error is normally distributed about $0$, then the observed data is normally distributed around the model as:

$$p(y_i|x_{ij},\beta_j,a_i) = \sqrt{\frac{a_i}{2 \pi}}\exp \left[ \frac{a_i}{2} (y_i - x_{ij}\beta_j)^2 \right]$$

We want to find the parameters $\beta_j$ and $a_i$. To do this, we will explore the parameter space and find the most probable parameters given the data. In other words, $p(\beta_j, a_i | y_i, x_i)$. According to Bayes' Theorem, this is:

$$p(\beta_j, a_i | y_i, x_{ij}) \propto p(y_i|x_{ij},\beta_j, a_i)p(\beta_j)p(a_i)$$

The left hand side is the posterior probability. On the right, $p(\beta_j)$ and $a_i$ are the priors, and $p(y_i|x_{ij},\beta_j,a_i)$ is the likelihood. The prior can be any informed or uninformed distribution of $\beta_j$ and $a_i$

With that framework set up, here comes the Markov Chain Monte Carlo (MCMC) part. In MCMC, we propose a stochastic jump, evaluate the proposal, and determine to accept or reject the proposal. (Note: let's introduce another notation here; parenthetical superscripts denote iteration number.) The way this works in molecular simulation is:

Given a box of atoms with known pairwise interactions, evaluate the energy of the system by summing up those interactions: $E^{(t)}$
Pick an atom and propose a random jump to a different location in the box.
Evaluate the new energy of the system: $E^{(t+1)}$

If $E^{(t+1)} \lt E^{(t)}$, accept the proposal. We are now in state $t+1$ with energy $E^{(t+1)}$.
Otherwise, accept the proposal with some fractional probability, depending on the ratio of energies. If we reject, we stay in state $t$ with energy $E^{(t)}$.

Repeat until convergence to find the minimum energy of the system

With the linear regression example however, we are trying to maximize a probability, rather than minimize an energy. Otherwise, it is largely the same.

We have to set up a multivariate distribution to sample parameters from. According to the Metropolis-Hastings algorithm of MCMC, every $\beta_j^{(t+1)}$ is chosen from a proposal distribution given the previous parameter values: $Q(\beta_j^{(t+1)} | \beta_j^{(t)})$; similarly with $a_i$. To keep things relatively simple, we sample from a normal distribution -- this is the Metropolis Monte Carlo algorithm. In this way, stochastic jumps are limited to a region close to the previous parameter values.

To decide whether we accept or reject, we look at the ratio of probabilities:

$$\alpha = \frac{p(y_i^{(t+1)}|x_{ij}^{(t+1)},\beta_j^{(t+1)},a_i^{(t+1)})p(\beta_j^{(t+1)})p(a_i^{(t+1)})}{p(y_i^{(t)}|x_{ij}^{(t)},\beta_j^{(t)},a_i^{(t)})p(\beta_j^{(t)})p(a_i^{(t)})} \frac{Q(\beta_j^{(t)}|\beta_j^{(t+1)})}{Q(\beta_j^{(t+1)}|\beta_j^{(t)})} \frac{Q(a_i^{(t)}|a_i^{(t+1)})}{Q(a_i^{(t+1)}|a_i^{(t)})}$$

The ratio of proposal densities drops out nicely due to our choice of a symmetric distribution. As a result, for each proposed jump, we only need to evaluate: $p(y_i|x_{ij},\beta_j)p(\beta_j)p(a_i)$ and compare it to the value from the previous iteration.

This process of proposals and accept/reject decisions continues as the probability distributions for all parameters are explored on and around the probability maximum.

Please let me know if anything needs clarification and/or correcting, and I will try to update this post accordingly. Thanks!

Wednesday, May 21, 2014

Markov! Where are my umbrellas?

Today I learned about Markov Chains. The toy problem that we tackled was so challenging and figuring out the solution was so satisfying, I felt the urge to share immediately (which expedited my inaugural blog post).

Problem: I have four umbrellas, some at home, some in the office. I keep moving between home and office. I take an umbrella with me only if it rains. If it does not rain, I leave the umbrella behind. If there are no umbrellas where I am, and it's raining, I get wet. The probability that it rains on any given day is $0.6$. What is the probability that I get wet?

What?! First of all, if it rains $60\%$ of the time, I'm moving.

Ok so the (first) trick here is knowing that home/office doesn't matter. It only matters how many umbrellas I have with me in my current location. With that in mind, the random variable, say $U$, can take the values $\lbrace 0, 1, 2, 3, 4 \rbrace$.

For $U = 0$, I will by default transition into a state with $U = 4$ with $100\%$ probability. This means I have zero umbrellas with me, and I will leave empty-handed. At my destination are four umbrellas.

For $U = 1$, there is a $60\%$ probability that I bring the umbrella with me, and transition to $U = 4$. There is also a $40\%$ probability that I leave the umbrella behind, and transition to $U = 3$.

For $U = 2$, there is a $60\%$ probability that I transition to $U = 3$, and $40\%$ probability that I stay at $U = 2$. Similar arguments can be made for $U = 3$ and $U = 4$.

With those transitions in mind, we can draw a Markov Chain as such:

The arrows and probabilities denote transitions and corresponding probabilities, respectively. How does this help? From here, we can populate a transition matrix, say $M$, for Markov.

$$ M = \begin{bmatrix} 0.0 & 0.0 & 0.0 & 0.0 & 1.0 \\ 0.0 & 0.0 & 0.0 & 0.4 & 0.6 \\ 0.0 & 0.0 & 0.4 & 0.6 & 0.0 \\ 0.0 & 0.4 & 0.6 & 0.0 & 0.0 \\ 0.4 & 0.6 & 0.0 & 0.0 & 0.0 \end{bmatrix} $$

To understand this matrix, rows denote starting state and columns denote ending state. For example, row 2 column 3 (with zero indexing) means there is a $60\%$ probability of transitioning from Node 2 to Node 3 (or $U = 2$ to $U = 3$). Now, let us assign a state vector as our initial condition:

$$ s_0 = \begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \\ 0.0 \end{bmatrix} $$

The values in this vector can be arbitrary since we are seeking the long time steady state (provided they sum to $100\%$). This vector tells us the probability of being in each state. As we have it filled in, we gave $U = 3$ a $100\%$ probability (zero indexing still), i.e., we have three umbrellas with us currently. Now, to transition into the next state (i.e., march forward one time step), we just let the matrix do all the heavy lifting. By hitting the vector on the left with the transpose of $M$, we get the new state condition:

$$ s_1 = M^T s_0 $$

We can repeat this process infinite times to reach the steady state condition. Note that in general, a matrix raised to the infinite power might not converge, particularly if its values are greater than $1$. Fortunately for us, it converges. We can diagonalize $M$ as such:

$$ M = PDP^{-1} $$

Here, $P$ is a block matrix comprised of the right eigenvectors of $M$. $D$ is the diagonal matrix of $M$, comprised of the eigenvalues along the diagonal. This allows us to multiply as many times as we like, e.g.:

$$ \begin{aligned} \lim_{n\rightarrow\infty}M^n & = \lim_{n\rightarrow\infty} \overbrace{PDP^{-1} \times PDP^{-1} \times ... \times PDP^{-1}}^{n~\text{times}} \\ & = \lim_{n\rightarrow\infty} \overbrace{PD(P^{-1}P)D(P^{-1}P)...DP^{-1}}^{n~\text{times}} \\ & = \lim_{n\rightarrow\infty} PD^nP^{-1} \end{aligned} $$

Note that the internal $P$ and $P^{-1}$ matrices cancel out. Ultimately, with some plugging and chugging, we get:

$$ \begin{aligned} s_{\infty} & = \lim_{n\rightarrow\infty} (M^T)^{n}s_o \\ & = \lim_{n\rightarrow\infty}(P^T)^{-1}D^nP^Ts_o \\ & = \begin{bmatrix} 0.09091 \\ 0.22727 \\ 0.22727 \\ 0.22727 \\ 0.22727 \end{bmatrix} \end{aligned} $$

The probability that I get wet will be the probability that it rains and $U = 0$. In other words, the probability of $U = 0$ times $0.6$:

$$ \begin{aligned} p(\text{wet}) & = 0.09091 \times 0.6 \\ & = 0.05455 \end{aligned} $$

Update:
Let me be a little more explicit, as this is quite useful in interpreting the math. The $P$ and $D$ matrices for this case are:

$$ P = \begin{bmatrix} 0.597 & -0.583 & -0.776 & -0.447 & 0.594 \\ 0.480 & 0.148 & 0.511 & -0.447 & 0.183 \\ 0.128 & 0.414 & -0.320 & -0.447 & -0.545 \\ -0.286 & -0.619 & 0.175 & -0.447 & -0.331 \\ -0.561 & 0.290 & -0.056 & -0.447 & 0.455 \end{bmatrix} $$

and

$$ D = \begin{bmatrix} -0.939 & 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & -0.497 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.072 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 1.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.765 \end{bmatrix} $$

Raising $D$ to the $n^{\text{th}}$ power is the same as raising each eigenvalue to the $n^{\text{th}}$ power. Since four of the five values are fractions, they quickly become zero. The eigenvalue with value unity remains the same. Then, as $M$ is taken to the $n^{\text{th}}$ power, we obtain

$$ \lim_{n\rightarrow\infty}(M^T)^n = \begin{bmatrix} 0.091 & 0.091 & 0.091 & 0.091 & 0.091 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \end{bmatrix} $$

Notice that this is precisely the steady state vector repeated as columns. This ensures that we will recover the same steady state vector every time, as long as the initial vector sums to 100%.

However, it has also been pointed out to me that the end can be solved more analytically. So let's take a couple steps back. Rather than multiplying an initial state vector $n$-times, all we have to do is multiply the final state vector once. Since it is the final state, it will remain unchanged under the transformation. Specifically:

$$ \begin{aligned} s_{\infty+1} = s_{\infty} & = M^T s_{\infty} \\ 0 & = M^T s_{\infty} - s_{\infty} \\ 0 & = (M^T - I) s_{\infty} \end{aligned} $$

It turns out $(M^T-I)$ only has rank $4$. Another equation can be added to one of the rows, conditioning that the elements of the state vector sum to $1$, i.e.,

$$ 1 = \pi_0 + \pi_1 + \pi_2 + \pi_3 + \pi_4 $$

Here, the $\pi$'s denote the steady state probabilities of each node. Explicitly, we have:

$$ \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1.0 & 0.0 & 0.0 & 0.0 & 0.4 \\ 0.0 & -1.0 & 0.0 & 0.4 & 0.6 \\ 0.0 & 0.0 & -0.6 & 0.6 & 0.0 \\ 0.0 & 0.4 & 0.6 & -1.0 & 0.0 \\ 2.0 & 1.6 & 1.0 & 1.0 & 0.0 \end{bmatrix} \begin{bmatrix} \pi_0 \\ \pi_1 \\ \pi_2 \\ \pi_3 \\ \pi_4 \end{bmatrix} $$

Finally, the modified matrix equation can be solved with matrix multiplication to get:

$$ s_{\infty} = \begin{bmatrix} \pi_0 \\ \pi_1 \\ \pi_2 \\ \pi_3 \\ \pi_4 \end{bmatrix} = \begin{bmatrix} 0.09091 \\ 0.22727 \\ 0.22727 \\ 0.22727 \\ 0.22727 \end{bmatrix} $$.

Tuesday, May 20, 2014

Zipf! (like thwip! or snikt!)

Hello World! My name is Jonny, and welcome to my blog on data science escapades!

I was a computational researcher in mechanical engineering and materials science where much of my research focused on applying molecular simulation to understanding material behavior. Following my postdoctoral appointment, I found that my quantitative skillset could be complemented with a data scientist's toolbox to tackle a much wider range of problems. And so began my forays into data science.

I've spent a lot of time in recent months learning (or re-learning in some cases) python and various libraries (mechanize, BeautifulSoup, pandas, matplotlib, etc.), SQL, machine learning algorithms, Hadoop and MapReduce, etc. on my own. Less than two weeks ago, I was accepted into the Zipfian Academy, and quickly I began my formal training as a budding data scientist. I am now halfway through my second week, and already I've learned a ton of things, met some really interesting people with eclectic backgrounds, and began scraping the web for some interesting datasets (more on this soon!).

Whether you are an established data scientist, a curious academic, an engineer, a scientist, a student, a friend or relative, or none of the above, I welcome you to join me in my edification experience and hopefully beyond.

To find more information about me, you can also explore my website at https://sites.google.com/site/jonathanwlee5.