Sunday, June 22, 2014

d3 visualization experiences Schrodinger's cat paradox (i.e., win and fail)

A few weekends ago, I took a d3 visualization workshop class hosted at Zipfian. Eager to show off my newfound skillz (with a "z"), I tried dumping my previous post's data into d3. Encountering countless hurdles along the way (asynchronous calls, adding jQuery source to blogspot, cross-site hosting, MIME-type errors, jsonCallback, getting pandas to output json in a desirable format, etc.), it is now two weeks later. I have good news and bad news.

The good news is that after hours of futzing around on a perfectly sunny Saturday afternoon, I managed to get a prototype working! It reads in jsons hosted on my github and plots nice, simple bars, dots, and a SF street map overlay (or underlay?). See above, if you're lucky.

The bad news, I discovered this morning, is that it actually still doesn't work (for you unlucky folks). As far as I can tell, it doesn't work with Chrome on Windows or Chrome on Chrome OS. It also doesn't work on mobile (but even my MathJax equations from the Markov Chains and Bayes-MCMC posts didn't work on mobile). I am working on an Ubuntu machine with Chrome, and it works here for some reason.

The problem apparently is that github hosts raw files as text/plain to prevent hotlinking. Initially only Internet Explorer respected the Content-type header, but Chrome and Firefox got on board about a year ago. My guess is the Ubuntu version of Chrome hasn't been updated yet, which is why it works for me.

So... back to the drawing board.

Monday, June 2, 2014

Gone in 134 seconds... Actually I still see you.

For my Zipfian Academy Capstone Project, I want to work on something related to traffic. A few weekends ago, I found some interesting data sets on the interwebs, including Anonymized Uber GPS Traces from several years ago. The data only contains trip IDs, timestamp, latitude, and longitude. The timestamps have preserved time of day and day of week, however the actual dates have been modified to protect the privacy of drivers and passengers. Moreover, the start and end of each ride is truncated, also for privacy reasons.

I created the above image right off the bat. It is just an overlay of all the rides together (after a little housekeeping). In case you don't recognize the geography, it is the San Francisco Bay Area. The long tail reaching southward are mostly trips to SFO Airport, and the eastward tail are mostly trips to OAK Airport.

Just last week, I went to a meetup where I met Kevin Novak, the lead data scientist at Uber. He mentioned one particular driver who has been with Uber since nearly the beginning. This driver, who shall not be named here, is well regarded but known particularly for driving under the speed limit.

So I decided to try and find this driver.

First, I needed to create new features based on the limited data I had. Using the central finite differences method, I took time derivatives to compute the speed and acceleration associated with each datapoint. Seemingly trivial, this was actually somewhat challenging to do within the python pandas framework.

(I will add some details here later when it's not 2am.)

I converted speed and acceleration (both scalars for now) into units of miles per hour and meters per second-squared, respectively. It turns out there are several "trips" that have long latency times. Specifically, this means the car is stopped, potentially for hours, and this can drive the mean and median speeds way down. Of course, there are ways around this, but for a quick result, I decided to look instead at maximum acceleration of each driver. Presumably this unnamed slow driver not only drives slow, but accelerates slowly as well.

Behold, my findings:

In the upper plot, I have just shown the path taken by "Driver 19393." I believe it is 101-N to I-80E. In the middle plot, I have shown the speed as a function of time. Driver 19393 has a top speed of approximately 17 miles per hour.  In the bottom plot, I have shown the acceleration as a function of time. Driver 19393 has a maximum acceleration of approximately 0.2 meters per second-squared. This is equivalent to 0 to 60 in a whopping 134 seconds.

Now, one might be inclined to ask if this driver was driving in traffic. I can not say absolutely definitively, but I know this trip happened between 12:00am and 12:30am. I've actually driven near that highway junction close to midnight quite often, and have yet to encounter any traffic.

So maybe Driver 19393 is our mystery driver? We'll see if deeper exploration reveals anything different...

Wednesday, May 28, 2014

Bayes, Markov, and (Monte) Carlo try to walk in a straight line... (what happens next is unbelievable)

On Monday, we played around with linear regression. Boring, right? Well, then we tackled the "Bayesian" method of linear regression. There are apparently a couple ways of doing this, but since we last talked about Markov Chains, and my background is in molecular simulation, let's talk about the Markov Chain Monte Carlo method of linear regression.

(Note: As of this writing, one of my Zipfian co-fellows, Dan Saber, has published an excellent review of the topic here implemented in python. I will try not to re-hash most of what he said, just what is necessary to describe the process mathematically from my understanding.)

Let's say we're given some data points, \(y_i\) and \(x_{ij}\). Here, we are using Einstein tensor notation, which basically means a single index denotes a vector and a double index denotes a matrix. Specifically here, we will use \(i\) to denote the data point index, where \(i \in \lbrace 1,2,3,...,N\rbrace\), and \(j\) to denote the dimension index, where \(j \in \lbrace 0,1,2,3,...,D\rbrace\). Note that \(j\) is zero-indexed to allow for an intercept term. Our linear regression model assumes the form:

$$y_i = x_{ij} \beta_j + \epsilon_i$$

Here, \(\beta_j\) is the vector of weighting coefficients, and \(\epsilon_i\) denotes the error between the model and the data. (Note: in Einstein notation, a repeated index implies a summation over all possible values of the index. In this example, after the multiplication between \(x_{ij}\) and \(\beta_j\), sum over all \(j\)'s; this is matrix multiplication. Here, the \(j\)'s are "summed away" and every term is a vector with index \(i\).)

We assume that \(\epsilon_i\) is normally distributed with mean \(0\) and variance \(\sigma_i^2\), and that each data point's error is independent of the rest. For simplicity, we'll refer to the "precision" \(a_i = \frac{1}{\sigma_i^2}\) rather than the variance. If the error is normally distributed about \(0\), then the observed data is normally distributed around the model as:

$$p(y_i|x_{ij},\beta_j,a_i) = \sqrt{\frac{a_i}{2 \pi}}\exp \left[ \frac{a_i}{2} (y_i - x_{ij}\beta_j)^2 \right]$$

We want to find the parameters \(\beta_j\) and \(a_i\). To do this, we will explore the parameter space and find the most probable parameters given the data. In other words, \(p(\beta_j, a_i | y_i, x_i)\). According to Bayes' Theorem, this is:

$$p(\beta_j, a_i | y_i, x_{ij}) \propto p(y_i|x_{ij},\beta_j, a_i)p(\beta_j)p(a_i)$$

The left hand side is the posterior probability. On the right, \(p(\beta_j)\) and \(a_i\) are the priors, and \(p(y_i|x_{ij},\beta_j,a_i)\) is the likelihood. The prior can be any informed or uninformed distribution of \(\beta_j\) and \(a_i\)

With that framework set up, here comes the Markov Chain Monte Carlo (MCMC) part. In MCMC, we propose a stochastic jump, evaluate the proposal, and determine to accept or reject the proposal. (Note: let's introduce another notation here; parenthetical superscripts denote iteration number.) The way this works in molecular simulation is:

  1. Given a box of atoms with known pairwise interactions, evaluate the energy of the system by summing up those interactions: \(E^{(t)}\)
  2. Pick an atom and propose a random jump to a different location in the box.
  3. Evaluate the new energy of the system: \(E^{(t+1)}\)
    • If \(E^{(t+1)} \lt E^{(t)}\), accept the proposal. We are now in state \(t+1\) with energy \(E^{(t+1)}\).
    • Otherwise, accept the proposal with some fractional probability, depending on the ratio of energies. If we reject, we stay in state \(t\) with energy \(E^{(t)}\).
  4. Repeat until convergence to find the minimum energy of the system
With the linear regression example however, we are trying to maximize a probability, rather than minimize an energy. Otherwise, it is largely the same.

We have to set up a multivariate distribution to sample parameters from. According to the Metropolis-Hastings algorithm of MCMC, every \(\beta_j^{(t+1)}\) is chosen from a proposal distribution given the previous parameter values: \(Q(\beta_j^{(t+1)} | \beta_j^{(t)})\); similarly with \(a_i\). To keep things relatively simple, we sample from a normal distribution -- this is the Metropolis Monte Carlo algorithm. In this way, stochastic jumps are limited to a region close to the previous parameter values.

To decide whether we accept or reject, we look at the ratio of probabilities:

$$\alpha = \frac{p(y_i^{(t+1)}|x_{ij}^{(t+1)},\beta_j^{(t+1)},a_i^{(t+1)})p(\beta_j^{(t+1)})p(a_i^{(t+1)})}{p(y_i^{(t)}|x_{ij}^{(t)},\beta_j^{(t)},a_i^{(t)})p(\beta_j^{(t)})p(a_i^{(t)})} \frac{Q(\beta_j^{(t)}|\beta_j^{(t+1)})}{Q(\beta_j^{(t+1)}|\beta_j^{(t)})} \frac{Q(a_i^{(t)}|a_i^{(t+1)})}{Q(a_i^{(t+1)}|a_i^{(t)})}$$

The ratio of proposal densities drops out nicely due to our choice of a symmetric distribution. As a result, for each proposed jump, we only need to evaluate: \(p(y_i|x_{ij},\beta_j)p(\beta_j)p(a_i)\) and compare it to the value from the previous iteration.

This process of proposals and accept/reject decisions continues as the probability distributions for all parameters are explored on and around the probability maximum.

Please let me know if anything needs clarification and/or correcting, and I will try to update this post accordingly. Thanks!

Wednesday, May 21, 2014

Markov! Where are my umbrellas?

Today I learned about Markov Chains. The toy problem that we tackled was so challenging and figuring out the solution was so satisfying, I felt the urge to share immediately (which expedited my inaugural blog post).

Problem: I have four umbrellas, some at home, some in the office. I keep moving between home and office. I take an umbrella with me only if it rains. If it does not rain, I leave the umbrella behind. If there are no umbrellas where I am, and it's raining, I get wet. The probability that it rains on any given day is \(0.6\). What is the probability that I get wet?

What?! First of all, if it rains \(60\%\) of the time, I'm moving.

Ok so the (first) trick here is knowing that home/office doesn't matter. It only matters how many umbrellas I have with me in my current location. With that in mind, the random variable, say \(U\), can take the values \(\lbrace 0, 1, 2, 3, 4 \rbrace\).

For \(U = 0\), I will by default transition into a state with \(U = 4\) with \(100\%\) probability. This means I have zero umbrellas with me, and I will leave empty-handed. At my destination are four umbrellas.

For \(U = 1\), there is a \(60\%\) probability that I bring the umbrella with me, and transition to \(U = 4\). There is also a \(40\%\) probability that I leave the umbrella behind, and transition to \(U = 3\).

For \(U = 2\), there is a \(60\%\) probability that I transition to \(U = 3\), and \(40\%\) probability that I stay at \(U = 2\). Similar arguments can be made for \(U = 3\) and \(U = 4\).

With those transitions in mind, we can draw a Markov Chain as such:

The arrows and probabilities denote transitions and corresponding probabilities, respectively. How does this help? From here, we can populate a transition matrix, say \(M\), for Markov.

$$ M = \begin{bmatrix} 0.0 & 0.0 & 0.0 & 0.0 & 1.0 \\ 0.0 & 0.0 & 0.0 & 0.4 & 0.6 \\ 0.0 & 0.0 & 0.4 & 0.6 & 0.0 \\ 0.0 & 0.4 & 0.6 & 0.0 & 0.0 \\ 0.4 & 0.6 & 0.0 & 0.0 & 0.0 \end{bmatrix} $$

To understand this matrix, rows denote starting state and columns denote ending state. For example, row 2 column 3 (with zero indexing) means there is a \(60\%\) probability of transitioning from Node 2 to Node 3 (or \(U = 2\) to \(U = 3\)). Now, let us assign a state vector as our initial condition:

$$ s_0 = \begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \\ 0.0 \end{bmatrix} $$

The values in this vector can be arbitrary since we are seeking the long time steady state (provided they sum to \(100\%\)). This vector tells us the probability of being in each state. As we have it filled in, we gave \(U = 3\) a \(100\%\) probability (zero indexing still), i.e., we have three umbrellas with us currently. Now, to transition into the next state (i.e., march forward one time step), we just let the matrix do all the heavy lifting. By hitting the vector on the left with the transpose of \(M\), we get the new state condition:

$$ s_1 = M^T s_0 $$

We can repeat this process infinite times to reach the steady state condition. Note that in general, a matrix raised to the infinite power might not converge, particularly if its values are greater than \(1\). Fortunately for us, it converges. We can diagonalize \(M\) as such:

$$ M = PDP^{-1} $$

Here, \(P\) is a block matrix comprised of the right eigenvectors of \(M\). \(D\) is the diagonal matrix of \(M\), comprised of the eigenvalues along the diagonal. This allows us to multiply as many times as we like, e.g.:

$$ \begin{aligned} \lim_{n\rightarrow\infty}M^n & = \lim_{n\rightarrow\infty} \overbrace{PDP^{-1} \times PDP^{-1} \times ... \times PDP^{-1}}^{n~\text{times}} \\ & = \lim_{n\rightarrow\infty} \overbrace{PD(P^{-1}P)D(P^{-1}P)...DP^{-1}}^{n~\text{times}} \\ & = \lim_{n\rightarrow\infty} PD^nP^{-1} \end{aligned} $$

Note that the internal \(P\) and \(P^{-1}\) matrices cancel out. Ultimately, with some plugging and chugging, we get:

$$ \begin{aligned} s_{\infty} & = \lim_{n\rightarrow\infty} (M^T)^{n}s_o \\ & = \lim_{n\rightarrow\infty}(P^T)^{-1}D^nP^Ts_o \\ & = \begin{bmatrix} 0.09091 \\ 0.22727 \\ 0.22727 \\ 0.22727 \\ 0.22727 \end{bmatrix} \end{aligned} $$

The probability that I get wet will be the probability that it rains and \(U = 0\). In other words, the probability of \(U = 0\) times \(0.6\):

$$ \begin{aligned} p(\text{wet}) & = 0.09091 \times 0.6 \\ & = 0.05455 \end{aligned} $$

Let me be a little more explicit, as this is quite useful in interpreting the math. The \(P\) and \(D\) matrices for this case are:

$$ P = \begin{bmatrix} 0.597 & -0.583 & -0.776 & -0.447 & 0.594 \\ 0.480 & 0.148 & 0.511 & -0.447 & 0.183 \\ 0.128 & 0.414 & -0.320 & -0.447 & -0.545 \\ -0.286 & -0.619 & 0.175 & -0.447 & -0.331 \\ -0.561 & 0.290 & -0.056 & -0.447 & 0.455 \end{bmatrix} $$


$$ D = \begin{bmatrix} -0.939 & 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & -0.497 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.072 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 1.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.765 \end{bmatrix} $$

Raising \(D\) to the \(n^{\text{th}}\) power is the same as raising each eigenvalue to the \(n^{\text{th}}\) power. Since four of the five values are fractions, they quickly become zero. The eigenvalue with value unity remains the same. Then, as \(M\) is taken to the \(n^{\text{th}}\) power, we obtain

$$ \lim_{n\rightarrow\infty}(M^T)^n = \begin{bmatrix} 0.091 & 0.091 & 0.091 & 0.091 & 0.091 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \\ 0.227 & 0.227 & 0.227 & 0.227 & 0.227 \end{bmatrix} $$

Notice that this is precisely the steady state vector repeated as columns. This ensures that we will recover the same steady state vector every time, as long as the initial vector sums to 100%.

However, it has also been pointed out to me that the end can be solved more analytically. So let's take a couple steps back. Rather than multiplying an initial state vector \(n\)-times, all we have to do is multiply the final state vector once. Since it is the final state, it will remain unchanged under the transformation. Specifically:

$$ \begin{aligned} s_{\infty+1} = s_{\infty} & = M^T s_{\infty} \\ 0 & = M^T s_{\infty} - s_{\infty} \\ 0 & = (M^T - I) s_{\infty} \end{aligned} $$

It turns out \((M^T-I)\) only has rank \(4\). Another equation can be added to one of the rows, conditioning that the elements of the state vector sum to \(1\), i.e.,

$$ 1 = \pi_0 + \pi_1 + \pi_2 + \pi_3 + \pi_4 $$

Here, the \(\pi\)'s denote the steady state probabilities of each node. Explicitly, we have:

$$ \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1.0 & 0.0 & 0.0 & 0.0 & 0.4 \\ 0.0 & -1.0 & 0.0 & 0.4 & 0.6 \\ 0.0 & 0.0 & -0.6 & 0.6 & 0.0 \\ 0.0 & 0.4 & 0.6 & -1.0 & 0.0 \\ 2.0 & 1.6 & 1.0 & 1.0 & 0.0 \end{bmatrix} \begin{bmatrix} \pi_0 \\ \pi_1 \\ \pi_2 \\ \pi_3 \\ \pi_4 \end{bmatrix} $$

Finally, the modified matrix equation can be solved with matrix multiplication to get:

$$ s_{\infty} = \begin{bmatrix} \pi_0 \\ \pi_1 \\ \pi_2 \\ \pi_3 \\ \pi_4 \end{bmatrix} = \begin{bmatrix} 0.09091 \\ 0.22727 \\ 0.22727 \\ 0.22727 \\ 0.22727 \end{bmatrix} $$.

Tuesday, May 20, 2014

Zipf! (like thwip! or snikt!)

Hello World! My name is Jonny, and welcome to my blog on data science escapades!

I was a computational researcher in mechanical engineering and materials science where much of my research focused on applying molecular simulation to understanding material behavior. Following my postdoctoral appointment, I found that my quantitative skillset could be complemented with a data scientist's toolbox to tackle a much wider range of problems. And so began my forays into data science.

I've spent a lot of time in recent months learning (or re-learning in some cases) python and various libraries (mechanize, BeautifulSoup, pandas, matplotlib, etc.), SQL, machine learning algorithms, Hadoop and MapReduce, etc. on my own. Less than two weeks ago, I was accepted into the Zipfian Academy, and quickly I began my formal training as a budding data scientist. I am now halfway through my second week, and already I've learned a ton of things, met some really interesting people with eclectic backgrounds, and began scraping the web for some interesting datasets (more on this soon!).

Whether you are an established data scientist, a curious academic, an engineer, a scientist, a student, a friend or relative, or none of the above, I welcome you to join me in my edification experience and hopefully beyond.

To find more information about me, you can also explore my website at