Learning and Probabilistic Reasoning
Andrew M. Adare
January 26, 2016
When the facts change, I change my opinion.
What do you do, sir?
John Maynard Keynes
knowl·edge
This talk is about the transition from belief to knowledge through evidence and probabilistic reasoning.
Fascinating history:
Developed for apologetics; applied in war; suppressed and discredited for centuries; now rocking the mic
Essential features:
Contrast this with frequentist inference:
Any experiment is viewed as one of an infinite sequence of similiar, independent experiments
Key technique: Method of Maximum Likelihood
Unknown parameters typically treated as fixed values; data points get error bars
Unnatural/problematic for rare events like earthquakes, stock market crashes
Consider this 2D histogram:
N is the sum of all cells
Joint $$ p(x_i, y_j) = \frac{n_{ij}}{N} $$
Marginal $$ p(x_i) = \frac{c_i}{N} $$
Conditional $$ p(y_j\,|\,x_i) = \frac{n_{ij}}{c_i} $$
Joint $$ p(x_i, y_j) = \frac{n_{ij}}{N} $$
Marginal $$ p(x_i) = \frac{c_i}{N} $$
Conditional $$ p(y_j\,|\,x_i) = \frac{n_{ij}}{c_i} $$
Combine:
$ p(x_i, y_j) = $ $ \frac{n_{ij}}{c_i}$ $\frac{c_i}{N} $
$ = $ $ p(y_j\,|\,x_i)$ $ p(x_i) $
$$ p(x, y) = p(y\,|\,x)\,p(x) = p(x\,|\,y)\,p(y) $$
$$ p(y\,|\,x) = \frac{p(x\,|\,y)\,p(y)}{p(x)} $$
$$ \mathrm{posterior} = \frac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} $$
Two goals, given a dataset $ \{ \mathbf{x}_{i} \}_{i=1}^N $
Three general approaches
Find the set of parameters under which the data are most likely: $$ \hat{\mathbf{\theta}} = \underset{\mathbf{\theta}}{\mathrm{argmax}} \left[ p(\mathbf{x}_{1\ldots I} \,|\, \mathbf{\theta}) \right] = \underset{\mathbf{\theta}}{\mathrm{argmax}} \left[ \prod_{i=1}^I p(\mathbf{x}_i \,|\, \mathbf{\theta}) \right] $$
Ordinary Least Squares works as an ML estimator for Gaussian likelihood (by minimizing $-\!\ln p(\mathbf{x} \,|\, \mathbf{\theta})$)
MAP estimation uses Bayes' rule (without the denominator) $$ \hat{\mathbf{\theta}} = \underset{\mathbf{\theta}}{\mathrm{argmax}} \left[ p(\mathbf{\theta} \,|\, \mathbf{x}_{1\ldots I}) \right] = \underset{\mathbf{\theta}}{\mathrm{argmax}} \left[ \prod_{i=1}^I p(\mathbf{x}_i \,|\, \mathbf{\theta}) p(\mathbf{\theta}) \right] $$ This generalizes ML estimation by introducing prior knowledge/belief via $ p(\mathbf{\theta}) $.
Many choices of $\mathbf{\theta}$ may be consistent with the data.
Point estimates cannot accommodate that.
Fully Bayesian methods calculate the full joint posterior probability over the model parameters: $$ p(\mathbf{\theta} \,|\, \mathbf{x}_{1\ldots I}) = \frac{\prod_{i=1}^I p(\mathbf{x}_i \,|\, \mathbf{\theta}) p(\mathbf{\theta})} {p(\mathbf{x}_{1\ldots I})} $$
This is a distribution over possible models.
$p(\mathbf{x}^* \,|\, \mathbf{\theta})$ gives us a prediction for the unseen data $\mathbf{x}^*$ for a given $\mathbf{\theta}$.
Since there are many possible $\mathbf{\theta}$ values, we must integrate over them all, weighting by their probability: $$ p(\mathbf{x}^* \,|\, \mathbf{x}_{1\ldots I}) = \int p(\mathbf{x}^* \,|\, \mathbf{\theta}) \, p(\mathbf{\theta} \,|\, \mathbf{x}_{1\ldots I}) d\mathbf{\theta} $$
A unified picture: "posterior" from ML and MAP estimation is like a $\delta$-function at $\mathbf{\hat\theta}$.
So predicting $\mathbf{x}^*$ simply amounts to evaluating $p(\mathbf{x}^* \,|\, \mathbf{\hat\theta })$.
How do we test whether a coin is fair?
Use the binomial distribution to model the likelihood: $$ p(k\,|\, N, \theta) = \binom{N}{k} \theta^k (1-\theta)^{N-k} $$ $\theta$ controls the probability to obtain $k$ heads in $N$ flips.
Use a Beta distribution for the prior $p(\theta)$: $$ \mathrm{Beta}(\theta \,|\, a, b) \propto \theta^{a - 1}(1 - \theta)^{b - 1} $$
Conjugate prior to binomial likelihood
Uninformative if $a = b = 1$ (uniform distribution)
Posterior: $\mathrm{Beta}(\theta \,|\, N_h + a, N_t + b)$
Max. Likelihood vs Bayesian shootout
We will compare $$\hat{\theta}_{MLE} = \frac{N_{h}}{N}$$
vs. the posterior mean $$E[\theta \,|\, \mathrm{outcomes}]$$
Both methods agree on the truth as $N \to \infty$
After two flips (both heads up), the MLE mean was $\frac{N_{h}}{N} = 1$ with a variance of zero!
Overfitting: that prediction wouldn't generalize.
But the Bayesian estimate is immediately closer to the correct answer, with a saner uncertainty
Posterior mean $ = \frac{N_{h} + a}{N + a + b} = \frac{3}{4} $ with a variance of $0.19^2$.
Simultaneous Localization and Mapping
Start at unknown location in an unknown environment
Incrementally build a consistent map
Simultaneously determine location within map
SLAM can be solved using an extended Kalman filter (EKF)
The EKF is a specialization of Bayesian filtering for linearized Gaussian models
Simulation: autonomous navigation using predefined waypoints, landmarks observed with a rangefinder, and Bayesian filtering
The Gaussian is the highest-entropy distribution
Outline of demo, skipping many details:
The posterior is thus also Gaussian: $p(w|t) = N(w|m_N, S_N)$, where $m_N = \beta S_N \Phi^T t$, and $S_N^{-1} = \alpha I + \beta \Phi^T \Phi$