HMMs//wiki č.cc

Given an HMM with possible hidden states $\mathcal{H} = {h_1,\dots,h_{n_H}}$, emissions $\mathcal{X} = {x_1,\dots,x_{n_E}}$, actual hidden states $H_{1:T}={h_1,\dots,h_T}$ and emissions $X_{1:T}={x_1,\dots,x_T}$, and transition/emission probabilities $\mathbf\Theta = {\mathbf \Pi\in\mathbb{R}^{n_H\times n_H}, \mathbf E\in\mathbb{R}^{n_E\times n_H}}$, with $\pi_{i,j}\in\mathbf\Pi$ and $e_{i,j}\in\mathbf E$.

Various quantities of interest

$p(h_t|X_{1:t},\mathbf\Theta)$, hidden state given previous emissions
$p(h_t|X_{1:T},\mathbf\Theta)$, hidden state given all emissions
$\arg\max_{H_{1:T}} p(H_{1:T}|X_{1:T},\mathbf\Theta)$, the best sequence of hidden states
$\arg\max_{\mathbf\Theta}p(X_{1:T}|\mathbf\Theta)$, the best transition/emission parameters

The overall likelihood of a sequence of hidden states $H_{1:t}$ and observed states $X_{1:t}$ is

\[\begin{align*} p(X_{1:t},H_{1:t}|\mathbf\Theta) &= \prod_{\tau = 1}^t p(x_\tau|h_\tau) p(h_\tau|h_{\tau - 1})\\ &= \prod_{\tau = 1}^t e_{x_\tau,h_\tau} \pi_{h_\tau,h_{\tau-1}}. \end{align*}\]

Hidden state given previous emissions

By definition of conditional probability,

\[p(h_t|X_{1:t},\mathbf\Theta) = \frac{p(X_{1:t},h_t|\mathbf\Theta)}{p(X_{1:t}|\mathbf\Theta)}.\]

The emission likelihood in the denominator could be computed inefficiently by summing over all possible hidden states in the observation likelihood

\[\begin{align*} p(X_{1:t}|\mathbf\Theta) &= \sum_{h_1,\dots,h_t} p(X_{1:t},H_{1:t}|\mathbf\Theta)\\ &= \sum_{h_1\in\mathcal{H},\dots,h_t\in\mathcal{H}} \prod_{\tau = 1}^t p(x_\tau|h_\tau) p(h_\tau|h_{\tau - 1}), \end{align*}\]

which is $O(n_H^t)$ and thus intractible. Likewise, the numerator could be computed as

\[p(X_{1:t},h_t|\mathbf\Theta) = \sum_{h_1,\dots,h_{t-1}} p(X_{1:t},H_{1:t}|\mathbf\Theta),\]

which also runs in exponential time and is intractible.

We can instead sum over the hidden states inductively; rather than summing over all $n_H^t$ possible hidden paths in the full likelihood, we can instead sum over just the $n_H$ hidden states at each timestep. Given initial hidden state probability $p(h_1)$, the joint probability of $x_1,h_1$ is

\[p(x_1,h_1|\mathbf\Theta) = p(x_1|h_1)p(h_1).\]

The joint probability of $x_1,x_2,h_2,h_1$ is

\[p(x_1,x_2,h_1,h_2|\mathbf\Theta) = p(x_1,h_1|\mathbf\Theta)p(x_2|h_2)p(h_2|h_1);\]

marginalizing over $h_1$, we have

\[p(x_1,x_2,h_2|\mathbf\Theta) = \sum_{h_1}p(h_1,x_1|\mathbf\Theta)p(x_2|h_2)p(h_2|h_1).\]

Likewise, for $x_1,x_2,x_3,h_3$, we have

\[p(x_1,x_2,x_3,h_3|\mathbf\Theta) = \sum_{h_2}p(x_1,x_2,h_2|\mathbf\Theta)p(x_3|h_3)p(h_3|h_2).\]

This is the forward algorithm; given base step

\[p(x_1,h_1|\mathbf\Theta) = p(x_1|h_1)p(h_1),\]

the $\tau$th inductive step is

\[p(x_1,\dots,x_\tau,h_\tau|\mathbf\Theta) = \sum_{h_{\tau-1}}p(x_1,\dots,x_{\tau-1},h_{\tau-1}|\mathbf\Theta)p(x_\tau|h_\tau)p(h_\tau|h_{\tau-1}).\]

From this it is straightforward to compute

\[p(h_t|X_{1:t},\mathbf\Theta)=\frac{p(x_1,\dots,x_t,h_t|\mathbf\Theta)}{\sum_{h_t}p(x_1,\dots,x_t,h_t|\mathbf\Theta)}.\]

Forward-backward algorithm

We may also be interested in computing the probability of the hidden state at step $t$, given all observed states, i.e. including those that come after $t$. We do this by starting with the full joint probability of all observed states and the hidden state at $t$, which we factor as

\[\begin{align*} p(X_{1:t}, X_{(t+1):T},h_t)&=p(X_{(t+1):T}|\underbrace{X_{1:t}}_{\emptyset}, h_t)p(X_{1:t}, h_t)\\ &=p(x_{t+1},\dots,x_T|h_t)p(x_1,\dots,x_t,h_t) \end{align*}\]

The right factor comes from the forward algorithm above. For the left factor $p(x_T,\dots,x_{t+1}|h_t)$, we first note that the conditioning on $X_{1:t}$ disappears because by the HMM graphical model, there is no conditional dependence between emissions. We then take the joint probability of $x_T,h_T|h_{T-1}$, and note that this can be marginalized to $p(x_T|h_{T-1})$,

\[p(x_T|h_{T-1}) = \sum_{h_T} p(x_T|h_T)p(h_T|h_{T-1}).\]

Likewise, the joint probability of $x_T,x_{T-1},h_{T-1}|h_{T-2}$ is

\[p(x_T,x_{T-1},h_{T-1}|h_{T-2}) = p(x_T|h_{T-1})p(h_{T-1}|h_{T-2})p(x_{T-1}|h_{T-1});\]

marginalizing out $h_{T-1}$, we get

\[p(x_T,x_{T-1}|h_{T-2}) = \sum_{h_{T-1}}p(x_T|h_{T-1})p(h_{T-1}|h_{T-2})p(x_{T-1}|h_{T-1}).\]

Explicitly writing out one more recursion for clarity:

\[p(x_T,x_{T-1},x_{T-2}|h_{T-3}) = \sum_{h_{T-2}}p(x_T,x_{T-1}|h_{T-2})p(h_{T-2}|h_{T-3})p(x_{T-2}|h_{T-2})\]

This is the backward algorithm; given base step

\[p(x_T|h_{T-1}) = \sum_{h_T} p(x_T|h_T)p(h_T|h_{T-1}),\]

the $b$th inductive step yields

\[p(x_T,x_{T-1},\dots, x_{T-b}|h_{T-b-1})=\sum_{h_{T - b}} p(x_T,x_{T-1},\dots, x_{T-b+1}|h_{T-b})p(h_{T-b}|h_{T-b-1})p(x_{T-b}|h_{T-b}).\]

We can thus compute $p(h_t|X_{1:t},X_{t+1,T})$ from the full joint distribution by normalizing,

\[\begin{align*} p(h_t|X_{1:t}, X_{(t+1):T})&=\frac{p(X_{1:t}, X_{(t+1):T},h_t)}{p(X_{1:t}, X_{(t+1):T})}\\ &=\frac{p(x_{t+1},\dots,x_T|h_t)p(x_1,\dots,x_t,h_t)}{\sum_{h_T}p(x_1,\dots,x_T,h_T)} \end{align*}\]

Baum-Welch algorithm

Viterbi algorithm

Misc notes

Silent states do not advance the $t$ counter in aforementioned algorithms.
Is there a name for the graph generated in these algorithms? What if the HMM is unrolled into that graph?