$$ \newcommand{\problemdivider}{\begin{center}\large \bf\ldots\ldots\ldots\ldots\ldots\ldots\end{center}} \newcommand{\subproblemdivider}{\begin{center}\large \bf\ldots\ldots\end{center}} \newcommand{\pdiv}{\problemdivider} \newcommand{\spdiv}{\subproblemdivider} \newcommand{\ba}{\begin{align*}} \newcommand{\ea}{\end{align*}} \newcommand{\rt}{\right} \newcommand{\lt}{\left} \newcommand{\bp}{\begin{problem}} \newcommand{\ep}{\end{problem}} \newcommand{\bsp}{\begin{subproblem}} \newcommand{\esp}{\end{subproblem}} \newcommand{\bssp}{\begin{subsubproblem}} \newcommand{\essp}{\end{subsubproblem}} \newcommand{\atag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{subsection}.\alph{equation}}} \newcommand{\btag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{equation}}} \newcommand{\ctag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{equation}}} \newcommand{\dtag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\Alph{chapter}.\arabic{section}.\arabic{equation}}} \newcommand{\unts}[1]{\ \text{#1}} \newcommand{\textop}[1]{\operatorname{#1}} \newcommand{\textopl}[1]{\operatornamewithlimits{#1}} \newcommand{\prt}{\partial} \newcommand{\pderi}[3]{\frac{\prt^{#3}#1}{\prt #2^{#3}}} \newcommand{\deri}[3]{\frac{d^{#3}#1}{d #2^{#3}}} \newcommand{\del}{\vec\nabla} \newcommand{\exval}[1]{\langle #1\rangle} \newcommand{\bra}[1]{\langle #1|} \newcommand{\ket}[1]{|#1\rangle} \newcommand{\ham}{\mathcal{H}} \newcommand{\arr}{\mathfrak{r}} \newcommand{\conv}{\mathop{\scalebox{2}{\raisebox{-0.2ex}{$\ast$}}}} \newcommand{\bsm}{\lt(\begin{smallmatrix}} \newcommand{\esm}{\end{smallmatrix}\rt)} \newcommand{\bpm}{\begin{pmatrix}} \newcommand{\epm}{\end{pmatrix}} \newcommand{\bdet}{\lt|\begin{smallmatrix}} \newcommand{\edet}{\end{smallmatrix}\rt|} \newcommand{\bs}[1]{\boldsymbol{#1}} \newcommand{\uvec}[1]{\bs{\hat{#1}}} \newcommand{\qed}{\hfill$\Box$} $$
Tags:
  • statistics
  • Given an HMM with possible hidden states $\mathcal{H} = {h_1,\dots,h_{n_H}}$, emissions $\mathcal{X} = {x_1,\dots,x_{n_E}}$, actual hidden states $H_{1:T}={h_1,\dots,h_T}$ and emissions $X_{1:T}={x_1,\dots,x_T}$, and transition/emission probabilities $\mathbf\Theta = {\mathbf \Pi\in\mathbb{R}^{n_H\times n_H}, \mathbf E\in\mathbb{R}^{n_E\times n_H}}$, with $\pi_{i,j}\in\mathbf\Pi$ and $e_{i,j}\in\mathbf E$.

    Various quantities of interest

    1. $p(h_t|X_{1:t},\mathbf\Theta)$, hidden state given previous emissions
    2. $p(h_t|X_{1:T},\mathbf\Theta)$, hidden state given all emissions
    3. $\arg\max_{H_{1:T}} p(H_{1:T}|X_{1:T},\mathbf\Theta)$, the best sequence of hidden states
    4. $\arg\max_{\mathbf\Theta}p(X_{1:T}|\mathbf\Theta)$, the best transition/emission parameters

    The overall likelihood of a sequence of hidden states $H_{1:t}$ and observed states $X_{1:t}$ is

    \[\begin{align*} p(X_{1:t},H_{1:t}|\mathbf\Theta) &= \prod_{\tau = 1}^t p(x_\tau|h_\tau) p(h_\tau|h_{\tau - 1})\\ &= \prod_{\tau = 1}^t e_{x_\tau,h_\tau} \pi_{h_\tau,h_{\tau-1}}. \end{align*}\]

    Hidden state given previous emissions

    By definition of conditional probability,

    \[p(h_t|X_{1:t},\mathbf\Theta) = \frac{p(X_{1:t},h_t|\mathbf\Theta)}{p(X_{1:t}|\mathbf\Theta)}.\]

    The emission likelihood in the denominator could be computed inefficiently by summing over all possible hidden states in the observation likelihood

    \[\begin{align*} p(X_{1:t}|\mathbf\Theta) &= \sum_{h_1,\dots,h_t} p(X_{1:t},H_{1:t}|\mathbf\Theta)\\ &= \sum_{h_1\in\mathcal{H},\dots,h_t\in\mathcal{H}} \prod_{\tau = 1}^t p(x_\tau|h_\tau) p(h_\tau|h_{\tau - 1}), \end{align*}\]

    which is $O(n_H^t)$ and thus intractible. Likewise, the numerator could be computed as

    \[p(X_{1:t},h_t|\mathbf\Theta) = \sum_{h_1,\dots,h_{t-1}} p(X_{1:t},H_{1:t}|\mathbf\Theta),\]

    which also runs in exponential time and is intractible.

    We can instead sum over the hidden states inductively; rather than summing over all $n_H^t$ possible hidden paths in the full likelihood, we can instead sum over just the $n_H$ hidden states at each timestep. Given initial hidden state probability $p(h_1)$, the joint probability of $x_1,h_1$ is

    \[p(x_1,h_1|\mathbf\Theta) = p(x_1|h_1)p(h_1).\]

    The joint probability of $x_1,x_2,h_2,h_1$ is

    \[p(x_1,x_2,h_1,h_2|\mathbf\Theta) = p(x_1,h_1|\mathbf\Theta)p(x_2|h_2)p(h_2|h_1);\]

    marginalizing over $h_1$, we have

    \[p(x_1,x_2,h_2|\mathbf\Theta) = \sum_{h_1}p(h_1,x_1|\mathbf\Theta)p(x_2|h_2)p(h_2|h_1).\]

    Likewise, for $x_1,x_2,x_3,h_3$, we have

    \[p(x_1,x_2,x_3,h_3|\mathbf\Theta) = \sum_{h_2}p(x_1,x_2,h_2|\mathbf\Theta)p(x_3|h_3)p(h_3|h_2).\]

    This is the forward algorithm; given base step

    \[p(x_1,h_1|\mathbf\Theta) = p(x_1|h_1)p(h_1),\]

    the $\tau$th inductive step is

    \[p(x_1,\dots,x_\tau,h_\tau|\mathbf\Theta) = \sum_{h_{\tau-1}}p(x_1,\dots,x_{\tau-1},h_{\tau-1}|\mathbf\Theta)p(x_\tau|h_\tau)p(h_\tau|h_{\tau-1}).\]

    From this it is straightforward to compute

    \[p(h_t|X_{1:t},\mathbf\Theta)=\frac{p(x_1,\dots,x_t,h_t|\mathbf\Theta)}{\sum_{h_t}p(x_1,\dots,x_t,h_t|\mathbf\Theta)}.\]

    Forward-backward algorithm

    We may also be interested in computing the probability of the hidden state at step $t$, given all observed states, i.e. including those that come after $t$. We do this by starting with the full joint probability of all observed states and the hidden state at $t$, which we factor as

    \[\begin{align*} p(X_{1:t}, X_{(t+1):T},h_t)&=p(X_{(t+1):T}|\underbrace{X_{1:t}}_{\emptyset}, h_t)p(X_{1:t}, h_t)\\ &=p(x_{t+1},\dots,x_T|h_t)p(x_1,\dots,x_t,h_t) \end{align*}\]

    The right factor comes from the forward algorithm above. For the left factor $p(x_T,\dots,x_{t+1}|h_t)$, we first note that the conditioning on $X_{1:t}$ disappears because by the HMM graphical model, there is no conditional dependence between emissions. We then take the joint probability of $x_T,h_T|h_{T-1}$, and note that this can be marginalized to $p(x_T|h_{T-1})$,

    \[p(x_T|h_{T-1}) = \sum_{h_T} p(x_T|h_T)p(h_T|h_{T-1}).\]

    Likewise, the joint probability of $x_T,x_{T-1},h_{T-1}|h_{T-2}$ is

    \[p(x_T,x_{T-1},h_{T-1}|h_{T-2}) = p(x_T|h_{T-1})p(h_{T-1}|h_{T-2})p(x_{T-1}|h_{T-1});\]

    marginalizing out $h_{T-1}$, we get

    \[p(x_T,x_{T-1}|h_{T-2}) = \sum_{h_{T-1}}p(x_T|h_{T-1})p(h_{T-1}|h_{T-2})p(x_{T-1}|h_{T-1}).\]

    Explicitly writing out one more recursion for clarity:

    \[p(x_T,x_{T-1},x_{T-2}|h_{T-3}) = \sum_{h_{T-2}}p(x_T,x_{T-1}|h_{T-2})p(h_{T-2}|h_{T-3})p(x_{T-2}|h_{T-2})\]

    This is the backward algorithm; given base step

    \[p(x_T|h_{T-1}) = \sum_{h_T} p(x_T|h_T)p(h_T|h_{T-1}),\]

    the $b$th inductive step yields

    \[p(x_T,x_{T-1},\dots, x_{T-b}|h_{T-b-1})=\sum_{h_{T - b}} p(x_T,x_{T-1},\dots, x_{T-b+1}|h_{T-b})p(h_{T-b}|h_{T-b-1})p(x_{T-b}|h_{T-b}).\]

    We can thus compute $p(h_t|X_{1:t},X_{t+1,T})$ from the full joint distribution by normalizing,

    \[\begin{align*} p(h_t|X_{1:t}, X_{(t+1):T})&=\frac{p(X_{1:t}, X_{(t+1):T},h_t)}{p(X_{1:t}, X_{(t+1):T})}\\ &=\frac{p(x_{t+1},\dots,x_T|h_t)p(x_1,\dots,x_t,h_t)}{\sum_{h_T}p(x_1,\dots,x_T,h_T)} \end{align*}\]

    Baum-Welch algorithm

    Viterbi algorithm

    Misc notes

    • Silent states do not advance the $t$ counter in aforementioned algorithms.
    • Is there a name for the graph generated in these algorithms? What if the HMM is unrolled into that graph?