blog//č.cc

$$ \newcommand{\problemdivider}{\begin{center}\large \bf\ldots\ldots\ldots\ldots\ldots\ldots\end{center}} \newcommand{\subproblemdivider}{\begin{center}\large \bf\ldots\ldots\end{center}} \newcommand{\pdiv}{\problemdivider} \newcommand{\spdiv}{\subproblemdivider} \newcommand{\ba}{\begin{align*}} \newcommand{\ea}{\end{align*}} \newcommand{\rt}{\right} \newcommand{\lt}{\left} \newcommand{\bp}{\begin{problem}} \newcommand{\ep}{\end{problem}} \newcommand{\bsp}{\begin{subproblem}} \newcommand{\esp}{\end{subproblem}} \newcommand{\bssp}{\begin{subsubproblem}} \newcommand{\essp}{\end{subsubproblem}} \newcommand{\atag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{subsection}.\alph{equation}}} \newcommand{\btag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{equation}}} \newcommand{\ctag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{equation}}} \newcommand{\dtag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\Alph{chapter}.\arabic{section}.\arabic{equation}}} \newcommand{\unts}[1]{\ \text{#1}} \newcommand{\textop}[1]{\operatorname{#1}} \newcommand{\textopl}[1]{\operatornamewithlimits{#1}} \newcommand{\prt}{\partial} \newcommand{\pderi}[3]{\frac{\prt^{#3}#1}{\prt #2^{#3}}} \newcommand{\deri}[3]{\frac{d^{#3}#1}{d #2^{#3}}} \newcommand{\del}{\vec\nabla} \newcommand{\exval}[1]{\langle #1\rangle} \newcommand{\bra}[1]{\langle #1|} \newcommand{\ket}[1]{|#1\rangle} \newcommand{\ham}{\mathcal{H}} \newcommand{\arr}{\mathfrak{r}} \newcommand{\conv}{\mathop{\scalebox{2}{\raisebox{-0.2ex}{$\ast$}}}} \newcommand{\bsm}{\lt(\begin{smallmatrix}} \newcommand{\esm}{\end{smallmatrix}\rt)} \newcommand{\bpm}{\begin{pmatrix}} \newcommand{\epm}{\end{pmatrix}} \newcommand{\bdet}{\lt|\begin{smallmatrix}} \newcommand{\edet}{\end{smallmatrix}\rt|} \newcommand{\bs}[1]{\boldsymbol{#1}} \newcommand{\uvec}[1]{\bs{\hat{#1}}} \newcommand{\qed}{\hfill$\Box$} $$

Blog post

7 July 2021

Tags:

stats

likelihood

information theory

What is a likelihood function, actually?

An overview of the standard Bayesian definition of the likelihood, an alternative information-theoretic treatment, and a generalization.

Suppose we have some data $\mathbf x = {x_1,\dots,x_n}$, which we assume to be independently and identically drawn from discrete random variable $X\sim p(\theta)$. The probability mass function of $X$ is $p(x|\theta)$, which yields the discrete probability $\textop{Pr}[X=x|\theta]$, i.e. the probability that a random draw from $X$ is equal to $x$, given model parameter $\theta$.

Most statistics texts define the likelihood $\mathcal L(\theta|\mathbf x)$ as the joint probability of the data being generated by $p(\theta)$. Because the data are assumed to be IID, this is simply the product of independent probabilities,

\[\mathcal{L}(\theta|\mathbf{x}) = p(x_1,\dots,x_n|\theta) = \prod_{i=1}^n p(x_i|\theta)= \textop{Pr}[X=x_i\cap\dots\cap X=x_n|\theta]%\prod_{i=1}^n\textop{Pr}[X=x_i|\theta].\]

Since this is a probability mass function, summing over all possible values $\mathcal{X}$ for each $x_1\dots x_n$ yields 1:

\[\sum_{x_1\in \mathcal{X}}\dots\sum_{x_n\in\mathcal{X}}\prod_{i=1}^n p(x_i|\theta)=1\]

To fit our model to the data, we maximize the likelihood with respect to model parameters. This makes intuitive sense: a higher $\mathcal L$ means that our generative model $p(\theta)$ has a higher probability of generating $\mathbf x$, and is thus a better fit. Finding the optimal value $\hat\theta = \mathop{\arg\max}_\theta\mathcal{L}(\theta|\mathbf{x})$ yields the maximum joint probability of the data.

//continue reading →