Suppose we have some data $\mathbf x = {x_1,\dots,x_n}$, which we assume to be independently and identically drawn from discrete random variable $X\sim p(\theta)$. The probability mass function of $X$ is $p(x|\theta)$, which yields the discrete probability $\textop{Pr}[X=x|\theta]$, i.e. the probability that a random draw from $X$ is equal to $x$, given model parameter $\theta$.
Most statistics texts define the likelihood $\mathcal L(\theta|\mathbf x)$ as the joint probability of the data being generated by $p(\theta)$. Because the data are assumed to be IID, this is simply the product of independent probabilities,
\[\mathcal{L}(\theta|\mathbf{x}) = p(x_1,\dots,x_n|\theta) = \prod_{i=1}^n p(x_i|\theta)= \textop{Pr}[X=x_i\cap\dots\cap X=x_n|\theta]%\prod_{i=1}^n\textop{Pr}[X=x_i|\theta].\]Since this is a probability mass function, summing over all possible values $\mathcal{X}$ for each $x_1\dots x_n$ yields 1:
\[\sum_{x_1\in \mathcal{X}}\dots\sum_{x_n\in\mathcal{X}}\prod_{i=1}^n p(x_i|\theta)=1\]To fit our model to the data, we maximize the likelihood with respect to model parameters. This makes intuitive sense: a higher $\mathcal L$ means that our generative model $p(\theta)$ has a higher probability of generating $\mathbf x$, and is thus a better fit. Finding the optimal value $\hat\theta = \mathop{\arg\max}_\theta\mathcal{L}(\theta|\mathbf{x})$ yields the maximum joint probability of the data.