1.4 - Likelihood & LogLikelihood

Printer-friendly versionPrinter-friendly version

One of the most fundamental concepts of modern statistics is that of likelihood. In each of the discrete random variables we have considered thus far, the distribution depends on one or more parameters that are, in most statistical applications, unknown. In the Poisson distribution, the parameter is λ. In the binomial, the parameter of interest is p (since n is typically fixed and known).

Likelihood is a tool for summarizing the data’s evidence about unknown parameters. Let us denote the unknown parameter(s) of a distribution generically by θ. Since the probability distribution depends on θ, we can make this dependence explicit by writing f(x) as f(x ; θ). For example, in the Bernoulli distribution the parameter is θ =  π , and the distribution is

 \(f(x;π)=π^x(1-π)^{1-x}\qquad x=0,1\)    (2)

Once a value of X has been observed, we can plug this observed value x into f(x ; π ) and obtain a function of π only. For example, if we observe X = 1, then plugging x = 1 into (2) gives the function π . If we observe X = 0, the function becomes 1 − π .

Whatever function of the parameter we get when we plug the observed data x into f(x ; θ), we call that function the likelihood function.

We write the likelihood function as \(L(\theta;x)=\prod^n_{i=1}f(X_i;\theta)\) or sometimes just L(θ). Algebraically, the likelihood L(θ ; x) is just the same as the distribution f(x ; θ), but its meaning is quite different because it is regarded as a function of θ rather than a function of x. Consequently, a graph of the likelihood usually looks very different from a graph of the probability distribution.

For example, suppose that X has a Bernoulli distribution with unknown parameter π . We can graph the probability distribution for any fixed value of π  . For example, if π = .5 we get this:


Now suppose that we observe a value of X, say X = 1. Plugging x = 1 into the distribution \(π^x(1-π)^{1-x}\) gives the likelihood function L(π ; x) = π , which looks like this:


For discrete random variables, a graph of the probability distribution f(x ; θ) has spikes at specific values of x, whereas a graph of the likelihood L(θ ; x) is a continuous curve (e.g. a line) over the parameter space, the domain of possible values for θ.

L(θ ; x) summarizes the evidence about θ contained in the event X = x. L(θ ; x) is high for values of θ that make X = x more likely, and small for values of θ that make X = x unlikely. In the Bernoulli example, observing X = 1 gives some (albeit weak) evidence that π  is nearer to 1 than to 0, so the likelihood for x = 1 rises as p moves from 0 to 1.

For example, if we observe $x$ from $Bin(n, \pi)$, the likelihood function is
Any multiplicative constant which does not depend on $\theta$ is irrelevant and may be discarded, thus,
L(\pi|x)\propto \pi^x(1-\pi)^{n-x}.


In most cases, for various reasons, but often computational convenience, we work with the loglikelihood
l(\theta|x)=\log L(\theta|x)

which is defined up to an arbitrary additive constant.

For example, the binomial loglikelihood is
l(\pi|x) = x \log \pi + (n- x) \log(1 - \pi).

In many problems of interest, we will derive our loglikelihood from a sample rather than from a single observation. If we observe an independent sample $x_1, x_2, ..., x_n$  from a distribution $f(x|\theta)$, then the overall likelihood is the product of the individual likelihoods:
L(\theta|x) & = & \prod_{i=1}^{n} f(x_i|\theta)\nonumber\\
& = & \prod_{i=1}^{n} L(\theta|x_i)\nonumber

and the loglikelihood is:
l(\theta|x) & = & \mbox{log}\;\prod_{i=1}^{n} f(x_i|\theta)\nonumber\\
& = & \sum_{i=1}^{n}\; \mbox{log}\,f(x_i|\theta) = \sum_{i=1}^{n}\; l(\theta|x_i).\nonumber

Binomial loglikelihood examples: 
Plot of binomial loglikelihood function if n = 5 and we observe x = 0, x = 1, and x = 2 (see the lec1fig.R code in Canvas on how to produce these figures):

In regular problems, as the total sample size $n$ grows, the loglikelihood function does two things:

  • it  becomes more sharply peaked around its maximum,  and
  • its shape becomes nearly quadratic (i.e. a  parabola, if there is a single parameter).

This is important since the tests such as Wald test based on $z=\frac{\mbox{statistic}}{\mbox{SE of statistic}}$ only works if the logL approximates well to quadratic form. For example, the loglikelihood for a normal-mean problem is exactly quadratic. As the sample size grows, the inference comes to resemble the normal-mean problem. This is true even for discrete data. The extent to which normal-theory approximations work for discrete data does not depend on how closely the distribution of responses resembles a normal curve, but on how closely the loglikelihood resembles a quadratic function.

Transformations may help us to improve the shape of loglikelihood. More on this in Section 1.6 on Alternative Parametrizations. Next we will see how we use the likelihood, that is the corresponding loglikelihood, to estimate the most likely value of the unknown parameter of interest.