web.archive.org

maximum likelihood: Information and Much More from Answers.com

  • ️Wed Jul 01 2015

Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide an optimum fit.

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance (see examples below).

Loosely speaking, for a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. In the case of the normal distribution this gives a unique solution, although in more complex problems this may not be the case.

Prerequisites

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.

Principles

Consider a family Dθ of probability distributions parameterized by an unknown parameter θ (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as fθ. We draw a sample x1,x2,...,xn of n values from this distribution, and then using fθ we compute the (multivariate) probability density associated with our observed data, f_\theta(x_1,\dots,x_n \mid \theta).\,\!

As a function of θ with x1, ..., xn fixed, this is the likelihood function

\mathcal{L}(\theta) = f_{\theta}(x_1,\dots,x_n \mid \theta).\,\!

The method of maximum likelihood estimates θ by finding the value of θ that maximizes L(θ). This is the maximum likelihood estimator (MLE) of θ:

Failed to parse (unknown function\underset): \widehat{\theta} = \underset{\theta}{\operatorname{arg\ max}}\ \mathcal{L}(\theta).


Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:

\mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i \mid \theta)

and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

\mathcal{L}^*(\theta) = \sum_{i=1}^n \log f_{\theta}(x_i \mid \theta).

The maximum of this expression can then be found numerically using various optimization algorithms.

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.

Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.

Properties

Functional invariance

The maximum likelihood estimator (MLE) of a parameter θ can be used to calculate the MLE of a function of the parameter. Specifically, if \widehat{\theta} is the MLE for θ, and if g is a one-to-one function, then the MLE for α = g(θ) is

\widehat{\alpha} = g(\widehat{\theta}).\,\!

If g is not one-to-one, then Failed to parse (unknown function\scriptstyle): \scriptstyle g(\widehat{\theta})

is the MLE of α = g(θ) only if the likelihood function is modified to be
\bar{L}(\alpha) = \sup_{\theta: \alpha = g(\theta)} L(\theta).

Bias

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

Asymptotics

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

  • The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
  • The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
  • The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean θ and covariance matrix equal to the inverse of the Fisher information matrix.

It is straightforward to show that the asymptotic bias and efficiency are a result of the Gaussian distribution.

The regularity conditions required to ensure this behavior are:

  1. The first and second derivatives of the log-likelihood function must be defined.
  2. The Fisher information matrix must not be zero.

While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE.

Examples

Discrete distribution, finite parameter space

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:

\begin{matrix} \Pr(\mathrm{H} = 49 \mid p=1/3) & = & \binom{80}{49}(1/3)^{49}(1-1/3)^{31} \approx 0.000 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=1/2) & = & \binom{80}{49}(1/2)^{49}(1-1/2)^{31} \approx 0.012 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=2/3) & = & \binom{80}{49}(2/3)^{49}(1-2/3)^{31} \approx 0.054 \\ \end{matrix}

We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

Discrete distribution, continuous parameter space

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

L(\theta) = f_D(\mathrm{H} = 49 \mid p) = \binom{80}{49} p^{49}(1-p)^{31}

over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero:

Failed to parse (unknown function\begin): \begin{align} {0}&{} = \frac{\partial}{\partial p} \left( \binom{80}{49} p^{49}(1-p)^{31} \right) \\ & {}\propto 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} \\ & {}= p^{48}(1-p)^{30}\left[ 49(1-p) - 31p \right] \\ & {}= p^{48}(1-p)^{30}\left[ 49 - 80p \right] \end{align}
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.
Enlarge

Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.

Continuous distribution, continuous parameter space

For the normal distribution \mathcal{N}(\mu, \sigma^2) which has probability density function

f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\ \ }\sigma\ }                                 \exp{\left(-\frac {(x-\mu)^2}{2\sigma^2} \right)},

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \prod_{i=1}^{n} f( x_{i}\mid  \mu, \sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left( -\frac{ \sum_{i=1}^{n}(x_i-\mu)^2}{2\sigma^2}\right),

or more conveniently:

f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right),

where \bar{x} is the sample mean.

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood \mathcal{L} (\mu,\sigma) = f(x_1,\ldots,x_n \mid \mu, \sigma) over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]

0 = \frac{\partial}{\partial \mu} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)
= \frac{\partial}{\partial \mu} \left( \log\left( \frac{1}{2\pi\sigma^2} \right)^{n/2} - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)
= 0 - \frac{-2n(\bar{x}-\mu)}{2\sigma^2}

which is solved by

\hat\mu = \bar{x} = \sum^{n}_{i=1}x_i/n.

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,

E \left[ \widehat\mu \right] = \mu,

which means that the maximum-likelihood estimator \widehat\mu is unbiased.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

0 = \frac{\partial}{\partial \sigma} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)
= \frac{\partial}{\partial \sigma} \left( \frac{n}{2}\log\left( \frac{1}{2\pi\sigma^2} \right) - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)
= -\frac{n}{\sigma} + \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{\sigma^3}

which is solved by

\widehat\sigma^2 = \sum_{i=1}^n(x_i-\widehat{\mu})^2/n.

Inserting \widehat\mu we obtain

\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2                           -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n x_i x_j.

When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain

E \left[ \widehat{\sigma^2}  \right]= \frac{n-1}{n}\sigma^2.

This means that the estimator \widehat\sigma is biased (However, \widehat\sigma is consistent).

Formally we say that the maximum likelihood estimator for θ = (μ,σ2) is:

\widehat{\theta} = \left(\widehat{\mu},\widehat{\sigma}^2\right).

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

See also

  • likelihood function, a description on what likelihood functions are.
  • Delta method, a method for finding the distribution of functions of a maximum likelihood estimator.
  • mean squared error, a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator).
  • The Rao–Blackwell theorem, a result which yields a process for finding the best possible unbiased estimator (in the sense of having minimal mean squared error). The MLE is often a good starting place for the process.
  • sufficient statistic, a function of the data through which the MLE (if it exists and is unique) will depend on the data.

References

External links

Statistics
Descriptive statistics Mean (Arithmetic, Geometric) - Median - Mode - Power - Variance - Standard deviation
Inferential statistics Hypothesis testing - Significance - Null hypothesis/Alternate hypothesis - Error - Z-test - Student's t-test - Maximum likelihood - Standard score/Z score - P-value - Analysis of variance
Survival analysis Survival function - Kaplan-Meier - Logrank test - Failure rate - Proportional hazards models
Probability distributions Normal (bell curve) - Poisson - Bernoulli
Correlation Confounding variable - Pearson product-moment correlation coefficient - Rank correlation (Spearman's rank correlation coefficient, Kendall tau rank correlation coefficient)
Regression analysis Linear regression - Nonlinear regression - Logistic regression

This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)