Maximum likelihood
From Wikipedia, the free encyclopedia
|
Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide an optimum fit. The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:
The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance (see examples below). Loosely speaking, for a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them: if a uniform prior distribution is assumed over the parameters, these coincide with the most probable values thereof. Maximum likelihood estimation gives a unique solution in the case of the normal distribution, although in more complex problems this may not be the case.
PrerequisitesThe following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima. PrinciplesConsider a family Failed to parse (Missing texvc executable; please see math/README to configure.): D_\theta of probability distributions parameterized by an unknown parameter Failed to parse (Missing texvc executable; please see math/README to configure.): \theta (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as Failed to parse (Missing texvc executable; please see math/README to configure.): f_\theta . We draw a sample Failed to parse (Missing texvc executable; please see math/README to configure.): x_1,x_2,\dots,x_n of n values from this distribution, and then using Failed to parse (Missing texvc executable; please see math/README to configure.): f_\theta we compute the (multivariate) probability density associated with our observed data, Failed to parse (Missing texvc executable; please see math/README to configure.): f_\theta(x_1,\dots,x_n \mid \theta).\,\!
This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ. Note that the maximum likelihood estimator may not be unique, or indeed may not even exist. PropertiesFunctional invarianceThe maximum likelihood estimator (MLE) of a parameter θ can be used to calculate the MLE of a function of the parameter. Specifically, if Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat{\theta} is the MLE for θ, and if g is a one-to-one function, then the MLE for α = g(θ) is
is the MLE of α = g(θ) only if the likelihood function is modified to be
BiasThe bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number. AsymptoticsIn many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour. Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:
and covariance matrix equal to the inverse of the Fisher information matrix. It is straightforward to show that the asymptotic bias and efficiency are a result of the Gaussian distribution. The regularity conditions required to ensure this behavior are:
While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE. ExamplesDiscrete distribution, finite parameter spaceConsider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:
Discrete distribution, continuous parameter spaceNow suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:
One way to maximize this function is by differentiating with respect to p and setting to zero:
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.
which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80. This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'. Continuous distribution, continuous parameter spaceFor the normal distribution Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{N}(\mu, \sigma^2) which has probability density function
, where Failed to parse (Missing texvc executable; please see math/README to configure.): \bar{x} is the sample mean. This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L} (\mu,\sigma) = f(x_1,\ldots,x_n \mid \mu, \sigma) over both parameters simultaneously, or if possible, individually. Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]
. This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,
is unbiased. Similarly we differentiate the log likelihood with respect to σ and equate to zero:
. Inserting Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\mu we obtain
. When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain
. This means that the estimator Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma is biased (However, Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma is consistent). Formally we say that the maximum likelihood estimator for Failed to parse (Missing texvc executable; please see math/README to configure.): \theta=(\mu,\sigma^2) is:
See also
References
External links
fr:Maximum de vraisemblance it:Metodo della massima verosimiglianza nl:Meest aannemelijke schatter ja:最尤法 no:Sannsynlighetsmaksimeringsestimator nn:Sannsynsmaksimeringsestimator pt:Máxima verossimilhança ru:Метод максимального правдоподобия fi:Suurimman uskottavuuden estimointi sv:Maximum Likelihood-metoden |


