首页 | 主题 | 图库 | 问答 | 文摘 | 原创 | 百科

历史 | 地理 | 人物 | 艺术 | 体育 | 科学 | 音乐 | 电影 | 信息技术 | 世界遗产

 开放、中立,源自维基百科

Personal tools

Maximum likelihood

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide an optimum fit.

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance (see examples below).

Loosely speaking, for a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them: if a uniform prior distribution is assumed over the parameters, these coincide with the most probable values thereof. Maximum likelihood estimation gives a unique solution in the case of the normal distribution, although in more complex problems this may not be the case.

Contents

Prerequisites

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.

Principles

Consider a family Failed to parse (Missing texvc executable; please see math/README to configure.): D_\theta

of probability distributions parameterized by an unknown parameter Failed to parse (Missing texvc executable; please see math/README to configure.): \theta
(which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as Failed to parse (Missing texvc executable; please see math/README to configure.): f_\theta

. We draw a sample Failed to parse (Missing texvc executable; please see math/README to configure.): x_1,x_2,\dots,x_n

of n values from this distribution, and then using Failed to parse (Missing texvc executable; please see math/README to configure.): f_\theta
we compute the (multivariate) probability density associated with our observed data, Failed to parse (Missing texvc executable; please see math/README to configure.):  f_\theta(x_1,\dots,x_n \mid \theta).\,\!


As a function of θ with x1, ..., xn fixed, this is the likelihood function

Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L}(\theta) = f_{\theta}(x_1,\dots,x_n \mid \theta).\,\!


The method of maximum likelihood estimates θ by finding the value of θ that maximizes Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L}(\theta) . This is the maximum likelihood estimator (MLE) of θ:

Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat{\theta} = \underset{\theta}{\operatorname{arg\ max}}\ \mathcal{L}(\theta).


Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:

Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i \mid \theta)


and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L}^*(\theta) = \sum_{i=1}^n \log f_{\theta}(x_i \mid \theta).


The maximum of this expression can then be found numerically using various optimization algorithms.

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.

Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.

Properties

Functional invariance

The maximum likelihood estimator (MLE) of a parameter θ can be used to calculate the MLE of a function of the parameter. Specifically, if Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat{\theta}

is the MLE for θ, and if g is a one-to-one function, then the MLE for α = g(θ) is 
Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat{\alpha} = g(\widehat{\theta}).\,\!


If g is not one-to-one, then Failed to parse (Missing texvc executable; please see math/README to configure.): \scriptstyle g(\widehat{\theta})

is the MLE of α = g(θ) only if the likelihood function is modified to be
Failed to parse (Missing texvc executable; please see math/README to configure.): \bar{L}(\alpha) = \sup_{\theta: \alpha = g(\theta)} L(\theta).


Bias

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

Asymptotics

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

  • The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
  • The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
  • The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean Failed to parse (Missing texvc executable; please see math/README to configure.): \theta
and covariance matrix equal to the inverse of the Fisher information matrix.

It is straightforward to show that the asymptotic bias and efficiency are a result of the Gaussian distribution.

The regularity conditions required to ensure this behavior are:

  1. The first and second derivatives of the log-likelihood function must be defined.
  2. The Fisher information matrix must not be zero.

While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE.

Examples

Discrete distribution, finite parameter space

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:

Failed to parse (Missing texvc executable; please see math/README to configure.): \begin{matrix} \Pr(\mathrm{H} = 49 \mid p=1/3) & = & \binom{80}{49}(1/3)^{49}(1-1/3)^{31} \approx 0.000 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=1/2) & = & \binom{80}{49}(1/2)^{49}(1-1/2)^{31} \approx 0.012 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=2/3) & = & \binom{80}{49}(2/3)^{49}(1-2/3)^{31} \approx 0.054 \\ \end{matrix}


We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

Discrete distribution, continuous parameter space

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

Failed to parse (Missing texvc executable; please see math/README to configure.): L(\theta) = f_D(\mathrm{H} = 49 \mid p) = \binom{80}{49} p^{49}(1-p)^{31}


over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero:

Failed to parse (Missing texvc executable; please see math/README to configure.): \begin{align} {0}&{} = \frac{\partial}{\partial p} \left( \binom{80}{49} p^{49}(1-p)^{31} \right) \\ & {}\propto 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} \\ & {}= p^{48}(1-p)^{30}\left[ 49(1-p) - 31p \right] \\ & {}= p^{48}(1-p)^{30}\left[ 49 - 80p \right] \end{align}


Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.

Continuous distribution, continuous parameter space

For the normal distribution Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{N}(\mu, \sigma^2)

which has probability density function
Failed to parse (Missing texvc executable; please see math/README to configure.): f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\ \ }\sigma\ } \exp{\left(-\frac {(x-\mu)^2}{2\sigma^2} \right)},


the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

Failed to parse (Missing texvc executable; please see math/README to configure.): f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \prod_{i=1}^{n} f( x_{i}\mid \mu, \sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left( -\frac{ \sum_{i=1}^{n}(x_i-\mu)^2}{2\sigma^2}\right),


or more conveniently:

Failed to parse (Missing texvc executable; please see math/README to configure.): f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right)

, where Failed to parse (Missing texvc executable; please see math/README to configure.): \bar{x}

is the sample mean.

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood Failed to parse (Missing texvc executable; please see math/README to configure.): \mathcal{L} (\mu,\sigma) = f(x_1,\ldots,x_n \mid \mu, \sigma)

over both parameters simultaneously, or if possible, individually.  

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]

Failed to parse (Missing texvc executable; please see math/README to configure.): 0 = \frac{\partial}{\partial \mu} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)


Failed to parse (Missing texvc executable; please see math/README to configure.): = \frac{\partial}{\partial \mu} \left( \log\left( \frac{1}{2\pi\sigma^2} \right)^{n/2} - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)


Failed to parse (Missing texvc executable; please see math/README to configure.): = 0 - \frac{-2n(\bar{x}-\mu)}{2\sigma^2}


which is solved by

Failed to parse (Missing texvc executable; please see math/README to configure.): \hat\mu = \bar{x} = \sum^{n}_{i=1}x_i/n

.

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,

Failed to parse (Missing texvc executable; please see math/README to configure.): E \left[ \widehat\mu \right] = \mu,


which means that the maximum-likelihood estimator Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\mu

is unbiased. 

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

Failed to parse (Missing texvc executable; please see math/README to configure.): 0 = \frac{\partial}{\partial \sigma} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)


Failed to parse (Missing texvc executable; please see math/README to configure.): = \frac{\partial}{\partial \sigma} \left( \frac{n}{2}\log\left( \frac{1}{2\pi\sigma^2} \right) - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)


Failed to parse (Missing texvc executable; please see math/README to configure.): = -\frac{n}{\sigma} + \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{\sigma^3}


which is solved by

Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma^2 = \sum_{i=1}^n(x_i-\widehat{\mu})^2/n

.

Inserting Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\mu

we obtain
Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2 -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n x_i x_j

.

When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain

Failed to parse (Missing texvc executable; please see math/README to configure.): E \left[ \widehat{\sigma^2} \right]= \frac{n-1}{n}\sigma^2

.

This means that the estimator Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma

is biased (However, Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat\sigma
is consistent).

Formally we say that the maximum likelihood estimator for Failed to parse (Missing texvc executable; please see math/README to configure.): \theta=(\mu,\sigma^2)

is:
Failed to parse (Missing texvc executable; please see math/README to configure.): \widehat{\theta} = \left(\widehat{\mu},\widehat{\sigma}^2\right).


In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

See also

References

External links

de:Maximum-Likelihood-Methode

fr:Maximum de vraisemblance it:Metodo della massima verosimiglianza nl:Meest aannemelijke schatter ja:最尤法 no:Sannsynlighetsmaksimeringsestimator nn:Sannsynsmaksimeringsestimator pt:Máxima verossimilhança ru:Метод максимального правдоподобия fi:Suurimman uskottavuuden estimointi sv:Maximum Likelihood-metoden

Languages
AD Links