Published on

Notes on Regression - Maximum Likelihood

Authors
  • Name
    Twitter

Part 4 in the series of notes on regression analysis derives the OLS formula through the maximum likelihood approach. Maximum likelihood involves finding the value of the parameters that maximise the probability of the observed data by assuming a particular functional form distribution.

Bernoulli example

Take for example a dataset consisting of results from a series of coin flips. The coin may be biased and we want to find an estimator for the probability of the coin landing heads. A fair assumption is that observations are drawn from nn independent coin flips that come from a Bernoulli(p)Bernoulli(p) distribution. This means that the probability mass function of a single observation xix_{i} is given by:

f(xi;p)=pxi(1p)1xif(x_{i};p) = p^{x_{i}}(1-p)^{1-x_{i}}

Note that xix_{i} is a single observation and takes the value of 0 or 1. The likelihood function is simply the joint distribution expressed as a function of its parameters:

L(p)=P(X1=x1,X2=x2,...,Xn=xn;p)=i=1nf(xi;p)  (by independence)=pxi(1p)nxi\begin{aligned} L(p) &= P(X_{1}=x_{1}, X_{2}=x_{2},...,X_{n}=x_{n}; p) \\ &= \prod_{i=1}^{n} f(x_{i};p)~~(\text{by independence})\\ &= p^{\sum x_{i}} (1-p)^{n - \sum x_{i}} \end{aligned}

Now we want to find the value pp that maximises the likelihood, L(p)L(p). A simplier alternatively is to maximise the log likelihood.^[Since the lnln function is monotonic, the parameter value that maximises the log likelihood will also maximise the likelihood.] The maximum likelihood estimate can then be calculated by finding the value pp that maximises the log likelihood:

lnL(p)=(xi)ln p+(nxi)ln(1p)lnL(p)p=xipnxi1p0=(1p)xip(nxi)p^=xin\begin{aligned} ln L(p) &= (\sum x_{i}) ln~p + (n - \sum x_{i}) ln (1-p) \\ \frac{\partial ln L(p)}{\partial p} &= \frac{\sum x_{i}}{p} - \frac{n - \sum x_{i}}{1 -p} \\ 0 &= (1-p)\sum x_{i} - p(n - \sum x_{i}) \\ \hat{p} &= \frac{\sum x_{i}}{n} \end{aligned}

Not surprisingly, the probability that the biased coin will land heads is simply the average number of heads across all observations.

Linear regression

Similarly, one can derive the formula for the OLS estimator through the maximum likelihood approach. Recall that linearity implies the following specification for the regression model: yi=xiβ+uiy_{i} = \mathbf{x}_{i}'\beta + u_{i}. In the maximum likelihood approach, we need to assume that the error terms conditional on xix_{i} are normally distributed with unknown variance i.e. uixiN(0,σ2)u_{i}\vert\mathbf{x}_{i} \sim N(0,\sigma^{2}). The PDF of a single observation is given by:

f(yi,xi;β,σ2)=12πσ2exp((yixiβ)22σ2)f(y_{i},\mathbf{x}_{i};\beta, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} exp(\frac{-(y_{i} -\mathbf{x}'_{i}\beta)^2}{2\sigma^2})

The likelihood or the joint PDF is:

L(β,σ2)=i=1n12πσ2exp((yixiβ)22σ2)L(\beta, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} exp(\frac{-(y_{i} -\mathbf{x}'_{i}\beta)^2}{2\sigma^2})

The log likelihood can be written as:

ln L(β,σ2)=lni=1nf(yi,xi;β,σ2)=i=1nln f(yi,xi;β,σ2)=i=1nln (12πσ2exp((yixiβ)22σ2))=n2ln 2πn2ln σ212σ2i=1n(yixiβ)2\begin{aligned} ln~L(\beta, \sigma^2) &= ln \prod_{i=1}^{n} f(y_{i},\mathbf{x}_{i};\beta, \sigma^2) \\ &= \sum_{i=1}^{n} ln~f(y_{i},\mathbf{x}_{i};\beta, \sigma^2) \\ &= \sum_{i=1}^{n} ln~\Big( \frac{1}{\sqrt{2\pi\sigma^2}} exp(\frac{-(y_{i} -\mathbf{x}'_{i}\beta)^2}{2\sigma^2}) \Big) \\ &= -\frac{n}{2} ln~2\pi -\frac{n}{2} ln~\sigma^2 -\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_{i} -\mathbf{x}'_{i}\beta)^2 \end{aligned}

Take the derivative with respect to β\beta and σ\sigma to derive the maximum likelihood estimator:

β^ML=(i=1nxixi)1(i=1nxiyi)σ^ML2=i=1n(yixiβ)2n\begin{aligned} \hat{\beta}_{ML} &= (\sum_{i=1}^{n}\mathbf{x}_{i}\mathbf{x}_{i}')^{-1}(\sum_{i=1}^{n}\mathbf{x}_{i}y_{i}) \\ \hat{\sigma}^{2}_{ML} &= \frac{\sum_{i=1}^{n}(y_{i} -\mathbf{x}'_{i}\beta)^2}{n} \end{aligned}

Additional Comments

While the maximum likelihood estimator can only be derived under very strong assumptions of the functional form which the error term takes, it is nonetheless a very popular method in statistics and has widespread applications. For example, binary choice models such as probit and logit assume that the dependent variable takes the value of 0 or 1 and could be modelled using the following functional form:

P(yi=1xi)=f(xiβ)P(y_{i}=1 | \mathbf{x}_{i}) = f(\mathbf{x}_{i}'\beta)

where f is the CDF for the normal distribution in the case of the probit model or the logistic CDF for the logistic regression.1

Unlike the case of the linear regression presented above, for most other problems, there may be no explicit solution for the maximisation problem and the solution has to be derived using numerical optimisation.

Footnotes

  1. This corresponds to a latent variable model where the error terms are iid drawn from a normal or logistic distribution.