- Published on
Notes on Regression - Maximum Likelihood
- Authors
- Name
Part 4 in the series of notes on regression analysis derives the OLS formula through the maximum likelihood approach. Maximum likelihood involves finding the value of the parameters that maximise the probability of the observed data by assuming a particular functional form distribution.
Bernoulli example
Take for example a dataset consisting of results from a series of coin flips. The coin may be biased and we want to find an estimator for the probability of the coin landing heads. A fair assumption is that observations are drawn from independent coin flips that come from a distribution. This means that the probability mass function of a single observation is given by:
Note that is a single observation and takes the value of 0 or 1. The likelihood function is simply the joint distribution expressed as a function of its parameters:
Now we want to find the value that maximises the likelihood, . A simplier alternatively is to maximise the log likelihood.^[Since the function is monotonic, the parameter value that maximises the log likelihood will also maximise the likelihood.] The maximum likelihood estimate can then be calculated by finding the value that maximises the log likelihood:
Not surprisingly, the probability that the biased coin will land heads is simply the average number of heads across all observations.
Linear regression
Similarly, one can derive the formula for the OLS estimator through the maximum likelihood approach. Recall that linearity implies the following specification for the regression model: . In the maximum likelihood approach, we need to assume that the error terms conditional on are normally distributed with unknown variance i.e. . The PDF of a single observation is given by:
The likelihood or the joint PDF is:
The log likelihood can be written as:
Take the derivative with respect to and to derive the maximum likelihood estimator:
Additional Comments
While the maximum likelihood estimator can only be derived under very strong assumptions of the functional form which the error term takes, it is nonetheless a very popular method in statistics and has widespread applications. For example, binary choice models such as probit and logit assume that the dependent variable takes the value of 0 or 1 and could be modelled using the following functional form:
where f is the CDF for the normal distribution in the case of the probit model or the logistic CDF for the logistic regression.1
Unlike the case of the linear regression presented above, for most other problems, there may be no explicit solution for the maximisation problem and the solution has to be derived using numerical optimisation.
Footnotes
This corresponds to a latent variable model where the error terms are iid drawn from a normal or logistic distribution. ↩