Regression analysis
Mirror of English Wikipedia, the free encyclopedia
In statistics, regression analysis is used to model relationships between random variables, determine the magnitude of the relationships between variables, and can be used to make predictions based on the models.
Contents |
Introduction
Regression analysis models the relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named Y), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named X1,...,Xp). If there is more than one response variable, we speak of multivariate regression.
Types of regression
Simple and multiple linear regression
Simple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Simple linear regression refers to a regression on two variables while multiple regression refers to a regression on more than two variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors).
Nonlinear regression models
If the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression.
Other models
Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression.
Linear models
Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Although the method of estimating the model is the same for each case, different situations are sometimes known by different names for historical reasons:
- If the predictors are all quantitative, we speak of multiple regression.
- If the predictors are all qualitative, one performs analysis of variance.
- If some predictors are quantitative and some qualitative, one performs an analysis of covariance.
The linear model usually assumes that the data are continuous. If least squares estimation is used, then if it is assumed that the data are normally distributed, the model is fully parametric. If it is not assumed that the data are normally distributed, the model is semi-parametric. If the data are not normally distributed, there are often better approaches to fitting than least squares. In particular, if the data contain outliers, robust regression might be preferred.
If two or more independent variables are correlated, we say that the variables are multicollinear. Multicollinearity results in parameter estimates that are unbiased, consistent, but inefficient.
If the regression error is not normally distributed but is assumed to come from an exponential family, generalized linear models should be used. For example, if the response variable can take only binary values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors.
Regression and Bayesian statistics
Maximum likelihood is one method of estimating the parameters of a regression model, which behaves well for large samples. However, for small amounts of data, the estimates can have high variance or bias. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative, a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available. They are exact, not asymptotic, and thus work well for small data sets if some contextual information is available to be used in the prior. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.
Examples
To illustrate the various goals of regression, we will give three examples.
Prediction of future observations
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
| Height (in) | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
| Weight (lbs) | 115 | 117 | 120 | 123 | 126 | 129 | 132 | 135 | 139 | 142 | 146 | 150 | 154 | 159 | 164 |
, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:
will denote the vector containing all the measured heights (
) and
is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients θ0,θ1 and θ2 satisfying as well as possible (in the sense of the least-squares estimator) the equation:
Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables 1,X and X3. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (X3). The realization of this matrix (i.e. for the data at hand) can be written:
| 1 | x | x3 |
| 1 | 58 | 195112 |
| 1 | 59 | 205379 |
| 1 | 60 | 216000 |
| 1 | 61 | 226981 |
| 1 | 62 | 238328 |
| 1 | 63 | 250047 |
| 1 | 64 | 262144 |
| 1 | 65 | 274625 |
| 1 | 66 | 287496 |
| 1 | 67 | 300763 |
| 1 | 68 | 314432 |
| 1 | 69 | 328509 |
| 1 | 70 | 343000 |
| 1 | 71 | 357911 |
| 1 | 72 | 373248 |
The matrix
(sometimes called "information matrix" or "dispersion matrix") is:
Vector
is therefore:
hence
A plot of this function shows that it lies quite closely to the data set:
The confidence intervals are computed using:
with:
Therefore, we can say that with a probability of 0.95,
See also
- Confidence interval
- Extrapolation
- Kriging
- Prediction
- Prediction interval
- Statistics
- Trend estimation
- Robust regression
- multivariate normal distribution
- important publications in regression analysis.
References
- Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
- Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
- Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
- Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
- Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
- Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.
External links
- SixSigmaFirst - Intro to regression analysis, and linear regression example
- Curvefit - Online ten-point demo
- Curvefit: A complete guide to nonlinear regression - Online textbook
- Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
- Mazoo's Learning Blog - Example of linear regression. Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
- Regression of Weakly Correlated Data - How linear regression mistakes can appear when Y-range is much smaller than X-range
- Software
- Curve Expert - Shareware to fit a curve to your data, by selecting an appropriate regression model
- Zunzun.com - Online curve and surface fitting
- The R Project - Free software for statistics, including regression and graphics
- Insightful - The home of the S-PLUS software package for statistics, including regression and graphics
- TableCurve2D and TableCurve3D by Systat - Automated regression software
- Multiple (OLS) regression analysis - A freeware program for MS-DOS
- Least absolute deviation (LAD) multiple regression - A freeware program for MS-DOS



