|
In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables Xi, i = 1, ..., p, and a random term ε. The model can be written as This article is about the field of statistics. ...
In statistics, regression analysis examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). ...
Example of linear regression with one dependent and one independent variable.  where β0 is the intercept ("constant" term), the βis are the respective parameters of independent variables, and p is the number of parameters to be estimated in the linear regression. Linear regression can be contrasted with nonlinear regression. Image File history File links This is a lossless scalable vector image. ...
Image File history File links This is a lossless scalable vector image. ...
dataset with approximating polynomials Nonlinear regression in statistics is the problem of fitting a model to multidimensional x,y data, where f is a nonlinear function of x with parameters θ. In general, there is no algebraic expression for the best-fitting parameters, as there is in linear regression. ...
This method is called "linear" because the relation of the response (the dependent variable Y) to the independent variables is assumed to be a linear function of the parameters. It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of Y = β0 + βx is a straight line or that Y is a linear function of the X variables. But if the model is (for example) A linear function is a mathematical function term of the form: f(x) = m x + c where c is a constant. ...
 the problem is still one of linear regression, that is, linear in x and x2 respectively, even though the graph on x by itself is not a straight line. Historical remarks The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805,[1] and by Gauss in 1809.[2] The term “least squares” is from Legendre’s term, moindres carrés. However, Gauss claimed that he had known the method since 1795. Least squares is a mathematical optimization technique that attempts to find a best fit to a set of data by attempting to minimize the sum of the squares of the differences (called residuals) between the fitted function and the data. ...
Adrien-Marie Legendre (September 18, 1752–January 10, 1833) was a French mathematician. ...
Johann Carl Friedrich Gauss or Gauà ( ; Latin: ) (30 April 1777 â 23 February 1855) was a German mathematician and scientist of profound genius who contributed significantly to many fields, including number theory, analysis, differential geometry, geodesy, electrostatics, astronomy, and optics. ...
Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748) without success.[citation needed] Gauss published a further development of the theory of least squares in 1821,[3] including a version of the Gauss–Markov theorem. Leonhard Paul Euler (pronounced Oiler; IPA ) (April 15, 1707 â September 18 [O.S. September 7] 1783) was a pioneering Swiss mathematician and physicist, who spent most of his life in Russia and Germany. ...
In statistics, the GaussâMarkov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimators of the coefficients are the least-squares estimators. ...
Notation and naming convention In the notation below: - a vector of variables is denoted using a bolded arrow over the vector, such as
 - matrices are denoted using a bolded font, such as X
- a vector of parameters ("constants") is a bolded β without subscript
An X matrix-times-β vector is written as Xβ. The dependent variable, Y in regression is conventionally called the "response variable." The independent variables (in vector form) are called the explanatory variables or regressors. Other terms include "exogenous variables," "input variables," and "predictor variables". The factual accuracy of this article is disputed. ...
In statistics, a response variable (or response) is what one measures in an experiment. ...
In statistics, an explanatory variable (also regressor or independent variable) is a variable in a regression model which appears on the right hand side of the equation. ...
A hat, , over variable denotes that the variable or parameter has been estimated, for example, , estimated values of the parameter vector β.
The linear regression model The linear regression model can be written in vector-matrix notation as  The term ε represents the unpredicted or unexplained variation in the response variable; it is conventionally called the “error” whether it is really a measurement error or not, and is assumed to be independent of . For simple linear regression, where there is only a single explanatory variable and two parameters, the above equation reduces to: Measurement is the determination of the size or magnitude of something. ...
 An equivalent formulation that explicitly shows the linear regression as a model of conditional expectation can be given as  with the conditional distribution of y given x is identical to the distribution of the error term. This article defines some terms which characterize probability distributions of two or more variables. ...
Types of linear regression There are many different approaches to solving the regression problem, that is, determining suitable estimates for the parameters.
Least-squares analysis Least-squares analysis was developed by Carl Friedrich Gauss in the 1820s. This method uses the following Gauss-Markov assumptions: Johann Carl Friedrich Gauss or Gauà ( ; Latin: ) (30 April 1777 â 23 February 1855) was a German mathematician and scientist of profound genius who contributed significantly to many fields, including number theory, analysis, differential geometry, geodesy, electrostatics, astronomy, and optics. ...
Johann Carl Friedrich Gauss or Gauà ( ; Latin: ) (30 April 1777 â 23 February 1855) was a German mathematician and scientist of profound genius who contributed significantly to many fields, including number theory, analysis, differential geometry, geodesy, electrostatics, astronomy, and optics. ...
Andrey Andreyevich Markov (Андрей Андреевич Марков) (June 14, 1856 N.S. _ July 20, 1922) was a Russian mathematician. ...
- The random errors εi have expected value 0.
- The random errors εi are uncorrelated (this is weaker than an assumption of probabilistic independence).
- The random errors εi are homoscedastic, i.e., they all have the same variance.
(See also Gauss-Markov theorem). These assumptions imply that least-squares estimates of the parameters are optimal in a certain sense. In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs. ...
In statistics, a sequence or a vector of random variables is homoscedastic if all random variables in the sequence or vector have the same finite variance. ...
In probability theory and statistics, the variance of a random variable (or somewhat more precisely, of a probability distribution) is a measure of its statistical dispersion, indicating how its possible values are spread around the expected value. ...
This article is not about Gauss-Markov processes. ...
A linear regression with p parameters (including the regression intercept β1) and n data points (sample size), with allows construction of the following vectors and matrix with associated standard errors:  or, from vector-matrix notation above,  Each data point can be given as , . For n = p, standard errors of the parameter estimates could not be calculated. For n less than p, parameters could not be calculated. The estimated values of the parameters can be given as   Using the assumptions provided by the Gauss-Markov Theorem, it is possible to analyse the results and determine whether or not the model determined using least-squares is valid. The number of degrees of freedom is given by n − p. This article or section is in need of attention from an expert on the subject. ...
The residuals, representing 'observed' minus 'calculated' quantities, are useful to analyse the regression. They are determined from  The standard deviation, for the model is determined from  The variance in the errors can be described using the Chi-square distribution: In probability theory and statistics, the chi-square distribution (also chi-squared or Ï2 distribution) is one of the theoretical probability distributions most widely used in inferential statistics, i. ...
 The 100(1 − α)% confidence interval for the parameter, βi, is computed as follows:  where t follows the Student's t-distribution with n − p degrees of freedom and denotes the value located in the ith row and column of the matrix. In probability and statistics, the t-distribution or Students t-distribution is a probability distribution that arises in the problem of estimating the mean of a normally distributed population when the sample size is small. ...
The 100(1 − α)% mean response confidence interval for a prediction (interpolation or extrapolation) for a value is given by:  where . The 100(1 − α)% predicted response confidence intervals for the data are given by: . The regression sum of squares SSR is given by:  where is an n by 1 unit vector. The error sum of squares ESS is given by:  The total sum of squares TSS' is given by  Pearson's co-efficient of regression, R² is then given as In statistics, the coefficient of determination R2 is the proportion of variability in a data set that is accounted for by a statistical model. ...
 Assessing the least-squares model Once the above values have been corrected, the model should be checked for two different things: - Whether the assumptions of least-squares are fulfilled and
- Whether the model is valid
Checking model assumptions The model assumptions are checked by calculating the residuals and plotting them. The residuals are calculated as follows:  The following plots can be constructed to test the validity of the assumptions: - Plotting a normal probability plot of the residuals to test normality. The points should lie along a straight line.
- Plotting a time series plot of the residuals, that is, plotting the residuals as a function of time.
- Plotting the residuals as a function of the explanatory variables,
. - Plotting the residuals against the fitted values,
. - Plotting the residuals against the previous residual.
In all, but the first case, there should not be any noticeable pattern to the data.
Checking model validity The validity of the model can be checked using any of the following methods: - Using the confidence interval for each of the parameters, βi. If the confidence interval includes 0, then the parameter can be removed from the model. Ideally, a new regression analysis excluding that parameter would need to be performed and continued until there are no more parameters to remove.
- Calculate Pearson’s co-efficient of regression. The closer the value is to 1; the better the regression is. This co-efficient gives what fraction of the observed behaviour can be explained by the given variables.
- Examining the observational and prediction confidence intervals. The smaller they are the better.
- Computing the F-statistics.
An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true. ...
Modifications of least-squares analysis There are various different ways in which least-squares analysis can be modified including - weighted least squares, which is a generalisation of the least squares method
- polynomial fitting, which involves fitting a polynomial to the given data.
Weighted least squares is a method of regression, similar to least squares in that it uses the same minimization of the sum of the residuals: However, instead of weighting all points equally, they are weighted such that points with a greater weight contribute more to the fit: Often, wi is...
Polynomial fitting A polynomial fit is a specific type of multiple regression. The simple regression model (a first-order polynomial) can be trivially extended to higher orders. The regression model is a system of polynomial equations of order m with polynomial coefficients . As before, we can express the model using data matrix , target vector and parameter vector . The ith row of and will contain the x and y value for the ith data sample. Then the model can be written as a system of linear equations:  which when using pure matrix notation remains, as before,  and the vector of polynomial coefficients is  Robust regression -
A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable. In robust statistics, robust regression is a form of regression analysis designed to circumvent the limitations of traditional parametric and non-parametric methods. ...
In robust statistics, robust regression is a form of regression analysis designed to circumvent the limitations of traditional parametric and non-parametric methods. ...
In the mathematical subfield of numerical analysis the approximation error in some data is the discrepancy between an exact value and some approximation to it. ...
In the Stata culture, Robust regression means linear regression with Huber-White standard error estimates. This relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. Stata, created in 1985 by Statacorp, is a statistical program used by many businesses and academic institutions around the world. ...
In statistics, a sequence or a vector of random variables is homoscedastic if all random variables in the sequence or vector have the same finite variance. ...
Applications of linear regression The trend line - For trend lines as used in technical analysis, see Trend lines (technical analysis)
A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line. It has been suggested that some of the information in this articles Criticism or Controversy section(s) be merged into other sections to achieve a more neutral presentation. ...
Trend lines are a simple and widely used technical analysis construction drawn on the price charts of traded securities. ...
In statistics, signal processing, and econometrics, a time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals. ...
Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.
Examples Linear regression is widely used in biological, behavioral and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.
Medicine As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are considered to be more trustworthy than a regression analysis. The cigarette is the most common method of smoking tobacco. ...
In medicine, epidemiology and actuarial science, the term morbidity can refer to the state of being diseased (from Latin morbidus: sick, unhealthy), the degree or severity of a disease, the prevalence of a disease: the total number of cases in a particular population at a particular point in time, the...
It has been suggested that this article or section be merged into Spurious relationship. ...
A randomized controlled trial (RCT) is a form of clinical trial, or scientific procedure used in the testing of the efficacy of medicines or medical procedures. ...
Finance Linear regression underlies the capital asset pricing model, and the concept of using Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets. An estimation of the CAPM and the Security Market Line (purple) for the Dow Jones Industrial Average over the last 3 years for monthly data. ...
The Beta coefficient, in terms of finance and investing, is a measure of a stock (or portfolio)âs volatility in relation to the rest of the market. ...
The Beta coefficient, in terms of finance and investing, is a measure of a stock (or portfolio)âs volatility in relation to the rest of the market. ...
References - ^ A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes (1805). “Sur la Méthode des moindres quarrés” appears as an appendix.
- ^ C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)
- ^ C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)
Adrien-Marie Legendre (September 18, 1752 â January 10, 1833) was a French mathematician. ...
Johann Carl Friedrich Gauss or Gauà ( ; Latin: ) (30 April 1777 â 23 February 1855) was a German mathematician and scientist of profound genius who contributed significantly to many fields, including number theory, analysis, differential geometry, geodesy, electrostatics, astronomy, and optics. ...
Additional sources - Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
- Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
- Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
- Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [1])
- Robert S. Pindyck amd Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
- http://homepage.mac.com/nshoffner/nsh/CalcBookAll/Chapter%201/1functions.html
For other people of the same surname, and places and things named after Charles Darwin, see Darwin. ...
See also In statistics, regression analysis examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). ...
Segmented linear regression to detect relations and breakpoints despite scatter // Mustard and salinity In statistics, regression analysis [1] is done to detect a mathematical relation between several series of measured things (elements) that have variable values, especially when the relation is scattered due to random variation. ...
Econometrics is concerned with the tasks of developing and applying quantitative or statistical methods to the study and elucidation of economic principles. ...
In robust statistics, robust regression is a form of regression analysis designed to circumvent the limitations of traditional parametric and non-parametric methods. ...
Tikhonov regularization, is the most commonly used method of regularization of ill-posed problems. ...
In regression analysis, least squares, also known as ordinary least squares analysis, is a method for linear regression that determines the values of unknown quantities in a statistical model by minimizing the sum of the residuals (the difference between the predicted and observed values) squared. ...
In statistics, an instrumental variable (IV, or instrument) can be used in regression analysis to produce a consistent estimator when the explanatory variables (covariates) are correlated with the error terms. ...
Hierarchical linear modeling (HLM), also known as multi-level analysis, is a more advanced form of simple linear regression and multiple linear regression. ...
In statistics, empirical Bayes methods involve: An underlying probability distribution of some unobservable quantity assigned to each member of a statistical population. ...
The general linear model (GLM) is a statistical, linear model. ...
dataset with approximating polynomials Nonlinear regression in statistics is the problem of fitting a model to multidimensional x,y data, where f is a nonlinear function of x with parameters θ. In general, there is no algebraic expression for the best-fitting parameters, as there is in linear regression. ...
External links |