Microsoft KB Archive/828533

From BetaArchive Wiki

Article ID: 828533

Article Last Modified on 9/13/2007



APPLIES TO

  • Microsoft Office Excel 2007
  • Microsoft Office Excel 2003
  • Microsoft Excel 2004 for Mac



SUMMARY

The purpose of this article is to describe the LINEST function in Microsoft Office Excel 2003 and in later versions of Excel, to illustrate how the LINEST function is used, and to compare the results of the LINEST function in Excel 2003 and in later versions of Excel with the results of the LINEST function in earlier versions of Excel.

Microsoft has made extensive changes to the LINEST function to correct incorrect formulas that are used when the regression line must go through the origin. The changes also pay more attention to issues that involve collinear predictor variables. Because of these extensive improvements, this article focuses more on the improvements and less on instructing users about how to use LINEST.

Microsoft Excel 2004 for Mac information

The statistical functions in Excel 2004 for Mac were updated by using the same algorithms that were used to update the statistical functions in Microsoft Office Excel 2003 and in later versions of Excel. Any information in this article that describes how a function works or how a function was modified for Excel 2003 and for later versions of Excel also applies to Excel 2004 for Mac.

MORE INFORMATION

The LINEST(known_y's, known_x's, intercept, statistics) function is used to perform linear regression. A least squares criterion is used, and LINEST tries to find the best fit under that criterion. Known_y's represents data on the dependent variable, and known_x's represents data on one or more independent variables. The second argument is optional. If the second argument is omitted, it is assumed to be an array of the same size as known_y's that contains the values {1, 2, 3, and so on}.

The last argument is set to TRUE if you want additional statistics (various sums of squares, r-squared, f-statistic, or standard errors of the regression coefficients, for example). In this case, LINEST must be entered as an array formula. The last argument is optional; if it is omitted, it is interpreted as FALSE. The array's dimensions are five rows by a number of columns that is equal to the number of independent variables plus one if the third argument is set to TRUE (if the third argument is not set to TRUE, the number of columns is equal to the number of independent variables). Setting the third argument to FALSE in Microsoft Excel 2002 and in earlier versions of Excel requires a workaround. This workaround is discussed later in this article.

In the most common uses of LINEST, the argument intercept is set to TRUE. This setting means that you want the linear regression model to include the possibility of a non-zero intercept coefficient in its model. If known_x's is represented in data columns, setting intercept to TRUE tells LINEST to add a data column that is filled with 1s as data on an additional independent variable. The intercept argument should be set to FALSE only if you want to force the regression line to go through the origin. For Excel 2002 and for earlier versions of Excel, setting this argument to FALSE always returns results that are not correct, at least in the detailed statistics that are available from LINEST. This article discusses this issue and provides a workaround. This problem has been corrected in Excel 2003 and in later versions of Excel. The third argument is optional; if it is omitted, it is interpreted as TRUE.

For ease of exposition in the remainder of this article, assume that the data is arranged in columns, so that known_y's is a column of y data and known_x's is one or more columns of x data. The dimensions (or lengths) of each of these columns must be equal. All the following observations are equally true if the data is not arranged in columns, but it is easier to discuss this single (most frequently used) case.

Another reason for setting the intercept argument to FALSE is if you have already explicitly modeled the intercept in the data by including a column of 1s. In Excel 2002 and in earlier versions of Excel, the best solution is to ignore the column of 1s and to call LINEST with this column missing from known_x's and with the intercept argument set to TRUE. Excel 2002 and earlier versions of Excel always return results that are not correct when the intercept argument is set to FALSE. For Excel 2003 and for later versions of Excel, this approach is also preferred, although the formulas have been corrected for Excel 2003 and for later versions of Excel.

The performance of LINEST in earlier versions of Excel (or more precisely, the performance of the Analysis ToolPak's linear regression tool that calls LINEST) has been justifiably criticized (see the "References" section in this article for more information). The main concern about Excel's linear regression tools is a lack of attention to issues of collinear (or nearly collinear) predictor variables. Using datasets that were provided by the National Institute for Standards and Technology (NIST, formerly the National Bureau of Standards) that were designed to test the effectiveness of statistical software, numeric inaccuracies were found in the areas of linear regression, analysis of variance, and non-linear regression. In Excel 2003 and in later versions of Excel, these problems have been addressed, except for non-linear regression, caused by an issue with the Solver add-in instead of with the statistical functions or the Analysis ToolPak. The RAND function in Excel was also put through standard tests of randomness and reported subpar results. The RAND function has also been revised in Excel 2003 and in later versions of Excel.

LINEST was using the "Normal Equations" for finding regression coefficients. This method is less stable numerically than Singular Value Decomposition or QR Decomposition. Excel 2003 and later versions of Excel have implemented QR Decomposition. While this is a standard technique that is described in many texts, a small example is discussed in this article. QR Decomposition effectively analyzes collinearity issues and excludes any data column from the final model if that column can be expressed as a sum of multiples of the included columns. Near collinearity is treated in the same way; a set of columns is nearly collinear if, when you try to express one data column as a sum of multiples of others, the resulting fit is extremely close. For example, the sum of squared differences between the data column and the fitted values is less than 10^(-12).

The LINEST Help file has been updated in Excel 2003 and in later versions of Excel.

In summary, the main changes are as follows:

  • The computational formulas for additional statistics (such as r-squared and various sums of squares) that are used when intercept is set to FALSE have been corrected.
  • QR Decomposition has been implemented for solving all cases, regardless of the settings of the third and fourth arguments.

Syntax

LINEST(known_y's, known_x's, intercept, statistics)

The arguments, known_y's and known_x's, must be arrays or cell ranges that have related dimensions. If known_y's is one column by m rows, known_x's should be c columns by m rows where c is greater than or equal to one; c is the number of predictor variables; and m is the number of data points. (Similar relationships must hold if known_y's is laid out in a single row; known_x's should be in r rows where r is greater than or equal to one row, and known_y's and known_x's should have the same number of columns.) The intercept and statistics arguments must be set to TRUE or FALSE (or 0 or 1, which Excel interprets as FALSE or TRUE, respectively). The last three arguments to LINEST are all optional. If you omit the second argument, LINEST assumes a single predictor that contains the entries {1, 2, 3, ...}. If the third argument is omitted, it is interpreted as TRUE. If the fourth argument is omitted, it is interpreted as FALSE.

The most common usage of LINEST includes two ranges of cells that contain the data, such as LINEST(A1:A100, B1:F100, TRUE, TRUE). Because there is typically more than one predictor variable, the second argument in this example contains multiple columns. In this example, there are one hundred subjects, one dependent variable value (known_y's) for each subject, and five independent variable values (known_x's) for each subject.

Example of usage

Separate Excel worksheet examples are provided to illustrate different key concepts.

To illustrate a negative sum of squares in Excel with the third argument set to FALSE, follow these steps:

  1. Create a blank Excel worksheet, and then copy the following table.
  2. In Excel 2003 and in earlier versions of Excel, click cell A1 in your blank Excel worksheet, and then click Paste on the Edit menu so that the entries in the table fill cells A1:H19 in your worksheet.


In Excel 2007, click cell A1 in your blank Excel worksheet, and then click Paste in the Clipboard group on the Home tab so that the entries in the table fill cells A1:H19 in your worksheet.

  1. After you paste the table into your new Excel worksheet, click Paste Options, and then click Match Destination Formatting.
  2. In Excel 2003 and in earlier versions of Excel, while the pasted range is still selected, point to Column on the Format menu, and then click AutoFit Selection.


In Excel 2007, while the pasted range is still selected, click Format in the Cells group on the Home tab, and then click AutoFit Column Width.

X Y
1 11
2 12
3 13
Excel 2002 and earlier versions of Excel Excel 2003 and later versions of Excel
LINEST OUTPUT: LINEST OUTPUT: LINEST OUTPUT:
5.285714286 0 5.285714286 0 5.285714286 0
1.237179148 #N/A 1.237179148 #N/A 1.237179148 #N/A
0.901250823 4.629100499 -20.42857143 4.629100499 0.901250823 4.629100499
18.25333333 2 -1.906666667 2 18.25333333 2
391.1428571 42.85714286 -40.85714286 42.85714286 391.1428571 42.85714286
2 <--LINEST's total sum of squares
42.85714286 <--LINEST'S correct residual sum of squares
-40.85714286 <-- difference, LINEST's regression sum of squares
434 <--Correct total sum of squares
42.85714286 <--LINEST's correct residual sum of squares
391.1428571 <-- difference, correct regression sum of squares

Entries in cells A7:B11 correspond to output in Excel 2003 and in later versions of Excel. To generate output that is appropriate for your version of Excel, click cell A7, select the cell range A7:B11, and then enter the following array formula:

= LINEST(B2:B4, A2:A4, FALSE, TRUE)

This example focuses on a LINEST model that has the third argument set to FALSE. LINEST in Excel 2002 and in earlier versions of Excel use a formula for total sum of squares that is not correct in this case. This formula underestimates the real total sum of squares and always leads to values of the regression sum of squares that are not correct. This formula sometimes yields negative regression sum of squares and negative r-squared values.

Cells D6:E11 show the LINEST output in Excel 2002 and in earlier versions of Excel. In these versions of Excel, LINEST computes the total sum of squares for the model that has the third argument set to FALSE as the sum of squared deviations of y-values about the y column mean. This value is shown in cell A13 and is an appropriate computation when the third argument is set to TRUE. However, when the third argument is set to FALSE, the correct total sum of squares is the sum of squares of the y-values and is shown in cell A17. Use of the wrong formula for total sum of squares leads to the negative regression sum of squares in cell A15. The correct output in Excel 2003 is shown in cells G6:H11.

If you use an earlier version of Excel and if you want to force the best fit linear regression through the origin, you must compute some entries in the last three rows of the output array again. To do this, use the following workaround.

Note You can refer to the previous worksheet.

  1. Call Excel with the fourth argument set to TRUE to generate the detailed output array. Because you use Excel 2002 or earlier versions of Excel, assume that this output is in cells D7:E11.


Note that only the following entries require modification: r squared, f statistic, and regression sum of squares. These entries appear in cells D9, D10, and D11.

  1. Compute the total sum of squares again as SUMSQ(known_y's). In this example, SUMSQ(B2:B4).


The regression sum of squares (the value to replace the entry in cell D11) is SUMSQ(B2:B4) – E11. This value is total sum of squares minus the residual sum of squares (as computed correctly by LINEST).

  1. R squared (the value to replace the entry in cell D9) is then the regression sum of squares divided by total sum of squares.
  2. F statistic is f statistic for LINEST (in cell D10) multiplied by the correct regression sum of squares, and then divided by the LINEST regression sum of squares (in cell D11).

This procedure corrects the formulas in Excel 2002 and in earlier versions Excel, but does not address collinearity. Therefore, the procedure works well only without collinearity (the typical case in practice). Numeric problems can be magnified when collinearity or near-collinearity exists, similar to what occurs in the NIST datasets. Even simple cases can create problems, as illustrated in the next example.

Predictor columns (known_x's) are collinear if at least one column, c, can be expressed as a sum of multiples of others (c1, c2, and perhaps additional columns). Column c is frequently called redundant because the information that it contains can be constructed from the columns c1, c2, and other columns. The fundamental principle in the presence of collinearity is that results should not be affected by whether a redundant column is included in the original data or removed from the original data. Because LINEST in Excel 2002 and in earlier versions of Excel did not look for collinearity, this principle was easily violated. Predictor columns are nearly collinear if at least one column, c, can be expressed as almost equal to a sum of multiples of others (c1, c2, and others). In this case, "almost equal" means a very small sum of squared deviations of entries in c from corresponding entries in the weighted sum of c1, c2, and other columns; "very small" might be less than 10^(-12), for example.

To illustrate collinearity, follow these steps:

  1. Create a blank Excel worksheet, and then copy the following table.
  2. In Excel 2003 and in earlier versions of Excel, click cell A1 in your blank Excel worksheet, and then click Paste on the Edit menu, so that the entries in the table fill cells A1:N27 in your worksheet.


In Excel 2007, click cell A1 in your blank Excel worksheet, and then click Paste in the Clipboard group on the Home tab so that the entries in the table fill cells A1:N27 in your worksheet.

  1. After you paste the table into your new Excel worksheet, click Paste Options, and then click Match Destination Formatting.
  2. In Excel 2003 and in earlier versions of Excel, while the pasted range is still selected, point to Column on the Format menu, and then click AutoFit Selection.


In Excel 2007, while the pasted range is still selected, click Format in the Cells group on the Home tab, and then click AutoFit Column Width.

y's: x's:
1 1 2 1
2 3 4 1
3 4 5 1
4 6 7 1
5 7 8 1
LINEST using columns B,C: Values in Excel 2002 and in earlier versions of Excel: Values in Excel 2003 and in later versions of Excel:
#NUM! #NUM! #NUM! 0 0.657895 0.236842
#NUM! #NUM! #NUM! 0 0.04386 0.206653
#NUM! #NUM! #NUM! 0.986842 0.209427 #N/A
#NUM! #NUM! #NUM! 225 3 #N/A
#NUM! #NUM! #NUM! 9.868421 0.131579 #N/A
LINEST using columns B, C, D with FALSE 3rd arg:
0.403646 -0.1668 0.824698 0 0 0.236842 0.421053 0
2484491 2484491 2484491 #N/A 0 0.206653 0.246552 #N/A
0.986842 0.256495 #N/A #N/A 0.997608 0.209427 #N/A #N/A
50 2 #N/A #N/A 625.5 3 #N/A #N/A
9.868421 0.131579 #N/A #N/A 54.86842 0.131579 #N/A #N/A
LINEST using column B only
0.657895 0.236842 0.657895 0.236842
0.04386 0.206653 0.04386 0.206653
0.986842 0.209427 0.986842 0.209427
225 3 225 3
9.868421 0.131579 9.868421 0.131579

Data is included in cells A1:D6. Results of three different calls to LINEST are shown for Excel 2002 and for earlier versions of Excel in cells F8:I27, and the results for Excel 2003 and for later versions of Excel are in cells K8:N27.

To verify that the results in your version coincide with the results in cells F8:I27 or in cells K8:N27, you can enter the following three array formulas:

  • Select cell A9 and the cell range A9:C13, and then enter the following formula as an array formula:

    =LINEST(A2:A6,B2:C6,TRUE,TRUE)
  • Select cell A16 and the cell range A16:D20, and then enter the following formula as an array formula:

    =LINEST(A2:A6,B2:D6,FALSE,TRUE)
  • Select cell A23 and the cell range A23:B27, and then enter the following formula as an array formula:

    =LINEST(A2:A6,B2:B6,TRUE,TRUE)

The first model, in rows 8 to 13, uses columns B and C as predictors. By omitting the third argument, the first model requests Excel to model the intercept. Excel then effectively inserts an additional predictor column that looks just like cells D2:D6. Entries in column C in rows 2 to 6 are exactly equal to the sum of the corresponding entries in columns B and D. Therefore, collinearity is present because column C is a sum of multiples of column B and an additional column of 1s inserted by LINEST because the third argument to LINEST was omitted or set to TRUE (the "normal" case). Collinearity causes numeric problems, Excel 2002 and earlier versions of Excel cannot compute results, and the LINEST output table is filled with #NUM!.

The second model, in rows 15 to 20, uses columns B, C, and D as predictors but sets the third argument of LINEST to FALSE. Because the intercept was explicitly modeled through column D, you do not want Excel to separately model the intercept by building a second column of 1s. Again, collinearity is present because entries in column C in rows 2 to 6 are exactly equal to the sum of corresponding entries in columns B and D. Analyzing the presence of collinearity is not affected by the fact that column D is explicitly used in this model and a similar column of 1s is created internally by Excel in the first model. In this case, values are computed for the LINEST output table, but some of the values are not appropriate.

Any version of Excel can handle the third model (in rows 22 to 27). There is no collinearity, and Excel models the intercept, thereby avoiding the model with the third argument set to FALSE (that uses the incorrect formulas to compute some statistics in versions of Excel earlier than Excel 2003). This example is included in this article for the following reasons:

  • This example is perhaps most typical of practical cases: no collinearity is present and the third argument to LINEST is either TRUE or omitted. All versions of Excel can handle these cases. If you use Excel 2002 or an earlier version of Excel, numeric problems are not likely to occur in these cases.
  • This example is used to compare behavior of Excel 2003 and of later versions of Excel in the three models. Most major statistical packages analyze collinearity, remove a column that is a sum of multiples of others from the model, and alert you with a message like "Column C is linearly dependent on other predictor columns and has been removed from the analysis."

In Excel 2003 and in later versions of Excel, the message is conveyed in the LINEST output table instead of in a text string. A regression coefficient that is zero and whose standard error is also zero corresponds to a coefficient for a column that has been removed from the model. The entries in cells K9:K10 show this. In this case, LINEST removed column C (coefficients in cells K9, L9, M9 correspond to columns C, B, and Excel's intercept column, respectively). When collinearity is present, any one of the columns that are involved can be removed.

The second model in rows 15 to 20 sets the third argument of LINEST to FALSE. The entries in cells N16:N17 are Excel's standard way of conveying this information. Entries in cells K16:K17 show that LINEST removed one column (column D) from the model. Coefficients in columns L and M are for data columns C and B, respectively.

In the third model, in rows 22 to 27, no collinearity is present and no columns are removed. The predicted y values are the same in all three models because explicitly modeling an intercept (like in the second model) provides exactly the same modeling capability as implicitly modeling it in Excel internally (like in the first model and the third model). Also, removing a redundant column that is a sum of multiples of others (like in the first model and the second model) does not reduce the goodness of fit of the resulting model. Such columns are removed precisely because they represent no value added in trying to find the best least squares fit.

The following example is a final example of collinearity. The data in this example is also used in the QR Decomposition example in this article. To illustrate the final example of collinearity, follow these steps:

  1. Create a blank Excel worksheet, and then copy the following table.
  2. In Excel 2003 and in earlier versions of Excel, click cell A1 in your blank Excel worksheet, and then on the Edit menu, click Paste so that the entries in the table fill cells A1:D25 in your worksheet.

    In Excel 2007, click cell A1 in your blank Excel worksheet, and then click Paste in the Clipboard group on the Home tab so that the entries in the table fill cells A1:D25 in the worksheet.
  3. After you paste the table into your new Excel worksheet, click Paste Options, and then click Match Destination Formatting.
  4. In Excel 2003 and in earlier versions of Excel, while the pasted range is still selected, on the Format menu, point to Column, and then click AutoFit Selection.

    In Excel 2007, while the pasted range is still selected, click Format in the Cells group on the Home tab, and then click AutoFit Column Width.
  5. Select cell A7 and the cell range A7:C11. The formula editing bar should display the following information:

    =LINEST(A2:A5,C2:D5,,TRUE)

  6. Enter the information from the formula editing bar as an array formula by pressing CTRL+SHIFT+ENTER.

    Cells A7:C11 show LINEST results that match the values in cells A13:C18 or cells A20:C25, depending on the version of Excel that you use.
Y X0 X1
10 1 11
20 4 20
30 8 32
40 7 29
=LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE)
=LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE)
=LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE)
=LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE)
=LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE) =LINEST(A2:A5,C2:D5,,TRUE)
Excel 2002 values:
-3.5 14.1666666666667 34.6666666666666
0 0 0
0.806666666666667 9.83192080250175 #N/A
2.08620689655172 1 #N/A
403.333333333333 96.6666666666666 #N/A
Excel 2003 values:
1.22222222222222 0 -3.11111111111111
0.423098505881328 0 10.3334826751454
0.806666666666667 6.95221787153807 #N/A
8.3448275862069 2 #N/A
403.333333333333 96.6666666666667 #N/A

This model illustrates that the presence of collinearity might not be that easy to determine. Examining cells C2:D5 requires that you are aware that LINEST is modeling the intercept by providing a built-in column of 1s. If you call this column, X2, you might notice that X1 can be represented as 3*X0 + 8*X2.

All versions of Excel provide the same goodness of fit as measured by cell B18 and cell B25. However, Excel 2002 provides all zeros as the values for the standard errors of the regression coefficients.

The entries for df in cell B17 and cell B24 differ. The f-statistics in cell A17 and cell A24 also differ. The df for Excel 2003 is correct for a model with two predictor columns, exactly what the model uses (Excel's built-in intercept column and X1). The df for Excel 2002 is appropriate for three predictor columns. However, because of collinearity, there are only two predictor columns. There are only two predictor columns because after you have used any two of the three columns, expanding the model to use the third column has no value added. Therefore, because of collinearity, the entry in cell B17 is not correct and the entry in cell B24 is correct. The incorrect value of df affects statistics that depend on df: the f ratios in cell A17 and cell A24 and the standard error of y in cell B16 and cell B23. Entries in cell A17 and cell B16 are not correct; the entries in cell A24 and cell B23 are correct.

The following example illustrates the QR Decomposition algorithm. It has two primary advantages over the algorithm that uses the "Normal Equations." First, results are more stable numerically. When collinearity is not an issue, results are typically accurate to more decimal places with QR Decomposition. Second, QR Decomposition appropriately handles collinearity. It can be thought of as "processing" columns one at a time, and it does not process columns that are linearly dependent on previously processed columns. The previous algorithm does not correctly handle collinearity. If collinearity is present, the results from the previous algorithm are frequently distorted, sometimes to the point of returning #NUM!.

Y X0 X1
10 1 11
20 4 20
30 8 32
40 7 29
col means:
=AVERAGE(A2:A5) =AVERAGE(C2:C5) =AVERAGE(D2:D5)
centered data with added col of 1's, X2:
Y X0 X1 X2
=A2-A$7 =C2-C$7 =D2-D$7 1
=A3-A$7 =C3-C$7 =D3-D$7 1
=A4-A$7 =C4-C$7 =D4-D$7 1
=A5-A$7 =C5-C$7 =D5-D$7 1
TotalSS:
=SUMSQ(A11:A14)
X col squared lengths:
=SUMSQ(C11:C14) =SUMSQ(D11:D14) =SUMSQ(E11:E14)
after swapping cols:
Y X1 X0 X2
=A11 =D11 =C11 1
=A12 =D12 =C12 1
=A13 =D13 =C13 1
=A14 =D14 =C14 1
compute V: V and VTV: and V times V transpose:
=C23 =SQRT(D19) 1 =A29+B$29*C29 =SUMSQ(E29:E32) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32))
=C24 0 =A30+B$29*C30 =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32))
=C25 0 =A31+B$29*C31 =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32))
=C26 0 =A32+B$29*C32 =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32)) =MMULT(E29:E32,TRANSPOSE(E29:E32))
compute P = I - (2/VTV)*VVtranspose premultiply X by P: and Y by P:
X1 X0 X2 Y
=-(2/$G$29)*I29+1 =-(2/$G$29)*J29 =-(2/$G$29)*K29 =-(2/$G$29)*L29 =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,A23:A26)
=-(2/$G$29)*I30 =-(2/$G$29)*J30+1 =-(2/$G$29)*K30 =-(2/$G$29)*L30 =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,A23:A26)
=-(2/$G$29)*I31 =-(2/$G$29)*J31 =-(2/$G$29)*K31+1 =-(2/$G$29)*L31 =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,A23:A26)
=-(2/$G$29)*I32 =-(2/$G$29)*J32 =-(2/$G$29)*K32 =-(2/$G$29)*L32+1 =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,C23:E26) =MMULT(A35:D38,A23:A26)
Squared lengths of X, Y cols are unchanged after you premultiply by P: =SUMSQ(G35:G38) =SUMSQ(H35:H38) =SUMSQ(I35:I38) =SUMSQ(L35:L38)
Algorithm continues with only bold portions of the revised X matrix and Y column
squared lengths of last 3 rows of X vectors: =SUMSQ(H36:H38) =SUMSQ(I36:I38)
after swapping cols:
Y X1 X2 X0
=L35 =G35 =I35 =H35
=L36 =G36 =I36 =H36
=L37 =G37 =I37 =H37
=L38 =G38 =I38 =H38
compute V: V and VTV: and V times V transpose:
=D47 =SQRT(I42) 1 =A52+B$52*C52 =SUMSQ(E52:E54) =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54))
=D48 0 =A53+B$52*C53 =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54))
=D49 0 =A54+B$52*C54 =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54)) =MMULT(E52:E54,TRANSPOSE(E52:E54))
compute P = I - (2/VTV)*VVtranspose premultiply X by P: and Y by P:
X1 X2 X0 Y
=-(2/$G$52)*I52+1 =-(2/$G$52)*J52 =-(2/$G$52)*K52 =C46 =D46 =E46 =L35
=-(2/$G$52)*I53 =-(2/$G$52)*J53+1 =-(2/$G$52)*K53 =G36 =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,A47:A49)
=-(2/$G$52)*I54 =-(2/$G$52)*J54 =-(2/$G$52)*K54+1 =G37 =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,A47:A49)
=G38 =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,D47:E49) =MMULT(A57:C59,A47:A49)
Rewrite: effectively 0 --> 0: X1 X2 X0 Y
=G35 0 =I57 =L35
0 =H58 0 0
0 0 0 =L59
0 0 0 =L60
QR Decomposition main loop terminates because longest remaining sub-vector has length 0
regression coeffts by backsubstitution: =(L64- H64*H71)/G64 =L65/H65 0
residual SS from last 2 rows of Y: =SUMSQ(L66:L67)
Excel 2003 LINEST:
SSRegression = SSTotal - SSResidual: =A17-G72 1.22222222222222 0 -3.11111111111111
R-squared = SSRegression / SSTotal =E74/A17 0.423098505881328 0 10.3334826751454
DF = 2 (see article) 0.806666666666667 6.95221787153807 #N/A
stdErrorY = sqrt(SSResidual/DF) =SQRT(G72/2) 8.3448275862069 2 #N/A
FStatistic = (SSRegression / (DF Regression)) / (SSResidual/DF): =(E74/(0+1))/(G72/2) 403.333333333333 96.6666666666667 #N/A
Intercept: =A7 - I71*C7 - G71*D7

This worksheet uses the same data as the previous worksheet. QR Decomposition performs a sequence of orthogonal linear transformations. The original data is in cells A1:D5. The first step is to "center" the data for the original columns, and then explicitly add a column of 1s (it is assumed that the third argument to LINEST is TRUE or omitted). The revised data is shown in cells A10:D14. Column X2 is the added column of 1s. To center the data, find the mean in each column (shown in cells A7:D7) and then subtract it from each observation in the respective column. Therefore, for the Y, X0, and X1 columns, the original data has been replaced by deviations about the column means. Centering is useful in minimizing round off errors. The value in cell A17 is the sum of squares of the centered Y values, the Total Sum of Squares for the regression analysis. Values in cells C19:E19 are sums of squares of the centered X0 and X1 columns and the (non-centered) column of 1s that is named X2. You can interchange columns X0, X1, and X2 so that the one with the largest sum of squares comes first. These results are in cells A21:D26. When you interchange columns, you must keep track of the location of each original column.

After these preliminary changes, you can use the main loop of the QR Decomposition algorithm. You want to find a 4x4 matrix (because there are 4 rows of data) that you can use to premultiply each column. This transformation does not change the squared lengths of each column. You first find the column vector V by taking the first column and adding the square root of the column's sum of squares (computed in cell B29) to its first entry. Other entries in the first column are not changed. This action yields the vector in cells E29:E32. The sum of squares in V (as VTV) is in cell G29. (Note The T must be a superscript.) The 4x4 matrix VVT is in cells I29:L32. Use this information to compute the 4x4 transformation matrix, P, by using the following formula.

P = I – (2/ VTV)* VVT

Note All Ts must be superscript.

The resulting matrix P is displayed in cells A35:D38. If you premultiply the revised X columns in cells C23:E26 by P, you receive the results in cells G35:I38. Similarly, if you premultiply the revised Y column in cells A23:A26 by P, you receive the results in cells L35:L38. The X1 column has been transformed so that it still has the same sum of squares as before, but all entries except the top entry in the column are 0. More precisely, entries in cells G36:G38 are "effectively 0" because they are zero to fifteen decimal places. In row 40, sums of squares for all columns are computed and are not changed by the transformation.

The algorithm continues for a second iteration of the main loop and uses only the X0 and X2 data in cells H36:I38 and the Y data in cells L36:38. Because you are concerned with only three rows, you can calculate the sums of squares for only the last three rows of the X0 and X2 columns. These values are displayed in cells H42:H43. The sum of squares of X0 is essentially 0. The X0 and X2 columns are swapped because X2 has the larger relevant sum of squares. After the columns are swapped, revised columns are displayed in cells A45:E49. V is computed exactly as in the first iteration except that now V has only three rows. Computations of VTV, VVT, and P continue exactly as before and are shown in rows 51-54 and cells A57:C59. You can then premultiply only the last three rows of the X2, X0, and Y columns by P to yield the revised columns in cells G56:L60. To make this more readable, these columns are rewritten in cells G63:L67 by setting values that are effectively zero to exactly zero.

The next iteration only involves the X0 column and its last two rows. Because the sum of squares of entries in these rows is zero, the main loop of the algorithm terminates.

The residual sum of squares is the sum of squares of revised Y vector entries below the second row. All the rows that were not processed at the time the main loop of the QR Decomposition algorithm terminated are included here. In this case, processing stopped because the last two rows in the X0 column contained only zeros. The residual sum of squares is calculated in cell G74. You can see from the entries in cells G63:L67 that any values for the coefficients of the Xs leave a fitted value of zero for each of these last two rows. The values of coefficients for X1 and X2 that have been found yield an exact fit to Y values in the first two rows. Therefore, Y has been transformed so that its total sum of squares is not changed, the residual sum of squares is the sum of squares in the last two rows, and the regression sum of squares is the sum of squares in the first two rows.

The algorithm spotted collinearity when it noticed that the remaining entries in the X0 column were zero. At this point, no columns remain whose coefficients may improve the fit. The X0 column does not contain any useful additional information because X1 and X2 are already included in the model. Although X2 has a coefficient of zero, this does not make it a redundant column that is eliminated as a result of collinearity.

At this point, you can extract most of the summary statistics that LINEST provides. However, this article does not discuss how to determine standard errors of the regression coefficients. Values from LINEST output in Excel 2003 are shown in cells I74:K78 for comparison. The regression sum of squares is calculated in cell E74 and R-squared is calculated in cell E75; these values are displayed in the LINEST output in cell I78 and cell I76, respectively. The residual sum of squares (or error sum of squares) is calculated in cell G72 and displayed in the LINEST output in cell J78.

Other entries in the LINEST output depend on the degrees of freedom (DF). Many statistical packages report Regression DF, Error DF, and Total DF. Excel reports only Error DF (in cell J77). Earlier versions of Excel compute Error DF correctly in all cases except when there is collinearity that should have eliminated one or more predictor columns. The value of Error DF depends on the number of predictor columns that are actually used. With collinearity, Excel 2003 handles this computation correctly, while earlier versions count all predictor columns even though one or more should have been eliminated by collinearity.

Degrees of freedom is examined here in more detail. Assume that collinearity is not an issue. When the intercept is fitted, in other words, the third argument to LINEST is missing or true:

  • Total DF equals the number of rows (or datapoints) minus one.
  • Regression DF equals the number of predictor columns (not including the column for intercept).
  • Error DF equals Total DF minus Regression DF.

When the intercept is not fitted, in other words, the third argument to LINEST is false:

  • Total DF equals the number of rows (or datapoints).
  • Regression DF equals the number of predictor columns.
  • Error DF equals Total DF minus Regression DF.

The only difference between these two cases is the "minus one" in the formula for Total DF in the more common case where the intercept is fitted.

Earlier versions of Excel use these formulas to correctly compute DF, except that Excel 2002 does not look for collinearity. Looking for collinearity is one of the reasons for using QR Decomposition for these computations.

The predictor columns form a matrix. If the intercept is fitted, there is effectively an additional column of 1s that does not appear on your spreadsheet. QR Decomposition determines the rank of this matrix. The previous formulas for Regression DF should be changed to the following formulas:

  • For the "fitted" case: Regression DF equals the rank of the matrix of predictor columns (including a column of 1s for intercept) minus one (for the column for intercept)
  • For the "not fitted" case: Regression DF equals the rank of the matrix of predictor columns

Also, because Excel uses finite arithmetic, "rank" is really "approximate rank", so a column is linearly dependent on a set of other columns if there is a weighted sum of columns in the set, a vector, whose Euclidean distance from the column is very close to zero.

In the example on the worksheet, the intercept was fitted. Total DF is 4 – 1 = 3; Regression DF is 2 – 1 = 1; Error DF is Total DF – Regression DF = 3 – 1 = 2. For this example, Excel 2002 and earlier versions of Excel calculated Regression DF as 3 – 1 = 2 and Error DF as 3 – 2 = 1. The difference comes from the failure to look for collinearity. Earlier versions of Excel noted that there were three predictor columns; Excel 2003 examined these three columns and found that there were really only two.

Standard error of Y is calculated in cell E77 and is shown in the LINEST output in cell J76. The f statistic is calculated in cell H78 and in the LINEST output in cell I77. The formula for the f statistic is:

(SSRegression / DF Regression) / (SSError / DF Error)

In this example, the f statistic is:

(403.333 / 1) / (96.667 / 2) = 8.345

The LINEST output in cells I74:K74 shows the regression coefficients for X1, X0, and the fitted intercept. The coefficient for the intercept, -3.1111, differs from the coefficient for the column X2. This difference occurs because the data was centered to find the best fit linear regression model for this data. Optimal regression coefficient values for X1 and X0 are not affected by centering this data. Centering the data causes the best fit to pass through the origin. Centering this data is the reason that an optimal coefficient of zero for X2 (the column that was added to represent the intercept) was found. Fortunately, you can recover the corresponding intercept coefficient for the original model with little additional effort. The intercept coefficient can be found using the following formula:

Y col mean minus the sum over all X columns (except the intercept column) of X col regression coefficient times X col mean


This value is calculated in C80 and agrees with the LINEST output in cell K74.

Summary of results in earlier versions of Excel

LINEST used a formula that is not correct to find the total sum of squares when the third argument to LINEST was set to FALSE. This formula caused values that are not correct in the regression sum of squares and values that are not correct for the other output that depends on the regression sum of squares: r squared and the f statistic.

Regardless of the value of the third argument, LINEST was calculated by using an approach that paid no attention to collinearity issues. The presence of collinearity caused round off errors, standard errors of regression coefficients that are not appropriate, and degrees of freedom that are not appropriate. Sometimes, round off problems were sufficiently severe that LINEST filled its output table with #NUM!. LINEST generally provides acceptable results if the following conditions are true:

  • There are no collinear (or nearly collinear) predictor columns.
  • The third argument to LINEST is TRUE or is omitted.

However, solving for regression coefficients using the "Normal Equations" is more prone to round off errors than using the QR Decomposition approach that is used in Excel 2003 and in later versions of Excel. Even though these coefficients are more prone to round off errors, they are not likely to be problematic for most practical cases.

Summary of results in Excel 2003

Improvements include correcting the formula for total sum of squares in the case where the third argument to LINEST was set to FALSE and switching to the QR Decomposition method of determining the regression coefficients. QR Decomposition has the following two advantages:

  • Better numeric stability (generally smaller round off errors)
  • Analysis of collinearity issues

Collinearity is the main concern, particularly after tests were performed on NIST datasets.

Conclusions

LINEST has been greatly improved for Excel 2003 and for later versions of Excel. If you use an earlier version of Excel, verify that predictor columns are not collinear before you use LINEST. Be careful to use the workaround in this article if the third argument in LINEST is set to FALSE. Note that collinearity is only a problem in a small percentage of cases, and calls to LINEST with the third argument set to FALSE are also relatively rare in practice. Earlier versions of Excel give acceptable LINEST results when there is no collinearity and when the third argument in LINEST is TRUE or omitted. Improvements in LINEST affect the Analysis ToolPak's linear regression tool that calls LINEST and the following related functions:

  • TREND
  • LOGEST
  • GROWTH


REFERENCES

McCullough, B.D. and B. Wilson. "On the Accuracy of Statistical Procedures in Microsoft Excel 97." Computational Statistics and Data Analysis, 1999, 31, 27-37.

Keywords: kbinfo kbfunctions kbfuncstat kbexpertisebeginner KB828533