Microsoft KB Archive/828129

= Excel statistical functions: PEARSON =

Article ID: 828129

Article Last Modified on 1/17/2007

-

APPLIES TO


 * Microsoft Office Excel 2007
 * Microsoft Office Excel 2003

-



SUMMARY
This article describes the PEARSON function in Microsoft Excel. This article discusses how to use the function. Additionally, this article compares results of PEARSON in Microsoft Office Excel 2003 and in later versions of Excel with the results of PEARSON in earlier versions of Excel.



MORE INFORMATION
The PEARSON(array1, array2) function returns the Pearson product-moment correlation coefficient between two arrays of data.

Syntax
PEARSON(array1, array2)

The arguments, array1 and array2, must be either numbers, names, array constants, or references that contain numbers.

The most common use of PEARSON includes two ranges of cells that contain the data, such as PEARSON(A1:A100, B1:B100).

Example of usage
To illustrate the PEARSON function, follow these steps:  Create a blank Excel worksheet, and then copy the following table.  Select cell A1 in your blank Excel worksheet, and then paste the entries so that the table fills cells A1:D13 in your worksheet. Click the Paste Options button, and then click Match Destination Formatting. With the pasted range still selected, use one of the following procedures, as appropriate for the version of Excel that you are running:  In Microsoft Office Excel 2007, click the Home tab, click Format in the Cells group, and then click AutoFit Column Width. In Microsoft Office Excel 2003, point to Column on the Format menu, and then click AutoFit Selection.

Note You may want to format cells B1:B6 as Number with 0 decimal places.

Cells A1:A6 and B1:B6 contain the two data arrays that are used in this example to call PEARSON and the CORREL function in cells A8 and A9. PEARSON and CORREL both compute the Pearson product-moment correlation coefficient and their results should agree.

In versions of Excel that are earlier than Excel 2003, PEARSON may exhibit round-off errors. The behavior of PEARSON has been improved in Excel 2003 and in later versions of Excel. CORREL has always been implemented with the improved procedure that is now used in Excel 2003 and in later versions of Excel. Therefore, if you are using PEARSON for a version of Excel that is earlier than Excel 2003, Microsoft recommends that you use CORREL instead.

In versions of Excel that are earlier than Excel 2003, you can use the worksheet in this article to run an experiment and discover when round-off errors occur. If you add a constant to each of the observations in B1:B6, the value of PEARSON or CORREL should not be affected. If you increase the value in D2, a larger constant is added to B1:B6. If D2 is less than 7, there are no round-off errors that appear in the first six decimal places of PEARSON. Now change the value of D2 to 7.25, 7.5, 7.75, and then 8. Cells D6:D13 of the worksheet show values of PEARSON and CORREL when D2 = 7.5 and when D2 = 8, respectively.

CORREL is still accurate, but round-off errors in PEARSON have become so severe that division by 0 occurs when D2 = 8.

Earlier versions of Excel exhibit incorrect answers in these cases because the effects of round-off errors are more profound with the computational formula used by these versions. Still, the cases in this experiment can be viewed as extreme.

In Excel 2003 and in later versions of Excel, you should see no changes in values of PEARSON if you try the experiment. However, cells D6:D13 show the same round-off errors that you obtain in earlier versions of Excel.

Results in earlier versions of Excel
If you name the two data arrays X's and Y's, earlier versions of Excel use a single pass through the data to compute the sum of squares of X's, the sum of squares of Y's, the sum of X's, the sum of Y's, the sum of XY's, and the count of the number of observations in each array. These quantities are then combined in the computational formula in the Help file in earlier versions of Excel.

Results in Excel 2003 and in later versions of Excel
The procedure that is used in Excel 2003 and in later versions of Excel uses a two-pass process through the data. First, the sums of X's and Y's and the count of the number of observations in each array are computed. From these, the means (averages) of X and Y observations can be computed. Then, on the second pass, the squared difference between each X and the X mean is found; these squared differences are summed. The squared difference between each Y and the Y mean is found; these squared differences are summed. Additionally, the products (X – X mean) * (Y – Y mean) are found for each pair of data points and are summed. These three sums are combined in the formula for PEARSON. None of these three sums are affected by adding a constant to each value in the Y array (or the X array), because that same value is added to the Y mean (or the X mean). In the numeric examples, even with a high power of 10 in cell D12, these three sums are not affected and the results of the second pass are independent of the entry in cell D2. Therefore, the results in Excel 2003 and in later versions of Excel are more stable numerically.

Conclusions
A two-pass approach guarantees better numeric performance of PEARSON in Excel 2003 and in later versions of Excel than the one-pass approach that is used in earlier versions of Excel. The results that you obtain in Excel 2003 and in later versions of Excel will never be less accurate than results that you obtained in earlier versions of Excel.

CORREL has the same functionality and has always been implemented with the approach that is used for PEARSON in Excel 2003 and for later versions of Excel. Therefore, CORREL is a better choice for earlier versions of Excel.

In most practical examples, however, you are not likely to notice a difference between the results in Excel 2003 and in later versions of Excel and the results in earlier versions of Excel. Typical data is unlikely to exhibit the kind of unusual behavior that this experiment illustrates. Numeric instability is most likely to appear in earlier versions of Excel when data contains both a high number of significant digits and relatively little variation between data values.

The procedure that finds the sum of squared deviations about a sample mean by finding the sample mean, by computing each squared deviation, and then by summing the squared deviations is more accurate than the alternative procedure. (The alternative procedure is frequently referred to as the &quot;calculator formula,&quot; because it is suitable for use by a calculator on a small number of data points.) The alternative procedure consists of the following steps:
 * 1) Find the sum of squares of all observations, the sample size, and the sum of all observations.
 * 2) Compute the sum of squares of all observations minus ((sum of all observations)^2)/sample size).

There are many other functions that have been improved in Excel 2003 and in later versions of Excel. These functions are improved because later versions of Excel replace the one-pass procedure with the two-pass procedure that finds the sample mean on the first pass and then computes the sum of squared deviations about the sample mean on the second pass.

A short list of such functions includes the following functions:
 * VAR
 * VARP
 * STDEV
 * STDEVP
 * DVAR
 * DVARP
 * DSTDEV
 * DSTDEVP
 * FORECAST
 * SLOPE
 * INTERCEPT
 * PEARSON
 * RSQ
 * STEYX

Similar improvements were made in each of the three Analysis of Variance tools in the Analysis ToolPak.

Keywords: kbformula kbexpertisebeginner kbinfo KB828129

-

[mailto:TECHNET@MICROSOFT.COM Send feedback to Microsoft]

© Microsoft Corporation. All rights reserved.