Knowledge Base

Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel

Article ID: 829215

Article Last Modified on 1/10/2007

APPLIES TO

Microsoft Office Excel 2007
Microsoft Office Excel 2003

SUMMARY

This article describes numeric improvements in each of the three Analysis ToolPak ANOVA tools. This article also illustrates inaccurate results in Microsoft Excel 2002 and in earlier versions of Excel in extreme situations.

MORE INFORMATION

Many functions require the calculation of the sum of squared deviations about a mean. To do this accurately, Microsoft Office Excel 2003 and later versions of Excel use a two-pass procedure that finds the mean on the first pass, and then calculates squared deviations about the mean on the second pass.

In precise arithmetic, the same result occurs in earlier versions of Excel that use the "calculator formula." This formula is so named because it was in widespread use when statisticians used calculators instead of computers. With the calculator formula, earlier versions of Excel sum the squares of the observations, and then subtract from this total the following quantity:

((sum of observations)^2) / number of observations

This calculation occurs in a single pass through the data.

In finite precision arithmetic, the calculator formula is subject to roundoff errors in extreme cases. Excel 2002 and earlier versions of Excel use the calculator formula for most functions that require a sum of squared deviations about a mean (such as VAR, STDEV, SLOPE, and PEARSON). However, these versions of Excel also use the more numerically robust two-pass procedure for the CORREL, COVAR, and DEVSQ functions.

Experts in statistical computing recommend that you do not use the calculator formula. The calculator formula is presented as "how not to do it" in texts about statistical computing. Unfortunately, all three of the Analysis ToolPak (ATP) ANOVA tools make widespread use of the calculator formula or an equivalent single-pass approach in Excel 2002 and in earlier versions of Excel.

Excel 2003 and later versions of Excel use the two-pass procedure for all three ATP ANOVA models. This article discusses the following computational improvements in ATP's three ANOVA models:

Single Factor
Two-Factor with Replication
Two-Factor without Replication

This article discusses these models later.

Because Excel has always used the two-pass procedure with DEVSQ, this article makes frequent use of it to describe the improved procedures. These revised procedures either effectively call DEVSQ or use code whose functionality is exactly the same as DEVSQ's functionality.

For each ANOVA tool, ATP output contains a Summary table with values of Count, Sum, Average, and Variance, and an ANOVA table that has various sums of squares and values of SS, df, MS, F and P-value. Results in the summary table are calculated by calling Excel functions COUNT, SUM, AVERAGE, and VAR. Of these four functions, only VAR is subject to roundoff errors.

Excel 2002 and earlier versions of Excel implement VAR by using the calculator formula. The following article about VAR describes the improvements that occurred in Excel 2003 and in later versions of Excel. This article also lets you experiment with numeric data to see when roundoff errors are likely to occur in earlier versions of Excel.

For more information about VAR, click the following article number to view the article in the Microsoft Knowledge Base:

826112 Excel statistical functions: VAR

As this article discusses the three ANOVA models, it focuses on the ANOVA output tables. In each case, the Summary tables are well-behaved in Excel 2003 and in later versions of Excel. In Excel 2002 and in earlier versions of Excel, problems occur in the Variance column when data have extreme values.

However, this article includes the Summary tables in the model sections because these tables are useful for comparison when you review the modified examples in the Appendix.

Model 1: Single Factor

A simple example with data is as follows.

ANOVA 1 BASIC MODEL:
1	2	3
2	4	4
3	6	5
4	8	6
5		7
6		8
Anova: Single Factor

SUMMARY
Groups	Count	Sum	Average	Variance
Column 1	6	21	3.5	3.5
Column 2	4	20	5	6.666667
Column 3	6	33	5.5	3.5


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Between Groups	12.75	2	6.375	1.506818	0.257897	3.805567
Within Groups	55	13	4.230769

Total	67.75	15

Excel 2002 and earlier versions of Excel use the following pseudocode to calculate the sums of squares:

GrandSum = 0;
GrandSumOfSqs = 0; 
GrandSampleMeanSqrd = 0; 
GrandMeanSqrd = 0; 
GrandSampleSize = 0;

For s = 1 to Number_of_Samples do
   GrandSum = GrandSum + sum of observations in s-th sample;
   GrandSumOfSqs = GrandSumOfSqs + sum of squared observations in s-th sample;
   GrandSampleMeanSqrd = GrandSampleMeanSqrd  +
      (sum of observations in s-th sample^2)/size of s-th sample;
   GrandSampleSize = GrandSampleSize + size of s-th sample
Endfor;

GrandMeanSqrd = (GrandSum^2) / GrandSampleSize;

TotalSS = GrandSumOfSqs – GrandMeanSqrd;
BetweenGroupsSS = GrandSampleMeanSqrd – GrandMeanSqrd;
WithinGroupsSS = GrandSumOfSqs – GrandSampleMeanSqrd;

This approach is essentially the calculator formula. This approach computes the sums of squares of observations, and then subtracts a quantity from them, just as VAR computes the sum of squares of the observations, and then subtracts sum of observations^2/sample size. Similar pseudocode for the model 2 and model 3 has been omitted.

Again, for model 2 and model 3, sums of squares are calculated and a quantity is subtracted from the sum of squares as in the calculator formula. Unfortunately, basic statistics texts frequently suggest approaches for ANOVA such as the one that is shown earlier in this article.

Excel 2003 and later versions of Excel use a different approach to calculate the various entries in the SS column of the ANOVA table. For illustration, this article assumes that the numeric data in the earlier example appear in cells A2:C7 with missing data in cells B6 and B7.

Total SS is just DEVSQ applied to all the data, such as DEVSQ(A2:C7). DEVSQ works correctly even though data is missing.
Between Groups SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(A2:A7) + DEVSQ(B2:B7) + DEVSQ(C2:C7).
Within Groups SS is Total SS minus Between Groups SS.

If entries in the SS column of the ANOVA table are calculated correctly, the accuracy of the other entries in the table follow.

Model 2: Two-Factor with Replication

A simple example with data is as follows.

ANOVA 2 BASIC MODEL	group 1	group 2	group 3
trial 1	1	2	3
	2	4	4
	3	6	5
trial 2	4	8	6
	5	10	7
	6	12	8
Anova: Two-Factor With Replication

SUMMARY	group 1	group 2	group 3	Total
trial 1
Count	3	3	3	9
Sum	6	12	12	30
Average	2	4	4	3.333333
Variance	1	4	1	2.5

trial 2
Count	3	3	3	9
Sum	15	30	21	66
Average	5	10	7	7.333333
Variance	1	4	1	6.25

Total
Count	6	6	6
Sum	21	42	33
Average	3.5	7	5.5
Variance	3.5	14	3.5


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Sample	72	1	72	36	6.22E-05	4.747221
Columns	37	2	18.5	9.25	0.003709	3.88529
Interaction	9	2	4.5	2.25	0.147973	3.88529
Within	24	12	2

Total	142	17

Again, if entries in the SS column are calculated correctly, the accuracy of all the other entries in the ANOVA part of the output follows.

Here is the computational procedure for Excel 2003 and for later versions of Excel. This procedure uses DEVSQ to calculate the various entries in the SS column of the ANOVA table. For illustration, this example assumes that the numeric data appear in cells B2:D7.

Total SS is just DEVSQ applied to all the data, such as DEVSQ(B2:D7).
Sample SS is Total SS minus the sum of DEVSQ applied to each sample, such as DEVSQ(B2:D4) + DEVSQ(B5:D7).
Columns SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(B2:B7) + DEVSQ(C2:C7) + DEVSQ(D2:D7).
Within SS is the sum of DEVSQ applied to each trial or group pair, such as DEVSQ(B2:B4) + DEVSQ(C2:C4) + DEVSQ(D2:D4) + DEVSQ(B5:B7) + DEVSQ(C5:C7) + DEVSQ(D5:D7).
Interaction SS equals Total SS minus Sample SS minus Columns SS minus Within SS.

Model 3: Two-Factor without Replication

A simple example with data is as follows.

ANOVA 3 BASIC MODEL:	LOW	MED	HI
POOR	1	2	3
	2	4	4
	3	6	5
MID CLASS	4	8	6
	5	10	7
	6	12	8
RICH	7	14	10
	8	12	6
	9	10	2

Anova: Two-Factor Without Replication

SUMMARY	Count	Sum	Average	Variance
POOR	3	6	2	1
	3	10	3.333333	1.333333
	3	14	4.666667	2.333333
MID CLASS	3	18	6	4
	3	22	7.333333	6.333333
	3	26	8.666667	9.333333
RICH	3	31	10.33333	12.33333
	3	26	8.666667	9.333333
	3	21	7	19

LOW	9	45	5	7.5
MED	9	78	8.666667	16
HI	9	51	5.666667	6.25


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Rows	176.6667	8	22.08333	5.76087	0.001476	2.591094
Columns	68.66667	2	34.33333	8.956522	0.002455	3.633716
Error	61.33333	16	3.833333

Total	306.6667	26

If the values in the SS column are calculated correctly, the accuracy of all the other values in the ANOVA table follows.

Excel 2003 and later versions of Excel use the following computational procedure. The procedure uses DEVSQ to calculate the values in the SS column of the ANOVA table. For illustration, this example assumes that the range of cells shown in the earlier example is cells A1:D10. Therefore, the numeric data appear in cells B2:D10.

Total SS is just DEVSQ applied to all the data, such as DEVSQ(B2:D10).
Rows SS is Total SS minus the sum of DEVSQ applied to each row, such as DEVSQ(B2:D2) + DEVSQ(B3:D3) + DEVSQ(B4:D4) + DEVSQ(B5:D5) + DEVSQ(B6:D6) + DEVSQ(B7:D7) + DEVSQ(B8:D8) + DEVSQ(B9:D9) + DEVSQ(B10:D10).
Columns SS is Total SS minus the sum of DEVSQ applied to each column, such as DEVSQ(B2:B10) + DEVSQ(C2:C10) + DEVSQ(D2:D10).
Error SS is Total SS minus Rows SS minus Columns SS.

Results in Excel 2002 and in earlier versions of Excel

In extreme cases where there are many significant digits in the data but also a small variance, the calculator formula leads to inaccurate results. The Appendix that appears later in this article gives examples of roundoff problems in such extreme situations.

Results in Excel 2003 and in later versions of Excel

Excel 2003 and later versions of Excel use a procedure that makes two passes through the data. On the first pass, Excel 2003 and later versions of Excel calculate the sum and count of the data values. From these, Excel can calculate the sample mean (average).

On the second pass, Excel calculates the squared difference between each data point and the sample mean, and then sums these squared differences. As a result, the results in Excel 2003 and in later versions of Excel are more stable numerically.

Conclusions

A two-pass approach improves the numeric performance in all three ATP ANOVA tools in Excel 2003 and in later versions of Excel as compared to earlier versions of Excel. The results that you obtain by using Excel 2003 and later versions of Excel are never less accurate than the results that you obtain by using earlier versions of Excel.

In most practical cases, however, there is no difference between these results. This is because data do not typically exhibit the kind of unusual behavior that the following Appendix illustrates. Numeric instability is most likely to occur in earlier versions of Excel when data contains a high number of significant digits with relatively little variation between data values.

If you use an earlier version of Excel, and if you want to see whether Excel 2003 or a later version of Excel gives you different ANOVA results, compare the results that you obtain when you use the ANOVA tools in your earlier version of Excel with the results that you obtain when you use the procedures that use DEVSQ.

Note The procedures that use DEVSQ were described earlier in this article for the ANOVA table that is associated with each of the tools.

To verify that Variances are correct in the Summary table for each range, use DEVSQ(range)/(COUNT(range) – 1).

Appendix: Numeric examples of the performance of Excel 2002 and earlier versions of Excel

For each basic example from models 1, 2, and 3, this article previously presented the ATP tool's output. This included the Summary and ANOVA tables. Data was modified in each example to create a "stressed" example. This is done by adding 10^8 to each data value. Adding a constant such as 10^8 to each data value does not affect Variance in the Summary table (but will affect Average and Sum in obvious ways). It should also not affect any entry in the ANOVA table.

If you compare Variances in the Summary tables and SS in the ANOVA tables, you will notice that all of these are incorrectly calculated in all three of the following stressed models except for one entry in model 3 that is pointed to with "<---".

In all the stressed cases, the ANOVA results that you obtain by using Excel 2003 and later versions of Excel agree with the earlier results in the basic cases (as they should).

ANOVA 1 stressed model with large data values

100000001	100000002	100000003
100000002	100000004	100000004
100000003	100000006	100000005
100000004	100000008	100000006
100000005		100000007
100000006		100000008

Anova: Single Factor

SUMMARY
Groups	Count	Sum	Average	Variance
Column 1	6	600000021	1E+08	4.8
Column 2	4	400000020	1E+08	8
Column 3	6	600000033	1E+08	1.6


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Between Groups	0	2	0	0	1	3.805567
Within Groups	64	13	4.923077

Total	64	15

ANOVA 2 stressed model with large data values

	group 1	group 2	group 3
trial 1	100000001	100000002	100000003
	100000002	100000004	100000004
	100000003	100000006	100000005
trial 2	100000004	100000008	100000006
	100000005	100000010	100000007
	100000006	100000012	100000008
Anova: Two-Factor With Replication

SUMMARY	group 1	group 2	group 3	Total
trial 1
Count	3	3	3	9
Sum	300000006	300000012	300000012	9E+08
Average	100000002	100000004	100000004	1E+08
Variance	0	4	0	4

trial 2
Count	3	3	3	9
Sum	300000015	300000030	300000021	9E+08
Average	100000005	100000010	100000007	1E+08
Variance	0	4	0	6

Total
Count	6	6	6
Sum	600000021	600000042	600000033
Average	100000004	100000007	100000005.5
Variance	4.8	14.4	1.6


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Sample	64	1	64	24	0.000367	4.747221
Columns	32	2	16	6	0.015625	3.88529
Interaction	32	2	16	6	0.015625	3.88529
Within	32	12	2.666666667

Total	128	17

ANOVA 3 stressed model with large data values

	LOW	MED	HI
POOR	100000001	100000002	100000003
	100000002	100000004	100000004
	100000003	100000006	100000005
MID CLASS	100000004	100000008	100000006
	100000005	100000010	100000007
	100000006	100000012	100000008
RICH	100000007	100000014	100000010
	100000008	100000012	100000006
	100000009	100000010	100000002

Anova: Two-Factor Without Replication

SUMMARY	Count	Sum	Average	Variance
Row 1	3	300000006	100000002	0
Row 2	3	300000010	100000003	2
Row 3	3	300000014	100000005	2
Row 4	3	300000018	100000006	4	<---
Row 5	3	300000022	100000007	6
Row 6	3	300000026	100000009	10
Row 7	3	300000031	100000010	12
Row 8	3	300000026	100000009	10
Row 9	3	300000021	100000007	18

Column 1	9	900000045	100000005	8
Column 2	9	900000078	100000009	14
Column 3	9	900000051	100000006	4


ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Rows	128	8	16	2	0.113281	2.591094
Columns	32	2	16	2	0.167772	3.633716
Error	128	16	8

Total	288	26

Keywords: kbexpertisebeginner kbfunctions kbprogramming kbfuncstat kbinfo KB829215

Microsoft KB Archive/829215

Contents

Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel

SUMMARY

MORE INFORMATION

Model 1: Single Factor

Model 2: Two-Factor with Replication

Model 3: Two-Factor without Replication

Results in Excel 2002 and in earlier versions of Excel

Results in Excel 2003 and in later versions of Excel

Conclusions

Appendix: Numeric examples of the performance of Excel 2002 and earlier versions of Excel

ANOVA 1 stressed model with large data values

ANOVA 2 stressed model with large data values

ANOVA 3 stressed model with large data values