Proof of ANOVA’s Partitioning of Sum of Squares Formula

Spread the love

The analysis-of-variance (ANOVA) approach — whose purpose is mainly to analyyze the quality of the estimated regression — is based on the so-called partioning of sums of squares, whose formula is as follows [Walpole et al., p. 415]

    \[ \sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2   \hspace{1cm} (1.1) \]

In short form, the above formula is indicated as

SST = SSR + SSE

The purpose of this short article is to provide the proof for the above formula — the so-called partition of sums of squares — for the case of regression involving a single independent variable x.

First of all, we start from the expansion of the left-hand side of formula (1.1),

    \[ \sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}[(y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})]^2 = \hspace{1cm} (1.2) \]

    \[ = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})(y_i - \hat{y}_i) = \]

    \[ =  \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})\hat{\epsilon}_{i} \]

where residuals are indicated by the epsilon character. Recall that the formula for the fitted regression line (involving a single independent variable x) is

    \[ \hat{y}_i = a + bx_i \hspace{1cm} (1.3) \]

Parameters a and b are estimated by the so-called method of least squares, which involves the minimization of the error sum of squares SSE, which means that the derivatives of SSE with respect to a and b are both set to 0:

    \[ SSE =   \sum_{i=1}^{n}\hat{\epsilon}_{i}^2  = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 =  \sum_{i=1}^{n}(y_i - a - bx_i)^2  \hspace{1cm} (1.4) \]

    \[ \frac{\partial (SSE)}{\partial a} = -2 \sum_{i=1}^{n}(y_i - a - bx_i) = 0  \implies \bar{y} = a + b\bar{x} \hspace{1cm} (1.5) \]

    \[ \frac{\partial (SSE)}{\partial b} = -2 \sum_{i=1}^{n}(y_i - \hat{y}_i)x_i  = -2 \sum_{i=1}^{n} \hat{\epsilon}_i x_i = 0  \implies \sum_{i=1}^{n} \hat{\epsilon}_i x_i = 0 \hspace{1cm} (1.6) \]

Equation (1.2) can be rewritten as:

    \[ \sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (\hat{y}_i - \bar{y})\hat{\epsilon}_{i} =  \hspace{1cm} (1.7) \]

    \[ =  \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 \sum_{i=1}^{n} (a + bx_i - \bar{y})\hat{\epsilon}_{i} =  \]

    \[ =  \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}\hat{\epsilon}_{i}^2 + 2 (a - \bar{y}) \sum_{i=1}^{n} \hat{\epsilon}_{i} + 2b\sum_{i=1}^{n} \hat{\epsilon}_{i} x_i  =  \]

Recalling equation (1.6)

    \[ \sum_{i=1}^{n} \hat{\epsilon}_{i} x_i = 0 \hspace{1cm} (1.8) \]

The sum of errors is expected to be almost zero:

    \[ \sum_{i=1}^{n} \hat{\epsilon}_{i} \approx 0 \hspace{1cm} (1.9) \]

We replace those values in (1.7) and get:

    \[ \sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^{n}(y_i - \hat{y}_i)^2   \hspace{1cm} (1.10) \]

which is exactly what we wanted to prove.

References

  • Walpole R.E., Meyers R.H., Myers S. L., Ye K. Probability & Statistics for Scientists and Engineers – Eighth Edition. Pearson Prentice Hall, 2007. ISBN 0-13-187711-9.