


Least Squares Regression MethodVitalnet uses the "least squares" regression method to determine the time trend line. Therefore, to help users better understand time trend analysis, this page explains least squares. What is least squares? Intuitively, least squares finds the line deemed to best fit the data. Mathematically, least squares finds the line that minimizes the sum of the squared residuals. Note that when we say "line", it means "straight line". In Figure 1, the red line is the least squares line, the line that is considered to best fit the data.
What is a "residual", as in "squared residual"? A residual is the difference between 1) the y value of a data point and 2) the y value in the least squares line. For example, in Figure 2, the dots are the data points, the red line is the least squares line, and the residuals are the lengths of the vertical lines. Residuals can be negative or positive. What is "sum of squared residuals"? Referring to Figure 2, assume that the first data point is (0, 17.5), and that the least squares line includes the point (0, 14). The residual for the first data point is 3.5 (17.5  14). The squared residual is 12.25 (3.5 * 3.5). We do the same for each of the 10 data points, and sum the squared residuals. Squared residuals are always positive. What are "slope" and "Yintercept"? The least squares line is completely described by a slope and a Yintercept. The slope is simply (change in y) / (change in x). The Yintercept is the y value at X = 0. In Figure 2, the slope is (30  14) / (8  0) = 16 / 8 = 2, and the Yintercept is 14. Can I calculate a confidence interval for the slope? Yes. Since least squares is used to calculate a time trend, the analyst wants to know if the trend is significant. Rhe confidence interval (CI) for the slope, at some confidence level (eg, 95%), helps determine significance. If the CI includes 0, there is no significant trend. Otherwise, there is an upward (positive slope) or downward (negative slope) trend. However, as explained elsewhere on this web site, the confidence interval calculation is incorrect when the data points are based on few observations. Thus, a better method for determining the confidence interval of the slope is sought. How is the least squares line calculated? The line could be iteratively determined, by drawing a line, calculating the sum of squared residuals, drawing another line, again calculating the sum of squared residuals, and repeating the process until the sum of squared residuals is minimized. Luckily, there is an exact method to calculate the line, without having to iterate through trial lines. Below, the exact method is shown in detail. We use a concrete example, with Texas mortality data, analyzing diabetes (E10E14). Here are the data:
The standard method for determining the Yintercept, the slope, and CI of the slope is described in "Statistics", 2nd edition, by Murray Spiegel, pages 296 and 319. The method, as adapted for use within Vitalnet, is as follows: First, calculate some preliminary values, to be used later: · yrRngs (number of year ranges) = 4 · sumX (sum of X values) = 0 + 1 + 2 + 3 = 6 · sumXX (sum of squares [SS] of X values) = 0 + 1 + 4 + 9 = 14 · avgX (average X value) = 6 / 4 = 1.5 · avgXX (average X squared value) = 14 / 4 = 3.5 · df (degrees of freedom) = yrRngs  2 = 2 · sumY (sum of Y values) = 25.533 + 25.941 + 25.603 + 24.126 = 101.203 · sumYY (SS for Y) = 651.934 + 672.936 + 655.514 + 582.064 = 2562.447 · sumXY (sum of XY values) = 0 + 25.941 + 51.206 + 72.378 = 149.525 · tVal (t Value) = 4.302656 (for 2 df and 95% confidence level) · varX (variance of X values) = avgXX  (avgX * avgX) = 3.5  2.25 = 1.250 · stdDevX (standard deviation of X values) = sqrt (varX) = 1.118 Next, calculate the Yintercept: · yInter = ((sumY * sumXX)  (sumX * sumXY)) / ((yrRngs * sumXX)  (sumX * sumX)) · numer = (sumY * sumXX)  (sumX * sumXY) · numer = (101.203 * 14)  (6 * 149.525) = 519.692 · denom = (yrRngs * sumXX)  (sumX * sumX) = (4 * 14)  (6 * 6) = 20 · yInter = numer / denom = 519.692 / 20 = 25.985 Next, calculate the slope: · slope = ((yrRngs * sumXY)  (sumX * sumY)) / ((yrRngs * sumXX)  (sumX * sumX)) · numer = (yrRngs * sumXY)  (sumX * sumY) · numer = (4 * 149.525)  (6 * 101.203) = 9.118 · denom = (yrRngs * sumXX)  (sumX * sumX) = (4 * 14)  (6 * 6) = 20 · slope = numer / denom = 9.118 / 20 = 0.456 Next, calculate the confidence interval (CI) of the slope: · numer = sumYY  (yInter * sumY)  (slope * sumXY) · numer = 2562.447  (25.985 * 101.203)  (0.456 * 149.525) = 0.870 · varOfEstimateOfYOnX = numer / yrRngs = 0.870 / 4 = 0.218 · stdErrOfEstimateOfYOnX = sqrt (varOfEstimateOfYOnX) = 0.466 · numer = tVal * stdErrOfEstimateOfYOnX = 4.303 * 0.466 = 2.005 · denom = sqrt (df) * stdDevX = 1.414 * 1.118 = 1.581 · halfInterval = numer / denom = 2.005 / 1.581 = 1.268 · slopeLoLimit = slope  halfInterval = 0.454  1.268 = 1.722 · slopeHiLimit = slope + halfInterval = 0.454 + 1.268 = +0.814 · CI of the slope = 1.722 to +0.814 Finally, calculate the coefficient of determination (r * r): · numer = (yrRngs * sumXY)  (sumX * sumY) · numer = (4 * 149.525)  (6 * 101.203) = 9.118 · xTerm = (yrRngs * sumXX)  (sumX * sumX) = (4 * 14)  (6 * 6) = 20 · yTerm = (yrRngs * sumYY)  (sumY * sumY) · yTerm = (4 * 2562.447)  (101.203 * 101.203) = 7.741 · denom = xTerm * yTerm = 20 * 7.741 = 154.800 · coefficient_of_determination = (numer * numer) / denom · coefficient_of_determination = (9.118 * 9.118) / 154.800 = 0.537 The coefficient of determination (COD) is a measure of goodness of fit. If the data points fall on a straight line, the least squares line will be the same line, and COD = 1. If the data points are randomly distributed, COD = 0. Vitalnet results for the above example. Some of the numbers may slightly differ, due to lower precision in the hand calculations. In summary, the least squares finds the line considered to best fit the data, and determines if there is a significant upward or downward trend. However, as discussed elsewhere on this web site, the slope CI is incorrect when the data points are based on few observations. Thus, a better method for determining the slope CI is sought. PrintFriendly Version of this Page 