What is the Sum of Squares?

Robinhood Learn
Democratize finance for all. Our writers’ work has appeared in The Wall Street Journal, Forbes, the Chicago Tribune, Quartz, the San Francisco Chronicle, and more.
Definition:

The sum of squares is a statistical method used to describe how far apart data points are from one another.

🤔 Understanding sum of squares

Sum of squares is a way to capture how dispersed (spread out) the numbers in a dataset are. The sum of squares gets its name from the way you calculate it — summing up the squared difference between an observation and the target value. The total sum of squares is a measure of deviation from a mean data point. Those deviations are a combination of numbers above and below the target value. Therefore, some numbers are positive, and some are negative. The simple sum of those deviations is always zero (by definition). To describe how spread out those numbers are from each other, they are first squared (which makes them all positive values) and then added together. The result is the sum of squares. In regression analysis (estimating one variable using another), the sum of squares is a measure of how well an estimate fits the data.

Example

Assume you wanted to compare the stock prices of two companies — NVIDIA and Apple. They trade at nearly equal values, but they are not the same. Consider the trading week of May 4th through 8th, 2020.

NVIDIAApple
5/8/2020$312.50$310.13
5/7/2020$304.87$302.92
5/6/2020$297.79$299.82
5/5/2020$293.74$296.76
5/4/2020$291.29$292.37
Average$300.04$300.40

Although the average closing price for the week is nearly identical, NVIDIA moved around more than Apple did. That’s information that doesn’t show up when only looking at average values. The sum of squares gives you some information about how much movement there is in the data. In this example, NVIDIA has a sum of squares value of 300, while Apple’s is 179. The lower number indicates less dispersion.

Takeaway

Sum of squares is like an appetizer…

When you sit down for a meal, you might not want to wait for the main course. So, perhaps you order a salad or appetizer to tide you over while you wait for the entre. That appetizer (sum of squares) is just the first course (step) in completing the dining experience. Like an appetizer, the sum of squares might be all you want. But, more often, you’ll use it to get to a different, more meaningful statistic.

Ready to start investing?
Sign up for Robinhood and get your first stock on us.
Sign up for Robinhood
Certain limitations apply

The free stock offer is available to new users only, subject to the terms and conditions at rbnhd.co/freestock. Free stock chosen randomly from the program’s inventory. Securities trading is offered through Robinhood Financial LLC.

Tell me more…

What is a sum of squares?

The sum of squares is a statistical measure of dispersion. It quantifies the distance numbers are from the average of a dataset, or a regression model (estimate of one variable using another). Those differences are squared then summed to determine the sum of squares value. The greater a sum of squares becomes, the more spread out a dataset is, and the worse a regression is at predicting an outcome. Because deviations are squared, more considerable differences cause the sum of squares to increase by much more than smaller variances do.

As a measure of spread, the total sum of squares is an indicator of volatility, and therefore, signals that there’s more risk. However, analysts usually use the sum of squares to calculate other measures of volatility rather than using it directly. In linear regression models, the total sum of squares is divided into the explained sum of squares (the variation explained by the regression model) and the residual (unexplained) sum of squares. A statistical model with a high residual sum of squares has less explanatory power than one with a lower value.

How do you calculate the sum of squares?

The sum of squares formula might look a little intimidating at first, but it’s actually quite simple. It just says to take the difference for each data point, square it, then add all the values together. Here is what it looks like:

Total Sum of Squares = ∑ (Xi – Xavg) 2

Xi = data point i

Xavg = average of all data points in the set

∑ = instruction to sum the values together

In words, the formula says to do this:

Step 1: Calculate the average value of the dataset. Do this by adding all of the numbers together and dividing that sum by the number of data points.

Step 2: Determine the difference between each data point and the average value.

Step 3: Square (multiply by itself) each of the differences you developed in Step 2. This action turns all of the numbers into positive values.

Step 4: Add up all of the squared deviations from Step 3.

Calculating the total sum of squares in Excel

Say you wanted to understand the variability of the Delta Airlines common stock price in April 2020 at closing. You would put those data into an Excel spreadsheet. Here is the data, along with the other three columns calculated for you:

DateClosing PriceAverageDifferenceSquared Difference
1-Apr-20$23.87$23.54$0.33$0.11
2-Apr-20$22.68$23.54-$0.86$0.73
3-Apr-20$22.48$23.54-$1.06$1.11
6-Apr-20$22.32$23.54-$1.22$1.48
7-Apr-20$22.25$23.54-$1.29$1.65
8-Apr-20$23.23$23.54-$0.31$0.09
9-Apr-20$24.39$23.54$0.85$0.73
13-Apr-20$23.25$23.54-$0.29$0.08
14-Apr-20$24.54$23.54$1.00$1.01
15-Apr-20$24.35$23.54$0.81$0.66
16-Apr-20$22.78$23.54-$0.76$0.57
17-Apr-20$24.27$23.54$0.73$0.54
20-Apr-20$23.64$23.54$0.10$0.01
21-Apr-20$23.10$23.54-$0.44$0.19
22-Apr-20$22.47$23.54-$1.07$1.13
23-Apr-20$22.48$23.54-$1.06$1.11
24-Apr-20$22.41$23.54-$1.13$1.27
27-Apr-20$22.16$23.54-$1.38$1.89
28-Apr-20$24.34$23.54$0.80$0.65
29-Apr-20$27.32$23.54$3.78$14.32
30-Apr-20$25.91$23.54$2.37$5.64
Sum of Squares$3.78$34.99

The third column is the average value for the entire dataset. In the fourth column, you’ll find the difference between the data point and the average. The last column takes the square of the deviations from the average. At the bottom, the squared differences are added together to determine the sum of squares.

A word of caution: The function in Excel =SUMSQ() squares then sums the values you give it. So, you can’t apply it to the data in the first column and be done. You’ll need to determine the differences first, then apply the function to the third column.

Alternative (shortcut) formula

There is a mathematically equivalent way to get the same answer, which might be a little easier in some cases. It involves doing the steps in a different order. The alternative formula looks like this:

Total Sum of Squares = ∑ (Xi2) – 1 / n ∑ (Xi) 2 Xi = data point i n = number of data points ∑ = instruction to sum the values together

The formula doesn’t look like much of a shortcut, but the steps might illustrate the advantage.

Step 1: Multiply each of the data points by itself and add up all of the resulting values.

Step 2: Add up all of the numbers in the dataset, and square the result.

Step 3: Divide the result from Step 2 by the number of data points.

Step 4: Subtract the result in Step 3 from the answer in Step 1.

What does the sum of squares tell you?

The total sum of squares (TSS or SST) tells you how far the data points in a dataset are from the center. It’s a descriptive statistic called a measure of spread or dispersion. Dividing the TSS by the number of observations in the dataset gives you the average variability within the data, which is called the variance. Taking the square root of the variance generates the standard deviation (another measure of dispersion).

In a linear regression analysis, the total sum of squares is partitioned into two pieces — The explained sum of squares (ESS or SSE) and the residual sum of squares (RSS or SSR). These values are related, in that:

TSS = ESS + RSS

That formula just says that the total variability equals what is explained by a regression model plus what is left unexplained (the residual). The best fit for a regression line is the one that minimizes the RSS, which is called the ordinary least square (OLS).

Analysts look at the percentage of the variability that is explained by the best fit line to determine how valuable a regression model is at predicting one variable based on observations of others. That value is called the R-squared. It’s calculated using the sum of squares as follows:

R2 = ESS / TSS

R-squared values closer to one explain almost all of the variability in the data. A value closer to zero has little explanatory power.

What are the practical applications and limitations of the sum of squares?

In practice, an analyst could use the total sum of squares (TSS) as a direct measure of variability. However, the TSS is unscaled. In other words, comparing the TSS of one data set to another wouldn’t tell you anything. The variability of the size of pumpkin seeds will almost certainly be a smaller number than the variability of pumpkins. But that doesn’t mean the seeds are more like each other than pumpkins are. Maybe it’s more likely for a pumpkin seed to be twice the size of the average seed than it is likely for a full-grown pumpkin to be twice the size of the average full-grown pumpkin. Comparing TSS won’t tell you that.

That is why it’s important to scale the measure of variability to the data in question. For that reason, analysts don’t use TSS very often. Instead, they use TSS to determine variance, standard deviation, and R-square values. Each of those statistics is scaled, which allows analysts to compare them to other data. A stock with a larger standard deviation than another stock is more volatile. A regression equation with a bigger R-squared has more explanatory power than another model with a smaller R-squared.

The practical application of the sum of squares is to develop more meaningful measures of volatility. Those measures help analysts determine the associated level of risk in a security, and the required potential reward necessary to compensate for that risk.

Ready to start investing?
Sign up for Robinhood and get your first stock on us.Certain limitations apply

The free stock offer is available to new users only, subject to the terms and conditions at rbnhd.co/freestock. Free stock chosen randomly from the program’s inventory. Securities trading is offered through Robinhood Financial LLC.

1229027

Related Articles

What is Variance?
Updated March 12, 2021

You May Also Like

The 3-minute newsletter with fresh takes on the financial news you need to start your day.
The 3-minute newsletter with fresh takes on the financial news you need to start your day.


© 2021 Robinhood. All rights reserved.

This information is educational, and is not an offer to sell or a solicitation of an offer to buy any security. This information is not a recommendation to buy, hold, or sell an investment or financial product, or take any action. This information is neither individualized nor a research report, and must not serve as the basis for any investment decision. All investments involve risk, including the possible loss of capital. Past performance does not guarantee future results or returns. Before making decisions with legal, tax, or accounting effects, you should consult appropriate professionals. Information is from sources deemed reliable on the date of publication, but Robinhood does not guarantee its accuracy.

Robinhood Financial LLC (member SIPC), is a registered broker dealer. Robinhood Securities, LLC (member SIPC), provides brokerage clearing services. Robinhood Crypto, LLC provides crypto currency trading. All are subsidiaries of Robinhood Markets, Inc. (‘Robinhood’).

1771482