Linear Regression and It's Assumptions - Predictive Power vs Interpretation

Linear regression is probably the most famous predictive model out there. If you are reading this, you are likely to know something about it.

You are also likely to know about the assumptions that stand behind linear modeling. But why are those assumptions there exactly, and when do they matter? This article will try to break this down, focusing to the case of classical linear regression, i.e. ordinary least squares (OLS).

A Bit on Classical Linear Regression

There are many definitions of linear regression that attempt to describe the gist of it.

I have found two classes of definitions.

The first of them focuses on linear regression being a model that makes predictions based on a linear combination of predictors. These definitions try to be more general than the classical ordinary least squares, but in many articles the assumptions of OLS somehow end up getting listed as mandatory. This has always made me a bit confused.

The second includes more than that, implying that there are certain assumptions behind the model, and when OLS is the focus, it might be a better approach.

One particular example of the second one I really like comes from the book Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill:

Linear regression is a method that summarizes how the average values of a numerical outcome variable vary over subpopulations defined by linear functions of predictors.

The reason I think this definition is the best is because it is very clear on what it's talking about (i.e. OLS linear regression). It is also very clear that the goal of modeling with linear regression is to explain, or summarize something about the data. It entails crucial parts needed for such an interpretation of linear regression:

It is a linear function of predictor variables (regardless of whether the predictors are discreet or continuous).
The predictors define subpopulations for which the outcome variable is computed.
The output of a linear regression model is the expected value of a distribution, for every subpopulation defined by predictors.

I will refer to this definition as my favorite definition in the rest of the article.

The Assumptions

I've always had a problem with just listing out assumptions for linear models. It seems so dry and it bypasses understanding of why these assumptions actually must hold. They seem like a checklist you need to tick away, and are obscuring the way to intimate familiarity with linear models. In practice, a checklist can be useful, but not if real understanding is the goal. And especially not if you are not clear on why and when the assumptions apply.

This is why I will refrain myself of giving a list in this article, and try to spontaneously discover the assumptions from the definitions of the model along the way.

Tearing Down the Model

I like to use a very simple framework to tear down a machine learning model to meaningful components. With your allowance to be quite liberal and playful in naming these components, let's call them:

the Hardware
the Software
the Interpretation

So let's see what all of these are in the case of OLS.

The Hardware

Hardware in my mind is a mechanical contraption set up to execute in a predictable way and within well defined physical constraints. Thus what I call the hardware of linear regression is the mathematical equation at the core of the model:

\hat{y} = x^T\beta + \beta_0 = \sum_{i=1}^N \beta_ix_i + \beta_0 \tag{1}

This equation tells us that it outputs the predictions $\hat{y}$ which are a linear combination of:

predictor values $x_i^j$ weighted by their corresponding coefficients $\beta_i$
an intercept term $\beta_0$

We can see here that by definition of the model we might be assuming that the outcome is a linear combination of predictors. Thus, if we want to claim that the model reflects reality, the first assumption we bump into is the Linearity assumption. But note that we don't have to claim that reality is linear. We can just make without it. The Hardware on it's own does not imply we want to say anything about reality.

The Software

The software is what powers up the model - in my mind this is the optimization algorithm that tells the Hardware which of the many possible models to use. It is only when the model definition (Hardware) is combined with data and instructions on how to work with it (Software), that we get something that can be used.

The goal of the optimization algorithm is to find the parameters $\beta, \beta_0$ that best fit the data according to the defined Hardware. Let's play a little and specify our hardware as a function $f(x ; \beta, \beta_0)$ . This is a function of three parameters, and note that neither of it's inputs are concrete - they are still just placeholders. However, this function outputs all possible linear models that we can have, depending on parameters $\beta$ and $\beta_0$ , and for some concrete $\beta', \beta_0'$ , we can say that:

f(x_i, \beta = \beta', \beta_0 = \beta_0') = f_{\beta',\beta_0'}(x) = x^T\beta' + \beta_0' \tag{2}

So we can say that $f_{\beta',\beta_0'}(x)$ is just one instance all possible models that we can have given the hardware we designed.

In classical linear regression, the optimization algorithm is represented by the following equation:

f_{\beta^*, \beta_0^*}(x) = arg\,min_{f_{\beta',\beta_0'}} \sum_{i=1}^{N} (y_i - f_{\beta', \beta_0'}(x = x_i))^2 \tag{3}

What this equation tells us is that the output of the algorithm is a linear model $f_{\beta^*, \beta_0^*}(x)$ with concrete parameters $\beta^*, \beta_0^*$ that minimize the function of the right hand side. Let's break the function down:

$y_i$ is the observed outcome for observation $i$
$f_{\beta', \beta_0'}(x = x_i)$ is the predicted outcome for a concrete set of predictors $x_i$ and concrete parameters $\beta', \beta_0'$ .
$(y_i - f_{\beta', \beta_0'}(x = x_i))^2$ - this is nothing but euclidean L2 distance between the observed and predicted outcome for each observation $i$ .

So, our optimization method basically finds a linear model that has optimal parameters $\beta^*,\beta_0^*$ that minimize the total euclidean distance between our predictions and the observed outcome over all data points. That is all it does in practice!

Note here that this method by itself doesn't need any of the assumptions that we have mentioned so far. It can work without any of them, and will work for any Hardware that we want to plug in (for example, a non-linear equation of predictors $x$ ). It will still do what it's built to do - find a set of parameters that minimize the euclidean distance for the equation.

The Interpretation

Now this is where the fun begins.

Interpretation practically means how we make sense of what is going on in the model. Considering interpretability is a big topic in machine learning nowadays, Interpretation takes a special place among the three components.

OLS linear regression has a very well defined Interpretation. It's important to note that until we add this interpretation into the game to the Hardware and Software, we do not have the OLS linear regression. So let's see what it's all about.

We start by saying that our observations are perfectly modeled by the following equation:

y = x^T\beta + \beta_0 + \epsilon = \sum_{i=1}^N \beta_ix_i + \beta_0 + \epsilon \tag{4}

Note that we are talking about observed data $y$ here, not predictions as we did in equation (1). Another way to write this, for the sake of clarity is:

y = \hat{y} + \epsilon

The last term $\epsilon$ is especially important and it represents the residuals, or the remaining error that we cannot explain using the predictors. In theoretical setups of linear regression, $\epsilon$ actually represents a probability distribution, and in OLS, it is set to:

\epsilon \sim \mathcal{N}(0,\sigma^2) \tag{5}

This necessarily means that $y$ is also a probability distribution, and in fact, since our Hardware is a linear equation, we know exactly what it is:

y \sim \mathcal{N}(x^T\beta + \beta_0, \sigma^2) \tag{6}

So, in the classical linear regression setup, we are explicitly assuming that the distribution of our errors is normal. Since we are also saying that all of our data are modeled by this equation, it means we assume that the same distribution applies across the whole range of the outcome variable. This brings us to another two assumptions:

the Normality assumption - the errors are normally distributed
the (in)famous Homoscedasticity - the errors have the same variance for the whole range of the outcome variable.

Adding an assumption that residuals are independent, identically distributed (call this the I.I.D. assumption) into the mix makes for some very interesting properties: If Normality and I.I.D. of residuals is satisfied, the least squares optimization becomes equal to the maximum likelihood estimate of parameters $\beta, \beta_0$ ¹. And then this means the following, very convenient equality is satisfied:

\hat{y} = \mathbf E[y | x] \tag{7}

meaning our prediction is actually interpreted as the expected value of the outcome given the predictors $x$ . Remember the point 3. from my favorite definition of linear regression from the beginning:

3. The output of a linear regression model is the expected value of a distribution, for every subpopulation defined by predictors.

Predictive Power vs Interpretation

Here is a good moment to make a sharp difference between two things - predictive power and interpretation.

Let's assume predictive power is what we are after. We have a business and we want to make predictions to improve some crucial metrics that earn us more money. We also have some performance and resource constraints, so a computationally non-intensive model like linear regression fits the role well. Other models don't yield big performance improvements, and are more expensive to use. If this is the case, then is it really important if, for example, Homoscedasticity is not satisfied? Why would we care if our errors are not normally distributed, and Normality assumption is not satisfied? And also, even if in reality the outcome is not a linear function of predictors, why does it matter? Seems like most of the assumptions are not really necessary for predictive power alone.

We don't always talk about what assumptions we make with more powerful models like neural networks. Think about training a neural network for regression. When training a neural network to solve a regression problem with the least squares loss function, it's easy to sneak in the interpretation that the output is some "expected value", or that we are finding the "maximum likelihood estimate" of the parameters. Whenever we say this, the hidden Normality assumption is automatically made. It's so easy to forget to mention that, but continuing on with such an interpretation can lead us to make conclusions which are simply false.

Linear regression models are rarely the ones that have high predicting power. Their Hardware is fairly simple. But this probably opens up the possibility to have convenient interpretations. Once we want to make useful interpretations of the model, to help us explain the data, our predictions, or find useful insight that can help us make important decisions, then we require assumptions.

So the conclusion I'm aiming for is that our Hardware and Software imply no assumptions at all. The linear equation Hardware will always make predictions the only way it can, and doesn't care about reality. The least squares method of Software will find the model whose predictions are the closest to the observed outcomes, according to euclidean distance. Only when we include a specific Interpretation into the picture, assumptions seem to pop up. And if you look closely, it is not a coincidence that we have bumped into all of the assumptions in the Interpretation section².

So When Are Assumptions Necessary?

When we want to make predictions only, we don't really need any. If a linear model is good enough for us, and we show that it improves our business, just go for it.

However, we must be careful when going further than that. Here are a couple of examples:

Talking about the output being the expected value. This is not true unless Normality, I.I.D. and Homoscedasticity are satisfied³.
Deriving standard errors and p-values on linear coefficients - we again need both Normality and Homoscedasticity, because the standard error is dependent on the $\sigma$ , i.e. standard deviation of the residuals⁴. Since we use all the data to estimate the parameters, we need to assume a fixed $\sigma$ for the whole range.
Related to the previous point, some feature selection algorithms use the p-values of the coefficients to select or remove features. This seems to make sense only when the assumptions from above are satisfied.
Using linear regression for causal inference⁵ - we need to be able to interpret the output as an expected value⁶. Thus this interpretation needs to be valid.

Conclusion

I hope this framework of tearing away Interpretation from the Hardware and Software was a good conceptual tool to understand when assumptions are needed, and when they are not. I believe this can be used for other models as well, and in the world of machine learning and data science, where we build data products that need to fulfill a purpose, it's important to understand the laws behind the models deeply so that we know when certain assumptions need to hold.

Conversely, if we don't assume certain things, we need to know where to stop in our interpretation, especially if we simplify things when we communicate findings or interpret results to stakeholders.

Thank you for reading!

The proof is beyond the scopes of this article, but it can be derived analytically starting from the expression for the likelihood described in (6). Maybe some future article will tackle this. ↩
One might say we bumped into Linearity already in Hardware, but the truth is, Hardware is all about the machinery that makes predictions, not assumptions on what the reality is. ↩
One interesting point: if I.I.D. assumption is satisfied, i.e. our residuals are independent, identically distributed, then Homoscedasticity is implied - the residuals need to have the same variance if they are identically distributed. ↩
This is beyond the scope of the article, but this can be seen if we look at how the standard errors of the coefficients are defined mathematically. They are equal to the diagonal elements of the covariance matrix: $\mathbf{(X^TX)}^{-1}\hat{\sigma}^2$ , where $X$ is the matrix of observations x predictors, and $\hat{\sigma}^2$ is the estimated variance of the residuals. This is where it becomes obvious that $\sigma$ must be same for all of our data. ↩
For causal inference, we might also have to assume that there is no strong correlation between the predictors, as this can affect the analysis. So this is another very specific assumption that will not be covered here, but that depends on the use case. ↩
In a more general setting, as far as the Normality assumption is concerned, we can assume some residual distribution other than normal. This is maybe a topic for future articles, but it's also important that we are only replacing one assumption with another, we are not reducing the number of assumptions. ↩