
Create articles from any YouTube video or use our API to get YouTube transcriptions
Start for freeThe Foundations of Linear Regression
Linear regression is often the first topic introduced in machine learning courses and textbooks. It's typically presented as finding the equation of a straight line that best fits a scatter of points. The standard explanation involves minimizing the average squared vertical distance from each point to the line. But have you ever wondered why we use vertical distances? Or why we square the errors instead of using absolute values or other powers?
In this article, we'll explore an alternative perspective on linear regression - a probabilistic approach that provides deeper insights into these fundamental questions. This viewpoint not only answers these queries but also connects this seemingly simple problem to more complex topics in machine learning.
Reframing Regression Through Probability
At its core, regression is about uncovering a relationship between inputs (X) and outputs (Y), then using that relationship to predict Y for new values of X. Unlike classification, which deals with discrete categories, regression handles continuous values.
Let's consider a classic example: predicting house prices based on various features. X1 might represent the number of bedrooms, X2 the distance to the subway, and so on. Our task is to reconstruct the price from these features.
Instead of immediately jumping to minimizing squared errors, let's approach this from a probabilistic perspective. We'll treat our data as if it emerges from a linear model plus some noise.
The Ideal Linear Model
Imagine that somewhere in the universe's underlying source code, the ideal price of each house is given precisely by a linear combination of features, weighted by some coefficients. We can express this as the dot product between weights and features stacked into vectors:
Ideal Price = W · X
Where W is the vector of weights and X is the vector of features.
Introducing Noise
However, real-world data is messy. It never perfectly follows linear patterns. Actual house prices are influenced not just by the features we've collected, but also by hidden variables we can't access, market fluctuations, and human behavior.
In other words, our observed values Y are corrupted versions of those ideal underlying values, with a noise term ε being added:
Observed Price = Ideal Price + ε
Here, ε represents everything we can't explain with our features - all the unknown contributors to the price beyond our control.
The Nature of Noise
The critical insight is that our resulting regression equation depends on our assumptions about this noise. The noise term ε is shaped by countless tiny influences: measurement errors, untracked variables, random market fluctuations, all added together.
When many small, independent effects accumulate additively, something remarkable happens. According to the Central Limit Theorem, when you sum many independent random variables, regardless of their individual distributions, their sum approaches a normal or Gaussian distribution - the familiar bell curve.
Probabilistic Interpretation
Suppose we have a candidate set of weights W - our hypothesis about the underlying linear model. When we examine a random house with features X and price Y, we can calculate exactly how much noise was added to our hypothesis by taking the difference between the observed price Y and its ideal, non-noise value.
Knowing that the noise follows a Gaussian distribution, we can calculate the probability of getting precisely the right amount of noise ε that would push the underlying price to be registered as Y. This equals the probability of observing that particular data point.
Shifting Perspective
Let's reiterate this shift of perspective. With a working set of weights W, we take the features X and compute their weighted sum, giving us where the ideal, noiseless version of Y should lie. Around this point exists a cloud of uncertainty - a bell curve centered at W · X.
When we observe the actual value of Y and calculate how much noise must have been added, we can compute the likelihood of that happening by plugging the noise amplitude into the Gaussian equation. This gives us the probability of observing a single data point as the probability of sampling just the right amount of noise from the Gaussian distribution.
Probability of the Entire Dataset
If we assume all data points in our scatter plot were sampled independently, then the probability of obtaining our entire dataset with a fixed W can be found by multiplying the probabilities of these independent events.
What we have done is express the probability of observing our data given a particular model (the coefficients W). If presented with two alternative models, intuitively, the better model would be the one with a higher probability of generating the observed data.
Optimizing the Model
So when solving the regression problem and choosing the optimal W, we just need to select the configuration that maximizes this probability. Given this optimization objective, we can expand the formula for the probability of data.
Let's take the logarithm of the right-hand side. Because logarithm is a monotonic function, whichever weight configuration maximizes the total probability also maximizes its logarithm. The two objectives are equivalent, but the logarithm transforms our product of probabilities into a sum, which is much easier to work with.
Notice that σ, the amplitude of the underlying noise, is a fixed value determined by factors like market volatility. From the perspective of the optimization objective, it is a constant factor that doesn't affect which set of weights is optimal. This allows us to simplify the formula.
Finally, the logarithm and the exponent cancel each other out, leaving us with this:
Optimal W = argmax(-Σ(Y - W · X)²)
The Emergence of Least Squares
This is exactly the well-known least squares objective, which states that the optimal coefficients should maximize the negative (or minimize the positive) of the sum of squared errors between the linear fit and the observed points.
Importantly, though, we arrived at this from the perspective of finding the linear model that maximizes the probability of observing our data. The square in the resulting formula is a direct consequence of assuming Gaussian noise.
This problem can be then solved either through gradient descent, by making small iterative adjustments to the weights, or by directly jumping to the solution using a closed-form expression found in any textbook.
Beyond Simple Linear Regression
Now that we understand the probabilistic foundations of linear regression, let's explore how this perspective can be extended to more complex scenarios.
Model Selection and Prior Beliefs
Previously, when choosing between different sets of weights W, we always picked the model with a higher probability of generating the observed data (the lower mean squared error). But what if two models have identical values for that probability but differ in their exact values for W? Would we have a reason to prefer one over the other?
If we know nothing about the nature of our features, we have no basis for comparing two equally performing models. But often, we have prior expectations about how features might contribute to the predictions, and thus we have reasonable boundaries for their values.
An Illustrative Example: Coin Tossing
Let's illustrate this with an example. Suppose we have a coin with an unknown bias, where the probability of heads is θ (between 0 and 1). We want to estimate this value of θ by tossing the coin and tracking the results.
Let's say we observe four heads out of five tosses. If we ignore any assumptions about θ and find the value that maximizes the probability of data, we will conclude that θ must be 0.8. Indeed, this type of biased coin maximizes the probability of observing four heads out of five tosses.
But something doesn't seem right. We know from experience that most coins typically land 50/50, maybe with a slight bias due to asymmetry, but certainly not 80 to 20. The problem is that our solution only cared about maximizing the probability of observed data and completely ignored prior beliefs about θ, which likely is centered around 0.5 and decreases towards the edges.
Incorporating Prior Beliefs
But is there a systematic way to incorporate these prior assumptions into our regression objective? Instead of maximizing the probability of data, we can search for a set of weights W that maximizes the joint probability of data and the weights. In other words, we look for weights that both explain the data well and align with our prior beliefs about what W should look like.
Following the conditional probability rule, we can decompose the total probability into the following product:
P(W, Data) = P(Data|W) * P(W)
The first factor is the likelihood - exactly what we had before. How likely a particular W is to have generated our observed data, given by the Gaussian formula for the noise.
The second factor is the prior, where we incorporate assumptions about how likely different values of W are. The key idea is that different assumptions on prior distributions of weights will lead to different criteria for choosing between alternative solutions.
This shows up in the overall objective as so-called regularization terms.
L2 Regularization: The Gaussian Prior
One of the most popular choices is to assume that weights W themselves follow a zero-centered Gaussian distribution. Why is that reasonable?
Well, in regression, each component of W is a coefficient describing how a particular feature in the X vector (like the size of the house) contributes to the prediction Y. If we randomly select features, intuitively, most will probably be irrelevant with values near zero, while only a small subset will have significant weights.
Additionally, since each feature's coefficient in real data is shaped by many underlying, unobserved causes, the Central Limit Theorem applies to the coefficients as well.
Formally, we can write that the prior probability of observing a particular set of weights is given by the product of probabilities of individual components, each of which is given by the Gaussian formula with some variance τ.
Going back to our optimization objective, we want to maximize the joint probability. Let's take the logarithm as before and substitute our formulas for the likelihood and the prior. Flipping the signs and grouping the two constants together, we get the following:
Optimal W = argmin(Σ(Y - W · X)² + λΣW²)
This is what's known as Ridge regression or L2 regularized linear regression, due to the square of the weight amplitudes. The idea is that we are searching for the model that would both explain the data well but at the same time be not overly complex, where complexity is measured as the sum of squares of the weights.
This regularization term penalizes large weight values, pushing them toward zero - exactly what we would expect from our Gaussian prior assumption.
Notice how beautifully this emerges. The parameter λ, which controls how strong the regularization is, is the ratio between the data noise (or the variance of the samples) and the variance of the prior distribution of coefficients. When we are very certain about our prior, λ becomes larger, giving more weight to the regularization term. Conversely, when the data is very reliable with small variance, λ decreases, placing more emphasis on fitting the data.
L1 Regularization: The Laplace Prior
But what if our intuition about the weights is different? Instead of just assuming they are generally small, what if we believe most should be exactly zero, with only a few significant ones? This would correspond to a model where only a handful of features truly matter, favoring sparse solutions.
This assumption is particularly relevant for biological systems. In genomics, for example, out of thousands of genes, only a small subset typically influences a particular trait. Similarly, for neuroscience, only a small, sparse subset of all neurons are responsible for encoding a particular feature.
In this case, a Gaussian prior is not ideal because it pushes weights towards zero too gently. Instead, we might prefer a distribution with a sharp peak at zero, which looks something like this:
P(W) ∝ exp(-|W|/b)
This is known as the Laplace distribution, and it is parametrized by symmetrically exponentially falling tails. The probability of the configuration of W's as a whole can be found by multiplying the probabilities of each component.
Following the same derivation as before, taking the logarithm to counteract the exponent, our optimization objective becomes the following:
Optimal W = argmin(Σ(Y - W · X)² + λΣ|W|)
Here, λ is again the combination of constants of noise variance and the falloff of the weights prior. This is known as L1 regularization because the complexity penalizes the absolute values of W rather than their squares.
L1 regularization typically leads to sparse solutions where many weights are exactly zero, which is preferable in many domains of science.
Tying It All Together
In this article, we've explored the probabilistic view of linear regression and seen how the familiar least squares equation naturally emerges when we find the linear fit that maximizes the probability of observing our data.
Importantly, the squared error term wasn't just an arbitrary choice - it is a direct consequence of assuming Gaussian noise in our model. While this assumption is reasonable in most cases, it may not be appropriate in specific settings where noise is correlated or multiplicative in nature.
We've also seen how incorporating prior beliefs about the coefficients leads to different regularization schemes, providing a principled approach to balancing model accuracy with complexity. The Gaussian prior gave us L2 regularization, gently pushing all weights toward zero, while the Laplace prior yielded L1 regularization, favoring sparse solutions where most weights become exactly zero.
This probabilistic perspective extends far beyond linear regression to nearly all machine learning models. Whether examining deep neural networks, decision trees, or clustering algorithms, viewing them through the lens of probability provides a much deeper understanding of their underlying assumptions and design choices.
By understanding these fundamental concepts, we can make more informed decisions when applying machine learning techniques to real-world problems, and potentially develop new, more effective algorithms based on different probabilistic assumptions.
Conclusion
The probabilistic approach to linear regression offers a powerful framework for understanding and extending this fundamental machine learning technique. By reframing the problem in terms of probability, we gain insights into why certain methods work and how they can be improved or adapted to different scenarios.
As we continue to push the boundaries of machine learning and artificial intelligence, this probabilistic perspective will undoubtedly play a crucial role in developing more sophisticated and effective algorithms. Whether you're a student just starting to explore machine learning or an experienced practitioner looking to deepen your understanding, embracing this probabilistic viewpoint can open up new avenues for innovation and insight in the field of data science and beyond.
Article created from: https://www.youtube.com/watch?v=q7seckj1hwM&list=PLgtmMKe4spCPsxyMpg-sxf3EcbsFYlzPK&index=7&pp=iAQB