1. YouTube Summaries
  2. Probability Distributions: The Key to Understanding AI and Machine Learning

Probability Distributions: The Key to Understanding AI and Machine Learning

By scribe 15 minute read

Create articles from any YouTube video or use our API to get YouTube transcriptions

Start for free
or, create a free article to see how easy it is.

The Probabilistic Nature of Our World

Imagine walking home on a foggy night. As you move forward, you notice a vague shape in the distance. Is it a stray dog? A cat? Or perhaps just a garbage bag blown by the wind? In this moment of uncertainty, your brain doesn't settle on a single answer. Instead, it considers multiple possibilities, each with its own likelihood.

This scenario illustrates a fundamental aspect of how our brains process information. We don't perceive the world in binary terms; rather, we interpret probabilities. When faced with unclear or incomplete information, we naturally think in terms of likelihoods and uncertainties.

This probabilistic view of sensory data isn't just a quirk of human cognition. It's a principle that underlies many advancements in modern machine learning. In fact, a surprising number of computational problems solved by both biological and artificial systems can be viewed through the lens of learning probability distributions, performing inference, and synthesizing new samples from these distributions.

The Language of Probability

The language of probability and uncertainty is built on a few crucial concepts, such as entropy and KL divergence. These ideas are so fundamental that they appear across diverse fields, from the latest diffusion models in AI to theories of perception in biological brains, like the free energy principle.

In this article, we'll dive deep into the core concepts behind probability, starting from the ground up. We'll demystify many of the formulas that might appear intimidating at first but which have very intuitive explanations.

Probability Distributions: The Foundation

The most fundamental concept in our discussion is the notion of a probability distribution. In general, we deal with an abstract set of possible states, and to each of those states, we assign a positive number between zero and one. This number reflects our degree of belief in how likely that particular state is to occur.

You're probably familiar with thinking about probability in terms of the proportion of outcomes from high school. For example, a six-sided die has six possible states. To find the probability of rolling a three, you would roll the die many times, count how often it lands on three, and divide it by the total number of rolls. As you increase the number of rolls, this ratio converges to 1/6, which is the probability for any side of a fair die.

This approach, which relies on repeated trials and tracking outcomes, is called the frequentist interpretation of probability. While this interpretation works well for dice or coin flips, it has limitations.

Beyond Frequentist Probability

Consider a weather forecast stating the probability of rain tomorrow is 70%. We intuitively understand this, but it doesn't fit neatly into the frequentist definition. We can't repeat tomorrow's weather multiple times to find the frequency of rain outcomes. Tomorrow will happen only once, and it will either rain or not.

This is why we'll adopt the Bayesian view, which treats probability more generally as the degree of belief. When we say "rain with probability 0.7," it means we've assigned a 0.7 degree of certainty to it.

While this definition might seem very broad, it still obeys certain laws. The most crucial one is that all the degrees of belief you assign to possible states must sum to one. For instance, if there are only two outcomes - rain and sun - then saying the probability of rain is 0.7 implies that the probability of sun is 0.3.

Understanding Probability Distributions

A probability distribution is essentially a function that maps each possible state to its assigned probability value. Because probabilities must sum to one, we can't manipulate this function for one state independently of others. It characterizes the system as a whole, completely describing how uncertainty is distributed among individual states.

Here are a few examples of probability distributions:

  1. For a fair die, the distribution is uniform. Each side has a probability of 1/6.
  2. The distribution of heights of adults in a population follows the famous bell-shaped Gaussian distribution.

There's a technical detail here: since height is a continuous variable, the probability distribution must integrate to one, meaning that the area under the entire curve equals one. But this is just a general case of summation. You can think of continuous distributions as being made up of a large number of tiny discrete bins. It won't change the core concepts we are discussing.

Multi-dimensional Probability Distributions

So far, we've only looked at probability distributions over states that can be neatly arranged on a one-dimensional line. But what about more complex scenarios, like states that include both height and weight?

We could try to put them on a one-dimensional line, but that's not very enlightening. Instead, it's more natural to think of states as occupying a two-dimensional plane, with one coordinate for height and one for weight. Then, each point on this plane corresponds to a particular state, and we can represent its probability with color or as a third coordinate.

For more coordinates, we can't directly visualize higher dimensions, but the idea remains the same. In machine learning, we often deal with much higher-dimensional probability distributions. For instance, all possible 100x100 pixel images occupy a 10,000-dimensional space, with one axis for each pixel's value. We can then reason about the probability distribution of images by assigning a probability to each point in this space.

Sampling: From Distribution to Data

How do we actually use or interact with these complex distributions? To understand this, let's go back to our simpler example and think about the reverse process of generating outcomes from a probability distribution. This is called sampling, and it's a crucial concept that is at the heart of many generative AI models.

In the case of rolling a die, it generates a new sample from the underlying distribution, picking one state at random with a certain probability. Similarly, if we have a way to draw samples from a 10,000-dimensional distribution, we can generate new images. Because this distribution captures the structure of data present in natural images, this would be entirely different from randomly selecting values for each pixel independently, as this would always result in noise.

The Math Behind Probability Distributions

Now that we have established a common ground for the language of probability, let's explore the powerful math behind what it takes to describe and model a distribution.

The Concept of Surprise

Let's start with the notion of surprise. Imagine that before you toss a coin, someone comes up to you and correctly predicts the outcome. Will you be surprised to see this prediction hold? Well, maybe, but not too much. After all, the chance of guessing was 50/50.

Now suppose you do the same for rolling a die, and someone correctly predicts that it would land on one. You'd likely be more surprised by this rather than a coin prediction, right?

Let's try to formalize this notion of surprise. Essentially, we're looking for a way to assign a positive number - a surprisal measure - to each state. This measure should be a function of probability that gets larger for smaller values and shrinks to zero when the probability approaches one. After all, events that are guaranteed to happen should have no surprise at all.

A crucial but less obvious aspect is the additive nature of surprise. If someone correctly predicts the outcomes of three dice rolled simultaneously, you'd be about three times as surprised as for one die. But the probability of such an event would be the product of individual die probabilities.

In other words, when probabilities multiply (as for independent events), the surprisal function should add. The function that behaves this way is the logarithm of one over the probability. This makes intuitive sense: surprise is high for rare events and low for common outcomes.

Entropy: Average Surprise

Now that each state has its own surprise, we can characterize the entire distribution by finding the average surprise packed into it. This average surprise is what we call the entropy of that distribution. Mathematically, it is expressed in the following way:

H(P) = -∑ P(x) * log(P(x))

Where we sum over all possible states, multiplying the probability of each state by its surprisal value.

For example, consider a thick coin that lands on heads and tails with probability 49% each, and on its side 2% of the time. If you randomly flip it many times, record your surprisal caused by the outcome of each flip, and compute the average surprisal, what value are you going to get?

Well, every time you get either a heads or a tails (which in total comprises 98% of all the flips), your surprise is about 0.7. But in the remaining 2%, your surprise will be around 4. So the weighted average given by this formula is approximately equal to 0.8.

Compare this to the entropy of a fair coin that always lands on heads or tails. This value is less, which means you will be on average more surprised by observing outcomes of a coin that lands on its side. In other words, it has more inherent uncertainty packed into it.

The Gap Between Model and Reality

Now, I'd like to highlight an important point. Computing entropy relies on knowing the probability distribution, but in the real world, when we observe a random variable unfold, we don't have access to the true underlying probability distribution. Instead, we usually have an internal model - what we believe the probability distribution should be - and this is what we use to assign the probabilities.

The true underlying distribution is hidden and typically far too complex, so all we have are approximate models of it. For instance, consider a fair die. The true probability distribution of its sides slightly deviates from uniform due to inherent microscopic imperfections in the die, air resistance, etc. It may even include other outcomes, such as landing exactly on a corner or, with vanishingly small probability, falling apart in the air. However, we approximate this with a simple uniform distribution: 1/6 for each side. This model captures everything we need for practical purposes.

But what if the deviation between our internal model and the actual distribution is more drastic? Let's consider an example. Suppose someone gives you a coin to flip. Based on your prior experience with coins, you believe the probabilities for heads and tails to be around 50% each. What you don't know, though, is that this particular coin is rigged, with its center of mass shifted so that it lands on heads around 99% of the time.

You go ahead and flip it 10 times, getting 10 heads in a row. Under your current model of a fair coin, the probability of this happening is extremely small, so you will be very surprised. However, the actual true probability of this outcome is quite high because the coin is rigged.

Notice that the surprisal you experienced is caused by your belief about the probability distribution being wrong, not because something truly remarkable had happened.

Cross-Entropy: Measuring Model Mismatch

This brings us to the idea of cross-entropy. Essentially, it is a function of two probability distributions that quantifies the average surprise you might expect when observing a random process generated by one probability distribution while believing it comes from another.

Mathematically, the cross-entropy between distributions P (the true one) and Q (your model) is defined similarly to entropy, but we use P for the outcome probabilities and Q for the surprisal terms:

H(P,Q) = -∑ P(x) * log(Q(x))

The surprise given by cross-entropy can come from two sources: the discrepancy between your model and the true distribution, as well as the inherent uncertainty of the underlying distribution itself. If P and Q are the same distribution (meaning that you have a perfect model), the cross-entropy is simply equal to the entropy of P.

A key mathematical property (which we're not going to prove rigorously here) is that the cross-entropy of distribution P as the ground truth and Q as its model is always greater than or equal to the entropy of P itself. In other words, believing in the wrong model of a random variable can only increase the surprise you'll get by observing it, never decrease it.

Another important feature of cross-entropy is its asymmetry. The roles of P and Q matter, and swapping them can lead to very different results. Let's explore this with two examples:

  1. Believing a coin is fair when it is actually rigged: Suppose the true distribution P is a rigged coin (99% for heads, 1% for tails), but your model Q is a fair coin. For a single flip, the cross-entropy can be calculated as follows:
H(P,Q) = -0.99 * log(0.5) - 0.01 * log(0.5) ≈ 0.7
  1. Believing a coin is rigged when it is actually fair: Now let's reverse the scenario. Computing the cross-entropy gives us:
H(P,Q) = -0.5 * log(0.99) - 0.5 * log(0.01) ≈ 2.3

This value is much larger than the other way around. This is because half the time when you see tails, you will be extremely surprised, as your model predicted this was very unlikely. The other half of the time, when you see heads, you'll have a very low surprise, but the extreme surprise from tails dominates, leading to a high average surprise measured by the cross-entropy.

KL Divergence: Isolating Model Error

Our examples of cross-entropy highlight an important question: how can we quantify the difference between our model's beliefs and reality? Can we somehow isolate the part of the surprise that comes purely from our model's inaccuracy, rather than from the inherent uncertainty in the data?

This is where the concept of Kullback-Leibler Divergence, or KL Divergence for short, comes into play. It allows us to peel away layers of surprise.

Recall our definition of cross-entropy:

H(P,Q) = -∑ P(x) * log(Q(x))

This measures the total surprise when using Q to predict P. Now, what if we subtracted the entropy of P from this?

H(P) = -∑ P(x) * log(P(x))

This formula represents the inherent uncertainty in P itself. By subtracting it from the cross-entropy and using the properties of logarithms, we're left with:

KL(P||Q) = ∑ P(x) * log(P(x)/Q(x))

This is the KL Divergence. It measures the extra surprise we get from using Q as our model instead of P, beyond the surprise inherent in P itself.

In our coin analogy, it's like isolating the surprise caused by believing in the wrong model, separate from the surprise caused by the coin flip itself.

Applications in Machine Learning

But why might we need to measure this discrepancy in the first place? Well, the ultimate goal of building and training many machine learning systems is to construct a good approximation - a model - of the underlying probability distribution of some data, to later use it for prediction or sampling.

For example, suppose you're building an AI system that can create new, lifelike photos of cats. Ideally, you'd want to sample directly from the distribution of all possible cat images in the real world. However, that is an incredibly complex space that we can't directly access or describe, so we'll have to resort to building a model to approximate it.

Without diving too deep into the specifics of generative models in this article, I will just say that we usually approximate the shape of that distribution with neural networks. These can be viewed as mathematical functions with enough expressive power to capture complex patterns. Once we have parameterized the distribution with a neural net of some kind, we can optimize its parameters with techniques like gradient descent.

But how do we know whether a particular approximation is a good or a bad fit to the true target distribution? After all, we need some kind of error function that we'll be trying to minimize.

You might have guessed that the KL Divergence between our model and the true distribution is exactly the target objective we would like to make as low as possible. Ideally, we'd aim for zero, which would mean our model perfectly aligns with the data distribution.

However, in practice, if you look at the code for training models to capture a probability distribution, you typically won't see the KL Divergence as a loss function. Instead, you're likely to see the optimization objective framed as minimizing the cross-entropy.

So what's the catch here? Let's look at the definition of KL Divergence once again:

KL(P||Q) = H(P,Q) - H(P)

We're trying to find the model Q that would minimize this quantity. However, notice that the term H(P) does not depend on our model at all. It's a constant determined by the true distribution P, which we can't change because it stems from the inherent uncertainty in the training data. That is fixed - no matter how we tweak the parameters of a neural network, it has absolutely no effect on how diverse cats in the real world are.

This is why whatever approximation Q minimizes the KL Divergence, it also, by definition, minimizes the cross-entropy. In other words, the two objectives are equivalent. The entropy of data term does not affect which Q is optimal; it just shifts the value of KL Divergence by a constant amount.

Computing the exact value of KL Divergence would require estimating the entropy of training data from a finite number of samples. Although you can certainly do it, usually we don't really care about what the actual value of the divergence is, as long as it's the lowest one possible. It doesn't affect the resulting model, so we just don't bother with estimating the entropy of data in the first place to save some computation.

Conclusion

This forms the basis for many machine learning algorithms, especially in the realm of generative models. By framing our objective in terms of entropy minimization, we can train our models to approximate complex real-world distributions, even when we don't have direct access to those distributions.

While minimizing the cross-entropy is at the very core of modern generative models, training complex systems like image generation networks in practice often requires additional techniques that go beyond the scope of this article. These may include things like variational methods, which we might explore in the future.

My main goal today was to lay a solid foundation for understanding what it means to learn the probability distribution of data. We have explored fundamental concepts like entropy, cross-entropy, and KL Divergence, showing how they are rooted in intuitive ideas of surprise and uncertainty.

These principles are not just abstract mathematical constructs. They form the backbone of how we quantify, model, and learn from data in the real world. So hopefully, next time you encounter terms like cross-entropy in a minimization problem, you will recognize it not as a mysterious formula, but as an intuitive idea of measuring and minimizing the gap between our models and reality.

As we continue to push the boundaries of AI and machine learning, these fundamental concepts of probability distributions will remain crucial. They provide the theoretical underpinnings for our most advanced algorithms and help us bridge the gap between the complexity of the real world and our attempts to model and understand it.

Whether you're a student, a researcher, or simply someone fascinated by the workings of AI, grasping these concepts will give you a deeper appreciation of the probabilistic nature of our world and how we can harness it to create intelligent systems. As we move forward, keep exploring, keep questioning, and keep marveling at the beautiful mathematics that underlies our understanding of uncertainty and probability.

Article created from: https://youtu.be/KHVR587oW8I?si=nbfknY_oecdTg-2E

Ready to automate your
LinkedIn, Twitter and blog posts with AI?

Start for free