Tutorial 5: Probability and Distributions¶
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
Learning Objectives¶
By the end of this tutorial, you will understand:
- Probability spaces, sample spaces, events, and the Kolmogorov axioms
- Conditional probability and independence
- Bayes' theorem and how to apply it
- Discrete random variables and their distributions (Bernoulli, Binomial, Geometric)
- Continuous random variables and their distributions (Uniform, Exponential, Gaussian)
- Expected value and variance, including computation rules
- The Gaussian (Normal) distribution and its properties
- Joint and marginal distributions
- Covariance and correlation
- The sum rule and product rule of probability
Part 1: Probability Space¶
1.1 Core Definitions¶
A probability space is a triple \((\Omega, \mathcal{F}, P)\) consisting of three components:
| Component | Name | Description |
|---|---|---|
| \(\Omega\) | Sample space | The set of all possible outcomes of an experiment |
| \(\mathcal{F}\) | Event space | A collection of subsets of \(\Omega\) (the events we can assign probabilities to) |
| \(P\) | Probability function | A function \(P: \mathcal{F} \to [0, 1]\) that assigns probabilities to events |
Example (Coin Flip): - Sample space: \(\Omega = \{H, T\}\) - Event space: \(\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \{H, T\}\}\) - Probability: \(P(\{H\}) = 0.5\), \(P(\{T\}) = 0.5\)
Example (Rolling a Die): - Sample space: \(\Omega = \{1, 2, 3, 4, 5, 6\}\) - Event "rolling an even number": \(A = \{2, 4, 6\}\) - \(P(A) = \frac{3}{6} = \frac{1}{2}\)
1.2 Kolmogorov Axioms of Probability¶
All of probability theory rests on three axioms, formalized by Andrey Kolmogorov:
| Axiom | Statement | Meaning |
|---|---|---|
| Axiom 1 (Non-negativity) | \(P(A) \geq 0\) for every event \(A\) | Probabilities are never negative |
| Axiom 2 (Normalization) | \(P(\Omega) = 1\) | Something must happen |
| Axiom 3 (Additivity) | If \(A \cap B = \emptyset\), then \(P(A \cup B) = P(A) + P(B)\) | For mutually exclusive events, probabilities add |
Key consequences of the axioms:
- \(P(\emptyset) = 0\) (the impossible event has probability zero)
- \(P(A^c) = 1 - P(A)\) (complement rule)
- \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) (inclusion-exclusion)
- If \(A \subseteq B\), then \(P(A) \leq P(B)\) (monotonicity)
Worked Example: Suppose \(P(A) = 0.6\) and \(P(B) = 0.4\) with \(P(A \cap B) = 0.2\). Find \(P(A \cup B)\).
Part 2: Conditional Probability and Independence¶
2.1 Conditional Probability¶
The conditional probability of event \(A\) given that event \(B\) has occurred is:
This reads: "the probability of \(A\) given \(B\)."
Intuition: Once we know \(B\) happened, the sample space effectively shrinks to \(B\), and we ask how much of \(B\) is also in \(A\).
Worked Example: A standard deck of 52 cards. What is the probability a card is a King given it is a face card?
- \(B\) = face card: there are 12 face cards (J, Q, K of each suit), so \(P(B) = \frac{12}{52}\)
- \(A \cap B\) = King and face card = King: there are 4 Kings, so \(P(A \cap B) = \frac{4}{52}\)
2.2 Independence¶
Two events \(A\) and \(B\) are independent if knowing one gives no information about the other. Formally:
Equivalently, if \(A\) and \(B\) are independent:
Example: Rolling two fair dice. Let \(A\) = "first die shows 3" and \(B\) = "second die shows 5."
Since \(P(A \cap B) = P(A)P(B)\), the events are independent.
Warning: Independence is not the same as mutual exclusivity. If \(A\) and \(B\) are mutually exclusive and both have positive probability, they are not independent (knowing one happened tells you the other did not).
Part 3: Bayes' Theorem¶
3.1 The Formula¶
Bayes' theorem lets us "reverse" conditional probabilities:
| Term | Name | Interpretation |
|---|---|---|
| \(P(A \mid B)\) | Posterior | Updated belief about \(A\) after observing \(B\) |
| \(P(A)\) | Prior | Initial belief about \(A\) before seeing evidence |
| \(P(B \mid A)\) | Likelihood | How probable the evidence \(B\) is if \(A\) is true |
| \(P(B)\) | Evidence (marginal likelihood) | Total probability of observing \(B\) |
3.2 The Law of Total Probability¶
The denominator \(P(B)\) is often computed using the law of total probability. If \(A_1, A_2, \ldots, A_n\) partition \(\Omega\):
For two complementary events \(A\) and \(A^c\):
3.3 Worked Example: Medical Testing¶
A disease affects 1% of the population. A test has: - Sensitivity (true positive rate): \(P(\text{Positive} \mid \text{Disease}) = 0.95\) - Specificity (true negative rate): \(P(\text{Negative} \mid \text{No Disease}) = 0.90\)
If a person tests positive, what is \(P(\text{Disease} \mid \text{Positive})\)?
Step 1: Define events and assign values. - \(P(D) = 0.01\), \(P(D^c) = 0.99\) - \(P(+ \mid D) = 0.95\), \(P(+ \mid D^c) = 1 - 0.90 = 0.10\)
Step 2: Compute \(P(+)\) using the law of total probability.
Step 3: Apply Bayes' theorem.
Interpretation: Even with a positive test, there is only about an 8.8% chance the person actually has the disease. This counterintuitive result arises because the disease is rare (low prior), so false positives outnumber true positives.
Part 4: Discrete Random Variables¶
4.1 Definitions¶
A random variable \(X\) is a function that maps outcomes in the sample space to real numbers:
A random variable is discrete if it takes values from a countable set (e.g., \(\{0, 1, 2, \ldots\}\)).
The probability mass function (PMF) of a discrete random variable \(X\) is:
Properties of a valid PMF:
- \(p(x) \geq 0\) for all \(x\)
- \(\displaystyle\sum_{\text{all } x} p(x) = 1\)
4.2 Bernoulli Distribution¶
A single trial with two outcomes: success (\(X = 1\)) with probability \(p\), or failure (\(X = 0\)) with probability \(1 - p\).
| Property | Value |
|---|---|
| Mean | \(E[X] = p\) |
| Variance | \(\text{Var}(X) = p(1-p)\) |
Example: A coin flip with \(P(\text{Heads}) = 0.6\). Then \(X \sim \text{Bernoulli}(0.6)\).
4.3 Binomial Distribution¶
The number of successes in \(n\) independent Bernoulli trials, each with success probability \(p\).
where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the binomial coefficient.
| Property | Value |
|---|---|
| Mean | \(E[X] = np\) |
| Variance | \(\text{Var}(X) = np(1-p)\) |
Worked Example: A fair coin is flipped 10 times. What is the probability of getting exactly 4 heads?
4.4 Geometric Distribution¶
The number of trials until the first success in a sequence of independent Bernoulli trials.
| Property | Value |
|---|---|
| Mean | \(E[X] = \frac{1}{p}\) |
| Variance | \(\text{Var}(X) = \frac{1-p}{p^2}\) |
Worked Example: You roll a fair die until you get a 6. What is the probability it takes exactly 3 rolls?
The expected number of rolls: \(E[X] = \frac{1}{1/6} = 6\).
Part 5: Continuous Random Variables¶
5.1 Definitions¶
A random variable is continuous if it can take any value in an interval (or union of intervals).
The probability density function (PDF) \(f(x)\) satisfies:
- \(f(x) \geq 0\) for all \(x\)
- \(\displaystyle\int_{-\infty}^{\infty} f(x) \, dx = 1\)
- \(\displaystyle P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)
Important: For a continuous random variable, \(P(X = x) = 0\) for any specific value \(x\). Only intervals have nonzero probability.
The cumulative distribution function (CDF) is:
Properties of the CDF: - \(F(-\infty) = 0\) and \(F(\infty) = 1\) - \(F\) is non-decreasing - \(f(x) = \frac{d}{dx} F(x)\) (the PDF is the derivative of the CDF)
5.2 Uniform Distribution¶
A random variable is equally likely to take any value in the interval \([a, b]\).
| Property | Value |
|---|---|
| Mean | \(E[X] = \frac{a + b}{2}\) |
| Variance | \(\text{Var}(X) = \frac{(b - a)^2}{12}\) |
Example: If \(X \sim \text{Uniform}(0, 10)\), then \(E[X] = 5\) and \(P(2 \leq X \leq 5) = \frac{5-2}{10-0} = 0.3\).
5.3 Exponential Distribution¶
Models the time between events in a Poisson process. The parameter \(\lambda > 0\) is the rate.
| Property | Value |
|---|---|
| Mean | \(E[X] = \frac{1}{\lambda}\) |
| Variance | \(\text{Var}(X) = \frac{1}{\lambda^2}\) |
Key property (Memoryless):
Worked Example: Light bulbs fail at a rate of \(\lambda = 0.01\) per hour. What is the probability a bulb lasts more than 200 hours?
5.4 Gaussian (Normal) Distribution¶
The most important distribution in statistics and machine learning.
| Property | Value |
|---|---|
| Mean | \(E[X] = \mu\) |
| Variance | \(\text{Var}(X) = \sigma^2\) |
Full details on the Gaussian are in Part 7 below.
Part 6: Expected Value and Variance¶
6.1 Expected Value (Mean)¶
The expected value is the long-run average of a random variable.
Discrete case:
Continuous case:
Expected value of a function \(g(X)\):
6.2 Properties of Expected Value¶
| Property | Formula |
|---|---|
| Linearity | \(E[aX + b] = aE[X] + b\) |
| Sum | \(E[X + Y] = E[X] + E[Y]\) (always, even if dependent) |
| Product (independent) | \(E[XY] = E[X] \cdot E[Y]\) (only if \(X, Y\) are independent) |
| Constant | \(E[c] = c\) |
Worked Example: Let \(X\) be a die roll. Compute \(E[X]\).
6.3 Variance¶
Variance measures how spread out a distribution is around its mean.
Shortcut formula (very useful for computation):
The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\).
6.4 Properties of Variance¶
| Property | Formula |
|---|---|
| Scaling | \(\text{Var}(aX) = a^2 \text{Var}(X)\) |
| Shift | \(\text{Var}(X + b) = \text{Var}(X)\) |
| Affine | \(\text{Var}(aX + b) = a^2 \text{Var}(X)\) |
| Sum (independent) | \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) (only if independent) |
| Constant | \(\text{Var}(c) = 0\) |
Worked Example: Let \(X\) be a die roll. Compute \(\text{Var}(X)\).
First, compute \(E[X^2]\):
We already know \(E[X] = 3.5\), so \((E[X])^2 = 12.25\).
Part 7: Gaussian (Normal) Distribution in Depth¶
7.1 Definition¶
The Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) has PDF:
We write \(X \sim \mathcal{N}(\mu, \sigma^2)\).
7.2 The Standard Normal Distribution¶
When \(\mu = 0\) and \(\sigma^2 = 1\), we get the standard normal \(Z \sim \mathcal{N}(0, 1)\):
Standardization: Any normal random variable can be converted to a standard normal:
7.3 Key Properties¶
| Property | Description |
|---|---|
| Symmetry | The PDF is symmetric about \(\mu\) |
| 68-95-99.7 Rule | ~68% of values fall within \(\mu \pm \sigma\), ~95% within \(\mu \pm 2\sigma\), ~99.7% within \(\mu \pm 3\sigma\) |
| Linear closure | If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then \(aX + b \sim \mathcal{N}(a\mu + b, \, a^2\sigma^2)\) |
| Sum of normals | If \(X \sim \mathcal{N}(\mu_1, \sigma_1^2)\) and \(Y \sim \mathcal{N}(\mu_2, \sigma_2^2)\) are independent, then \(X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \, \sigma_1^2 + \sigma_2^2)\) |
Worked Example: Exam scores are distributed as \(X \sim \mathcal{N}(75, 100)\) (mean 75, standard deviation 10). What fraction of students score above 90?
Standardize:
About 6.7% of students score above 90.
7.4 Why the Gaussian Matters in Machine Learning¶
- The Central Limit Theorem states that the sum of many independent random variables tends toward a Gaussian, regardless of their individual distributions.
- Many ML algorithms assume Gaussian noise (linear regression, Gaussian processes).
- The multivariate Gaussian is fundamental to dimensionality reduction (PCA) and generative models.
Part 8: Joint and Marginal Distributions¶
8.1 Joint Distribution¶
The joint distribution describes the probability behavior of two (or more) random variables simultaneously.
Discrete (Joint PMF):
Properties: - \(p(x, y) \geq 0\) for all \(x, y\) - \(\displaystyle\sum_{x}\sum_{y} p(x, y) = 1\)
Continuous (Joint PDF):
8.2 Marginal Distribution¶
The marginal distribution of one variable is obtained by summing (or integrating) over the other variable.
Discrete:
Continuous:
Worked Example (Discrete): Consider two discrete random variables \(X\) and \(Y\) with the following joint PMF table:
| \(Y = 0\) | \(Y = 1\) | \(p_X(x)\) | |
|---|---|---|---|
| \(X = 0\) | 0.1 | 0.2 | 0.3 |
| \(X = 1\) | 0.3 | 0.4 | 0.7 |
| \(p_Y(y)\) | 0.4 | 0.6 | 1.0 |
Marginals are computed by summing each row or column: - \(p_X(0) = 0.1 + 0.2 = 0.3\) - \(p_X(1) = 0.3 + 0.4 = 0.7\) - \(p_Y(0) = 0.1 + 0.3 = 0.4\) - \(p_Y(1) = 0.2 + 0.4 = 0.6\)
8.3 Independence of Random Variables¶
\(X\) and \(Y\) are independent if and only if:
Check the example above: \(p(0, 0) = 0.1\) but \(p_X(0) \cdot p_Y(0) = 0.3 \times 0.4 = 0.12 \neq 0.1\). So \(X\) and \(Y\) are not independent.
Part 9: Covariance and Correlation¶
9.1 Covariance¶
Covariance measures how two random variables vary together:
Shortcut formula:
| Value | Interpretation |
|---|---|
| \(\text{Cov}(X,Y) > 0\) | \(X\) and \(Y\) tend to increase together |
| \(\text{Cov}(X,Y) < 0\) | When \(X\) increases, \(Y\) tends to decrease |
| \(\text{Cov}(X,Y) = 0\) | No linear relationship (uncorrelated) |
Properties of Covariance:
- \(\text{Cov}(X, X) = \text{Var}(X)\)
- \(\text{Cov}(X, Y) = \text{Cov}(Y, X)\) (symmetric)
- \(\text{Cov}(aX + b, \, cY + d) = ac \, \text{Cov}(X, Y)\)
- If \(X\) and \(Y\) are independent, then \(\text{Cov}(X, Y) = 0\) (the converse is not always true)
9.2 Correlation Coefficient¶
The Pearson correlation coefficient normalizes covariance to the range \([-1, 1]\):
| Value | Interpretation |
|---|---|
| \(\rho = 1\) | Perfect positive linear relationship |
| \(\rho = -1\) | Perfect negative linear relationship |
| \(\rho = 0\) | No linear relationship (uncorrelated) |
| \(0 < \rho < 1\) | Positive linear tendency |
| \(-1 < \rho < 0\) | Negative linear tendency |
9.3 Variance of a Sum (General Case)¶
If \(X\) and \(Y\) are independent (so \(\text{Cov}(X,Y) = 0\)):
Worked Example: Using the joint PMF from Part 8, compute \(\text{Cov}(X, Y)\).
From the table: \(E[X] = 0(0.3) + 1(0.7) = 0.7\) and \(E[Y] = 0(0.4) + 1(0.6) = 0.6\).
The small negative covariance indicates a very slight negative association.
Part 10: Sum Rule and Product Rule¶
10.1 The Two Fundamental Rules¶
These two rules form the foundation of all probabilistic reasoning.
Product Rule (Chain Rule):
This generalizes to multiple variables:
Sum Rule (Marginalization):
10.2 Discrete Case¶
For discrete random variables \(X\) and \(Y\):
Product rule:
Sum rule (marginalization):
10.3 Continuous Case¶
For continuous random variables \(X\) and \(Y\):
Product rule:
Sum rule (marginalization):
10.4 Connection to Bayes' Theorem¶
Bayes' theorem is a direct consequence of applying the product rule in both directions and then dividing:
The denominator uses the sum rule to compute the marginal \(f(x)\).
Worked Example: Suppose \(Y \in \{0, 1\}\) with \(P(Y=1) = 0.3\) and \(P(Y=0) = 0.7\). Also: - \(P(X = 1 \mid Y = 1) = 0.9\) - \(P(X = 1 \mid Y = 0) = 0.2\)
Find \(P(Y = 1 \mid X = 1)\) using the sum and product rules.
Step 1 (Sum rule): Compute \(P(X = 1)\).
Step 2 (Bayes via product rule):
Reference: Table of Common Distributions¶
Discrete Distributions¶
| Distribution | PMF \(P(X = k)\) | Mean \(E[X]\) | Variance \(\text{Var}(X)\) |
|---|---|---|---|
| \(\text{Bernoulli}(p)\) | \(p^k(1-p)^{1-k}\), \(k \in \{0,1\}\) | \(p\) | \(p(1-p)\) |
| \(\text{Binomial}(n, p)\) | \(\binom{n}{k}p^k(1-p)^{n-k}\), \(k = 0,\ldots,n\) | \(np\) | \(np(1-p)\) |
| \(\text{Geometric}(p)\) | \((1-p)^{k-1}p\), \(k = 1, 2, \ldots\) | \(\frac{1}{p}\) | \(\frac{1-p}{p^2}\) |
| \(\text{Poisson}(\lambda)\) | \(\frac{\lambda^k e^{-\lambda}}{k!}\), \(k = 0, 1, 2, \ldots\) | \(\lambda\) | \(\lambda\) |
Continuous Distributions¶
| Distribution | PDF \(f(x)\) | Mean \(E[X]\) | Variance \(\text{Var}(X)\) |
|---|---|---|---|
| \(\text{Uniform}(a,b)\) | \(\frac{1}{b-a}\) for \(x \in [a,b]\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) |
| \(\text{Exponential}(\lambda)\) | \(\lambda e^{-\lambda x}\) for \(x \geq 0\) | \(\frac{1}{\lambda}\) | \(\frac{1}{\lambda^2}\) |
| \(\mathcal{N}(\mu, \sigma^2)\) | \(\frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\) | \(\mu\) | \(\sigma^2\) |
| \(\text{Beta}(\alpha, \beta)\) | \(\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\) for \(x \in [0,1]\) | \(\frac{\alpha}{\alpha+\beta}\) | \(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\) |
Summary: Key Takeaways¶
Probability Foundations¶
- A probability space is \((\Omega, \mathcal{F}, P)\) satisfying the Kolmogorov axioms
- Conditional probability: \(P(A \mid B) = P(A \cap B) / P(B)\)
- Bayes' theorem: \(P(A \mid B) = P(B \mid A) P(A) / P(B)\)
Random Variables¶
- Discrete: described by PMFs; Continuous: described by PDFs
- CDF: \(F(x) = P(X \leq x)\) works for both types
Key Formulas¶
- Expected value: \(E[X] = \sum x \, p(x)\) or \(\int x \, f(x) \, dx\)
- Variance shortcut: \(\text{Var}(X) = E[X^2] - (E[X])^2\)
- Covariance: \(\text{Cov}(X,Y) = E[XY] - E[X]E[Y]\)
- Correlation: \(\rho_{XY} = \text{Cov}(X,Y) / (\sigma_X \sigma_Y)\)
Fundamental Rules¶
- Product rule: \(P(A, B) = P(A \mid B) P(B)\)
- Sum rule: \(P(A) = \sum_B P(A, B)\)
Practice Problems¶
Problem 1¶
A bag contains 5 red balls and 3 blue balls. Two balls are drawn without replacement. What is the probability that both are red?
Problem 2¶
A factory has two machines. Machine A produces 60% of items and Machine B produces 40%. The defect rate is 3% for Machine A and 5% for Machine B. If an item is found to be defective, what is the probability it came from Machine A?
Problem 3¶
Let \(X \sim \text{Binomial}(8, 0.3)\). Compute \(P(X = 2)\), \(E[X]\), and \(\text{Var}(X)\).
Problem 4¶
Let \(X \sim \mathcal{N}(50, 25)\) (mean 50, variance 25, so \(\sigma = 5\)). Find: - (a) \(P(X > 60)\) - (b) \(P(40 < X < 55)\)
Problem 5¶
Random variables \(X\) and \(Y\) have \(E[X] = 3\), \(E[Y] = 5\), \(E[X^2] = 13\), \(E[Y^2] = 30\), and \(E[XY] = 16\). Compute \(\text{Cov}(X,Y)\), \(\text{Var}(X)\), \(\text{Var}(Y)\), and the correlation \(\rho_{XY}\).
Problem 6¶
Consider the continuous random variable \(X\) with PDF:
- (a) Find the constant \(c\) so that \(f\) is a valid PDF.
- (b) Compute \(E[X]\).
- (c) Compute \(\text{Var}(X)\).
Solutions¶
Solution 1:
Use the product rule (chain rule) for drawing without replacement.
Solution 2:
Apply Bayes' theorem. Let \(A\) = "from Machine A," \(B\) = "from Machine B," and \(D\) = "defective."
- \(P(A) = 0.6\), \(P(B) = 0.4\)
- \(P(D \mid A) = 0.03\), \(P(D \mid B) = 0.05\)
First, compute \(P(D)\) using the law of total probability:
Then:
There is about a 47.4% chance the defective item came from Machine A.
Solution 3:
\(X \sim \text{Binomial}(8, 0.3)\).
Solution 4:
\(X \sim \mathcal{N}(50, 25)\), so \(\mu = 50\) and \(\sigma = 5\).
(a) Standardize:
About 2.3% of the distribution lies above 60.
(b) Standardize both bounds:
About 81.9% of the distribution falls between 40 and 55.
Solution 5:
Covariance:
Variance of \(X\):
Variance of \(Y\):
Correlation:
A moderate positive linear association.
Solution 6:
(a) For \(f\) to be a valid PDF, the total area must equal 1:
(b) Expected value:
(c) First compute \(E[X^2]\):
Then apply the variance shortcut:
Equivalently, \(\text{Var}(X) = \frac{3}{20}\).
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
Previous: Tutorial 4 - Matrix Decompositions Next: Tutorial 6 - Optimization and Gradient Descent