Tutorial 4: Vector Calculus¶
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
📚 Learning Objectives¶
By the end of this tutorial, you will understand:
- Differentiation of univariate functions and basic derivative rules
- Taylor series and polynomial approximation
- Partial derivatives and gradients
- Jacobians for vector-valued functions
- Matrix calculus rules and gradient identities
- The chain rule in single-variable and multivariate settings
- Backpropagation and computation graphs
- Higher-order derivatives and the Hessian matrix
- Useful gradient identities for machine learning
Part 1: Differentiation of Univariate Functions¶
1.1 Definition of the Derivative¶
The derivative of a function \(f(x)\) measures the instantaneous rate of change of \(f\) with respect to \(x\).
Geometric interpretation: The derivative at a point gives the slope of the tangent line to the curve at that point.
1.2 Basic Derivative Rules¶
| Rule | Function \(f(x)\) | Derivative \(f'(x)\) | Example |
|---|---|---|---|
| Constant | \(c\) | \(0\) | \(\frac{d}{dx}(5) = 0\) |
| Power Rule | \(x^n\) | \(nx^{n-1}\) | \(\frac{d}{dx}(x^3) = 3x^2\) |
| Exponential | \(e^x\) | \(e^x\) | \(\frac{d}{dx}(e^x) = e^x\) |
| Logarithm | \(\ln(x)\) | \(\frac{1}{x}\) | \(\frac{d}{dx}(\ln x) = \frac{1}{x}\) |
| Sine | \(\sin(x)\) | \(\cos(x)\) | \(\frac{d}{dx}(\sin x) = \cos x\) |
| Cosine | \(\cos(x)\) | \(-\sin(x)\) | \(\frac{d}{dx}(\cos x) = -\sin x\) |
1.3 Combination Rules¶
Sum Rule: $\(\frac{d}{dx}\left[f(x) + g(x)\right] = f'(x) + g'(x)\)$
Product Rule: $\(\frac{d}{dx}\left[f(x) \cdot g(x)\right] = f'(x) \cdot g(x) + f(x) \cdot g'(x)\)$
Quotient Rule: $\(\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x) \cdot g(x) - f(x) \cdot g'(x)}{\left[g(x)\right]^2}\)$
Chain Rule (single variable): $\(\frac{d}{dx}\left[f(g(x))\right] = f'(g(x)) \cdot g'(x)\)$
1.4 Worked Examples¶
Example 1 (Product Rule): Find \(\frac{d}{dx}\left[x^2 \cdot e^x\right]\).
Example 2 (Chain Rule): Find \(\frac{d}{dx}\left[e^{-x^2}\right]\).
Let \(u = -x^2\), so \(f(u) = e^u\).
Example 3 (Quotient Rule): Find the derivative of the sigmoid function \(\sigma(x) = \frac{1}{1 + e^{-x}}\).
This simplifies to the elegant identity:
Part 2: Taylor Series¶
2.1 Taylor Series Definition¶
A Taylor series expands a smooth function \(f(x)\) around a point \(x_0\) as an infinite polynomial:
2.2 Taylor Polynomial Approximations¶
In machine learning, we often use truncated Taylor polynomials for local approximation.
First-order (linear) approximation: $\(f(x) \approx f(x_0) + f'(x_0)(x - x_0)\)$
Second-order (quadratic) approximation: $\(f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2}(x - x_0)^2\)$
2.3 Why Taylor Series Matter in ML¶
| Approximation Order | Use in Machine Learning |
|---|---|
| First-order | Gradient descent (linear approximation of loss function) |
| Second-order | Newton's method (quadratic approximation of loss function) |
2.4 Worked Example¶
Example: Approximate \(e^x\) around \(x_0 = 0\) to second order.
We need \(f(0)\), \(f'(0)\), and \(f''(0)\). Since \(f(x) = e^x\), all derivatives are \(e^x\), so \(f(0) = f'(0) = f''(0) = 1\).
Checking: at \(x = 0.1\), the true value is \(e^{0.1} = 1.10517...\)
The approximation is excellent near \(x_0\)!
Part 3: Partial Derivatives¶
3.1 Definition¶
For a function of multiple variables \(f(x_1, x_2, \ldots, x_n)\), the partial derivative with respect to \(x_i\) measures how \(f\) changes when only \(x_i\) varies, with all other variables held constant.
3.2 Notation¶
Partial derivatives have several equivalent notations:
| Notation | Meaning |
|---|---|
| \(\frac{\partial f}{\partial x}\) | Partial derivative of \(f\) with respect to \(x\) |
| \(f_x\) | Shorthand for \(\frac{\partial f}{\partial x}\) |
| \(\partial_x f\) | Another shorthand |
| \(D_x f\) | Differential operator notation |
3.3 How to Compute Partial Derivatives¶
Rule: To find \(\frac{\partial f}{\partial x_i}\), treat every variable except \(x_i\) as a constant, then differentiate with respect to \(x_i\) using the standard rules.
Example 1: Let \(f(x, y) = x^2 y + 3xy^2 - 2y\).
Example 2: Let \(f(x, y) = e^{xy} + \sin(x)\).
Part 4: Gradients¶
4.1 Definition¶
The gradient of a scalar-valued function \(f: \mathbb{R}^n \to \mathbb{R}\) is a vector of all its partial derivatives:
The gradient "lives" in the same space as the input \(\mathbf{x}\).
4.2 Gradient as Direction of Steepest Ascent¶
The gradient has a fundamental geometric meaning:
- \(\nabla f(\mathbf{x})\) points in the direction of steepest ascent of \(f\) at \(\mathbf{x}\)
- \(-\nabla f(\mathbf{x})\) points in the direction of steepest descent
- \(\|\nabla f(\mathbf{x})\|\) gives the rate of steepest ascent
This is why gradient descent updates parameters as:
where \(\eta > 0\) is the learning rate.
4.3 Worked Example¶
Example: Find the gradient of \(f(x_1, x_2, x_3) = x_1^2 + 2x_1 x_2 + x_3^3\).
At the point \(\mathbf{x} = \begin{bmatrix} 1 \\ 2 \\ -1 \end{bmatrix}\):
The direction of steepest descent at this point is \(-\nabla f = \begin{bmatrix} -6 \\ -2 \\ -3 \end{bmatrix}\).
Part 5: Jacobians¶
5.1 Definition¶
For a vector-valued function \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\) with components \(f_1, f_2, \ldots, f_m\), the Jacobian is the \(m \times n\) matrix of all first-order partial derivatives:
Key observation: Each row of the Jacobian is the gradient (transposed) of one output component \(f_i\).
5.2 Relationship to Gradients¶
| Object | Input | Output | Derivative |
|---|---|---|---|
| Gradient \(\nabla f\) | \(\mathbb{R}^n\) | \(\mathbb{R}\) (scalar) | Vector in \(\mathbb{R}^n\) |
| Jacobian \(\mathbf{J}\) | \(\mathbb{R}^n\) | \(\mathbb{R}^m\) (vector) | Matrix in \(\mathbb{R}^{m \times n}\) |
5.3 Worked Example¶
Example: Let \(\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^3\) be defined by:
The Jacobian is:
Part 6: Gradients of Matrices¶
6.1 Matrix Calculus Rules¶
When working with vectors and matrices, we need special differentiation rules.
Gradient of a linear function: For \(f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}\) where \(\mathbf{a}, \mathbf{x} \in \mathbb{R}^n\):
Gradient of a quadratic form: For \(f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}\) where \(\mathbf{A} \in \mathbb{R}^{n \times n}\):
If \(\mathbf{A}\) is symmetric (\(\mathbf{A} = \mathbf{A}^T\)), this simplifies to:
6.2 Useful Matrix Calculus Identities¶
| Function \(f(\mathbf{x})\) | Gradient \(\nabla_{\mathbf{x}} f\) |
|---|---|
| \(\mathbf{a}^T \mathbf{x}\) | \(\mathbf{a}\) |
| \(\mathbf{x}^T \mathbf{a}\) | \(\mathbf{a}\) |
| \(\mathbf{x}^T \mathbf{x}\) | \(2\mathbf{x}\) |
| \(\mathbf{x}^T \mathbf{A} \mathbf{x}\) | \((\mathbf{A} + \mathbf{A}^T)\mathbf{x}\) |
| \(\|\mathbf{x} - \mathbf{b}\|^2\) | \(2(\mathbf{x} - \mathbf{b})\) |
| \(\mathbf{b}^T \mathbf{A} \mathbf{x}\) | \(\mathbf{A}^T \mathbf{b}\) |
6.3 Worked Example¶
Example: Find the gradient of the least-squares loss.
The loss function is:
Expanding:
Taking the gradient with respect to \(\mathbf{w}\):
Setting \(\nabla_{\mathbf{w}} L = \mathbf{0}\) gives the normal equation:
Part 7: The Chain Rule¶
7.1 Single Variable Chain Rule¶
If \(y = f(g(x))\), then:
Example: \(y = (3x + 1)^4\)
Let \(g = 3x + 1\), so \(y = g^4\).
7.2 Multivariate Chain Rule¶
If \(f\) depends on \(\mathbf{x}\) through intermediate variables \(\mathbf{u}\):
Then:
In matrix form (using Jacobians):
7.3 Chain Rule for Neural Networks¶
Consider a simple two-layer neural network:
To find \(\frac{\partial L}{\partial \mathbf{W}_1}\), we apply the chain rule through the entire computation:
Each term in this product corresponds to a specific operation in the network.
Part 8: Backpropagation¶
8.1 Computation Graphs¶
A computation graph represents a function as a directed acyclic graph (DAG) where: - Nodes represent operations or variables - Edges represent data flow
Example: For \(f(x, y) = (x + y) \cdot (y + 1)\):
x ---\
(+) = a ---\
y ---/ (*) = f
y ---\ /
(+) = b --/
1 ---/
Here \(a = x + y\), \(b = y + 1\), and \(f = a \cdot b\).
8.2 Forward Pass¶
In the forward pass, we compute the output by evaluating the graph from inputs to output.
Example: With \(x = 2\), \(y = 3\):
| Step | Computation | Value |
|---|---|---|
| 1 | \(a = x + y\) | \(a = 2 + 3 = 5\) |
| 2 | \(b = y + 1\) | \(b = 3 + 1 = 4\) |
| 3 | \(f = a \cdot b\) | \(f = 5 \cdot 4 = 20\) |
8.3 Backward Pass (Backpropagation)¶
In the backward pass, we compute gradients by traversing the graph from output to inputs, applying the chain rule at each node.
Starting from \(\frac{\partial f}{\partial f} = 1\):
| Step | Gradient | Computation | Value |
|---|---|---|---|
| 1 | \(\frac{\partial f}{\partial a}\) | \(b\) | \(4\) |
| 2 | \(\frac{\partial f}{\partial b}\) | \(a\) | \(5\) |
| 3 | \(\frac{\partial f}{\partial x}\) | \(\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} = b \cdot 1\) | \(4\) |
| 4 | \(\frac{\partial f}{\partial y}\) | \(\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial y} + \frac{\partial f}{\partial b} \cdot \frac{\partial b}{\partial y} = b \cdot 1 + a \cdot 1\) | \(4 + 5 = 9\) |
Note: Since \(y\) appears in two paths (\(a\) and \(b\)), we sum the contributions from both paths.
8.4 General Backpropagation Algorithm¶
For a computation graph with output \(L\):
- Forward pass: Compute all intermediate values from inputs to output
- Initialize: Set \(\frac{\partial L}{\partial L} = 1\)
- Backward pass: For each node \(v\) in reverse topological order:
This is the foundation of training neural networks.
Part 9: Higher-Order Derivatives¶
9.1 Second-Order Partial Derivatives¶
For a function \(f(x_1, x_2, \ldots, x_n)\), we can differentiate partial derivatives again:
Symmetry of mixed partials (Schwarz's theorem): If \(f\) has continuous second partial derivatives:
9.2 The Hessian Matrix¶
The Hessian collects all second-order partial derivatives into a matrix:
Properties: - The Hessian is symmetric (by Schwarz's theorem): \(\mathbf{H} = \mathbf{H}^T\) - If \(\mathbf{H}\) is positive definite at a critical point, the point is a local minimum - If \(\mathbf{H}\) is negative definite, the point is a local maximum - If \(\mathbf{H}\) has both positive and negative eigenvalues, the point is a saddle point
9.3 Second-Order Taylor Expansion (Multivariate)¶
The multivariate second-order Taylor expansion around \(\mathbf{x}_0\) is:
This is the basis for Newton's method in optimization.
9.4 Worked Example¶
Example: Find the Hessian of \(f(x_1, x_2) = x_1^3 + 2x_1 x_2^2 - x_2\).
First, compute the gradient:
Then, compute the Hessian:
Notice that \(\mathbf{H} = \mathbf{H}^T\), confirming symmetry.
Part 10: Useful Gradient Identities¶
10.1 Reference Table¶
These identities appear frequently in machine learning derivations. Here \(\mathbf{x}, \mathbf{a}, \mathbf{b} \in \mathbb{R}^n\) and \(\mathbf{A} \in \mathbb{R}^{n \times n}\).
| # | Function | Gradient \(\nabla_{\mathbf{x}}\) |
|---|---|---|
| 1 | \(\mathbf{a}^T \mathbf{x}\) | \(\mathbf{a}\) |
| 2 | \(\mathbf{x}^T \mathbf{x}\) | \(2\mathbf{x}\) |
| 3 | \(\mathbf{x}^T \mathbf{A} \mathbf{x}\) | \((\mathbf{A} + \mathbf{A}^T)\mathbf{x}\) |
| 4 | \((\mathbf{A}\mathbf{x} - \mathbf{b})^T(\mathbf{A}\mathbf{x} - \mathbf{b})\) | \(2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})\) |
| 5 | \(\|\mathbf{x}\|^2 = \mathbf{x}^T\mathbf{x}\) | \(2\mathbf{x}\) |
| 6 | \(\mathbf{b}^T \mathbf{A} \mathbf{x}\) | \(\mathbf{A}^T \mathbf{b}\) |
10.2 Deriving Identity 3¶
Let us prove \(\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\).
Write \(f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x} = \sum_{i}\sum_{j} x_i \, A_{ij} \, x_j\).
Taking the partial derivative with respect to \(x_k\):
Collecting into a vector:
10.3 When \(\mathbf{A}\) is Symmetric¶
If \(\mathbf{A} = \mathbf{A}^T\), then \(\mathbf{A} + \mathbf{A}^T = 2\mathbf{A}\), so:
This is a very common case in machine learning, since covariance matrices and Hessians are symmetric.
Summary: Key Takeaways¶
Differentiation Fundamentals¶
- Derivatives measure rates of change; partial derivatives fix all variables except one
- The gradient \(\nabla f\) collects all partial derivatives into a vector
- The Jacobian generalizes the gradient for vector-valued functions
The Chain Rule and Backpropagation¶
- The multivariate chain rule composes Jacobians through multiplication
- Backpropagation applies the chain rule on a computation graph, working backward from the loss
- Gradients with respect to variables appearing in multiple paths are summed
Higher-Order Information¶
- The Hessian matrix \(\mathbf{H}\) captures second-order (curvature) information
- Positive definite Hessian at a critical point indicates a local minimum
Matrix Calculus¶
- \(\nabla_{\mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}\)
- \(\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
- The normal equation for least squares: \(\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
Practice Problems¶
Problem 1¶
Find the derivative of \(f(x) = x^3 e^{2x}\) using the product and chain rules.
Problem 2¶
Let \(f(x, y) = x^2 y - 3xy^3 + 2x\). Find \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\), then compute the gradient at the point \((1, -1)\).
Problem 3¶
Compute the Jacobian of the function \(\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^2\) defined by:
Problem 4¶
Find the Hessian of \(f(x_1, x_2) = x_1^2 + 4x_1 x_2 + x_2^2\). Is this Hessian positive definite?
Problem 5¶
Consider the computation graph for \(f(x) = (x + 2)^2\). Perform the forward pass with \(x = 3\), then use backpropagation to compute \(\frac{df}{dx}\).
Problem 6¶
Let \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\) and \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\). Compute \(\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x})\) using the identity from Part 10, and verify by expanding \(\mathbf{x}^T \mathbf{A} \mathbf{x}\) and differentiating directly.
Solutions¶
Solution 1:
Using the product rule with \(u = x^3\) and \(v = e^{2x}\):
Solution 2:
At \((1, -1)\):
Solution 3:
For \(f_1 = x^2 + y\): \(\frac{\partial f_1}{\partial x} = 2x\), \(\frac{\partial f_1}{\partial y} = 1\)
For \(f_2 = xy - y^2\): \(\frac{\partial f_2}{\partial x} = y\), \(\frac{\partial f_2}{\partial y} = x - 2y\)
Solution 4:
First, compute the gradient:
The Hessian (matrix of second derivatives):
To check positive definiteness, compute the eigenvalues. For a \(2 \times 2\) matrix:
So \(\lambda_1 = 6\) and \(\lambda_2 = -2\).
Since one eigenvalue is negative, the Hessian is not positive definite. It is indefinite, meaning any critical point of \(f\) would be a saddle point.
Solution 5:
Decompose \(f(x) = (x + 2)^2\) into elementary steps:
- \(a = x + 2\)
- \(f = a^2\)
Forward pass with \(x = 3\):
| Step | Computation | Value |
|---|---|---|
| 1 | \(a = x + 2\) | \(a = 3 + 2 = 5\) |
| 2 | \(f = a^2\) | \(f = 5^2 = 25\) |
Backward pass:
| Step | Gradient | Computation | Value |
|---|---|---|---|
| 1 | \(\frac{\partial f}{\partial f}\) | (seed) | \(1\) |
| 2 | \(\frac{\partial f}{\partial a}\) | \(2a\) | \(2(5) = 10\) |
| 3 | \(\frac{\partial f}{\partial x}\) | \(\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} = 10 \cdot 1\) | \(10\) |
Verification: \(f'(x) = 2(x + 2)\), so \(f'(3) = 2(5) = 10\). Correct!
Solution 6:
Using the identity:
Since \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\) is symmetric (\(\mathbf{A} = \mathbf{A}^T\)):
Direct verification:
Expand \(\mathbf{x}^T \mathbf{A} \mathbf{x}\):
Taking partial derivatives:
Both methods agree, confirming the identity.
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
Next: Tutorial 5 - Probability and Distributions