Tutorial 4: Vector Calculus¶

Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari

📚 Learning Objectives¶

By the end of this tutorial, you will understand:

Differentiation of univariate functions and basic derivative rules
Taylor series and polynomial approximation
Partial derivatives and gradients
Jacobians for vector-valued functions
Matrix calculus rules and gradient identities
The chain rule in single-variable and multivariate settings
Backpropagation and computation graphs
Higher-order derivatives and the Hessian matrix
Useful gradient identities for machine learning

Part 1: Differentiation of Univariate Functions¶

1.1 Definition of the Derivative¶

The derivative of a function $f(x)$ measures the instantaneous rate of change of $f$ with respect to $x$.

\[\frac{df}{dx} = f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}\]

Geometric interpretation: The derivative at a point gives the slope of the tangent line to the curve at that point.

1.2 Basic Derivative Rules¶

Rule	Function $f(x)$	Derivative $f'(x)$	Example
Constant	$c$	$0$	$\frac{d}{dx}(5) = 0$
Power Rule	$x^n$	$nx^{n-1}$	$\frac{d}{dx}(x^3) = 3x^2$
Exponential	$e^x$	$e^x$	$\frac{d}{dx}(e^x) = e^x$
Logarithm	$\ln(x)$	$\frac{1}{x}$	$\frac{d}{dx}(\ln x) = \frac{1}{x}$
Sine	$\sin(x)$	$\cos(x)$	$\frac{d}{dx}(\sin x) = \cos x$
Cosine	$\cos(x)$	$-\sin(x)$	$\frac{d}{dx}(\cos x) = -\sin x$

1.3 Combination Rules¶

Sum Rule: $$\frac{d}{dx}\left[f(x) + g(x)\right] = f'(x) + g'(x)$$

Product Rule: $$\frac{d}{dx}\left[f(x) \cdot g(x)\right] = f'(x) \cdot g(x) + f(x) \cdot g'(x)$$

Quotient Rule: $$\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x) \cdot g(x) - f(x) \cdot g'(x)}{\left[g(x)\right]^2}$$

Chain Rule (single variable): $$\frac{d}{dx}\left[f(g(x))\right] = f'(g(x)) \cdot g'(x)$$

1.4 Worked Examples¶

Example 1 (Product Rule): Find $\frac{d}{dx}\left[x^2 \cdot e^x\right]$.

\[\frac{d}{dx}\left[x^2 \cdot e^x\right] = 2x \cdot e^x + x^2 \cdot e^x = e^x(2x + x^2)\]

Example 2 (Chain Rule): Find $\frac{d}{dx}\left[e^{-x^2}\right]$.

Let $u = -x^2$, so $f(u) = e^u$.

\[\frac{d}{dx}\left[e^{-x^2}\right] = e^{-x^2} \cdot (-2x) = -2x \, e^{-x^2}\]

Example 3 (Quotient Rule): Find the derivative of the sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$.

\[\sigma'(x) = \frac{0 \cdot (1 + e^{-x}) - 1 \cdot (-e^{-x})}{(1 + e^{-x})^2} = \frac{e^{-x}}{(1 + e^{-x})^2}\]

This simplifies to the elegant identity:

\[\sigma'(x) = \sigma(x)\left(1 - \sigma(x)\right)\]

Part 2: Taylor Series¶

2.1 Taylor Series Definition¶

A Taylor series expands a smooth function $f(x)$ around a point $x_0$ as an infinite polynomial:

\[f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(x_0)}{k!}(x - x_0)^k\]

\[= f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2!}(x - x_0)^2 + \frac{f'''(x_0)}{3!}(x - x_0)^3 + \cdots\]

2.2 Taylor Polynomial Approximations¶

In machine learning, we often use truncated Taylor polynomials for local approximation.

First-order (linear) approximation: $$f(x) \approx f(x_0) + f'(x_0)(x - x_0)$$

Second-order (quadratic) approximation: $$f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2}(x - x_0)^2$$

2.3 Why Taylor Series Matter in ML¶

Approximation Order	Use in Machine Learning
First-order	Gradient descent (linear approximation of loss function)
Second-order	Newton's method (quadratic approximation of loss function)

2.4 Worked Example¶

Example: Approximate $e^x$ around $x_0 = 0$ to second order.

We need $f(0)$, $f'(0)$, and $f''(0)$. Since $f(x) = e^x$, all derivatives are $e^x$, so $f(0) = f'(0) = f''(0) = 1$.

\[e^x \approx 1 + x + \frac{x^2}{2}\]

Checking: at $x = 0.1$, the true value is $e^{0.1} = 1.10517...$

\[1 + 0.1 + \frac{0.01}{2} = 1.105\]

The approximation is excellent near $x_0$!

Part 3: Partial Derivatives¶

3.1 Definition¶

For a function of multiple variables $f(x_1, x_2, \ldots, x_n)$, the partial derivative with respect to $x_i$ measures how $f$ changes when only $x_i$ varies, with all other variables held constant.

\[\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}\]

3.2 Notation¶

Partial derivatives have several equivalent notations:

Notation	Meaning
$\frac{\partial f}{\partial x}$	Partial derivative of $f$ with respect to $x$
$f_x$	Shorthand for $\frac{\partial f}{\partial x}$
$\partial_x f$	Another shorthand
$D_x f$	Differential operator notation

3.3 How to Compute Partial Derivatives¶

Rule: To find $\frac{\partial f}{\partial x_i}$, treat every variable except $x_i$ as a constant, then differentiate with respect to $x_i$ using the standard rules.

Example 1: Let $f(x, y) = x^2 y + 3xy^2 - 2y$.

\[\frac{\partial f}{\partial x} = 2xy + 3y^2\]

\[\frac{\partial f}{\partial y} = x^2 + 6xy - 2\]

Example 2: Let $f(x, y) = e^{xy} + \sin(x)$.

\[\frac{\partial f}{\partial x} = y \, e^{xy} + \cos(x)\]

\[\frac{\partial f}{\partial y} = x \, e^{xy}\]

Part 4: Gradients¶

4.1 Definition¶

The gradient of a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$ is a vector of all its partial derivatives:

\[\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n\]

The gradient "lives" in the same space as the input $\mathbf{x}$.

4.2 Gradient as Direction of Steepest Ascent¶

The gradient has a fundamental geometric meaning:

$\nabla f(\mathbf{x})$ points in the direction of steepest ascent of $f$ at $\mathbf{x}$
$-\nabla f(\mathbf{x})$ points in the direction of steepest descent
$\|\nabla f(\mathbf{x})\|$ gives the rate of steepest ascent

This is why gradient descent updates parameters as:

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \nabla f(\mathbf{x}_t)\]

where $\eta > 0$ is the learning rate.

4.3 Worked Example¶

Example: Find the gradient of $f(x_1, x_2, x_3) = x_1^2 + 2x_1 x_2 + x_3^3$.

\[\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \frac{\partial f}{\partial x_3} \end{bmatrix} = \begin{bmatrix} 2x_1 + 2x_2 \\ 2x_1 \\ 3x_3^2 \end{bmatrix}\]

At the point $\mathbf{x} = \begin{bmatrix} 1 \\ 2 \\ -1 \end{bmatrix}$:

\[\nabla f\big|_{\mathbf{x}} = \begin{bmatrix} 2(1) + 2(2) \\ 2(1) \\ 3(-1)^2 \end{bmatrix} = \begin{bmatrix} 6 \\ 2 \\ 3 \end{bmatrix}\]

The direction of steepest descent at this point is $-\nabla f = \begin{bmatrix} -6 \\ -2 \\ -3 \end{bmatrix}$.

Part 5: Jacobians¶

5.1 Definition¶

For a vector-valued function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ with components $f_1, f_2, \ldots, f_m$, the Jacobian is the $m \times n$ matrix of all first-order partial derivatives:

\[\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}\]

Key observation: Each row of the Jacobian is the gradient (transposed) of one output component $f_i$.

5.2 Relationship to Gradients¶

Object	Input	Output	Derivative
Gradient $\nabla f$	$\mathbb{R}^n$	$\mathbb{R}$ (scalar)	Vector in $\mathbb{R}^n$
Jacobian $\mathbf{J}$	$\mathbb{R}^n$	$\mathbb{R}^m$ (vector)	Matrix in $\mathbb{R}^{m \times n}$

5.3 Worked Example¶

Example: Let $\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^3$ be defined by:

\[\mathbf{f}(x_1, x_2) = \begin{bmatrix} x_1^2 x_2 \\ 5x_1 + \sin(x_2) \\ x_2^2 \end{bmatrix}\]

The Jacobian is:

\[\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \\ \frac{\partial f_3}{\partial x_1} & \frac{\partial f_3}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 x_2 & x_1^2 \\ 5 & \cos(x_2) \\ 0 & 2x_2 \end{bmatrix} \in \mathbb{R}^{3 \times 2}\]

Part 6: Gradients of Matrices¶

6.1 Matrix Calculus Rules¶

When working with vectors and matrices, we need special differentiation rules.

Gradient of a linear function: For $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$ where $\mathbf{a}, \mathbf{x} \in \mathbb{R}^n$:

\[\nabla_{\mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}\]

Gradient of a quadratic form: For $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ where $\mathbf{A} \in \mathbb{R}^{n \times n}$:

\[\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\]

If $\mathbf{A}$ is symmetric ($\mathbf{A} = \mathbf{A}^T$), this simplifies to:

\[\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = 2\mathbf{A}\mathbf{x}\]

6.2 Useful Matrix Calculus Identities¶

Function $f(\mathbf{x})$	Gradient $\nabla_{\mathbf{x}} f$
$\mathbf{a}^T \mathbf{x}$	$\mathbf{a}$
$\mathbf{x}^T \mathbf{a}$	$\mathbf{a}$
$\mathbf{x}^T \mathbf{x}$	$2\mathbf{x}$
$\mathbf{x}^T \mathbf{A} \mathbf{x}$	$(\mathbf{A} + \mathbf{A}^T)\mathbf{x}$
$\\|\mathbf{x} - \mathbf{b}\\|^2$	$2(\mathbf{x} - \mathbf{b})$
$\mathbf{b}^T \mathbf{A} \mathbf{x}$	$\mathbf{A}^T \mathbf{b}$

6.3 Worked Example¶

Example: Find the gradient of the least-squares loss.

The loss function is:

\[L(\mathbf{w}) = \|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2 = (\mathbf{X}\mathbf{w} - \mathbf{y})^T(\mathbf{X}\mathbf{w} - \mathbf{y})\]

Expanding:

\[L(\mathbf{w}) = \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} - 2\mathbf{y}^T \mathbf{X} \mathbf{w} + \mathbf{y}^T \mathbf{y}\]

Taking the gradient with respect to $\mathbf{w}$:

\[\nabla_{\mathbf{w}} L = 2\mathbf{X}^T \mathbf{X} \mathbf{w} - 2\mathbf{X}^T \mathbf{y}\]

Setting $\nabla_{\mathbf{w}} L = \mathbf{0}$ gives the normal equation:

\[\mathbf{X}^T \mathbf{X} \mathbf{w}^* = \mathbf{X}^T \mathbf{y} \quad \Longrightarrow \quad \mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

Part 7: The Chain Rule¶

7.1 Single Variable Chain Rule¶

If $y = f(g(x))$, then:

\[\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]

Example: $y = (3x + 1)^4$

Let $g = 3x + 1$, so $y = g^4$.

\[\frac{dy}{dx} = 4g^3 \cdot 3 = 12(3x + 1)^3\]

7.2 Multivariate Chain Rule¶

If $f$ depends on $\mathbf{x}$ through intermediate variables $\mathbf{u}$:

\[\mathbf{x} \in \mathbb{R}^n \xrightarrow{\mathbf{g}} \mathbf{u} \in \mathbb{R}^m \xrightarrow{f} y \in \mathbb{R}\]

Then:

\[\frac{\partial f}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial f}{\partial u_j} \cdot \frac{\partial u_j}{\partial x_i}\]

In matrix form (using Jacobians):

\[\frac{\partial f}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}\]

7.3 Chain Rule for Neural Networks¶

Consider a simple two-layer neural network:

\[\mathbf{x} \xrightarrow{\mathbf{W}_1} \mathbf{z}_1 \xrightarrow{\sigma} \mathbf{a}_1 \xrightarrow{\mathbf{W}_2} \mathbf{z}_2 \xrightarrow{\text{loss}} L\]

To find $\frac{\partial L}{\partial \mathbf{W}_1}$, we apply the chain rule through the entire computation:

\[\frac{\partial L}{\partial \mathbf{W}_1} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot \frac{\partial \mathbf{z}_2}{\partial \mathbf{a}_1} \cdot \frac{\partial \mathbf{a}_1}{\partial \mathbf{z}_1} \cdot \frac{\partial \mathbf{z}_1}{\partial \mathbf{W}_1}\]

Each term in this product corresponds to a specific operation in the network.

Part 8: Backpropagation¶

8.1 Computation Graphs¶

A computation graph represents a function as a directed acyclic graph (DAG) where: - Nodes represent operations or variables - Edges represent data flow

Example: For $f(x, y) = (x + y) \cdot (y + 1)$:

x ---\
      (+) = a ---\
y ---/            (*) = f
y ---\           /
      (+) = b --/
1 ---/

Here $a = x + y$, $b = y + 1$, and $f = a \cdot b$.

8.2 Forward Pass¶

In the forward pass, we compute the output by evaluating the graph from inputs to output.

Example: With $x = 2$, $y = 3$:

Step	Computation	Value
1	$a = x + y$	$a = 2 + 3 = 5$
2	$b = y + 1$	$b = 3 + 1 = 4$
3	$f = a \cdot b$	$f = 5 \cdot 4 = 20$

8.3 Backward Pass (Backpropagation)¶

In the backward pass, we compute gradients by traversing the graph from output to inputs, applying the chain rule at each node.

Starting from $\frac{\partial f}{\partial f} = 1$:

Step	Gradient	Computation	Value
1	$\frac{\partial f}{\partial a}$	$b$	$4$
2	$\frac{\partial f}{\partial b}$	$a$	$5$
3	$\frac{\partial f}{\partial x}$	$\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} = b \cdot 1$	$4$
4	$\frac{\partial f}{\partial y}$	$\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial y} + \frac{\partial f}{\partial b} \cdot \frac{\partial b}{\partial y} = b \cdot 1 + a \cdot 1$	$4 + 5 = 9$

Note: Since $y$ appears in two paths ($a$ and $b$), we sum the contributions from both paths.

8.4 General Backpropagation Algorithm¶

For a computation graph with output $L$:

Forward pass: Compute all intermediate values from inputs to output
Initialize: Set $\frac{\partial L}{\partial L} = 1$
Backward pass: For each node $v$ in reverse topological order:

\[\frac{\partial L}{\partial v} = \sum_{u \in \text{children}(v)} \frac{\partial L}{\partial u} \cdot \frac{\partial u}{\partial v}\]

This is the foundation of training neural networks.

Part 9: Higher-Order Derivatives¶

9.1 Second-Order Partial Derivatives¶

For a function $f(x_1, x_2, \ldots, x_n)$, we can differentiate partial derivatives again:

\[\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_i}\left(\frac{\partial f}{\partial x_j}\right)\]

Symmetry of mixed partials (Schwarz's theorem): If $f$ has continuous second partial derivatives:

\[\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}\]

9.2 The Hessian Matrix¶

The Hessian collects all second-order partial derivatives into a matrix:

\[\mathbf{H} = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \in \mathbb{R}^{n \times n}\]

Properties: - The Hessian is symmetric (by Schwarz's theorem): $\mathbf{H} = \mathbf{H}^T$ - If $\mathbf{H}$ is positive definite at a critical point, the point is a local minimum - If $\mathbf{H}$ is negative definite, the point is a local maximum - If $\mathbf{H}$ has both positive and negative eigenvalues, the point is a saddle point

9.3 Second-Order Taylor Expansion (Multivariate)¶

The multivariate second-order Taylor expansion around $\mathbf{x}_0$ is:

\[f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2}(\mathbf{x} - \mathbf{x}_0)^T \mathbf{H}(\mathbf{x}_0)(\mathbf{x} - \mathbf{x}_0)\]

This is the basis for Newton's method in optimization.

9.4 Worked Example¶

Example: Find the Hessian of $f(x_1, x_2) = x_1^3 + 2x_1 x_2^2 - x_2$.

First, compute the gradient:

\[\nabla f = \begin{bmatrix} 3x_1^2 + 2x_2^2 \\ 4x_1 x_2 - 1 \end{bmatrix}\]

Then, compute the Hessian:

\[\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix} = \begin{bmatrix} 6x_1 & 4x_2 \\ 4x_2 & 4x_1 \end{bmatrix}\]

Notice that $\mathbf{H} = \mathbf{H}^T$, confirming symmetry.

Part 10: Useful Gradient Identities¶

10.1 Reference Table¶

These identities appear frequently in machine learning derivations. Here $\mathbf{x}, \mathbf{a}, \mathbf{b} \in \mathbb{R}^n$ and $\mathbf{A} \in \mathbb{R}^{n \times n}$.

#	Function	Gradient $\nabla_{\mathbf{x}}$
1	$\mathbf{a}^T \mathbf{x}$	$\mathbf{a}$
2	$\mathbf{x}^T \mathbf{x}$	$2\mathbf{x}$
3	$\mathbf{x}^T \mathbf{A} \mathbf{x}$	$(\mathbf{A} + \mathbf{A}^T)\mathbf{x}$
4	$(\mathbf{A}\mathbf{x} - \mathbf{b})^T(\mathbf{A}\mathbf{x} - \mathbf{b})$	$2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})$
5	$\\|\mathbf{x}\\|^2 = \mathbf{x}^T\mathbf{x}$	$2\mathbf{x}$
6	$\mathbf{b}^T \mathbf{A} \mathbf{x}$	$\mathbf{A}^T \mathbf{b}$

10.2 Deriving Identity 3¶

Let us prove $\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}$.

Write $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x} = \sum_{i}\sum_{j} x_i \, A_{ij} \, x_j$.

Taking the partial derivative with respect to $x_k$:

\[\frac{\partial f}{\partial x_k} = \sum_{j} A_{kj} \, x_j + \sum_{i} x_i \, A_{ik}\]

\[= (\mathbf{A}\mathbf{x})_k + (\mathbf{A}^T\mathbf{x})_k\]

Collecting into a vector:

\[\nabla_{\mathbf{x}} f = \mathbf{A}\mathbf{x} + \mathbf{A}^T\mathbf{x} = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\]

10.3 When $\mathbf{A}$ is Symmetric¶

If $\mathbf{A} = \mathbf{A}^T$, then $\mathbf{A} + \mathbf{A}^T = 2\mathbf{A}$, so:

\[\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = 2\mathbf{A}\mathbf{x}\]

This is a very common case in machine learning, since covariance matrices and Hessians are symmetric.

Summary: Key Takeaways¶

Differentiation Fundamentals¶

Derivatives measure rates of change; partial derivatives fix all variables except one
The gradient $\nabla f$ collects all partial derivatives into a vector
The Jacobian generalizes the gradient for vector-valued functions

The Chain Rule and Backpropagation¶

The multivariate chain rule composes Jacobians through multiplication
Backpropagation applies the chain rule on a computation graph, working backward from the loss
Gradients with respect to variables appearing in multiple paths are summed

Higher-Order Information¶

The Hessian matrix $\mathbf{H}$ captures second-order (curvature) information
Positive definite Hessian at a critical point indicates a local minimum

Matrix Calculus¶

$\nabla_{\mathbf{x}}(\mathbf{a}^T \mathbf{x}) = \mathbf{a}$
$\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}$
The normal equation for least squares: $\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Practice Problems¶

Problem 1¶

Find the derivative of $f(x) = x^3 e^{2x}$ using the product and chain rules.

Problem 2¶

Let $f(x, y) = x^2 y - 3xy^3 + 2x$. Find $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$, then compute the gradient at the point $(1, -1)$.

Problem 3¶

Compute the Jacobian of the function $\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^2$ defined by:

\[\mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ xy - y^2 \end{bmatrix}\]

Problem 4¶

Find the Hessian of $f(x_1, x_2) = x_1^2 + 4x_1 x_2 + x_2^2$. Is this Hessian positive definite?

Problem 5¶

Consider the computation graph for $f(x) = (x + 2)^2$. Perform the forward pass with $x = 3$, then use backpropagation to compute $\frac{df}{dx}$.

Problem 6¶

Let $\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$ and $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$. Compute $\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x})$ using the identity from Part 10, and verify by expanding $\mathbf{x}^T \mathbf{A} \mathbf{x}$ and differentiating directly.

Solutions¶

Solution 1:

Using the product rule with $u = x^3$ and $v = e^{2x}$:

\[f'(x) = \frac{d}{dx}(x^3) \cdot e^{2x} + x^3 \cdot \frac{d}{dx}(e^{2x})\]

\[= 3x^2 \cdot e^{2x} + x^3 \cdot 2e^{2x}\]

\[= e^{2x}(3x^2 + 2x^3) = x^2 e^{2x}(3 + 2x)\]

Solution 2:

\[\frac{\partial f}{\partial x} = 2xy - 3y^3 + 2\]

\[\frac{\partial f}{\partial y} = x^2 - 9xy^2\]

At $(1, -1)$:

\[\frac{\partial f}{\partial x}\bigg|_{(1,-1)} = 2(1)(-1) - 3(-1)^3 + 2 = -2 + 3 + 2 = 3\]

\[\frac{\partial f}{\partial y}\bigg|_{(1,-1)} = (1)^2 - 9(1)(-1)^2 = 1 - 9 = -8\]

\[\nabla f\big|_{(1,-1)} = \begin{bmatrix} 3 \\ -8 \end{bmatrix}\]

Solution 3:

\[\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y} \end{bmatrix}\]

For $f_1 = x^2 + y$: $\frac{\partial f_1}{\partial x} = 2x$, $\frac{\partial f_1}{\partial y} = 1$

For $f_2 = xy - y^2$: $\frac{\partial f_2}{\partial x} = y$, $\frac{\partial f_2}{\partial y} = x - 2y$

\[\mathbf{J} = \begin{bmatrix} 2x & 1 \\ y & x - 2y \end{bmatrix}\]

Solution 4:

First, compute the gradient:

\[\nabla f = \begin{bmatrix} 2x_1 + 4x_2 \\ 4x_1 + 2x_2 \end{bmatrix}\]

The Hessian (matrix of second derivatives):

\[\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix} = \begin{bmatrix} 2 & 4 \\ 4 & 2 \end{bmatrix}\]

To check positive definiteness, compute the eigenvalues. For a $2 \times 2$ matrix:

\[\det(\mathbf{H} - \lambda \mathbf{I}) = (2 - \lambda)^2 - 16 = 0\]

\[\lambda^2 - 4\lambda + 4 - 16 = 0 \implies \lambda^2 - 4\lambda - 12 = 0\]

\[\lambda = \frac{4 \pm \sqrt{16 + 48}}{2} = \frac{4 \pm 8}{2}\]

So $\lambda_1 = 6$ and $\lambda_2 = -2$.

Since one eigenvalue is negative, the Hessian is not positive definite. It is indefinite, meaning any critical point of $f$ would be a saddle point.

Solution 5:

Decompose $f(x) = (x + 2)^2$ into elementary steps:

$a = x + 2$
$f = a^2$

Forward pass with $x = 3$:

Step	Computation	Value
1	$a = x + 2$	$a = 3 + 2 = 5$
2	$f = a^2$	$f = 5^2 = 25$

Backward pass:

Step	Gradient	Computation	Value
1	$\frac{\partial f}{\partial f}$	(seed)	$1$
2	$\frac{\partial f}{\partial a}$	$2a$	$2(5) = 10$
3	$\frac{\partial f}{\partial x}$	$\frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} = 10 \cdot 1$	$10$

Verification: $f'(x) = 2(x + 2)$, so $f'(3) = 2(5) = 10$. Correct!

Solution 6:

Using the identity:

Since $\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$ is symmetric ($\mathbf{A} = \mathbf{A}^T$):

\[\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = 2\mathbf{A}\mathbf{x} = 2\begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 4x_1 + 2x_2 \\ 2x_1 + 6x_2 \end{bmatrix}\]

Direct verification:

Expand $\mathbf{x}^T \mathbf{A} \mathbf{x}$:

\[\mathbf{x}^T \mathbf{A} \mathbf{x} = \begin{bmatrix} x_1 & x_2 \end{bmatrix}\begin{bmatrix} 2x_1 + x_2 \\ x_1 + 3x_2 \end{bmatrix} = 2x_1^2 + x_1 x_2 + x_1 x_2 + 3x_2^2 = 2x_1^2 + 2x_1 x_2 + 3x_2^2\]

Taking partial derivatives:

\[\frac{\partial}{\partial x_1}(2x_1^2 + 2x_1 x_2 + 3x_2^2) = 4x_1 + 2x_2\]

\[\frac{\partial}{\partial x_2}(2x_1^2 + 2x_1 x_2 + 3x_2^2) = 2x_1 + 6x_2\]

\[\nabla_{\mathbf{x}}(\mathbf{x}^T \mathbf{A} \mathbf{x}) = \begin{bmatrix} 4x_1 + 2x_2 \\ 2x_1 + 6x_2 \end{bmatrix}\]

Both methods agree, confirming the identity.