Tutorial 3: Matrix Decomposition¶

Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari

📚 Learning Objectives¶

By the end of this tutorial, you will understand:

How to compute determinants for 2x2 and 3x3 matrices
The trace of a matrix and its properties
Eigenvalues and eigenvectors and how to compute them
Cholesky decomposition for symmetric positive definite matrices
Eigendecomposition and diagonalization
Singular Value Decomposition (SVD) and its geometric meaning
Matrix approximation using truncated SVD

Part 1: Determinants¶

1.1 What is a Determinant?¶

The determinant is a scalar value computed from a square matrix that captures important information about the matrix. Think of it as a single number that tells you:

Whether the matrix is invertible (determinant is nonzero)
How the matrix scales areas or volumes when used as a linear transformation
The "signed volume" of the parallelepiped formed by the column vectors

Notation: For a matrix \(A\), the determinant is written as \(\det(A)\) or \(|A|\).

1.2 Determinant of a 2x2 Matrix¶

For a \(2 \times 2\) matrix:

\[A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\]

The determinant is:

\[\det(A) = ad - bc\]

In plain English: multiply the diagonals and subtract. The main diagonal product minus the off-diagonal product.

Worked Example:

\[A = \begin{bmatrix} 3 & 7 \\ 1 & 5 \end{bmatrix}\]

\[\det(A) = 3(5) - 7(1) = 15 - 7 = 8\]

Since \(\det(A) = 8 \neq 0\), the matrix \(A\) is invertible.

1.3 Determinant of a 3x3 Matrix¶

For a \(3 \times 3\) matrix:

\[A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix}\]

Method 1: Sarrus' Rule¶

Write the matrix and repeat the first two columns to the right:

\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix} \begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{matrix}\]

Then sum the products along the three downward diagonals and subtract the products along the three upward diagonals:

\[\det(A) = a_{11}a_{22}a_{33} + a_{12}a_{23}a_{31} + a_{13}a_{21}a_{32} - a_{13}a_{22}a_{31} - a_{12}a_{21}a_{33} - a_{11}a_{23}a_{32}\]

Method 2: Cofactor Expansion (along the first row)¶

\[\det(A) = a_{11} \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{vmatrix} - a_{12} \begin{vmatrix} a_{21} & a_{23} \\ a_{31} & a_{33} \end{vmatrix} + a_{13} \begin{vmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{vmatrix}\]

Each smaller determinant is called a minor, and the signed minor is a cofactor.

Worked Example:

\[B = \begin{bmatrix} 2 & 1 & 3 \\ 0 & 4 & 5 \\ 1 & 0 & 2 \end{bmatrix}\]

Using cofactor expansion along the first row:

\[\det(B) = 2 \begin{vmatrix} 4 & 5 \\ 0 & 2 \end{vmatrix} - 1 \begin{vmatrix} 0 & 5 \\ 1 & 2 \end{vmatrix} + 3 \begin{vmatrix} 0 & 4 \\ 1 & 0 \end{vmatrix}\]

\[= 2(4 \cdot 2 - 5 \cdot 0) - 1(0 \cdot 2 - 5 \cdot 1) + 3(0 \cdot 0 - 4 \cdot 1)\]

\[= 2(8) - 1(-5) + 3(-4)\]

\[= 16 + 5 - 12 = 9\]

1.4 Properties of Determinants¶

Property	Statement	Example / Note
Identity	\(\det(I) = 1\)	The identity matrix always has determinant 1
Transpose	\(\det(A^T) = \det(A)\)	Transposing does not change the determinant
Product	\(\det(AB) = \det(A) \cdot \det(B)\)	Determinant of a product is the product of determinants
Inverse	\(\det(A^{-1}) = \frac{1}{\det(A)}\)	Only defined when \(\det(A) \neq 0\)
Scalar multiple	\(\det(cA) = c^n \det(A)\)	For an \(n \times n\) matrix
Singular matrix	\(\det(A) = 0\)	Matrix is not invertible
Row swap	Swapping two rows flips the sign	\(\det(\text{swapped}) = -\det(A)\)
Triangular	\(\det(A) = \prod_{i=1}^{n} a_{ii}\)	Product of diagonal entries for triangular matrices

Part 2: Trace¶

2.1 Definition¶

The trace of a square matrix \(A\) is the sum of its diagonal entries:

\[\text{tr}(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn}\]

In plain English: just add up all the numbers on the main diagonal.

Example:

\[A = \begin{bmatrix} 5 & 2 & 1 \\ 0 & 3 & 4 \\ 7 & 6 & 8 \end{bmatrix}\]

\[\text{tr}(A) = 5 + 3 + 8 = 16\]

2.2 Properties of the Trace¶

Property	Statement
Linearity	\(\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\)
Scalar multiplication	\(\text{tr}(cA) = c \cdot \text{tr}(A)\)
Transpose	\(\text{tr}(A^T) = \text{tr}(A)\)
Cyclic property	\(\text{tr}(AB) = \text{tr}(BA)\)
Cyclic property (3 matrices)	\(\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)\)
Sum of eigenvalues	\(\text{tr}(A) = \sum_{i=1}^{n} \lambda_i\)
Frobenius norm	\(\text{tr}(A^T A) = \sum_{i,j} a_{ij}^2 = \\|A\\|_F^2\)

The cyclic property \(\text{tr}(AB) = \text{tr}(BA)\) is especially important in machine learning. It lets you rearrange matrix products inside a trace, which simplifies many derivations in optimization and statistics.

Worked Example: Verify \(\text{tr}(AB) = \text{tr}(BA)\).

\[A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}\]

\[AB = \begin{bmatrix} 1(5)+2(7) & 1(6)+2(8) \\ 3(5)+4(7) & 3(6)+4(8) \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}\]

\[\text{tr}(AB) = 19 + 50 = 69\]

\[BA = \begin{bmatrix} 5(1)+6(3) & 5(2)+6(4) \\ 7(1)+8(3) & 7(2)+8(4) \end{bmatrix} = \begin{bmatrix} 23 & 34 \\ 31 & 46 \end{bmatrix}\]

\[\text{tr}(BA) = 23 + 46 = 69\]

Both sides give 69, confirming the property.

Part 3: Eigenvalues and Eigenvectors¶

3.1 Definition¶

Given a square matrix \(A\), a nonzero vector \(\mathbf{v}\) is an eigenvector of \(A\) if multiplying \(A\) by \(\mathbf{v}\) simply scales \(\mathbf{v}\) by some scalar \(\lambda\):

\[A\mathbf{v} = \lambda \mathbf{v}\]

\(\lambda\) is called the eigenvalue corresponding to \(\mathbf{v}\)
\(\mathbf{v}\) is the eigenvector corresponding to \(\lambda\)

In plain English: an eigenvector is a special direction that the matrix only stretches (or flips), but does not rotate. The eigenvalue tells you by how much it stretches.

3.2 The Characteristic Polynomial¶

To find eigenvalues, we rearrange \(A\mathbf{v} = \lambda \mathbf{v}\):

\[A\mathbf{v} - \lambda \mathbf{v} = \mathbf{0}\]

\[(A - \lambda I)\mathbf{v} = \mathbf{0}\]

For a nonzero solution \(\mathbf{v}\) to exist, the matrix \((A - \lambda I)\) must be singular, meaning:

\[\det(A - \lambda I) = 0\]

This equation is called the characteristic equation, and the polynomial on the left side is the characteristic polynomial. Solving it gives us the eigenvalues.

3.3 Worked Example: Finding Eigenvalues and Eigenvectors¶

Find the eigenvalues and eigenvectors of:

\[A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}\]

Step 1: Characteristic polynomial.

\[A - \lambda I = \begin{bmatrix} 4 - \lambda & 1 \\ 2 & 3 - \lambda \end{bmatrix}\]

\[\det(A - \lambda I) = (4 - \lambda)(3 - \lambda) - (1)(2)\]

\[= 12 - 4\lambda - 3\lambda + \lambda^2 - 2\]

\[= \lambda^2 - 7\lambda + 10\]

Step 2: Solve the characteristic equation.

\[\lambda^2 - 7\lambda + 10 = 0\]

\[(\lambda - 5)(\lambda - 2) = 0\]

\[\lambda_1 = 5, \quad \lambda_2 = 2\]

Step 3: Find eigenvectors for each eigenvalue.

For \(\lambda_1 = 5\):

\[(A - 5I)\mathbf{v} = \mathbf{0}\]

\[\begin{bmatrix} -1 & 1 \\ 2 & -2 \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

From the first row: \(-v_1 + v_2 = 0\), so \(v_2 = v_1\).

Choosing \(v_1 = 1\):

\[\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\]

For \(\lambda_2 = 2\):

\[(A - 2I)\mathbf{v} = \mathbf{0}\]

\[\begin{bmatrix} 2 & 1 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

From the first row: \(2v_1 + v_2 = 0\), so \(v_2 = -2v_1\).

Choosing \(v_1 = 1\):

\[\mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}\]

Quick check: \(\text{tr}(A) = 4 + 3 = 7 = 5 + 2 = \lambda_1 + \lambda_2\) and \(\det(A) = 4(3) - 1(2) = 10 = 5 \times 2 = \lambda_1 \cdot \lambda_2\). Both checks pass.

3.4 Key Facts About Eigenvalues¶

Fact	Statement
Sum of eigenvalues	\(\sum \lambda_i = \text{tr}(A)\)
Product of eigenvalues	\(\prod \lambda_i = \det(A)\)
Symmetric matrices	All eigenvalues are real; eigenvectors are orthogonal
Positive definite	All eigenvalues are strictly positive
Singular matrix	At least one eigenvalue is zero

Part 4: Cholesky Decomposition¶

4.1 What is Cholesky Decomposition?¶

For a symmetric positive definite (SPD) matrix \(A\), the Cholesky decomposition factors \(A\) into:

\[A = LL^T\]

where \(L\) is a lower triangular matrix with positive diagonal entries.

Think of it as the "square root" of a matrix. Just as every positive number has a square root, every SPD matrix has a Cholesky factor.

4.2 What Does "Symmetric Positive Definite" Mean?¶

A matrix \(A\) is symmetric positive definite if:

Symmetric: \(A = A^T\) (the matrix equals its transpose)
Positive definite: \(\mathbf{x}^T A \mathbf{x} > 0\) for all nonzero vectors \(\mathbf{x}\)

Equivalently, \(A\) is SPD if it is symmetric and all its eigenvalues are strictly positive.

4.3 The Cholesky Algorithm (for 2x2)¶

Given:

\[A = \begin{bmatrix} a_{11} & a_{12} \\ a_{12} & a_{22} \end{bmatrix} = \begin{bmatrix} l_{11} & 0 \\ l_{21} & l_{22} \end{bmatrix} \begin{bmatrix} l_{11} & l_{21} \\ 0 & l_{22} \end{bmatrix}\]

Expanding the right side and matching entries:

\[l_{11} = \sqrt{a_{11}}\]

\[l_{21} = \frac{a_{12}}{l_{11}}\]

\[l_{22} = \sqrt{a_{22} - l_{21}^2}\]

4.4 The Cholesky Algorithm (general)¶

For an \(n \times n\) SPD matrix, the entries of \(L\) are computed as:

\[l_{jj} = \sqrt{a_{jj} - \sum_{k=1}^{j-1} l_{jk}^2}\]

\[l_{ij} = \frac{1}{l_{jj}} \left( a_{ij} - \sum_{k=1}^{j-1} l_{ik} l_{jk} \right), \quad \text{for } i > j\]

4.5 Worked Example¶

Find the Cholesky decomposition of:

\[A = \begin{bmatrix} 4 & 2 \\ 2 & 5 \end{bmatrix}\]

Step 1: Check that \(A\) is SPD. It is symmetric (\(A = A^T\)). Its eigenvalues are \(\lambda = \frac{9 \pm \sqrt{1}}{2}\), giving \(\lambda_1 \approx 5.56\) and \(\lambda_2 \approx 3.44\), both positive. (Alternatively, \(\det(A) = 16 > 0\) and \(a_{11} = 4 > 0\).)

Step 2: Compute \(L\).

\[l_{11} = \sqrt{4} = 2\]

\[l_{21} = \frac{2}{2} = 1\]

\[l_{22} = \sqrt{5 - 1^2} = \sqrt{4} = 2\]

Result:

\[L = \begin{bmatrix} 2 & 0 \\ 1 & 2 \end{bmatrix}\]

Verification:

\[LL^T = \begin{bmatrix} 2 & 0 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 4 & 2 \\ 2 & 5 \end{bmatrix} = A \quad \checkmark\]

4.6 Why Cholesky Decomposition Matters¶

Application	How It Helps
Solving linear systems	\(Ax = b\) becomes two easy triangular solves: \(Ly = b\), then \(L^Tx = y\)
Sampling from multivariate Gaussians	If \(\Sigma = LL^T\), generate \(\mathbf{x} = L\mathbf{z} + \boldsymbol{\mu}\) where \(\mathbf{z} \sim \mathcal{N}(0, I)\)
Numerical stability	More stable and about twice as fast as general LU decomposition for SPD matrices

Part 5: Eigendecomposition¶

5.1 Definition¶

If a square matrix \(A\) has \(n\) linearly independent eigenvectors, then \(A\) can be factored as:

\[A = PDP^{-1}\]

where: - \(P\) is the matrix whose columns are the eigenvectors of \(A\): \(P = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_n]\) - \(D\) is a diagonal matrix with the corresponding eigenvalues on the diagonal:

\[D = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix}\]

This factorization is called the eigendecomposition or spectral decomposition (for symmetric matrices).

5.2 When Does Eigendecomposition Exist?¶

Condition	Eigendecomposition Exists?
\(A\) has \(n\) distinct eigenvalues	Always yes
\(A\) is symmetric (\(A = A^T\))	Always yes, and \(P\) is orthogonal (\(P^{-1} = P^T\))
\(A\) has repeated eigenvalues	Sometimes (depends on geometric multiplicity)
\(A\) is defective (not enough independent eigenvectors)	No

For symmetric matrices, the decomposition simplifies to:

\[A = PDP^T\]

because the eigenvectors are orthogonal.

5.3 Worked Example¶

Diagonalize the matrix from Part 3:

\[A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}\]

We already found \(\lambda_1 = 5\), \(\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\) and \(\lambda_2 = 2\), \(\mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}\).

\[P = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix}, \quad D = \begin{bmatrix} 5 & 0 \\ 0 & 2 \end{bmatrix}\]

Compute \(P^{-1}\). For a \(2 \times 2\) matrix \(\begin{bmatrix} a & b \\ c & d \end{bmatrix}\), the inverse is \(\frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\):

\[P^{-1} = \frac{1}{(1)(-2) - (1)(1)} \begin{bmatrix} -2 & -1 \\ -1 & 1 \end{bmatrix} = \frac{1}{-3} \begin{bmatrix} -2 & -1 \\ -1 & 1 \end{bmatrix} = \begin{bmatrix} \frac{2}{3} & \frac{1}{3} \\ \frac{1}{3} & -\frac{1}{3} \end{bmatrix}\]

Verification:

\[PDP^{-1} = \begin{bmatrix} 1 & 1 \\ 1 & -2 \end{bmatrix} \begin{bmatrix} 5 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} \frac{2}{3} & \frac{1}{3} \\ \frac{1}{3} & -\frac{1}{3} \end{bmatrix}\]

\[= \begin{bmatrix} 5 & 2 \\ 5 & -4 \end{bmatrix} \begin{bmatrix} \frac{2}{3} & \frac{1}{3} \\ \frac{1}{3} & -\frac{1}{3} \end{bmatrix} = \begin{bmatrix} \frac{10}{3} + \frac{2}{3} & \frac{5}{3} - \frac{2}{3} \\ \frac{10}{3} - \frac{4}{3} & \frac{5}{3} + \frac{4}{3} \end{bmatrix} = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix} = A \quad \checkmark\]

5.4 Why Eigendecomposition is Useful¶

Computing matrix powers: If \(A = PDP^{-1}\), then:

\[A^k = PD^kP^{-1}\]

Since \(D\) is diagonal, \(D^k\) is just each diagonal entry raised to the \(k\)-th power:

\[D^k = \begin{bmatrix} \lambda_1^k & 0 \\ 0 & \lambda_2^k \end{bmatrix}\]

This makes computing \(A^{100}\) just as easy as computing \(A^2\).

Part 6: Singular Value Decomposition (SVD)¶

6.1 Definition¶

Every matrix \(A\) (of any shape) can be decomposed as:

\[A = U \Sigma V^T\]

where: - \(A\) is \(m \times n\) - \(U\) is \(m \times m\) orthogonal matrix (columns are left singular vectors) - \(\Sigma\) is \(m \times n\) diagonal matrix (diagonal entries are singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\)) - \(V\) is \(n \times n\) orthogonal matrix (columns are right singular vectors)

This is the most important matrix decomposition in applied mathematics and machine learning.

6.2 Geometric Interpretation¶

Any linear transformation \(A\) can be broken down into three simple steps:

\(V^T\): Rotate (in the input space)
\(\Sigma\): Scale along each axis (possibly changing dimensions)
\(U\): Rotate (in the output space)

In plain English: every matrix transformation is just a rotation, followed by a stretch, followed by another rotation.

6.3 Relationship to Eigendecomposition¶

The singular values and vectors are connected to eigenvalues:

SVD Component	Obtained From
\(V\) (right singular vectors)	Eigenvectors of \(A^T A\)
\(U\) (left singular vectors)	Eigenvectors of \(A A^T\)
\(\sigma_i\) (singular values)	\(\sigma_i = \sqrt{\lambda_i}\) where \(\lambda_i\) are eigenvalues of \(A^T A\)

6.4 Worked Example: Computing the SVD¶

Find the SVD of:

\[A = \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix}\]

This is already a diagonal matrix, so the SVD is straightforward.

Step 1: Compute \(A^T A\).

\[A^T A = \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 9 & 0 \\ 0 & 4 \end{bmatrix}\]

Step 2: Find eigenvalues of \(A^T A\).

Eigenvalues: \(\lambda_1 = 9\), \(\lambda_2 = 4\).

Singular values: \(\sigma_1 = \sqrt{9} = 3\), \(\sigma_2 = \sqrt{4} = 2\).

Step 3: Find \(V\) (eigenvectors of \(A^T A\)).

\[V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = I\]

Step 4: Find \(U\) (eigenvectors of \(A A^T\)).

\[A A^T = \begin{bmatrix} 9 & 0 \\ 0 & 4 \end{bmatrix}\]

\[U = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = I\]

Result:

\[A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = I \Sigma I = \Sigma\]

6.5 Worked Example: Non-Diagonal Matrix¶

Find the SVD of:

\[A = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}\]

Step 1: Compute \(A^T A\).

\[A^T A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \end{bmatrix}\]

Step 2: Find eigenvalues of \(A^T A\).

\[\det(A^T A - \lambda I) = (1 - \lambda)(2 - \lambda) - 1 = \lambda^2 - 3\lambda + 1 = 0\]

\[\lambda = \frac{3 \pm \sqrt{5}}{2}\]

\[\lambda_1 = \frac{3 + \sqrt{5}}{2} \approx 2.618, \quad \lambda_2 = \frac{3 - \sqrt{5}}{2} \approx 0.382\]

Singular values:

\[\sigma_1 = \sqrt{\frac{3 + \sqrt{5}}{2}} \approx 1.618, \quad \sigma_2 = \sqrt{\frac{3 - \sqrt{5}}{2}} \approx 0.618\]

(Notice: \(\sigma_1 \approx \phi\), the golden ratio, and \(\sigma_2 \approx 1/\phi\).)

Step 3: Find \(V\).

For \(\lambda_1 = \frac{3+\sqrt{5}}{2}\), solve \((A^TA - \lambda_1 I)\mathbf{v} = 0\):

\[\begin{bmatrix} 1 - \lambda_1 & 1 \\ 1 & 2 - \lambda_1 \end{bmatrix}\mathbf{v} = 0\]

This gives \(v_2 = (\lambda_1 - 1) v_1\). Normalizing:

\[\mathbf{v}_1 = \frac{1}{\sqrt{1 + (\lambda_1 - 1)^2}}\begin{bmatrix} 1 \\ \lambda_1 - 1 \end{bmatrix}\]

Similarly for \(\lambda_2\). Then \(V = [\mathbf{v}_1 \mid \mathbf{v}_2]\).

Step 4: Find \(U\).

Compute \(\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i\) for each singular vector.

The full numerical result gives \(A = U\Sigma V^T\) as desired.

6.6 Key Properties of SVD¶

Property	Statement
Existence	Every matrix has an SVD (unlike eigendecomposition)
Rank	\(\text{rank}(A) =\) number of nonzero singular values
Norm	\(\\|A\\|_2 = \sigma_1\) (largest singular value)
Frobenius norm	\(\\|A\\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}\)
Condition number	\(\kappa(A) = \sigma_1 / \sigma_n\) (ratio of largest to smallest)

Part 7: Matrix Approximation¶

7.1 Truncated SVD¶

Given the full SVD \(A = U\Sigma V^T\) with \(r\) nonzero singular values, we can approximate \(A\) by keeping only the \(k\) largest singular values (where \(k < r\)):

\[A_k = U_k \Sigma_k V_k^T = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T\]

where: - \(U_k\) is the first \(k\) columns of \(U\) - \(\Sigma_k\) is the top-left \(k \times k\) block of \(\Sigma\) - \(V_k\) is the first \(k\) columns of \(V\)

In plain English: keep the \(k\) most important "layers" of the matrix and throw away the rest. Each layer is a rank-1 matrix \(\sigma_i \mathbf{u}_i \mathbf{v}_i^T\) weighted by its singular value.

7.2 The Eckart-Young Theorem¶

The truncated SVD gives the best rank-\(k\) approximation to \(A\):

\[A_k = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_F\]

In other words, among all matrices with rank at most \(k\), \(A_k\) is the closest to \(A\) in the Frobenius norm. The approximation error is:

\[\|A - A_k\|_F = \sqrt{\sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_r^2}\]

This is a remarkable result: the best low-rank approximation is obtained simply by discarding the smallest singular values.

7.3 Worked Example: Rank-1 Approximation¶

Find the best rank-1 approximation to:

\[A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}\]

Step 1: Compute the SVD.

\[A^T A = \begin{bmatrix} 10 & 6 \\ 6 & 10 \end{bmatrix}\]

Eigenvalues of \(A^T A\): \(\lambda_1 = 16\), \(\lambda_2 = 4\).

Singular values: \(\sigma_1 = 4\), \(\sigma_2 = 2\).

Eigenvectors of \(A^T A\):

For \(\lambda_1 = 16\): \(\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}\)

For \(\lambda_2 = 4\): \(\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}\)

Left singular vectors: \(\mathbf{u}_i = \frac{1}{\sigma_i}A\mathbf{v}_i\)

\[\mathbf{u}_1 = \frac{1}{4} \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix} \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{4\sqrt{2}} \begin{bmatrix} 4 \\ 4 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}\]

\[\mathbf{u}_2 = \frac{1}{2} \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix} \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix} = \frac{1}{2\sqrt{2}} \begin{bmatrix} 2 \\ -2 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}\]

Step 2: Compute the rank-1 approximation.

\[A_1 = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T = 4 \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \end{bmatrix} = 4 \cdot \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}\]

Step 3: Check the error.

\[\|A - A_1\|_F = \sigma_2 = 2\]

\[A - A_1 = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix} - \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix} = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}\]

\[\|A - A_1\|_F = \sqrt{1 + 1 + 1 + 1} = \sqrt{4} = 2 \quad \checkmark\]

7.4 Applications of Low-Rank Approximation¶

Application	How Truncated SVD Helps
Image compression	An \(m \times n\) image stored as a matrix can be approximated with rank \(k\), requiring only \(k(m + n + 1)\) numbers instead of \(mn\)
Dimensionality reduction (PCA)	Principal Component Analysis keeps the top \(k\) singular vectors to reduce feature dimensions
Recommender systems	User-item rating matrices are approximated at low rank to predict missing ratings
Noise reduction	Small singular values often correspond to noise; discarding them denoises the data
Latent semantic analysis	In NLP, document-term matrices are approximated to find semantic structure

Image compression example: Suppose you have a \(1000 \times 1000\) grayscale image (1,000,000 pixel values). A rank-50 SVD approximation stores only \(50 \times (1000 + 1000 + 1) = 100{,}050\) numbers, which is about 10% of the original, yet often captures the essential visual content.

Summary: Key Takeaways¶

Matrix Scalars¶

Determinant \(\det(A)\): tells you invertibility and volume scaling
Trace \(\text{tr}(A)\): sum of diagonal entries = sum of eigenvalues

Eigenvalues and Eigenvectors¶

\(A\mathbf{v} = \lambda \mathbf{v}\): eigenvectors are special directions, eigenvalues are scale factors
Found by solving \(\det(A - \lambda I) = 0\)

Matrix Decompositions¶

Cholesky: \(A = LL^T\) for symmetric positive definite matrices
Eigendecomposition: \(A = PDP^{-1}\) for diagonalizable square matrices
SVD: \(A = U\Sigma V^T\) for any matrix (the universal decomposition)

Matrix Approximation¶

Truncated SVD gives the best rank-\(k\) approximation (Eckart-Young theorem)
Central to PCA, image compression, recommender systems, and denoising

Practice Problems¶

Problem 1¶

Compute the determinant:

\[A = \begin{bmatrix} 1 & 3 & 2 \\ 4 & 1 & 3 \\ 2 & 5 & 2 \end{bmatrix}\]

Problem 2¶

Let \(A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\) and \(B = \begin{bmatrix} 0 & 4 \\ 1 & 2 \end{bmatrix}\).

Verify that \(\text{tr}(AB) = \text{tr}(BA)\).

Problem 3¶

Find the eigenvalues and eigenvectors of:

\[C = \begin{bmatrix} 5 & 4 \\ 2 & 3 \end{bmatrix}\]

Problem 4¶

Find the Cholesky decomposition of:

\[A = \begin{bmatrix} 9 & 6 \\ 6 & 8 \end{bmatrix}\]

Problem 5¶

Diagonalize the matrix from Problem 3. That is, find \(P\) and \(D\) such that \(C = PDP^{-1}\).

Problem 6¶

The singular values of a \(3 \times 3\) matrix \(M\) are \(\sigma_1 = 10\), \(\sigma_2 = 5\), \(\sigma_3 = 1\).

(a) What is \(\|M\|_F\)?

(b) What is the Frobenius-norm error of the best rank-2 approximation \(M_2\)?

(c) What percentage of \(\|M\|_F^2\) is captured by \(M_2\)?

Solutions¶

Solution 1:

Using cofactor expansion along the first row:

\[\det(A) = 1 \begin{vmatrix} 1 & 3 \\ 5 & 2 \end{vmatrix} - 3 \begin{vmatrix} 4 & 3 \\ 2 & 2 \end{vmatrix} + 2 \begin{vmatrix} 4 & 1 \\ 2 & 5 \end{vmatrix}\]

\[= 1(1 \cdot 2 - 3 \cdot 5) - 3(4 \cdot 2 - 3 \cdot 2) + 2(4 \cdot 5 - 1 \cdot 2)\]

\[= 1(2 - 15) - 3(8 - 6) + 2(20 - 2)\]

\[= 1(-13) - 3(2) + 2(18)\]

\[= -13 - 6 + 36 = 17\]

Solution 2:

\[AB = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\begin{bmatrix} 0 & 4 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 2(0)+1(1) & 2(4)+1(2) \\ 1(0)+3(1) & 1(4)+3(2) \end{bmatrix} = \begin{bmatrix} 1 & 10 \\ 3 & 10 \end{bmatrix}\]

\[\text{tr}(AB) = 1 + 10 = 11\]

\[BA = \begin{bmatrix} 0 & 4 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 0(2)+4(1) & 0(1)+4(3) \\ 1(2)+2(1) & 1(1)+2(3) \end{bmatrix} = \begin{bmatrix} 4 & 12 \\ 4 & 7 \end{bmatrix}\]

\[\text{tr}(BA) = 4 + 7 = 11\]

Both traces equal 11, confirming \(\text{tr}(AB) = \text{tr}(BA)\).

Solution 3:

Characteristic polynomial:

\[\det(C - \lambda I) = (5 - \lambda)(3 - \lambda) - (4)(2) = \lambda^2 - 8\lambda + 15 - 8 = \lambda^2 - 8\lambda + 7\]

\[(\lambda - 7)(\lambda - 1) = 0\]

\[\lambda_1 = 7, \quad \lambda_2 = 1\]

Eigenvector for \(\lambda_1 = 7\):

\[(C - 7I)\mathbf{v} = \begin{bmatrix} -2 & 4 \\ 2 & -4 \end{bmatrix}\mathbf{v} = \mathbf{0}\]

From the first row: \(-2v_1 + 4v_2 = 0\), so \(v_1 = 2v_2\). Choosing \(v_2 = 1\):

\[\mathbf{v}_1 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\]

Eigenvector for \(\lambda_2 = 1\):

\[(C - I)\mathbf{v} = \begin{bmatrix} 4 & 4 \\ 2 & 2 \end{bmatrix}\mathbf{v} = \mathbf{0}\]

From the first row: \(4v_1 + 4v_2 = 0\), so \(v_1 = -v_2\). Choosing \(v_2 = 1\):

\[\mathbf{v}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\]

Check: \(\text{tr}(C) = 5 + 3 = 8 = 7 + 1\) and \(\det(C) = 15 - 8 = 7 = 7 \times 1\). Both pass.

Solution 4:

\[l_{11} = \sqrt{9} = 3\]

\[l_{21} = \frac{6}{3} = 2\]

\[l_{22} = \sqrt{8 - 2^2} = \sqrt{4} = 2\]

\[L = \begin{bmatrix} 3 & 0 \\ 2 & 2 \end{bmatrix}\]

Verification:

\[LL^T = \begin{bmatrix} 3 & 0 \\ 2 & 2 \end{bmatrix}\begin{bmatrix} 3 & 2 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 9 & 6 \\ 6 & 8 \end{bmatrix} = A \quad \checkmark\]

Solution 5:

From Solution 3: \(\lambda_1 = 7\), \(\mathbf{v}_1 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), \(\lambda_2 = 1\), \(\mathbf{v}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\).

\[P = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}, \quad D = \begin{bmatrix} 7 & 0 \\ 0 & 1 \end{bmatrix}\]

\[P^{-1} = \frac{1}{2(1) - (-1)(1)} \begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\]

Verification:

\[PDP^{-1} = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}\begin{bmatrix} 7 & 0 \\ 0 & 1 \end{bmatrix} \cdot \frac{1}{3}\begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\]

\[= \begin{bmatrix} 14 & -1 \\ 7 & 1 \end{bmatrix} \cdot \frac{1}{3}\begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\]

\[= \frac{1}{3}\begin{bmatrix} 14(1)+(-1)(-1) & 14(1)+(-1)(2) \\ 7(1)+1(-1) & 7(1)+1(2) \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 15 & 12 \\ 6 & 9 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ 2 & 3 \end{bmatrix} = C \quad \checkmark\]

Solution 6:

(a) The Frobenius norm is:

\[\|M\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \sigma_3^2} = \sqrt{100 + 25 + 1} = \sqrt{126} = 3\sqrt{14} \approx 11.22\]

(b) The best rank-2 approximation discards \(\sigma_3 = 1\). The error is:

\[\|M - M_2\|_F = \sqrt{\sigma_3^2} = \sigma_3 = 1\]

(c) The squared Frobenius norm captured by \(M_2\) is:

\[\frac{\sigma_1^2 + \sigma_2^2}{\sigma_1^2 + \sigma_2^2 + \sigma_3^2} = \frac{100 + 25}{100 + 25 + 1} = \frac{125}{126} \approx 99.2\%\]

So the rank-2 approximation captures about 99.2% of the total "energy" of \(M\).

Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari

Next: Tutorial 4 - Matrix Calculus and Optimization