Tutorial 3: Matrix Decomposition¶
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
📚 Learning Objectives¶
By the end of this tutorial, you will understand:
- How to compute determinants for 2x2 and 3x3 matrices
- The trace of a matrix and its properties
- Eigenvalues and eigenvectors and how to compute them
- Cholesky decomposition for symmetric positive definite matrices
- Eigendecomposition and diagonalization
- Singular Value Decomposition (SVD) and its geometric meaning
- Matrix approximation using truncated SVD
Part 1: Determinants¶
1.1 What is a Determinant?¶
The determinant is a scalar value computed from a square matrix that captures important information about the matrix. Think of it as a single number that tells you:
- Whether the matrix is invertible (determinant is nonzero)
- How the matrix scales areas or volumes when used as a linear transformation
- The "signed volume" of the parallelepiped formed by the column vectors
Notation: For a matrix \(A\), the determinant is written as \(\det(A)\) or \(|A|\).
1.2 Determinant of a 2x2 Matrix¶
For a \(2 \times 2\) matrix:
The determinant is:
In plain English: multiply the diagonals and subtract. The main diagonal product minus the off-diagonal product.
Worked Example:
Since \(\det(A) = 8 \neq 0\), the matrix \(A\) is invertible.
1.3 Determinant of a 3x3 Matrix¶
For a \(3 \times 3\) matrix:
Method 1: Sarrus' Rule¶
Write the matrix and repeat the first two columns to the right:
Then sum the products along the three downward diagonals and subtract the products along the three upward diagonals:
Method 2: Cofactor Expansion (along the first row)¶
Each smaller determinant is called a minor, and the signed minor is a cofactor.
Worked Example:
Using cofactor expansion along the first row:
1.4 Properties of Determinants¶
| Property | Statement | Example / Note |
|---|---|---|
| Identity | \(\det(I) = 1\) | The identity matrix always has determinant 1 |
| Transpose | \(\det(A^T) = \det(A)\) | Transposing does not change the determinant |
| Product | \(\det(AB) = \det(A) \cdot \det(B)\) | Determinant of a product is the product of determinants |
| Inverse | \(\det(A^{-1}) = \frac{1}{\det(A)}\) | Only defined when \(\det(A) \neq 0\) |
| Scalar multiple | \(\det(cA) = c^n \det(A)\) | For an \(n \times n\) matrix |
| Singular matrix | \(\det(A) = 0\) | Matrix is not invertible |
| Row swap | Swapping two rows flips the sign | \(\det(\text{swapped}) = -\det(A)\) |
| Triangular | \(\det(A) = \prod_{i=1}^{n} a_{ii}\) | Product of diagonal entries for triangular matrices |
Part 2: Trace¶
2.1 Definition¶
The trace of a square matrix \(A\) is the sum of its diagonal entries:
In plain English: just add up all the numbers on the main diagonal.
Example:
2.2 Properties of the Trace¶
| Property | Statement |
|---|---|
| Linearity | \(\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\) |
| Scalar multiplication | \(\text{tr}(cA) = c \cdot \text{tr}(A)\) |
| Transpose | \(\text{tr}(A^T) = \text{tr}(A)\) |
| Cyclic property | \(\text{tr}(AB) = \text{tr}(BA)\) |
| Cyclic property (3 matrices) | \(\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)\) |
| Sum of eigenvalues | \(\text{tr}(A) = \sum_{i=1}^{n} \lambda_i\) |
| Frobenius norm | \(\text{tr}(A^T A) = \sum_{i,j} a_{ij}^2 = \|A\|_F^2\) |
The cyclic property \(\text{tr}(AB) = \text{tr}(BA)\) is especially important in machine learning. It lets you rearrange matrix products inside a trace, which simplifies many derivations in optimization and statistics.
Worked Example: Verify \(\text{tr}(AB) = \text{tr}(BA)\).
Both sides give 69, confirming the property.
Part 3: Eigenvalues and Eigenvectors¶
3.1 Definition¶
Given a square matrix \(A\), a nonzero vector \(\mathbf{v}\) is an eigenvector of \(A\) if multiplying \(A\) by \(\mathbf{v}\) simply scales \(\mathbf{v}\) by some scalar \(\lambda\):
- \(\lambda\) is called the eigenvalue corresponding to \(\mathbf{v}\)
- \(\mathbf{v}\) is the eigenvector corresponding to \(\lambda\)
In plain English: an eigenvector is a special direction that the matrix only stretches (or flips), but does not rotate. The eigenvalue tells you by how much it stretches.
3.2 The Characteristic Polynomial¶
To find eigenvalues, we rearrange \(A\mathbf{v} = \lambda \mathbf{v}\):
For a nonzero solution \(\mathbf{v}\) to exist, the matrix \((A - \lambda I)\) must be singular, meaning:
This equation is called the characteristic equation, and the polynomial on the left side is the characteristic polynomial. Solving it gives us the eigenvalues.
3.3 Worked Example: Finding Eigenvalues and Eigenvectors¶
Find the eigenvalues and eigenvectors of:
Step 1: Characteristic polynomial.
Step 2: Solve the characteristic equation.
Step 3: Find eigenvectors for each eigenvalue.
For \(\lambda_1 = 5\):
From the first row: \(-v_1 + v_2 = 0\), so \(v_2 = v_1\).
Choosing \(v_1 = 1\):
For \(\lambda_2 = 2\):
From the first row: \(2v_1 + v_2 = 0\), so \(v_2 = -2v_1\).
Choosing \(v_1 = 1\):
Quick check: \(\text{tr}(A) = 4 + 3 = 7 = 5 + 2 = \lambda_1 + \lambda_2\) and \(\det(A) = 4(3) - 1(2) = 10 = 5 \times 2 = \lambda_1 \cdot \lambda_2\). Both checks pass.
3.4 Key Facts About Eigenvalues¶
| Fact | Statement |
|---|---|
| Sum of eigenvalues | \(\sum \lambda_i = \text{tr}(A)\) |
| Product of eigenvalues | \(\prod \lambda_i = \det(A)\) |
| Symmetric matrices | All eigenvalues are real; eigenvectors are orthogonal |
| Positive definite | All eigenvalues are strictly positive |
| Singular matrix | At least one eigenvalue is zero |
Part 4: Cholesky Decomposition¶
4.1 What is Cholesky Decomposition?¶
For a symmetric positive definite (SPD) matrix \(A\), the Cholesky decomposition factors \(A\) into:
where \(L\) is a lower triangular matrix with positive diagonal entries.
Think of it as the "square root" of a matrix. Just as every positive number has a square root, every SPD matrix has a Cholesky factor.
4.2 What Does "Symmetric Positive Definite" Mean?¶
A matrix \(A\) is symmetric positive definite if:
- Symmetric: \(A = A^T\) (the matrix equals its transpose)
- Positive definite: \(\mathbf{x}^T A \mathbf{x} > 0\) for all nonzero vectors \(\mathbf{x}\)
Equivalently, \(A\) is SPD if it is symmetric and all its eigenvalues are strictly positive.
4.3 The Cholesky Algorithm (for 2x2)¶
Given:
Expanding the right side and matching entries:
4.4 The Cholesky Algorithm (general)¶
For an \(n \times n\) SPD matrix, the entries of \(L\) are computed as:
4.5 Worked Example¶
Find the Cholesky decomposition of:
Step 1: Check that \(A\) is SPD. It is symmetric (\(A = A^T\)). Its eigenvalues are \(\lambda = \frac{9 \pm \sqrt{1}}{2}\), giving \(\lambda_1 \approx 5.56\) and \(\lambda_2 \approx 3.44\), both positive. (Alternatively, \(\det(A) = 16 > 0\) and \(a_{11} = 4 > 0\).)
Step 2: Compute \(L\).
Result:
Verification:
4.6 Why Cholesky Decomposition Matters¶
| Application | How It Helps |
|---|---|
| Solving linear systems | \(Ax = b\) becomes two easy triangular solves: \(Ly = b\), then \(L^Tx = y\) |
| Sampling from multivariate Gaussians | If \(\Sigma = LL^T\), generate \(\mathbf{x} = L\mathbf{z} + \boldsymbol{\mu}\) where \(\mathbf{z} \sim \mathcal{N}(0, I)\) |
| Numerical stability | More stable and about twice as fast as general LU decomposition for SPD matrices |
Part 5: Eigendecomposition¶
5.1 Definition¶
If a square matrix \(A\) has \(n\) linearly independent eigenvectors, then \(A\) can be factored as:
where: - \(P\) is the matrix whose columns are the eigenvectors of \(A\): \(P = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_n]\) - \(D\) is a diagonal matrix with the corresponding eigenvalues on the diagonal:
This factorization is called the eigendecomposition or spectral decomposition (for symmetric matrices).
5.2 When Does Eigendecomposition Exist?¶
| Condition | Eigendecomposition Exists? |
|---|---|
| \(A\) has \(n\) distinct eigenvalues | Always yes |
| \(A\) is symmetric (\(A = A^T\)) | Always yes, and \(P\) is orthogonal (\(P^{-1} = P^T\)) |
| \(A\) has repeated eigenvalues | Sometimes (depends on geometric multiplicity) |
| \(A\) is defective (not enough independent eigenvectors) | No |
For symmetric matrices, the decomposition simplifies to:
because the eigenvectors are orthogonal.
5.3 Worked Example¶
Diagonalize the matrix from Part 3:
We already found \(\lambda_1 = 5\), \(\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\) and \(\lambda_2 = 2\), \(\mathbf{v}_2 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}\).
Compute \(P^{-1}\). For a \(2 \times 2\) matrix \(\begin{bmatrix} a & b \\ c & d \end{bmatrix}\), the inverse is \(\frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\):
Verification:
5.4 Why Eigendecomposition is Useful¶
Computing matrix powers: If \(A = PDP^{-1}\), then:
Since \(D\) is diagonal, \(D^k\) is just each diagonal entry raised to the \(k\)-th power:
This makes computing \(A^{100}\) just as easy as computing \(A^2\).
Part 6: Singular Value Decomposition (SVD)¶
6.1 Definition¶
Every matrix \(A\) (of any shape) can be decomposed as:
where: - \(A\) is \(m \times n\) - \(U\) is \(m \times m\) orthogonal matrix (columns are left singular vectors) - \(\Sigma\) is \(m \times n\) diagonal matrix (diagonal entries are singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\)) - \(V\) is \(n \times n\) orthogonal matrix (columns are right singular vectors)
This is the most important matrix decomposition in applied mathematics and machine learning.
6.2 Geometric Interpretation¶
Any linear transformation \(A\) can be broken down into three simple steps:
- \(V^T\): Rotate (in the input space)
- \(\Sigma\): Scale along each axis (possibly changing dimensions)
- \(U\): Rotate (in the output space)
In plain English: every matrix transformation is just a rotation, followed by a stretch, followed by another rotation.
6.3 Relationship to Eigendecomposition¶
The singular values and vectors are connected to eigenvalues:
| SVD Component | Obtained From |
|---|---|
| \(V\) (right singular vectors) | Eigenvectors of \(A^T A\) |
| \(U\) (left singular vectors) | Eigenvectors of \(A A^T\) |
| \(\sigma_i\) (singular values) | \(\sigma_i = \sqrt{\lambda_i}\) where \(\lambda_i\) are eigenvalues of \(A^T A\) |
6.4 Worked Example: Computing the SVD¶
Find the SVD of:
This is already a diagonal matrix, so the SVD is straightforward.
Step 1: Compute \(A^T A\).
Step 2: Find eigenvalues of \(A^T A\).
Eigenvalues: \(\lambda_1 = 9\), \(\lambda_2 = 4\).
Singular values: \(\sigma_1 = \sqrt{9} = 3\), \(\sigma_2 = \sqrt{4} = 2\).
Step 3: Find \(V\) (eigenvectors of \(A^T A\)).
Step 4: Find \(U\) (eigenvectors of \(A A^T\)).
Result:
6.5 Worked Example: Non-Diagonal Matrix¶
Find the SVD of:
Step 1: Compute \(A^T A\).
Step 2: Find eigenvalues of \(A^T A\).
Singular values:
(Notice: \(\sigma_1 \approx \phi\), the golden ratio, and \(\sigma_2 \approx 1/\phi\).)
Step 3: Find \(V\).
For \(\lambda_1 = \frac{3+\sqrt{5}}{2}\), solve \((A^TA - \lambda_1 I)\mathbf{v} = 0\):
This gives \(v_2 = (\lambda_1 - 1) v_1\). Normalizing:
Similarly for \(\lambda_2\). Then \(V = [\mathbf{v}_1 \mid \mathbf{v}_2]\).
Step 4: Find \(U\).
Compute \(\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i\) for each singular vector.
The full numerical result gives \(A = U\Sigma V^T\) as desired.
6.6 Key Properties of SVD¶
| Property | Statement |
|---|---|
| Existence | Every matrix has an SVD (unlike eigendecomposition) |
| Rank | \(\text{rank}(A) =\) number of nonzero singular values |
| Norm | \(\|A\|_2 = \sigma_1\) (largest singular value) |
| Frobenius norm | \(\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}\) |
| Condition number | \(\kappa(A) = \sigma_1 / \sigma_n\) (ratio of largest to smallest) |
Part 7: Matrix Approximation¶
7.1 Truncated SVD¶
Given the full SVD \(A = U\Sigma V^T\) with \(r\) nonzero singular values, we can approximate \(A\) by keeping only the \(k\) largest singular values (where \(k < r\)):
where: - \(U_k\) is the first \(k\) columns of \(U\) - \(\Sigma_k\) is the top-left \(k \times k\) block of \(\Sigma\) - \(V_k\) is the first \(k\) columns of \(V\)
In plain English: keep the \(k\) most important "layers" of the matrix and throw away the rest. Each layer is a rank-1 matrix \(\sigma_i \mathbf{u}_i \mathbf{v}_i^T\) weighted by its singular value.
7.2 The Eckart-Young Theorem¶
The truncated SVD gives the best rank-\(k\) approximation to \(A\):
In other words, among all matrices with rank at most \(k\), \(A_k\) is the closest to \(A\) in the Frobenius norm. The approximation error is:
This is a remarkable result: the best low-rank approximation is obtained simply by discarding the smallest singular values.
7.3 Worked Example: Rank-1 Approximation¶
Find the best rank-1 approximation to:
Step 1: Compute the SVD.
Eigenvalues of \(A^T A\): \(\lambda_1 = 16\), \(\lambda_2 = 4\).
Singular values: \(\sigma_1 = 4\), \(\sigma_2 = 2\).
Eigenvectors of \(A^T A\):
For \(\lambda_1 = 16\): \(\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}\)
For \(\lambda_2 = 4\): \(\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}\)
Left singular vectors: \(\mathbf{u}_i = \frac{1}{\sigma_i}A\mathbf{v}_i\)
Step 2: Compute the rank-1 approximation.
Step 3: Check the error.
7.4 Applications of Low-Rank Approximation¶
| Application | How Truncated SVD Helps |
|---|---|
| Image compression | An \(m \times n\) image stored as a matrix can be approximated with rank \(k\), requiring only \(k(m + n + 1)\) numbers instead of \(mn\) |
| Dimensionality reduction (PCA) | Principal Component Analysis keeps the top \(k\) singular vectors to reduce feature dimensions |
| Recommender systems | User-item rating matrices are approximated at low rank to predict missing ratings |
| Noise reduction | Small singular values often correspond to noise; discarding them denoises the data |
| Latent semantic analysis | In NLP, document-term matrices are approximated to find semantic structure |
Image compression example: Suppose you have a \(1000 \times 1000\) grayscale image (1,000,000 pixel values). A rank-50 SVD approximation stores only \(50 \times (1000 + 1000 + 1) = 100{,}050\) numbers, which is about 10% of the original, yet often captures the essential visual content.
Summary: Key Takeaways¶
Matrix Scalars¶
- Determinant \(\det(A)\): tells you invertibility and volume scaling
- Trace \(\text{tr}(A)\): sum of diagonal entries = sum of eigenvalues
Eigenvalues and Eigenvectors¶
- \(A\mathbf{v} = \lambda \mathbf{v}\): eigenvectors are special directions, eigenvalues are scale factors
- Found by solving \(\det(A - \lambda I) = 0\)
Matrix Decompositions¶
- Cholesky: \(A = LL^T\) for symmetric positive definite matrices
- Eigendecomposition: \(A = PDP^{-1}\) for diagonalizable square matrices
- SVD: \(A = U\Sigma V^T\) for any matrix (the universal decomposition)
Matrix Approximation¶
- Truncated SVD gives the best rank-\(k\) approximation (Eckart-Young theorem)
- Central to PCA, image compression, recommender systems, and denoising
Practice Problems¶
Problem 1¶
Compute the determinant:
Problem 2¶
Let \(A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\) and \(B = \begin{bmatrix} 0 & 4 \\ 1 & 2 \end{bmatrix}\).
Verify that \(\text{tr}(AB) = \text{tr}(BA)\).
Problem 3¶
Find the eigenvalues and eigenvectors of:
Problem 4¶
Find the Cholesky decomposition of:
Problem 5¶
Diagonalize the matrix from Problem 3. That is, find \(P\) and \(D\) such that \(C = PDP^{-1}\).
Problem 6¶
The singular values of a \(3 \times 3\) matrix \(M\) are \(\sigma_1 = 10\), \(\sigma_2 = 5\), \(\sigma_3 = 1\).
(a) What is \(\|M\|_F\)?
(b) What is the Frobenius-norm error of the best rank-2 approximation \(M_2\)?
(c) What percentage of \(\|M\|_F^2\) is captured by \(M_2\)?
Solutions¶
Solution 1:
Using cofactor expansion along the first row:
Solution 2:
Both traces equal 11, confirming \(\text{tr}(AB) = \text{tr}(BA)\).
Solution 3:
Characteristic polynomial:
Eigenvector for \(\lambda_1 = 7\):
From the first row: \(-2v_1 + 4v_2 = 0\), so \(v_1 = 2v_2\). Choosing \(v_2 = 1\):
Eigenvector for \(\lambda_2 = 1\):
From the first row: \(4v_1 + 4v_2 = 0\), so \(v_1 = -v_2\). Choosing \(v_2 = 1\):
Check: \(\text{tr}(C) = 5 + 3 = 8 = 7 + 1\) and \(\det(C) = 15 - 8 = 7 = 7 \times 1\). Both pass.
Solution 4:
Verification:
Solution 5:
From Solution 3: \(\lambda_1 = 7\), \(\mathbf{v}_1 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), \(\lambda_2 = 1\), \(\mathbf{v}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}\).
Verification:
Solution 6:
(a) The Frobenius norm is:
(b) The best rank-2 approximation discards \(\sigma_3 = 1\). The error is:
(c) The squared Frobenius norm captured by \(M_2\) is:
So the rank-2 approximation captures about 99.2% of the total "energy" of \(M\).
Course: Mathematics for Machine Learning Instructor: Mohammed Alnemari
Next: Tutorial 4 - Matrix Calculus and Optimization