Singular Value
Decomposition
A step-by-step visual guide to understanding SVD — the most powerful matrix factorization in modern data science, machine learning, and signal processing.
Why SVD? The Problem with Eigenvalues
Eigenvalues are beautiful — but they break down for rectangular matrices. If A is m×n (with m≠n), the equation Ax = λx is impossible: Ax lives in ℝᵐ while λx lives in ℝⁿ. They can't be equal.
The Core Formula
Any matrix A — no matter how large or rectangular — can be written as the product of three special matrices:
Avᵢ = σᵢ uᵢEach right vector vᵢ maps to a scaled left vector uᵢ. This replaces Ax = λx.
How to Derive the SVD
The magic ingredient is AᵀA. Let's see why this square, symmetric, positive semi-definite matrix unlocks everything.
For any m×n matrix A, AᵀA is n×n, symmetric, and positive semi-definite (all eigenvalues ≥ 0). This guarantees real, non-negative eigenvalues.
Since AᵀA is symmetric, the Spectral Theorem guarantees orthonormal eigenvectors. These become our right singular vectors V.
Similarly, AAᵀ is m×m, symmetric, and has the same non-zero eigenvalues as AᵀA. Its eigenvectors become our left singular vectors U.
We need to check u₁ᵀu₂ = 0. This is the key orthogonality proof:
Worked Example — Step by Step
Let's compute the SVD of a concrete 2×2 matrix by hand.
Step 1 — Compute AᵀA
First transpose A, then multiply:
[0 5] [4 5]
Step 2 — Find Eigenvalues (σ²)
Solve det(AᵀA − λI) = 0:
Step 3 — Find V (eigenvectors of AᵀA)
Step 4 — Find U using uᵢ = Avᵢ / σᵢ
Step 5 — Final Result
4 5 A
0.949 −0.316 U
0 2.24 Σ
0.707 −0.707 Vᵀ
Python Code
import numpy as np # Define matrix A A = np.array([[3, 0], [4, 5]], dtype=float) # Compute SVD U, sigma, Vt = np.linalg.svd(A) print("U =\n", U) print("Singular values:", sigma) print("Vt =\n", Vt) # Reconstruct A to verify Sigma = np.diag(sigma) A_reconstructed = U @ Sigma @ Vt print("Reconstructed A =\n", A_reconstructed) # Output: # Singular values: [6.708 2.236] # U ≈ [[0.316, 0.949], [0.949, -0.316]] # Vt ≈ [[0.707, 0.707], [0.707, -0.707]]
Geometric Interpretation
The SVD tells us something profound: every linear transformation is a rotation, followed by a stretch, followed by another rotation.
Multiply by Vᵀ (an orthogonal matrix). This rotates the input unit circle without changing its shape. Lengths are preserved. The unit circle stays a unit circle — just turned.
Multiply by Σ. This stretches along the coordinate axes by σ₁, σ₂, … The unit circle becomes an ellipse with semi-axes of length σ₁ ≥ σ₂ ≥ …
Finally multiply by U. This rotates the ellipse to its final position in the output space. The tip of the longest axis points along u₁, shortest along u₂.
Unit circle → rotation (Vᵀ) → ellipse stretch (Σ) → final rotation (U)
Full SVD vs. Compact SVD
There are two flavors of SVD. The compact version discards zeros and is much more efficient for large data.
| Property | Full SVD | Compact (Economy) SVD |
|---|---|---|
| U size | m × m | m × r smaller |
| Σ size | m × n (padded with 0s) | r × r no zeros |
| V size | n × n | n × r smaller |
| Null space | ✓ included | ✗ excluded |
| Storage | large | efficient |
| Use case | Theory, full decomposition | Data science, compression, PCA |
Rank-k Approximation
One of SVD's superpowers: you can approximate A using only the top-k terms:
This is the best possible rank-k approximation — no other rank-k matrix is closer to A in Frobenius norm. It's the foundation of image compression, recommendation systems, and PCA.
Polar Decomposition
Just as complex numbers have a polar form z = r·eⁱᶿ = (magnitude) × (rotation), matrices have a polar decomposition:
Where S is symmetric positive semi-definite (the "stretch") and Q is orthogonal (the "rotation"). We extract it directly from the SVD by inserting UᵀU = I:
Connection to PCA
Principal Component Analysis (PCA) is SVD — just viewed through a statistics lens. For a centered data matrix A (rows = samples, columns = features):
The eigenvectors of the covariance matrix are the principal components — and they are exactly the right singular vectors V of A.
σᵢ² / (n−1) is the variance explained by the i-th principal component. The first component (σ₁, v₁) captures the most variance.
Project data onto the top-k principal components: Z = A · Vₖ where Vₖ contains the first k columns of V. This gives the best k-dimensional representation.
import numpy as np from sklearn.preprocessing import StandardScaler # Data matrix: n_samples × n_features X = np.random.randn(100, 10) # Step 1: Center the data X_centered = X - X.mean(axis=0) # Step 2: SVD U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False) # Step 3: Keep top-2 components k = 2 Z = X_centered @ Vt[:k].T # shape: 100 × 2 # Variance explained by each component var_explained = sigma**2 / np.sum(sigma**2) print(f"Top-2 explain: {var_explained[:2].sum()*100:.1f}%")
Practical Computation
Complexity
| Method | Time Complexity | Use When |
|---|---|---|
| Full SVD (LAPACK) | O(min(m,n) · mn) | m,n < 10,000 |
| Randomized SVD | O(mnk) | Want top-k, large matrix |
| Sparse SVD (ARPACK) | O(nnz · k) | Sparse data, large n |
Quick API Reference
import numpy as np from sklearn.utils.extmath import randomized_svd # Full SVD (numpy) U, s, Vt = np.linalg.svd(A, full_matrices=True) # full U, Vt U, s, Vt = np.linalg.svd(A, full_matrices=False) # compact form # Randomized SVD — top k components (fast for large A) U, s, Vt = randomized_svd(A, n_components=50, random_state=42) # Rank-k reconstruction A_approx = (U * s) @ Vt # shape restored to original A shape # Image compression example for k in [5, 20, 50, 100]: img_k = (U[:,:k] * s[:k]) @ Vt[:k,:] print(f"k={k}: compression ratio = {img_k.size / k/(m+n+1):.1f}x")
Interactive Demo
Adjust the matrix entries and watch the singular values update in real time.