Gilbert Strang SVD — Singular Value Decomposition | Sim Vattanac
Prof. Gilbert Strang · MIT 18.065

Singular Value
Decomposition

A step-by-step visual guide to understanding SVD — the most powerful matrix factorization in modern data science, machine learning, and signal processing.

Linear Algebra Matrix Factorization Data Science PCA · Compression
01

Why SVD? The Problem with Eigenvalues

Eigenvalues are beautiful — but they break down for rectangular matrices. If A is m×n (with m≠n), the equation Ax = λx is impossible: Ax lives in ℝᵐ while λx lives in ℝⁿ. They can't be equal.

Eigenvalue Problems
Only works for square matrices. Eigenvalues can be complex. Eigenvectors may not be orthogonal. Fails completely for data matrices.
SVD Solution
Works for any matrix — any shape. Always real, non-negative values. Always orthogonal vectors. The universal factorization.
Big idea: SVD replaces one set of eigenvectors with two sets of singular vectors — one for the input space, one for the output space — connected by singular values.
02

The Core Formula

Any matrix A — no matter how large or rectangular — can be written as the product of three special matrices:

Am × n
=
Um × m
·
Σm × n
·
Vᵀn × n
U
Left Singular Vectors
Columns are orthonormal. They form an orthogonal basis for the output / column space. Think: "where things land after transformation."
Σ
Singular Values
A diagonal matrix with values σ₁ ≥ σ₂ ≥ … ≥ 0. They measure the strength of each component. Always real and non-negative.
V
Right Singular Vectors
Columns are orthonormal. They form an orthogonal basis for the input / row space. Think: "directions to stretch along."
The Key Equation
Avᵢ = σᵢ uᵢ
Each right vector vᵢ maps to a scaled left vector uᵢ. This replaces Ax = λx.
03

How to Derive the SVD

The magic ingredient is AᵀA. Let's see why this square, symmetric, positive semi-definite matrix unlocks everything.

1
Form AᵀA — the symmetric powerhouse

For any m×n matrix A, AᵀA is n×n, symmetric, and positive semi-definite (all eigenvalues ≥ 0). This guarantees real, non-negative eigenvalues.

Properties of AᵀA
(AᵀA)ᵀ = AᵀA   →   symmetric

xᵀ(AᵀA)x = ‖Ax‖² ≥ 0   →   PSD
2
Find V — eigenvectors of AᵀA

Since AᵀA is symmetric, the Spectral Theorem guarantees orthonormal eigenvectors. These become our right singular vectors V.

AᵀA = VΛVᵀ← spectral decomposition
AᵀA vᵢ = λᵢ vᵢ← vᵢ are the right singular vectors
σᵢ = √λᵢ← singular values are square roots
3
Find U — eigenvectors of AAᵀ

Similarly, AAᵀ is m×m, symmetric, and has the same non-zero eigenvalues as AᵀA. Its eigenvectors become our left singular vectors U.

AAᵀ = UΛUᵀ← same non-zero eigenvalues as AᵀA
uᵢ = Avᵢ / σᵢ← practical formula to get uᵢ from vᵢ
4
Verify: u's are actually orthogonal

We need to check u₁ᵀu₂ = 0. This is the key orthogonality proof:

u₁ᵀu₂
=(Av₁/σ₁)ᵀ(Av₂/σ₂)← substitute definition of uᵢ
=v₁ᵀ(AᵀA)v₂ / (σ₁σ₂)← rearrange
=v₁ᵀ(σ₂²v₂) / (σ₁σ₂)← v₂ is eigenvector of AᵀA
=(σ₂/σ₁) · v₁ᵀv₂ = 0 ✓← v's are orthonormal!
Because the v's are orthonormal eigenvectors of AᵀA, when we multiply by A we get orthonormal u's. Orthogonal inputs → orthogonal outputs.
04

Worked Example — Step by Step

Let's compute the SVD of a concrete 2×2 matrix by hand.

Our Matrix
A = [[3, 0], [4, 5]]

Step 1 — Compute AᵀA

First transpose A, then multiply:

AᵀA
25
20
20
25
=
Aᵀ · A
[3 4] · [3 0]
[0 5] [4 5]

Step 2 — Find Eigenvalues (σ²)

Solve det(AᵀA − λI) = 0:

det([[25−λ, 20], [20, 25−λ]]) = 0
=(25−λ)² − 400 = 0
λ₁ = 45    λ₂ = 5
σ₁ = √45 ≈ 6.71    σ₂ = √5 ≈ 2.24← singular values

Step 3 — Find V (eigenvectors of AᵀA)

For λ₁=45: v₁ = [1/√2, 1/√2]ᵀ ≈ [0.707, 0.707]ᵀ
For λ₂= 5: v₂ = [1/√2, −1/√2]ᵀ ≈ [0.707, −0.707]ᵀ
V =
0.707
0.707
0.707
−0.707
Σ =
6.71
0
0
2.24

Step 4 — Find U using uᵢ = Avᵢ / σᵢ

Av₁ = [3,0;4,5][0.707;0.707] = [2.12; 6.36]
u₁ = [2.12;6.36] / 6.71 ≈ [0.316; 0.949]
Av₂ = [3,0;4,5][0.707;−0.707] = [2.12; −0.707]
u₂ = [2.12;−0.707] / 2.24 ≈ [0.949; −0.316]

Step 5 — Final Result

3   0
4   5 A
=
0.316   0.949
0.949   −0.316 U
·
6.71   0
0   2.24 Σ
·
0.707   0.707
0.707   −0.707 Vᵀ
Verification: Multiply U · Σ · Vᵀ back out and you get the original matrix A = [[3,0],[4,5]] exactly. The decomposition is perfect.

Python Code

import numpy as np

# Define matrix A
A = np.array([[3, 0],
               [4, 5]], dtype=float)

# Compute SVD
U, sigma, Vt = np.linalg.svd(A)

print("U =\n", U)
print("Singular values:", sigma)
print("Vt =\n", Vt)

# Reconstruct A to verify
Sigma = np.diag(sigma)
A_reconstructed = U @ Sigma @ Vt
print("Reconstructed A =\n", A_reconstructed)

# Output:
# Singular values: [6.708  2.236]
# U ≈ [[0.316, 0.949], [0.949, -0.316]]
# Vt ≈ [[0.707, 0.707], [0.707, -0.707]]
05

Geometric Interpretation

The SVD tells us something profound: every linear transformation is a rotation, followed by a stretch, followed by another rotation.

Vᵀ — First Rotation

Multiply by Vᵀ (an orthogonal matrix). This rotates the input unit circle without changing its shape. Lengths are preserved. The unit circle stays a unit circle — just turned.

Σ — Stretching (the key part!)

Multiply by Σ. This stretches along the coordinate axes by σ₁, σ₂, … The unit circle becomes an ellipse with semi-axes of length σ₁ ≥ σ₂ ≥ …

σ₁ is always the largest. The ellipse's longest axis has length σ₁, shortest axis σ₂.
U — Second Rotation

Finally multiply by U. This rotates the ellipse to its final position in the output space. The tip of the longest axis points along u₁, shortest along u₂.

Geometric Summary
Unit Circle → Vᵀ → Rotated Circle → Σ → Ellipse → U → Final Ellipse

Unit circle → rotation (Vᵀ) → ellipse stretch (Σ) → final rotation (U)

06

Full SVD vs. Compact SVD

There are two flavors of SVD. The compact version discards zeros and is much more efficient for large data.

PropertyFull SVDCompact (Economy) SVD
U sizem × mm × r smaller
Σ sizem × n (padded with 0s)r × r no zeros
V sizen × nn × r smaller
Null space✓ included✗ excluded
Storagelargeefficient
Use caseTheory, full decompositionData science, compression, PCA
r = rank(A) — only the r non-zero singular values matter. Everything else is zeros that don't contribute to the reconstruction.

Rank-k Approximation

One of SVD's superpowers: you can approximate A using only the top-k terms:

Best Rank-k Approximation (Eckart–Young Theorem)
Aₖ = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + … + σₖuₖvₖᵀ

This is the best possible rank-k approximation — no other rank-k matrix is closer to A in Frobenius norm. It's the foundation of image compression, recommendation systems, and PCA.

07

Polar Decomposition

Just as complex numbers have a polar form z = r·eⁱᶿ = (magnitude) × (rotation), matrices have a polar decomposition:

Polar Form
A = S · Q

Where S is symmetric positive semi-definite (the "stretch") and Q is orthogonal (the "rotation"). We extract it directly from the SVD by inserting UᵀU = I:

A = UΣVᵀ
=U Σ Uᵀ · U Vᵀ← insert UᵀU = I
=S · Q
S
S = UΣUᵀ
Symmetric positive semi-definite. Represents the "pure stretch" part of the transformation — no rotation, just scaling.
Q
Q = UVᵀ
Orthogonal matrix. Represents the "pure rotation" part. QᵀQ = I. This is the rotation after all stretching is removed.
In mechanical engineering, this says: any deformation of an elastic body = pure stretch × internal rotation. The SVD separates these two effects cleanly.
08

Connection to PCA

Principal Component Analysis (PCA) is SVD — just viewed through a statistics lens. For a centered data matrix A (rows = samples, columns = features):

1
The covariance matrix is AᵀA / (n−1)

The eigenvectors of the covariance matrix are the principal components — and they are exactly the right singular vectors V of A.

2
Singular values encode variance

σᵢ² / (n−1) is the variance explained by the i-th principal component. The first component (σ₁, v₁) captures the most variance.

Variance Explained
Var_i = σᵢ² / Σⱼ σⱼ²
3
Dimensionality reduction

Project data onto the top-k principal components: Z = A · Vₖ where Vₖ contains the first k columns of V. This gives the best k-dimensional representation.

import numpy as np
from sklearn.preprocessing import StandardScaler

# Data matrix: n_samples × n_features
X = np.random.randn(100, 10)

# Step 1: Center the data
X_centered = X - X.mean(axis=0)

# Step 2: SVD
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Step 3: Keep top-2 components
k = 2
Z = X_centered @ Vt[:k].T  # shape: 100 × 2

# Variance explained by each component
var_explained = sigma**2 / np.sum(sigma**2)
print(f"Top-2 explain: {var_explained[:2].sum()*100:.1f}%")
09

Practical Computation

⚠️ Never compute AᵀA explicitly in practice! The proof used AᵀA, but squaring a matrix squares its condition number, dramatically amplifying numerical errors.
🔢
Bidiagonalization
Golub-Kahan-Reinsch algorithm. Reduces A to bidiagonal form first, then applies QR iteration. Numerically stable — industry standard.
Randomized SVD
For huge matrices, random projection finds the top-k singular values/vectors in O(mnk) time. Used in sklearn's TruncatedSVD.

Complexity

MethodTime ComplexityUse When
Full SVD (LAPACK)O(min(m,n) · mn)m,n < 10,000
Randomized SVDO(mnk)Want top-k, large matrix
Sparse SVD (ARPACK)O(nnz · k)Sparse data, large n

Quick API Reference

import numpy as np
from sklearn.utils.extmath import randomized_svd

# Full SVD (numpy)
U, s, Vt = np.linalg.svd(A, full_matrices=True)   # full U, Vt
U, s, Vt = np.linalg.svd(A, full_matrices=False)  # compact form

# Randomized SVD — top k components (fast for large A)
U, s, Vt = randomized_svd(A, n_components=50, random_state=42)

# Rank-k reconstruction
A_approx = (U * s) @ Vt  # shape restored to original A shape

# Image compression example
for k in [5, 20, 50, 100]:
    img_k = (U[:,:k] * s[:k]) @ Vt[:k,:]
    print(f"k={k}: compression ratio = {img_k.size / k/(m+n+1):.1f}x")
10

Interactive Demo

Adjust the matrix entries and watch the singular values update in real time.

Live SVD Calculator 2×2 matrix

MATRIX ENTRY
PRESETS
Computing...