The Newton-Schulz iteration is a matrix version of Newton’s Method used to approximate matrix inverses, inverse square roots, and orthogonalized matrices.
It is useful because it only needs Matrix Multiplication, so it can run efficiently on GPUs/TPUs without explicitly computing an inverse or SVD.
Matrix Inverse Form
Goal: approximate
Given an initial guess
Equivalent form:
If
Let:
Then:
This is the key idea: once the error norm is less than 1, each iteration rapidly shrinks the error.
Convergence Condition
A common sufficient condition is:
This means the starting guess
A simple starting point is:
where
Inverse Square Root / Orthogonalization Form
Newton-Schulz is often used to approximate the polar factor of a matrix.
For a matrix
This makes the columns more orthonormal. In optimization, this can turn a raw gradient matrix into an update direction with controlled singular values.
A common Newton-Schulz update for this is:
Before applying it,
Why It Shows Up In Muon
Muon: An optimizer for hidden layers in neural networks uses a function called
Here, Newton-Schulz is not being used to solve a linear system directly. It is used to orthogonalize the momentum/gradient matrix
Intuition:
- Raw gradients can have directions with very different scales.
- Newton-Schulz pushes the update matrix toward an orthogonal matrix.
- This makes the update depend more on direction and less on extreme singular values.
Some ML implementations use a tuned fifth-order polynomial version instead of the classic cubic update. The goal is not always exact orthogonalization; sometimes the goal is a fast approximate update that works well in a fixed small number of steps.
Scalar Analogy
For scalars, approximating
Newton-Schulz is the same idea lifted from numbers to matrices.
Key Takeaways
- Newton-Schulz is an iterative method for matrix inverse-like operations.
- It replaces expensive decompositions with repeated matrix multiplications.
- It converges very fast when the initial guess is good.
- It is common in GPU-friendly algorithms because matrix multiplication is highly optimized.
- In neural network optimizers like Muon, it is used to approximately orthogonalize gradient updates.