The Newton-Schulz iteration is a matrix version of Newton’s Method used to approximate matrix inverses, inverse square roots, and orthogonalized matrices.

It is useful because it only needs Matrix Multiplication, so it can run efficiently on GPUs/TPUs without explicitly computing an inverse or SVD.

Matrix Inverse Form

Goal: approximate .

Given an initial guess , repeat:

Equivalent form:

If is close enough to , the error gets squared each step, so convergence is quadratic.

Let:

Then:

This is the key idea: once the error norm is less than 1, each iteration rapidly shrinks the error.

Convergence Condition

A common sufficient condition is:

This means the starting guess must already be reasonably close to the true inverse. If the initial guess is bad, the iteration can diverge.

A simple starting point is:

where is the maximum absolute column sum and is the maximum absolute row sum.

Inverse Square Root / Orthogonalization Form

Newton-Schulz is often used to approximate the polar factor of a matrix.

For a matrix , the closest orthogonal-like matrix is:

This makes the columns more orthonormal. In optimization, this can turn a raw gradient matrix into an update direction with controlled singular values.

A common Newton-Schulz update for this is:

Before applying it, is usually scaled so its singular values are in a stable range.

Why It Shows Up In Muon

Muon: An optimizer for hidden layers in neural networks uses a function called .

Here, Newton-Schulz is not being used to solve a linear system directly. It is used to orthogonalize the momentum/gradient matrix before applying the parameter update.

Intuition:

  • Raw gradients can have directions with very different scales.
  • Newton-Schulz pushes the update matrix toward an orthogonal matrix.
  • This makes the update depend more on direction and less on extreme singular values.

Some ML implementations use a tuned fifth-order polynomial version instead of the classic cubic update. The goal is not always exact orthogonalization; sometimes the goal is a fast approximate update that works well in a fixed small number of steps.

Scalar Analogy

For scalars, approximating with Newton’s method gives:

Newton-Schulz is the same idea lifted from numbers to matrices.

Key Takeaways

  • Newton-Schulz is an iterative method for matrix inverse-like operations.
  • It replaces expensive decompositions with repeated matrix multiplications.
  • It converges very fast when the initial guess is good.
  • It is common in GPU-friendly algorithms because matrix multiplication is highly optimized.
  • In neural network optimizers like Muon, it is used to approximately orthogonalize gradient updates.