Muon is an optimizer for 2D parameters of neural network hidden layers

Require: Learning rate , momentum

  1. Initialize
  2. for do
  3.   Compute gradient $G_t \leftarrow \nabla_{\theta}\mathcal{L}_t(\theta_{t-1})$
    
  4.   $B_t \leftarrow \mu B_{t-1} + G_t$
    
  5.   $O_t \leftarrow \text{NewtonSchulz5}(B_t)$
    
  6.   Update parameters $\theta_t \leftarrow \theta_{t-1} - \eta O_t$
    
  7. end for
  8. return