Muon is an optimizer for 2D parameters of neural network hidden layers
Require: Learning rate
- Initialize
- for
do -
Compute gradient $G_t \leftarrow \nabla_{\theta}\mathcal{L}_t(\theta_{t-1})$ -
$B_t \leftarrow \mu B_{t-1} + G_t$ -
$O_t \leftarrow \text{NewtonSchulz5}(B_t)$ -
Update parameters $\theta_t \leftarrow \theta_{t-1} - \eta O_t$ - end for
- return