MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time.
So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.
We believe our MXFP8 MoE training stack is faster than any open-source alternative available today.
Read more here: