这有点埋了重点,实际上是通过一个小改动,你也可以在具有 N 个中间体的 NxN 矩阵上执行 ReLU(A @ B) @ C,而无需重新计算。 (也就是:一个神经网络。)
the tiny corp
the tiny corp8月8日 23:42
Everyone knows about Flash Attention. But do you know about Flash GEMM? This code computes (A @ B) @ C on NxN matrices with N intermediates and no recomputation. If you don't use a BLAS library, you don't need to materialize the intermediate matrix.
25.39K