The code below as you may guess multiplies two matrices (see Ulrich Drepper, What Every Programmer Should Know About Memory)
But the main difference as author says is
for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += mul1[i][k] * mul2[k][j];The next piece of code does the same (as you probably may not guess anymore)
#define SM (CLS / sizeof (double)) for (i = 0; i < N; i += SM) for (j = 0; j < N; j += SM) for (k = 0; k < N; k += SM) for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < SM; ++i2, rres += N, rmul1 += N) for (k2 = 0, rmul2 = &mul2[k][j]; k2 < SM; ++k2, rmul2 += N) for (j2 = 0; j2 < SM; ++j2) rres[j2] += rmul1[k2] * rmul2[j2];where CLS is a linesize of L1d cache
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) ...The speed
Original | Tuned | |
Cycles | 16,765,297,87 | 2,895,041,480 |
Relative | 100% | 17.3% |
But the main difference as author says is
This looks quite scary.
This smacks of premature optimisation and the possibility the user really does not know what they are talking about, not to mention the problem of portability.See Why not cache lines.
No comments:
Post a Comment