Sunday, December 22, 2013

Don't optimize when you are not asked to

The code below as you may guess multiplies two matrices (see Ulrich Drepper, What Every Programmer Should Know About Memory)
for (i = 0;  i < N; ++i)
    for (j = 0; j < N; ++j)
        for (k = 0; k < N; ++k)
            res[i][j] += mul1[i][k] * mul2[k][j];
The next piece of code does the same (as you probably may not guess anymore)
#define SM (CLS / sizeof (double))

for (i = 0; i < N; i += SM)
    for (j = 0; j < N; j += SM)
        for (k = 0; k < N; k += SM)
            for (i2 = 0, rres = &res[i][j],
                rmul1 = &mul1[i][k]; i2 < SM;
                ++i2, rres += N, rmul1 += N)
                for (k2 = 0, rmul2 = &mul2[k][j];
                    k2 < SM; ++k2, rmul2 += N)
                    for (j2 = 0; j2 < SM; ++j2)
                        rres[j2] += rmul1[k2] * rmul2[j2];
where CLS is a linesize of L1d cache
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) ...
The speed
OriginalTuned
Cycles16,765,297,872,895,041,480
Relative100%17.3%

But the main difference as author says is
This looks quite scary.
This smacks of premature optimisation and the possibility the user really does not know what they are talking about, not to mention the problem of portability.
See Why not cache lines.

No comments:

Post a Comment