web.archive.org

GCC makes a mess | Hardwarebug

Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.

A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.

Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:

int mulh(int a, int b)
{
    return ((int64_t)a * (int64_t)b) >> 32;
}

It doesn’t take much thinking to see that the PowerPC mulhw instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be really sure that those low 32 bits are not needed? GCC seems unconvinced:

mulhw   r9,  r4,  r3
mullw   r10, r4,  r3
srawi   r11, r9,  31
srawi   r12, r9,  0
mr      r3,  r12
blr

The second example is slightly more complicated:

int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}

This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:

stwu    r1,  -32(r1)
srawi   r0,  r6,  31
mullw   r0,  r0,  r5
srawi   r8,  r7,  31
stw     r29, 20(r1)
srawi   r29, r5,  31
stw     r27, 12(r1)
stw     r28, 16(r1)
mullw   r11, r29, r6
mulhwu  r9,  r6,  r5
add     r0,  r0,  r11
mullw   r10, r6,  r5
add     r9,  r0,  r9
mullw   r29, r29, r7
addc    r28, r10, r4
adde    r27, r9,  r3
mullw   r8,  r8,  r5
mulhwu  r9,  r7,  r5
add     r8,  r8,  r29
lwz     r29, 20(r1)
mullw   r10, r7,  r5
add     r9,  r8,  r9
addc    r12, r28, r10
adde    r11, r27, r9
lwz     r27, 12(r1)
mr      r4,  r12
lwz     r28, 16(r1)
mr      r3,  r11
addi    r1,  r1,  32
blr

Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.

This entry was posted on Wednesday, May 13th, 2009 at 2:16 amand is filed under Compilers, Optimisation, PowerPC. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.