GCC makes a mess | Hardwarebug
Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.
A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c
where b
and c
are 32 bits in size and a
is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.
Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:
int mulh(int a, int b) { return ((int64_t)a * (int64_t)b) >> 32; }
It doesn’t take much thinking to see that the PowerPC mulhw
instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be really sure that those low 32 bits are not needed? GCC seems unconvinced:
mulhw r9, r4, r3 mullw r10, r4, r3 srawi r11, r9, 31 srawi r12, r9, 0 mr r3, r12 blr
The second example is slightly more complicated:
int64_t mac(int64_t a, int b, int c, int d) { a += (int64_t)b * (int64_t)c; a += (int64_t)b * (int64_t)d; return a; }
This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:
stwu r1, -32(r1) srawi r0, r6, 31 mullw r0, r0, r5 srawi r8, r7, 31 stw r29, 20(r1) srawi r29, r5, 31 stw r27, 12(r1) stw r28, 16(r1) mullw r11, r29, r6 mulhwu r9, r6, r5 add r0, r0, r11 mullw r10, r6, r5 add r9, r0, r9 mullw r29, r29, r7 addc r28, r10, r4 adde r27, r9, r3 mullw r8, r8, r5 mulhwu r9, r7, r5 add r8, r8, r29 lwz r29, 20(r1) mullw r10, r7, r5 add r9, r8, r9 addc r12, r28, r10 adde r11, r27, r9 lwz r27, 12(r1) mr r4, r12 lwz r28, 16(r1) mr r3, r11 addi r1, r1, 32 blr
Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.
This entry was posted on Wednesday, May 13th, 2009 at 2:16 amand is filed under Compilers, Optimisation, PowerPC. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.