2009년 9월 16일 수요일

128-bit MMX

I’m quite sure that Intel would not like to see SSE2 named 128-bit MMX. In fact, MMX has a bad reputation: the Intel marketing hype pushed it as an universal solution to multimedia requirements, but at the same time the gaming industry switched from mostly 2D games to Virtual Reality-like 3D games that were not accelerated by MMX. Bad press coverage spread the news that MMX was meaningless as it did not improve the Quake frame-rate. That would be correct if the only applications worth running were 3D games, but the overly simplified vision of the world shared by most hardware sites missed several points: in fact, MMX instructions are constantly used to perform a wide array of tasks. PhotoShop users surely remember the performance boost given by MMX, but it should be made clear that each time you play a MP3, view a JPEG image in your browser or play a MPEG video a lot of MMX instructions are executed. Today all multimedia applications are built on MMX instructions, and they are the key to run computing-intensive tasks such as speech recognition on commonplace PCs.
Writing MMX code is still very hard, as you have to go back to assembler, but the performance benefits are rewarding. The support offered by current compilers is barebone. There are a few attempts to write C++ compilers that can automatically turn normal C code into vector MMX code, but they deal only with limited complexity loop vectorization and place too many constraints on the parallelizable code; in general, they appear notably less mature than vectorizing compilers available in the supercomputing domain.
So we cannot expect to have SSE2 enabled compilers anytime soon. This will not stop large companies that sell shrinkwrap software from exploiting SSE2 instructions as they can afford the required development time, but small-scale software firms are not likely to use SSE2 until the appearance of better development tools. In my opinion, the Pentium 4 scenario closely resembles the Pentium MMX one, where lack of software support made the additional investment for the Pentium MMX over plain old Pentium quite useless.
We have just analyzed the dark side of SSE2, i.e. difficult programming; now we can go on and delve into the technical details.
SSE2 extends MMX by using 128-bit registers instead of 64-bit ones, effectively doubling the level of parallelism. We may be tempted to replace MMX register names with SSE2 ones (e.g. turning MM0 into XMM0), recompile it and see it running at twice the speed. Unfortunately, it would not work, actually it would not even compile. These are the steps required to migrate MMX code to SSE2:
1) replace MMX register names with SSE2 ones, e.g. MM0 becomes XMM0;
2) replace MOVQ instructions with MOVAPD (if the memory address is 16-byte aligned) or MOVUPD (if the memory address is not aligned);
3) replace PSHUFW, which is a SSE extension to MMX, by a combination of the following instructions: PSHUFHW, PSHUFLW, PSHUFD;
4) replace PSLLQ and PSRLQ with PSLLDQ and PSRLDQ respectively;
5) update loop counters and numeric memory offsets, since we work on 128 bits at once instead of 64.
Looks easy, doesn’t it? Actually, it is not that simple. Replacing 64-bit shifts with 128-bit ones is trivial, but SSE2 expects memory references to be 16-byte aligned: while the MOVUPD instruction lets you load unaligned memory blocks at the expense of poor performance (so it should be not used unless strictly necessary), every instruction that uses a memory source operand, e.g. a PADDB MM0,[EAX], is a troublesome spot. Using unaligned memory references raises a General Protection fault, but avoiding GPF requires quite a lot of work. First of all, the memory allocators used in current compiler do not align data blocks on 16-bytes boundaries, so you will have to build a wrapper function around the malloc() function that allocates a slightly larger block than required and correctly aligns the resulting pointer (note: the Processor Pack for Visual C++ features an aligned_malloc() function that supports user-definable alignment of allocated blocks). Then you will have to find out all the lines in your source code where the code blocks that are processed with SSE2 instructions get allocated, and replace the standard allocation call with an invocation to your wrapper function: this is fairly easy if you have access to all the source code of your app, but impossible when third-party libraries allocate misaligned memory blocks; in this case, contact the software vendor and ask for an update.
If your MMX routine spills some variables onto the stack, we are in for more trouble, as we have to force the alignment of the stack, and it requires the modification of the entry and exit code of the routine.
The easiest way to fix a PSHUFW instruction is parting it in two, a PSHUFHW and a PSHUFLW, each operating respectively on the high and low 64-bit halves of the 128-bit register.
Here is the list of SSE2 instructions that extend MMX (adapted from Intel’s documentation):

댓글 없음:

댓글 쓰기

팔로어

프로필

평범한 모습 평범함을 즐기는 평범하게 사는 사람입니다.