Optimizing for SSE: A Case Study

Originally appeared in Hugi Issue #25

Table of Contents

Introduction

In this article I'll be describing the process of optimizing a piece of code to make use of Intel's SSE instruction set. As an example, I will work with a 4x4 matrix-vector multiplication function (commonly found in most 3D graphics engines). Starting from a na´ve C++ implementation, I will progressively apply various optimization techniques, until the function is more than four times faster than its original version. The code samples are in C++ with inline Intel x86 assembly (MSVC syntax), but the techniques generalize easily to other environments and platforms.

We'll apply these techniques to a problem that often arises in computer graphics: you want to multiply a 3D vector by a 4x4 matrix. This problem arises when you have a 3D vertex that you wish to transform, using a standard 4x4 transformation matrix. Of course, all graphics cards since the original Nvidia GeForce perform final vertex transformations in hardware. But there are still plenty of reasons to be transforming your vertices before sending them to the graphics card (matrix composition, visibility culling and collision detection, to name a few - not to mention non-hardware-accelerated applications).

Na´ve C++

So, hopefully you agree that it's important to have a fast software matrix-vector multiplier. Well, you're in luck - there are plenty of fast implementations available! In Hugi #20, Chris Dragan uses AMD's 3DNow extensions to write a matrix-vector multiplier that runs in 16 cycles! But...what if you don't have an Athlon (or what if you want your demo to run well on non-Athlons)? Well, not to worry - Intel provides a free matrix library, optimized for the Pentium 3 and 4! Unfortunately, to make use of the optimized versions of Intel's matrix functions, you need to be using Intel's proprietary compiler. Well, there's always libSIMD, a free cross-platform optimized matrix library. However, the libSIMD project is still in the early stages of development, and currently doesn't provide any optimization at all. So why not go dig out your Linear Algebra textbook, look up the definition of matrix multiplication, and write some C++ code! I suspect your first draft would look something like this:

A brief note on cycle timings: the timings I give for each function were achieved when running the sample code on my 500 MHz Pentium 3. Different machines will have different timings (for instance, the Pentium 4 offers significantly faster SIMD performance than the Pentium 3). But the absolute timings are mostly irrelevant - what matters is the relative timings, which show off the performance gains from each successive optimization.

So sure, the function works - but if all you're looking for is code that WORKS, you wouldn't be reading Hugi, would you? Functionality isn't good enough - we want something FAST! And since Chris Dragan's 3DNow version runs more than 5 times faster, you may suspect (correctly, as it turns out) that we can make some pretty heavy optimizations. You can optimize the C++ all you want (see MatrixMultiply2 in the code sample), but you aren't going to get the performance increase you're hoping for.

Basic SSE

Why does Chris's version run so fast? Because it uses 3DNow! 3DNow is an example of a SIMD (Single Instruction, Multiple Data) instruction set, which lets you execute the same operation on multiple pieces of data simultaneously. Intel provides its own SIMD instructions, called SSE (Streaming SIMD Extensions), which are available on the Pentium 3 and up. SSE lets you perform operations on four 32-bit floats in parallel! (The Pentium 4's SSE2 instructions provide similar operations for 64-bit doubles, but most graphics applications don't need the extra precision).

SIMD lends itself amazingly well to matrix operations. Intel's developer site contains descriptions of SSE-optimized matrix multiplication, inversion, and L-U decomposition algorithms. After familiarizing yourself with the SSE instructions (Intel provides an excellent tutorial/reference - see the links at the end of the article for more information), you might update your MatrixMultiply function to look something like this:

Note: due to the way that SSE instructions work, it ends up being convenient to have the input matrix be in column-major order. If you'd rather keep your matrices in row-major order (the way OpenGL expects them), don't panic - it turns out you can do it either way without a significant performance penalty. Keep reading.

Here's an general overview of the SSE version of the multiplier algorithm:

Not bad - about twice as fast as the original version! But there's still plenty of ways to make this function faster.

Batch Processing

Next, let's make a slight algorithmic change. We know matrix-vector multiplication is a common operation, but how often are we going to be transforming each vertex by a different matrix? Not very often - usually, we're doing things like transforming entire models into world-space, which involves multiplying lots of vertices by the exact same matrix (namely, the object's model matrix). This means we don't need to load the matrix before every vector - we can just load it once, and then feed the matrix a stream of vectors to be multiplied! With some slight modifications to our code, we end up with a BatchMultiply function:

Better still! Note that since we're now processing many vectors at once, it's feasible to insert a transpose operation during the function initialization - the amortized cost across all the input vectors is virtually zero (however, there's a pretty fast SIMDified matrix transpose function included in the sample code). So it doesn't really matter whether you store your matrices in row-major or column-major order, so long as you remember to transpose them into column-major order before loading them into the SSE registers.

16-byte memory alignment

Okay, next lesson: when looking through the SSE instruction reference, did you notice that there are two different instructions for data movement: movups (unaligned 16-byte MOV) and movaps (aligned 16-byte MOV)? If you tried using movaps, you probably ended up with a "privileged instruction" runtime error, which is why you settled on movups. As it turns out, movaps is much faster than movups, but it requires that its arguments be aligned on 16-byte boundaries when moving to/from memory (otherwise, the instruction throws an exception). You have to do a little extra work in order to get your matrices and vectors to be 16-byte aligned, but it's worth it.

The first thing I tried in Visual C++ was to go into Project -> Settings -> C/C++ -> Code Generation, and change the "struct member alignment" field from "8 bytes" to "16 bytes". Problem solved? No, apparently not. Changing this field doesn't appear to do anything useful (if anybody knows why, please explain it to me!). Instead, I had to resort to making changes in the code itself, using the C/C++ __declspec(align(16)) specifier. You can put this specifier in front of the definition of any variable, data member, struct or class in order to force the object to be 16-byte aligned (in the case of structs or classes declared as __declspec(align(16)), all instances of that struct/class will be properly aligned).

It's important to note that apparently this method doesn't work for objects (or arrays of objects) which are dynamically allocated (with malloc() or new). Fortunately, the MSVC Processor Pack Upgrade includes an _mm_malloc() function (defined in malloc.h) which should do the trick. Otherwise, it may be necessary to write your own memory management routines to make sure your objects are properly aligned. Email me if you'd like suggestions on how to set this up.

After modifying my Matrix4f and Vector4f classes to be 16-byte aligned, simply replacing all instances of movups with movaps in our BatchMultiply function results in another speedup! Here's the code:

Instruction Pairing

Chris's articles go into a great deal of detail about instruction pairing, so I'll only summarize here. Basically, if you order your code properly, the Pentium III will execute multiple instructions at the same time! There are two main strategies you use to ensure optimum instruction pairing.

First, try to eliminate data dependencies between adjacent instructions. If at all possible, avoid reading from and writing to the same register in adjacent instructions. In our case, that means instead of executing the "move / broadcast / multiply / add" loop once for each component of the input vector, we execute all the moves, then all the broadcasts, then all the multiplies, and then all the adds. This eliminates data dependencies between instructions.

Secondly, it's important to understand the instruction decoding mechanism of the processor. If your instructions are ordered properly (and free of data dependencies), the CPU can decode up to three instructions in the same clock cycle! Chris Dragan's article in Hugi #21 explains the Pentium III's instruction decoder quite well, so I won't go into too much detail. However, the most important practical lesson of the article is that only one packed SSE instruction (i.e. one whose opcode ends in 'ps', which operate on more than one value at once) can be decoded per cycle. Long strings of packed SSE instructions can be a major performance bottleneck - so much potential parallelism goes to waste! Wherever possible, try to alternate between packed and non-packed instructions.

Here's a version of BatchMultiply whose instructions have been re-ordered to improve performance:

Prefetching

The next optimization is more subtle. In addition to the SIMD instructions we've been using so far, SSE also provides instructions that let you specify how you want your data to interact with the system cache. The Intel SSE tutorial provides full coverage of these instructions, but the one that interests us right now is the prefetchnta instruction. It allows us to pre-load a specific address into a special non-temporal area of the L1 cache, which is guaranteed not to displace any data that was loaded into the cache due to more legitimate reasons (such as temporal or special locality). Essentially, we use prefetchnta to indicate that we want to the data at the specified address to be in the cache, so we can read to (or write from) it exactly once, at which point it won't be used again.

This is exactly how we want to treat our streams of input and output vectors - we only want each vector to be in the cache long enough to be read from (or written to). Then we want it GONE, so that we have plenty of cache space for future vectors! So we add a pair of prefetchnta instructions to BatchMultiply - one for the input stream, and one for the output stream:

At first, the benefits of prefetching aren't apparent - we only saved a single cycle! However, once we start passing in larger streams of vectors (around 10000), the prefetching version of the code begins to perform significantly better than the non-prefetching version.

Technically, we should be prefetching data which is significantly ahead of the data we're currently operating on. You should experiment to find the specific prefetching offset that provides the best performance for your application (48 bytes seems to be the "sweet spot" for the Pentium 3). It might also be worth investigating the movntps (streaming store) instruction, which is used to write data from SSE registers directly into memory while bypassing the cache entirely (thus further avoiding cache pollution). See the Intel SSE tutorial for more information.

To see the benefits of prefetching, run the sample program simdtest.exe with the -f flag. This will flush the entire cache between each test. You'll see that prefetching can double the speed of your code in the worst case. This is a useful technique to remember, since the prefetching routines can be used in non-SIMD code as well!

Increase Temporal Locality of Memory I/O

Our next optimization technique warrants another nod to Chris Dragan. He sped up his 3DNow multiplier from 19 cycles to 16 cycles by operating on two adjacent vectors concurrently! While this didn't change the amount of work he had to do on each vector, it reduced the overall amount of time spent waiting for memory I/O operations to complete. Reading two objects from memory one right after the other is faster than reading one object, then running a lot of code, then reading the other object. With a bit of re-organization, we can apply a similar method to our SSE multiplier.

First we need to reduce the number of registers we're using per vector. We're using four of the 8 SSE registers to store the matrix columns, one register for the input vector, one for the output vector, and either one or two for scratch values. This won't do at all if we want to work on two vectors at once; we need at least two registers to store the output vectors, leaving us only two registers for the input vectors and scratch values! It looks like we're stuck.

Well, not really - we don't really need to store the input vectors in registers. Instead, we can load individual components of the input vectors directly from memory to the scratch registers, and then broadcast, multiply and add them as usual! To load individual floats instead of whole vectors, we use the movss instruction. As an added bonus, movss is a non-packed instruction, which allows us to make better use of the parallel instruction decoders (see above)! Here's a version of BatchMultiply which operates on two vectors at once:

Of course, there's some additional overhead involved before we enter the main loop: we need to account for the case that we're passed an odd number of vectors. But as always, initialization costs drop to zero as the size of the data set increases, so for a reasonable number of input vectors, the impact of this additional setup is virtually zero.

Application-Specific Specialization

There's one more significant change we can make: by making some assumptions about the input vectors, we can shave off a few more cycles. Until now, we've been working with a general-purpose 4x4 matrix-vector multiplier. But remember that our original application involved transforming 3D vertices. 3D vertices are usually stored as four-element vectors, where the fourth element w is the "homogenous coordinate", used for perspective correction during the final transformation process. The value of w is initially 1.0, and it almost never changes (until the final stages of the rasterization pipeline, at which point the vertex is out of our control anyway). By assuming that w = 1.0 in the input vectors, we don't need to broadcast w and multiply it by xmm7; we can just add xmm7 directly to the output vector! Thus, we can eliminate three instructions (a move, a shuffle and a multiply) from our function (I'll call the new version BatchTransform, to differentiate it from the slower but more generalized BatchMultiply):

Conclusion and Future Possibilities

Ideally, you want both 3DNow and SSE optimized versions of certain algorithms in your program (in addition to a non-optimized version, for testing or portability). But you certainly don't want to litter your code with conditional statements to choose the right version to use at runtime. Compiling a separate version of your program for each processor variant is obviously the optimal solution performance-wise, but it's a tad inelegant (and it's easy to get the different versions confused). Using the Strategy design pattern, we can create a single monolithic application which runs the appropriate code on all supported platforms, without the additional costs in performance or code readability of abundant conditional statements.

It's also worth noting that BatchMultiply5(mat, vin[4], vout[4], 4) is a fast, general-purpose drop-in replacement for 4x4 matrix*matrix multiplication!

Sample Code

Download the complete sample code archive for this article here (25 KB, .zip format).

Links and Resources

Document Revision History