*Originally appeared in Hugi Issue #25*

- Introduction
- Techniques
- Conclusion and Future Possibilities
- Download Sample Code
- Links and Resources
- Revision History

In this article I'll be describing the process of optimizing a piece of code to make use of Intel's SSE instruction set. As an example, I will work with a 4x4 matrix-vector multiplication function (commonly found in most 3D graphics engines). Starting from a naïve C++ implementation, I will progressively apply various optimization techniques, until the function is more than four times faster than its original version. The code samples are in C++ with inline Intel x86 assembly (MSVC syntax), but the techniques generalize easily to other environments and platforms.

We'll apply these techniques to a problem that often arises in computer graphics: you want to multiply a 3D vector by a 4x4 matrix. This problem arises when you have a 3D vertex that you wish to transform, using a standard 4x4 transformation matrix. Of course, all graphics cards since the original Nvidia GeForce perform final vertex transformations in hardware. But there are still plenty of reasons to be transforming your vertices before sending them to the graphics card (matrix composition, visibility culling and collision detection, to name a few - not to mention non-hardware-accelerated applications).

So, hopefully you agree that it's important to have a fast software matrix-vector multiplier. Well, you're in luck - there are plenty of fast implementations available! In Hugi #20, Chris Dragan uses AMD's 3DNow extensions to write a matrix-vector multiplier that runs in 16 cycles! But...what if you don't have an Athlon (or what if you want your demo to run well on non-Athlons)? Well, not to worry - Intel provides a free matrix library, optimized for the Pentium 3 and 4! Unfortunately, to make use of the optimized versions of Intel's matrix functions, you need to be using Intel's proprietary compiler. Well, there's always libSIMD, a free cross-platform optimized matrix library. However, the libSIMD project is still in the early stages of development, and currently doesn't provide any optimization at all. So why not go dig out your Linear Algebra textbook, look up the definition of matrix multiplication, and write some C++ code! I suspect your first draft would look something like this:

- MatrixMultiply1 (100 cycles/vec)

A brief note on cycle timings: the timings I give for each function were achieved when running the sample code on my 500 MHz Pentium 3. Different machines will have different timings (for instance, the Pentium 4 offers significantly faster SIMD performance than the Pentium 3). But the absolute timings are mostly irrelevant - what matters is the relative timings, which show off the performance gains from each successive optimization.

So sure, the function works - but if all you're looking for is code that WORKS, you wouldn't be reading Hugi, would you? Functionality isn't good enough - we want something FAST! And since Chris Dragan's 3DNow version runs more than 5 times faster, you may suspect (correctly, as it turns out) that we can make some pretty heavy optimizations. You can optimize the C++ all you want (see MatrixMultiply2 in the code sample), but you aren't going to get the performance increase you're hoping for.

Why does Chris's version run so fast? Because it uses 3DNow! 3DNow is an example of a SIMD (Single Instruction, Multiple Data) instruction set, which lets you execute the same operation on multiple pieces of data simultaneously. Intel provides its own SIMD instructions, called SSE (Streaming SIMD Extensions), which are available on the Pentium 3 and up. SSE lets you perform operations on four 32-bit floats in parallel! (The Pentium 4's SSE2 instructions provide similar operations for 64-bit doubles, but most graphics applications don't need the extra precision).

SIMD lends itself amazingly well to matrix operations. Intel's developer site contains descriptions of SSE-optimized matrix multiplication, inversion, and L-U decomposition algorithms. After familiarizing yourself with the SSE instructions (Intel provides an excellent tutorial/reference - see the links at the end of the article for more information), you might update your MatrixMultiply function to look something like this:

- MatrixMultiply3 (43 cycles/vec)

Note: due to the way that SSE instructions work, it ends up being convenient to have the input matrix be in column-major order. If you'd rather keep your matrices in row-major order (the way OpenGL expects them), don't panic - it turns out you can do it either way without a significant performance penalty. Keep reading.

Here's an general overview of the SSE version of the multiplier algorithm:

- Load matrix columns into SSE registers xmm4-xmm7.
- Load input vector into xmm0.
- Initialize the output vector (xmm2) to zero.
- For each component of the input vector:
- Broadcast that component into xmm1 (i.e. replicate its value into all four slots of xmm1).
- Multiply xmm1 by the appropriate column of the matrix (xmm4 for x, xmm5 for y, etc).
- Add the results to the output vector, xmm2.

- Write xmm2 to memory.

Not bad - about twice as fast as the original version! But there's still plenty of ways to make this function faster.

Next, let's make a slight algorithmic change. We know matrix-vector multiplication is a common operation, but how often are we going to be transforming each vertex by a different matrix? Not very often - usually, we're doing things like transforming entire models into world-space, which involves multiplying lots of vertices by the exact same matrix (namely, the object's model matrix). This means we don't need to load the matrix before every vector - we can just load it once, and then feed the matrix a stream of vectors to be multiplied! With some slight modifications to our code, we end up with a BatchMultiply function:

- BatchMultiply1 (34 cycles/vec)

Better still! Note that since we're now processing many vectors at once, it's feasible to insert a transpose operation during the function initialization - the amortized cost across all the input vectors is virtually zero (however, there's a pretty fast SIMDified matrix transpose function included in the sample code). So it doesn't really matter whether you store your matrices in row-major or column-major order, so long as you remember to transpose them into column-major order before loading them into the SSE registers.

Okay, next lesson: when looking through the SSE instruction
reference, did you notice that there are two different instructions
for data movement: `movups`

(unaligned 16-byte MOV) and
`movaps`

(aligned 16-byte MOV)? If you tried using
`movaps`

, you probably ended up with a "privileged
instruction" runtime error, which is why you settled on
`movups`

. As it turns out, `movaps`

is much
faster than `movups`

, but it requires that its arguments be
aligned on 16-byte boundaries when moving to/from memory (otherwise,
the instruction throws an exception). You have to do a little extra
work in order to get your matrices and vectors to be 16-byte aligned,
but it's worth it.

The first thing I tried in Visual C++ was to go into Project ->
Settings -> C/C++ -> Code Generation, and change the "struct member
alignment" field from "8 bytes" to "16 bytes". Problem solved? No,
apparently not. Changing this field doesn't appear to do anything
useful (if anybody knows why, please explain it to me!). Instead, I
had to resort to making changes in the code itself, using the C/C++
`__declspec(align(16))`

specifier. You can put this
specifier in front of the definition of any variable, data member,
struct or class in order to force the object to be 16-byte aligned (in
the case of structs or classes declared as
`__declspec(align(16))`

, all instances of that struct/class
will be properly aligned).

It's important to note that apparently this method doesn't work for
objects (or arrays of objects) which are dynamically allocated (with
`malloc()`

or `new`

). Fortunately, the MSVC
Processor Pack Upgrade includes an `_mm_malloc()`

function (defined in `malloc.h`

) which should do the trick.
Otherwise, it may be necessary to write your own memory management
routines to make sure your objects are properly aligned. Email me if
you'd like suggestions on how to set this up.

After modifying my Matrix4f and Vector4f classes to be 16-byte
aligned, simply replacing all instances of `movups`

with
`movaps`

in our BatchMultiply function results in another
speedup! Here's the code:

- BatchMultiply2 (28 cycles/vec)

Chris's articles go into a great deal of detail about instruction pairing, so I'll only summarize here. Basically, if you order your code properly, the Pentium III will execute multiple instructions at the same time! There are two main strategies you use to ensure optimum instruction pairing.

First, try to eliminate data dependencies between adjacent instructions. If at all possible, avoid reading from and writing to the same register in adjacent instructions. In our case, that means instead of executing the "move / broadcast / multiply / add" loop once for each component of the input vector, we execute all the moves, then all the broadcasts, then all the multiplies, and then all the adds. This eliminates data dependencies between instructions.

Secondly, it's important to understand the instruction decoding mechanism of the processor. If your instructions are ordered properly (and free of data dependencies), the CPU can decode up to three instructions in the same clock cycle! Chris Dragan's article in Hugi #21 explains the Pentium III's instruction decoder quite well, so I won't go into too much detail. However, the most important practical lesson of the article is that only one packed SSE instruction (i.e. one whose opcode ends in 'ps', which operate on more than one value at once) can be decoded per cycle. Long strings of packed SSE instructions can be a major performance bottleneck - so much potential parallelism goes to waste! Wherever possible, try to alternate between packed and non-packed instructions.

Here's a version of BatchMultiply whose instructions have been re-ordered to improve performance:

- BatchMultiply3 (22 cycles/vec)

The next optimization is more subtle. In addition to the SIMD
instructions we've been using so far, SSE also provides instructions
that let you specify how you want your data to interact with the
system cache. The Intel SSE tutorial provides full coverage of these
instructions, but the one that interests us right now is the
`prefetchnta`

instruction. It allows us to pre-load a
specific address into a special non-temporal area of the L1 cache,
which is guaranteed not to displace any data that was loaded into the
cache due to more legitimate reasons (such as temporal or special
locality). Essentially, we use `prefetchnta`

to indicate
that we want to the data at the specified address to be in the cache,
so we can read to (or write from) it exactly once, at which point it
won't be used again.

This is exactly how we want to treat our streams of input and
output vectors - we only want each vector to be in the cache long
enough to be read from (or written to). Then we want it *GONE*,
so that we have plenty of cache space for future vectors! So we add a
pair of `prefetchnta`

instructions to BatchMultiply - one
for the input stream, and one for the output stream:

- BatchMultiply4 (21 cycles/vec)

At first, the benefits of prefetching aren't apparent - we only saved a single cycle! However, once we start passing in larger streams of vectors (around 10000), the prefetching version of the code begins to perform significantly better than the non-prefetching version.

Technically, we should be prefetching data which is significantly
ahead of the data we're currently operating on. You should experiment
to find the specific prefetching offset that provides the best
performance for your application (48 bytes seems to be the "sweet
spot" for the Pentium 3). It might also be worth investigating the
`movntps`

(streaming store) instruction, which is used to
write data from SSE registers directly into memory while bypassing the
cache entirely (thus further avoiding cache pollution). See the Intel
SSE tutorial for more information.

To see the benefits of prefetching, run the sample program simdtest.exe with the -f flag. This will flush the entire cache between each test. You'll see that prefetching can double the speed of your code in the worst case. This is a useful technique to remember, since the prefetching routines can be used in non-SIMD code as well!

Our next optimization technique warrants another nod to Chris Dragan. He sped up his 3DNow multiplier from 19 cycles to 16 cycles by operating on two adjacent vectors concurrently! While this didn't change the amount of work he had to do on each vector, it reduced the overall amount of time spent waiting for memory I/O operations to complete. Reading two objects from memory one right after the other is faster than reading one object, then running a lot of code, then reading the other object. With a bit of re-organization, we can apply a similar method to our SSE multiplier.

First we need to reduce the number of registers we're using per vector. We're using four of the 8 SSE registers to store the matrix columns, one register for the input vector, one for the output vector, and either one or two for scratch values. This won't do at all if we want to work on two vectors at once; we need at least two registers to store the output vectors, leaving us only two registers for the input vectors and scratch values! It looks like we're stuck.

Well, not really - we don't really need to store the input vectors
in registers. Instead, we can load individual components of the input
vectors directly from memory to the scratch registers, and then
broadcast, multiply and add them as usual! To load individual floats
instead of whole vectors, we use the `movss`

instruction. As an added bonus, `movss`

is a non-packed
instruction, which allows us to make better use of the parallel
instruction decoders (see above)! Here's a version of BatchMultiply
which operates on two vectors at once:

- BatchMultiply5 (20 cycles/vec)

Of course, there's some additional overhead involved before we enter the main loop: we need to account for the case that we're passed an odd number of vectors. But as always, initialization costs drop to zero as the size of the data set increases, so for a reasonable number of input vectors, the impact of this additional setup is virtually zero.

There's one more significant change we can make: by making some
assumptions about the input vectors, we can shave off a few more
cycles. Until now, we've been working with a general-purpose 4x4
matrix-vector multiplier. But remember that our original application
involved transforming 3D vertices. 3D vertices are usually stored as
four-element vectors, where the fourth element *w* is the
"homogenous coordinate", used for perspective correction during the
final transformation process. The value of *w* is initially 1.0,
and it almost never changes (until the final stages of the
rasterization pipeline, at which point the vertex is out of our
control anyway). By assuming that *w* = 1.0 in the input
vectors, we don't need to broadcast *w* and multiply it by xmm7;
we can just add xmm7 directly to the output vector! Thus, we can
eliminate three instructions (a move, a shuffle and a multiply) from
our function (I'll call the new version BatchTransform, to
differentiate it from the slower but more generalized
BatchMultiply):

- BatchTransform1 (17 cycles/vec)

Ideally, you want both 3DNow and SSE optimized versions of certain
algorithms in your program (in addition to a non-optimized version,
for testing or portability). But you certainly don't want to litter
your code with conditional statements to choose the right version to
use at runtime. Compiling a separate version of your program for each
processor variant is obviously the optimal solution performance-wise,
but it's a tad inelegant (and it's easy to get the different versions
confused). Using the *Strategy* design pattern, we can create a
single monolithic application which runs the appropriate code on all
supported platforms, without the additional costs in performance or
code readability of abundant conditional statements.

It's also worth noting that BatchMultiply5(mat, vin[4], vout[4], 4) is a fast, general-purpose drop-in replacement for 4x4 matrix*matrix multiplication!

Download the complete sample code archive for this article here (25 KB, .zip format).

- Visual Studio Processor Pack
- Intel's Small Matrix Library
- Intel's SSE tutorial
- Intel's compiler
- Agner Fog's Pentium Optimization Document
- libSIMD
- Thanks to Mark Larson for his suggestions and feedback!
- Thanks to Ben Wendelstein for prompting me to locate the
`_mm_malloc()`

function!

- 5/8/2002 -- Initial HTML publication