Happy new year, everyone! It’s now 2009, which means I’ll be writing the wrong date on my checks for another few months at least. We’re celebrating 2009 with a new addition to our family:
Over at StorageMojo, Robin comments on the challenges of shared memory controllers with multi-core processors. This is actually something that’s been a big problem for regular software development for a while now, and is especially important in the storage space.
There’s a big problem today, which is that processors keep getting faster, but memory latency and bandwidth aren’t keeping up. A processor can perform a complicated operation in a nanosecond, but retrieving the data to operate on might take ten times that long. This is particularly an issue with storage because nearly everything involves moving data to and from the system, and all that data has to pass through main memory. If it has to pass through the processor as well you can rapidly use up available memory bandwidth; this is one of the reasons why technologies like RDMA have been developed.
Even worse, there don’t seem to be any good tools for identifying if a software process is performance-bound by memory bandwidth or latency! You might naively increase the processor speed or number of cores in your system in order to increase performance and find no change at all. Let me explain.
Let’s say you’re working on improving performance for an application. It’s easy to observe that the network interface is saturated, and thus your process is network I/O-bound — to improve performance you need to add more I/O bandwidth or change your wire encoding. With a little more instrumentation you can determine if your process is network latency-bound — waiting on remote requests all the time — and know that you need to add more network parallelism.
Similarly, it’s easy to tell if you are disk bandwidth or latency-bound — you’ll always be in disk wait.
If you’re not waiting on the disk and you’re not waiting on the network, the default assumption is that you are CPU-bound — add a faster processor and you’re on your way, or optimize the areas that your profiler shows you spending time in to run in fewer cycles. But this frequently doesn’t help today.
Processor speed has greatly outstripped memory speed. If you’re operating on data in registers or in cache, adding a faster processor can help. Most data is in main memory, however, and you need to get it onto the chip — fulfilling a request from main memory can take dozens of processor clock cycles! The instruction cannot be processed until the data has been retrieved, so even if the processor were twice as fast, it couldn’t get more done.
This is why modern processors have whizz-bang features like out-of-order execution, branch prediction, processor virtualization and parallelism… have enough stuff in-flight that you can always be processing something while waiting for memory requests to be fulfilled. This can mask a lot of the memory delays, but at some point you run out of things the processor can do.
Other than experimentally, how can you tell if a process is memory-bound? As far as I know, all profiling mechanisms will show such a process as CPU-bound, because the sampler will find the instruction pointers sitting in routines doing lots of computation on things in memory, and this will be indistinguishable from any number crunching those routines do. I’m pretty sure I can ask the chipset about cache miss statistics, but that really doesn’t tell me much.
This is really important, because it can tell you if it’s productive to jump through hoops trying to eliminate in-memory copies (for example) versus just needing a faster processor. It can tell you that adding additional memory channels or faster memory is a win. But I can’t find any way to determine this from a profiling perspective.
Here’s a good paper explaining the problem further, with the money quote “In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.”
The authors divide compute time into processing time, memory latency stall time, and memory bandwidth stall time. This is exactly the data I’d like to see — but they’ve gathered it by running SPECmarks on a simulated processor and memory architecture. I’d love to gather profiling data in situ, or at least on a synthetic modern Intel architecture… Cachegrind gets you part of the way there, but not far enough.
Meanwhile we work hard to increase system performance with the tools and instrumentation that we have available. We’ve produced a 50% performance improvement over the past six months, and are on track to repeat that again!