Kent's Blog
EC Container 5

Weak Ordering is a bad idea

Sorry for the 6 year hiatus. Oh boy.

I'm going to jump to the conclusion I wanted to give 8 years ago: Weakly ordered systems are bad. They make writing high-performance multithreaded software much more difficult. And hardware doesn't gain much performance (or simplification) by being weakly ordered.

The first problem with multithreaded code is it's hard to write. C is notoriously difficult to get the semantics right, but most languages have more problems than they would like to admit.

Even if the user writes up an algorithm which is correct, and codes it in C correctly, then they have one more problem to solve: handling weak ordering. This can actually be very difficult to handle with "lockless" algorithms--the barriers have to be placed in many places, and it's hard to get this right.

Most systems fit into one of three categories: Strongly Ordered; Total-Store-Ordered (TSO, a name Sun gave to their ordering, and it's a good name, and applies to many more systems); and Weakly Ordered.

Basically, Strongly Ordered and Total-Store-Ordered mean (almost) all algorithms, including lockless ones, just work without needing any barrier instructions. But Weakly Ordered needs barriers.

To define the ordering briefly: Strongly Ordered CPU cores hold each load and store which misses the cache until it receives ownership for that line, then the load or store completes. Later instructions do not appear to have completed. So a Store to A which misses the cache, followed by a Store to B which hits, stalls the store to B until the A cacheline returns (or is otherwise completed first). All other CPU agents see the stores done by each CPU in the order they are done. A load to A which misses the cache, followed by a load to B which hits, stalls the load to B until the load to A gets its correct data. If B gets snooped out in the meantime, then the CPU will then refetch the B data to get the correct new version.

Total-Store-Ordered means the loads still stall as I indicated above for Strongly Ordered, but stores are handled slightly differently. All stores get placed into a FIFO, and drain to the cache in the order they were placed in the FIFO. So a Store to A doesn't even have to check the cache--it just goes right to the FIFO and the instruction retires. A following store to B goes right to the FIFO, and that instruction retires. Later loads from this CPU check the entire store buffer, and can hit any older store (even if the cacheline is not present). This maintains the illusion that this one thread of instructions executed in order. But as long as the stores drain from the store buffer in FIFO order, this maintains almost all of the properties of Strongly Ordered. Basic Producer-Consumer coding models just work with no barriers, and almost any locking algorithm works with no barriers (although some will not). The only issues arise due to the fact that stores which are not yet visible to other CPUs are visible to later loads on this same CPU can cause some really tricky algorithms to break. However, these cases are generally easy to avoid.

x86 is TSO (they don't call it that, however). Any architecture which has weaker semantics than x86 has a software problem--x86 algorithms don't just translate over, barriers have to be inserted. This is hard work. This is also very difficult work.

Weakly ordered means loads and stores can execute and complete out of order in appearance to other CPUs, unless barriers are used.

I look at it this way: the hardware, goes to great lengths to make it appear that single-process code runs in linear order, with lots of hardware applied to the problem of making it go fast but still appear to be done in instruction order. The fact that this isn't done for multithreaded code shows a lack of concern for that type of code.

But the real problem is Weakly Ordered systems don't gain much. As I was trying to get at with my low-level pipeline discussions, a simple model can divide CPU complexity into 3 tiers: Basic in-order pipeline; Complex multi-issue sort-of-inorder pipeline; Great-Big-Out-of-Order (GBOoO) pipelines. All "high-end" machines (including cell phones) are GBOoO.

For Basic In-Order, the pipeline is so simple it's just Strongly Ordered. No barriers needed.

For GBOoO, there's so much hardware working so hard to make the single thread code appear to be done in order that implementing full Strongly Ordered or TSO costs basically nothing, either in performance or in area/power. It takes some additional care to get TSO/Strongly Ordered right, but the hardware resources are already there for other reasons,

The problem is the mid-tier Complex multi-issue cores. As I showed with Hit-Under-Miss, this can give a small performance boost (2-5%) at the cost of breaking Strongly Ordered or TSO ordering rules. And this is why Weakly Ordered architectures persist.

TSO/Strongly Ordered CPUs have hardware to enforce ordering. If there are no snoops breaking up the code sequence, then the ordering generally has no performance penalty at all. If there are snoops which need special handling, then hardware generally resolves ordering about as fast as possible, and then goes back to full speed.

But, here's the kicker--it's almost certain that code with explicit software barriers runs slower than code without the barrier, even if there are no ordering issues to resolve. The cores which need the barriers lack the hardware to properly order instructions (otherwise, it would just be TSO/Strongly Ordered). So the barrier has to do something stronger, like stall the pipeline until earlier operations complete. And it has to do this even when there is no ordering issue to resolve, because it doesn't have the hardware to "see" the ordering issue. So the barriers actually DO something, and so cost performance. If they cost 2%, then Weakly Ordering doesn't even pay off on the mid-Tier CPU design which wanted it in the first place!

It gets worse. Unless the architects of the GBOoO decide to enforce full Strongly Ordered or TSO semantics and treat barriers as No-Ops, then even though they could see ordering issues and resolve them quickly in hardware, chances are the barriers do some types of stalls even on the GBOoO cores. Now this is bad--any barriers which slow down GBOoO costs high-end performance. This is a big mistake.

So, the best answer is TSO (or even a little stronger), and just eat the slight performance loss on the mid-tier range. It will be made up for by easier software porting, and better high-end performance.

And writing software for weakly ordered systems is extremely difficult. I suspect only a very small minority of programmers even fully comprehend the problem, let alone are able to code it correctly.

Weak Ordering

Sorry for the 2.5 year hiatus on this blog.

I was working up to Weakly Ordered with Hit-Under-Miss. By letting a CPU complete instructions while a load miss is pending, we’re now weakly ordered.

Here’s why: Assume CPU #0 does a Load to address A, which misses the cache. CPU #0 continues executing, and does a Load to address B, which hits the cache, and so that instruction retires. Then let’s assume there’s another miss, and now CPU #0 is stalled. Meanwhile, CPU #1 does a Store to address B, which misses in its cache, then a Store to address A which is a hit. We’ll assume A starts in CPU #1’s cache, and B starts in CPU #0’s cache.

Effectively, the above all happens in parallel at once. CPUs are allowed to operate mostly independently, with delayed resolution of coherency. A simple way to view coherency is that it’s resolved at the DDR memory controller in an arbitrary order (generally, coherency is managed elsewhere, but there’s always an ordering point somewhere, so let’s just imagine it’s at the DDR controller). So at our DDR memory controller, we’re going to get a Load to A and a Store to B. Let’s just do them in that order. To give CPU #0 the A data, DDR needs to “snoop” CPU #1, and make it give up that cacheline first. So DDR asks CPU #1 for the line, and CPU #1 gives DDR back the new modified data (its Store to A has committed). DDR then gives this data to CPU #0, which will then allow CPU #0 to complete is Load to A.

DDR then processes the Store to B miss from CPU #1. DDR needs to snoop CPU #0, and CPU #0 just gives up the line (or provides modified data if it was previously modified). DDR can then return B to CPU #1.

(How does DDR know where the valid data is? It either keeps track of it in some sort of structure; or it just asks all CPUs all the time; or some combination of the two. This is an interesting problem).

Coherency is maintained, our instructions are complete.

In a Strongly-ordered system, all instructions across all CPUs would have a logical ordering. There are only so many combinations of CPU #0’s Load A; Load B can be ordered with respect to CPU #1’s Store B; Store A. It’s tedious to list them all, but here are some example: 0: Load A; 0: Load B; 1: Store B; 1: Store A (just assume CPU #0’s instructions all happened first); or: 1: Store B; 1: Store A; 0:Load A; 0:Load B (just assume CPU #1’s instruction all happened first); or perfectly interleaved: 0: Load A; 1: Store B; 0: Load B; 1: Store A.

What’s interesting is to describe an impossible Strongly-Ordered case: 0: Load B; 1: Store A; 0: Load A; 1: Store B. CPU #1’s instructions must be ordered Store B then Store A, so a global order of CPU #1’s stores cannot be Store A then Store B.

But I’ve just given the results our DDR controller and current pipeline would give. CPU #0 sees an old value for its Load B since it hits in the cache. But CPU #0’s Load A does get the new value of A from CPU #1, since the Store A executed first on CPU #1 (it hit in the cache). If CPU #1 was writing to B to write some info; and then wrote to A to say, “Info is valid”, and if CPU #0 was reading A to see if “Info is valid”, and reading B to get the info, then we’ve just seriously broken the software ordering. Because CPU #0 thinks “Info is valid”, but it read an old stale value.

This is exactly what Weakly Ordered systems do. They do not give a guarantee of global ordering semantics UNLESS you put in a barrier instruction to indicate that you care about ordering of instructions. Let’s call this MBAR, for Memory BARrier.

In this case, both CPU #0 and CPU #1 need MBAR. CPU #0’s sequence would be Load A; MBAR; Load B, and CPU #1’s sequence would be Store B; MBAR; Store A. What MBAR does is stop the execution of later instructions until all previous instructions are committed to be ordered as completed. This fixes this case.

But what does MBAR do? It basically stops hit-under-miss. So, if you care about ordering, you have to sprinkle in MBARs to stop hit-under-miss. But if you don’t care about ordering, you don’t need MBAR and you can take advantage of hit-under-miss.

You’re probably thinking the compiler inserts these MBARs when necessary and programmers can just ignore this stuff. Unfortunately, that’s generally not true.

Ordering

The CPU pipeline design I just laid out in the previous post was the pinnacle of 1989 RISC design. Mostly.

The simple pipeline has a very nice feature: each instruction appears to execute and complete in the order they pass through the EX stage. Earlier instructions before the current instruction are complete, and instructions not yet in EX have not done any work yet. The CPU is pipelined, so many instructions are in various stages of execution. But to software, the single EX pipeline stage makes it appear as if the instructions are executing in order. And a much stronger order than that: data accesses are strongly ordered.

Most programmers have a simplistic model for a CPU (if they have a model at all), and most expect it to execute instructions one at a time in the order of the assembly language instructions. This is a pretty reasonable mental model, and most CPU architectures go to great lengths to try to achieve it.

Every architecture (excluding Alpha, which fortunately is dead now) makes it appear that the instructions executed on a single CPU core fully execute in order. Storing to address A and immediately loading from address A in the next instruction always gets the right data. Loading from address B and then storing different data to address B never causes the earlier load to see the later store data. With no funny synchronization instructions needed.

So basically, all architectures have some EX pipeline stage which they order all instructions through, from a single core’s perspective. So why aren’t all CPU architectures strongly-ordered, and what are the other ordering models?

Unfortunately, there’s terrible nomenclature around CPU ordering, and even worse, technical descriptions tend to get long and very abstract. To put it simply, there are roughly 3 levels of ordering: Strongly Ordered, Store-Ordered (SPARC called this Total-Store-Ordering, and I like the acronym TSO), and Weakly-Ordered. What’s happening is architectures are breaking certain rules just laid down for the EX stage to try to get higher performance. So I think it’s best to think about what rule is being broken to understand what’s going on.

Let’s look at way to move our simple CPU into the 1990’s. One step was “dual-issue”, where multiple instructions could execute at once. This generally doesn’t affect ordering, so I’ll ignore it. Another step, which is an ordering issue, was called “hit-under-miss”.

Previously, if a Load instruction was a cache miss, we’d stall the pipeline (in the MEM or EX stage) and wait for data to return. Once data returned, the pipeline would restart, and subsequent instructions would then make it to the EX stage.

A very very good way to think about a CPU’s performance is to look at stalls. Stalls are any pipeline stalls where no useful work is done. With this CPU design, each load and store miss stalls the pipeline for the latency to main memory, which can be a fairly long time. The idea of Hit-Under-Miss is to allow one miss to be pending without stalling the pipeline. So, if there are no misses currently pending, when a Load instruction misses the cache, let it go through EX and MEM without a pipeline stall. Instead, mark its target register as “busy”, and stall any instruction in EX if it tries to read the busy register. Hardware on the side (not in the main pipeline) waits for the data from main memory to return. But instructions after the Load can execute and complete, as long as they don’t touch the “busy” register. For simplicity, if another Load or Store misses the cache, we’ll then stall at EX/MEM.

This is a nice speed boost. Let’s assume a code sequence which causes a Cache Miss every 6 cycles (which we’ll assume is 6 instructions), and main memory latency is 20 cycles. And let’s assume we can, on average, execute 3 instructions (3 cycles) after a Cache Miss before causing a stall (either because of using the Loaded value, or causing another miss).

Without Hit-Under-Miss, executing 6 instructions will take 6 cycles plus 1 miss of 20 clocks, for a total of 26 cycles. With Hit-Under-Miss, after 6 instructions, there will be a miss. But we can execute 3 instructions in the shadow of the miss, then stall, waiting for the data to come back. Then, restart and execute 3 more instructions, then miss again. Then execute 3 more instructions during the memory access, then stall waiting for the memory to come back. Repeat this pattern, and you can see a miss is started every 23 cycles. Effectively, the 3 instructions done while waiting for main memory are “free”, so 6 instructions take just 23 cycles. Even with high miss traffic, and only able to execute a few instructions before stalling again, Hit-Under-Miss helps noticeably. (In CPU design, a 10% improvement is pretty big).

Hit-Under-Miss doesn’t affect a single-CPU core’s view of the ordering of its instructions, but it does change the multiprocessor view of main memory.

CPU pipelines in 4 easy stages

I’m not a CPU designer, but I know how CPUs work. And since I have opinions on everything, I have opinions on CPUs. I want to get to an opinion on system architecture involving CPUs, but before I can get to that, it’s going to take several posts of background first.

Imagine every possible CPU design. I’ll wait. Got it yet?

Forget that. I’m going to simplify every possible CPU design to a simple in-order RISC pipeline. I know, your favorite CPU is much more complex. But for the point I want to make, all CPUs effectively have an EX pipeline stage, where all the magic happens.

The classic RISC pipeline consists of the stages IF, ID, EX, MEM, WB. IF is Instruction Fetch, ID is Instruction Decode, EX is Execute, MEM is memory access, and WB is WriteBack. The way I like to view it is EX is the main stage, and the other stages are preparation or cleanup for EX. ID is the important preparation stage, getting the registers ready and handling bypass results from EX. And MEM and WB exists so EX can be chock full of work, so writing results to the register file is pushed off so the clock frequency can be as high as possible.

I should make a drawing here, but that’s too much work for me. Since a RISC CPU instructions generally have 2 input registers and one output register, picture ID ending with flip-flops holding the two input values. ID is the stage where the register file is read to get data ready. Then in EX, an ADD instruction (for example) will take the two input values, add them together, and flop the result at the end of EX. It also will feed that result back to ID’s flip-flop inputs (the bypass path), in case the very next instruction wants to use the result value.

The thing to note is EX is where the instructions are effectively ordered. Instructions in IF or ID haven’t occurred yet, and if the instruction in EX needs to take a trap or exception, those instructions currently in IF or ID will be tossed. And the instruction in WB is committed--nothing the instruction in EX can do will make that instruction disappear. It’s already effectively occurred (even though it’s still technically in the pipeline). (Yes, I realize, a more complex pipeline could have many more stages, and even chase down and kill very late in the pipeline instructions due to certain reasons, and do instructions out-of-order...but that’s just complexity, all pipelines eventually have a commit point, let’s just call it EX).

But ADD instructions are not that interesting: loads and stores are interesting. Let’s assume loads and stores are done in two parts across the EX and MEM stage. Following the rule that before an instruction can get to the WB stage, it has to be committed, we’ll force the rule that any possible exception must be handled in EX. So TLB misses and TLB protection issues must be resolved in the EX stage.

But how are cache misses handled?

Let’s look at how cache accesses work. A load or store has two separate yet closely related lookups to do: one is to access the tag array to see if the data is valid in the data cache; the second is to actually access that data. At the level-1 data cache level, generally the load data access can begin without needing the tag lookup to complete (basically, address bits not in the tag are used to index into the data array). If the cache is associative, the tag results which arrive around the same time as the data results then select which of the associative data entries to choose. So, do the TLB lookup in EX, start the tag read and the data array read in EX, but let them complete in MEM. For stores, let the data write occur in MEM, after the tag read results are known. So exceptions are handled in EX, but actually handling the returned data is done in EX and MEM. And this is why the original RISC architectures had a load-to-use delay of one clock cycle: LOAD followed by an ADD using the loaded data would have a 1-cycle stall.

For a single-CPU simple design, cache misses would probably stall in MEM for loads and stores. If a load or store missed the cache (let’s assume write-back write-allocate caching only), the CPU would fetch the cacheline, then do the access again. Note that the stall is past the commit point--it will be done, it just has to wait on memory. This keeps the design simple, and achieves another important effect: the CPU appears to be strongly-ordered.

Spending $300 to save $10

I live in New England, where frugalness is prized. Well, to some extent, at least. How about: I don’t like to waste money needlessly, although I do value my own time.

When my HP CP2025DN color Laserjet printer ran out of its initial black toner cartridge, I knew how to get more toner out of it. Lots of years with HP black&white laser printers taught me there’s a LOT of toner still in there when the printer thinks it’s “out”, and the failure mechanism of actually running out of toner is the text on the printouts just gets lighter and lighter. When that happens, just shake the toner cartridge, and get another hundred pages or so.

The printer has 4 cartridges, the usual black, magenta, yellow, and cyan. Each cost about $100 from HP, a little less if you want refilled no-name cartridges. I bought 4 new cartridges since I’ll need them all eventually, but didn’t put the black in until I noticed the output fading. The color cartridges were more than 2/3rd full when the block toner initially ran “out”. So after a few hundred pages, I thought the printouts were looking a little less black, and put in a new black cartridge.

And...the printer complained it was still out of toner. What it had done was to drain the color cartridges, probably using all the ink to approximate black. And so all the color toner cartridges were empty.

So to use the last $10 or so of one black cartridge, I’d drained $300 of color cartridges.

But I still won’t replace the color cartridges--I’m using the new black cartridge, but not the new color cartridges. I still get color that looks fine to me even though the printer says all of the color cartridges are completely empty. But what do I have to lose now? Now, I’m running a new experiment, to see how long the printer will print color even with “empty” color cartridges. And to see how HP will try to rip me off this time. I guess it could needlessly drain the black toner, but where would the black ink go?

Moral of the story: Don’t run with scissors? Never start a land war in Asia? Well, I’m not happy HP feels the need to do something underhanded (who would really want to use expensive color ink to approximate black?) to make me buy more toner. I didn’t mention the fact that the printer by default stops and refuses to print when it thinks it still has 6% toner left, although that setting can be overridden on a menu setting (Toner Low Override: On).

I had previously bought a Minolta MC2550 color laser printer, and it did colors great (better than HP), but its small black text was blurry, and since 90% of what I want is to print really small text (2-up duplex text), that was just unacceptable. Then the Minolta began jamming constantly...so I never did buy new toner cartridges for it, I just traded it in towards the HP instead when Staples ran a deal on trade-ins, after I’d run the toner down to nearly empty. Which was a bad deal for Staples since I would have been happy to pay someone to take the Minolta away.

But now I no longer trust HP printers, which is probably going to cost HP more than $300 in the long run.
EC Container 6