Kent's Blog
EC Container 5

CPU pipelines in 4 easy stages

I’m not a CPU designer, but I know how CPUs work. And since I have opinions on everything, I have opinions on CPUs. I want to get to an opinion on system architecture involving CPUs, but before I can get to that, it’s going to take several posts of background first.

Imagine every possible CPU design. I’ll wait. Got it yet?

Forget that. I’m going to simplify every possible CPU design to a simple in-order RISC pipeline. I know, your favorite CPU is much more complex. But for the point I want to make, all CPUs effectively have an EX pipeline stage, where all the magic happens.

The classic RISC pipeline consists of the stages IF, ID, EX, MEM, WB. IF is Instruction Fetch, ID is Instruction Decode, EX is Execute, MEM is memory access, and WB is WriteBack. The way I like to view it is EX is the main stage, and the other stages are preparation or cleanup for EX. ID is the important preparation stage, getting the registers ready and handling bypass results from EX. And MEM and WB exists so EX can be chock full of work, so writing results to the register file is pushed off so the clock frequency can be as high as possible.

I should make a drawing here, but that’s too much work for me. Since a RISC CPU instructions generally have 2 input registers and one output register, picture ID ending with flip-flops holding the two input values. ID is the stage where the register file is read to get data ready. Then in EX, an ADD instruction (for example) will take the two input values, add them together, and flop the result at the end of EX. It also will feed that result back to ID’s flip-flop inputs (the bypass path), in case the very next instruction wants to use the result value.

The thing to note is EX is where the instructions are effectively ordered. Instructions in IF or ID haven’t occurred yet, and if the instruction in EX needs to take a trap or exception, those instructions currently in IF or ID will be tossed. And the instruction in WB is committed--nothing the instruction in EX can do will make that instruction disappear. It’s already effectively occurred (even though it’s still technically in the pipeline). (Yes, I realize, a more complex pipeline could have many more stages, and even chase down and kill very late in the pipeline instructions due to certain reasons, and do instructions out-of-order...but that’s just complexity, all pipelines eventually have a commit point, let’s just call it EX).

But ADD instructions are not that interesting: loads and stores are interesting. Let’s assume loads and stores are done in two parts across the EX and MEM stage. Following the rule that before an instruction can get to the WB stage, it has to be committed, we’ll force the rule that any possible exception must be handled in EX. So TLB misses and TLB protection issues must be resolved in the EX stage.

But how are cache misses handled?

Let’s look at how cache accesses work. A load or store has two separate yet closely related lookups to do: one is to access the tag array to see if the data is valid in the data cache; the second is to actually access that data. At the level-1 data cache level, generally the load data access can begin without needing the tag lookup to complete (basically, address bits not in the tag are used to index into the data array). If the cache is associative, the tag results which arrive around the same time as the data results then select which of the associative data entries to choose. So, do the TLB lookup in EX, start the tag read and the data array read in EX, but let them complete in MEM. For stores, let the data write occur in MEM, after the tag read results are known. So exceptions are handled in EX, but actually handling the returned data is done in EX and MEM. And this is why the original RISC architectures had a load-to-use delay of one clock cycle: LOAD followed by an ADD using the loaded data would have a 1-cycle stall.

For a single-CPU simple design, cache misses would probably stall in MEM for loads and stores. If a load or store missed the cache (let’s assume write-back write-allocate caching only), the CPU would fetch the cacheline, then do the access again. Note that the stall is past the commit point--it will be done, it just has to wait on memory. This keeps the design simple, and achieves another important effect: the CPU appears to be strongly-ordered.
EC Container 6