FACTOID # 152: Of the eight countries which include the word "democratic" in their conventional long form name, three are dictatorships: North Korea (Democratic People's Republic of Korea), Laos (Lao People's Democratic Republic) and the Democratic republic of the Congo.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Classic RISC pipeline

In history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later DLX. Computing hardware has been an important component of the process of calculation and data storage since it became useful for numerical values to be processed and shared. ... This article does not cite any references or sources. ... CPU redirects here. ... A MIPS R4400 microprocessor made by Toshiba. ... Sun UltraSPARC II Microprocessor Sun UltraSPARC T1 (Niagara 8 Core) SPARC (Scalable Processor Architecture) is a RISC microprocessor instruction set architecture originally designed in 1985 by Sun Microsystems. ... The 88000 (m88k for short) is a microprocessor design produced by Motorola. ... The DLX is a RISC processor architecture by John L. Hennessy and David A. Patterson, the principal designers of the MIPS and the Berkeley RISC designs (respectively), the two benchmark examples of RISC design. ...


Each of these classic scalar RISC designs, according to the basic SISD concept, fetched and attempted to execute one instruction per cycle. The main common concept of each design was a five stage execution instruction pipeline. During operation, each pipeline stage would work on one instruction at a time. SISD is an acronym for Single Instruction stream over a Single Data stream. ... Cycles per instruction, also known as clock cycles per instruction, or clocks per instruction (CPI) is the number of clock cycles that happen when a instruction is being executed by a computer with a given clock frequency. ... Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back) An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their performance. ...


Each of these stages consisted of an initial set of flip-flops, and combinatorial logic which operated on the outputs of those flops. In digital circuits, the flip-flop, latch, or bistable multivibrator is an electronic circuit which has two stable states and thereby is capable of serving as one bit of memory. ...

Contents

The classic five stage RISC pipeline

Instruction fetch

The Instruction Cache on these machines had a latency of one cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched from the cache.


The instruction fetcher sends the Program Counter (PC) to the memory to read the current instruction and at the same time the instruction was fetched, these machines predicted the address of the next instruction by incrementing the PC with 4 bytes. This prediction was always wrong in the case of a taken branch, jump, or exception, but the penalty for being wrong was small and simply incrementing takes very little circuitry. Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address. The program counter (also called the instruction pointer in some computers) is a register in a computer processor which indicates where the computer is in its instruction sequence. ... In computer architecture, a branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. ... In computer architecture, a branch target predictor is the part of a processor that predicts the target of a conditional branch or unconditional jump instruction before that instruction has been fetched from the instruction cache. ...


Decode

The instruction decoder of a processor is a combinatorial circuit sometimes in the form of a read-only memory. Its purpose is to translate an instruction code into the address in the micro memory where the micro code for the instruction starts.


All MIPS and SPARC instructions have at most two register inputs. During the decode stage, these two register names are identified within the instruction, and the two registers named are read from the register file. In the MIPS design, the register file had 32 entries. A register file is an array of processor registers in a central processing unit (CPU). ...


Decoding is done in parallel with the reading of the register file. Since the registers are at a fixed location, this technique is called fixed-field decoding.


At the same time the register file was read, instruction issue logic in this stage determined if the pipeline was ready to execute the instruction in this stage. If not, the issue logic would cause both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stages would prevent their initial flops from accepting new bits.


If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is also determined in parallel, and if the branch is taken or if the instruction is a jump, the target address is the next cycle's instruction fetch pointer.


Execute

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

  • Register-Register Operation (Single cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.
  • Memory Reference(Two cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.
  • Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

A typical schematic symbol for an ALU: A & B are operands; R is the output; F is the input from the Control Unit; D is an output status In computing, an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. ... This article or section is in need of attention from an expert on the subject. ...

Access

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available. If the instruction is a memory load, the data is read from the location computed at the previous stage. If it is a store, the memory writes the data from the second register using the effective address.


There have been a wide variety of data cache organizations. By far the simplest is direct mapped and virtually tagged, which is what will be assumed for the following explanation.


The cache has two SRAMS (Static RAM), one storing data, the other storing tags. During a load, the SRAMs are read in parallel during the access stage. The single tag read is compared with the virtual address specified by the load. Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ... Static Random Access Memory (SRAM) is a type of semiconductor memory. ...


If the two are equal, then the datum we are looking for is resident in the cache, and has just been read from the data SRAM. The load is a hit, and the load instruction can complete the writeback stage normally.


If the tag and virtual address are not equal, then the line is not in the cache. The CPU pipeline must suspend operation (described below under Cache Miss Handling) while a state machine reads the required data from memory into the cache, and optionally writes any dirty data evicted from the cache back into memory.


During a store, the tag SRAM is read to determine if the store is a hit or miss, and if a miss, the same state machine brings the datum into the cache. The store data cannot be written to the cache during the access stage because the processor does not yet know if the correct line is resident. Instead, the store data is written to the cache during the next store instruction. In the interim, the store data is held in a Store Data Queue, which, in a classic RISC pipeline is just a single 32 bit register. For reference, the virtual address written is held in the Store Address Queue, again just a single 32 bit register. On more complicated pipelines, the queues actually have multiple hardware registers and variable length.


There is one more complication. A load immediately after a store could reference the same memory address, in which case the data must come from the Store Data Queue rather than from the cache's data SRAM. So, during a load, the load address is also checked against the Store Address Queue. If there is a match, the data from the Store Data Queue is forwarded to the writeback stage, rather than the data from the data SRAM. This operation does not change the flow of the load operation through the pipeline.


Writeback

During this stage, both single cycle and two cycle instructions write their results into the register file.


Pipeline Registers

The Pipeline stages are independent from each other so they should be decoupled to prevent any interference between the stages. This is done by separating the pipeline stages from each other using the pipeline registers. At the end of each stage the results are stored back into this register and then the following stage works on what is inside this register. This decoupling allows for passing values between stages as there is no hard wire with the stage itself but rather with its result. For instance the store instruction should store a certain register value, this value is read during the Decode Stage & Register Fetch (RF), and it remains during the execution unchanged and passed to the MEM stage and stored in the memory as explained above.


Hazards

Structural Hazards

Structural hazards result from conflicts in the resources due to the pipelining of the instructions. In other words, if two instructions require the same register at two different stages in the pipeline, this would be considered a structural hazard. For example, if the processor were using the same memory for both the data and the instruction (such that the LOAD instructions would affect another instruction being fetched from the memory) and if this LOAD instruction now is in the MEM stage and another instruction at the same cycle is in the Fetch stage, which one will read from the memory? There are two solutions for structural hazards: (1) - split the cache into instruction and data parts, (2) stall the IF stage until the MEM stage finishes (this would then increase the instruction time by one cycle). This stall is termed a "pipeline bubble" since it floats in the pipeline, like an air bubble, occupying time uselessly. The direct solution for structural hazards is the duplication of all the resources in order to eliminate resource conflicts. However, this solution is much more expensive, a drawback not encountered with the stalling solution ((2) above).


Data Hazards

The data hazards result from a conflict over the sharing of data. Although the instructions are written with sequential execution in mind, a previous instruction's output may be the current instruction's input, such that this data dependency creates a problem while pipelining the instructions.


There are two types of hazards according to their solution:

  • By Passing

Suppose the CPU is executing the following piece of code:

 SUB r3,r4 -> r10 AND r10,3 -> r11 

The instruction fetch and decode stages will send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:

cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5
Fetch SUB AND
Decode SUB AND
Execute SUB AND
Access SUB AND
Writeback SUB AND

In cycle 2, the Decode stage fetches r10 from the register file. Because the SUB instruction writing to r10 is simultaneously in the execute stage, the value read from the register file is wrong.


The solution to this problem is a pair of bypass multiplexors. These multiplexors sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each multiplexor selects between a register file read port, the current register pipeline of the ALU, and the current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result).


Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexors to select the most recent data. These bypass multiplexors make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexor, and a flip-flop. Without the multiplexors, the latency of reading and writing the register file would have to be included in the latency of these instructions.

  • Pipeline Interlock

Suppose we have the following instruction set

 LD R1,0(R2) DSUB R4,R1,R5 

According to the order above the value in R1 won't be available until the MEM stage of the pipeline, which means that the ALU stage of the next instruction would be executing with the value of R1 in place which is not case since it will be there after this cycle. The solution for this is introducing a bubble which is done by a specific hardware that predicts such hazard and introduce a new bubble to stall the pipeline until the data is available. This would again increase the instruction time.


Control Hazards

The classic RISC pipeline resolves branches in the Decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:


The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32 bit compare on the MIPS CPUs), and the next instruction address multiplexer.


Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions which branch to a register+offset address. Jump to register is supported.


On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch IPC penalty, which is quite large.


There are four schemes to solve this branch problem

  • Pipeline Freeze: Just freeze the pipeline until you know the result of the branch
  • Predicted Not Taken: Take the next instruction as if the branch will not be taken. If the branch is taken, flush all instructions and take the branch.
  • Predicted Taken: Compute the target and branch to it. If the branch is not taken, flush all instructions and don't take the branch.
  • Branch Delay: Execute the Branch Instruction, but do not do any action, just take the next instructions after the branch until the branch is known. This can be done on hardware by Branch delay slot that introduces a Branch Delay to take the next transaction after the branch. It can also be done by software where it reorders the instructions. The SPARC, MIPS, and MC88K designers decided to avoid this penalty by designing a Branch delay slot into their ISA.

A delayed branch specifies that the jump to a new location happens after the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch. Delayed branches have been criticized as a poor short-term choice in ISA design. Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot). Superscalar processors, which fetch multiple instructions per cycle and have superior branch prediction, do not benefit from delayed branches. The Alpha ISA left out delayed branches, as it was intended for superscalar processors. The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions now have essentially two addresses, the exception address and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs. In computer architecture, a branch delay instruction is an instruction immediately following a conditional branch instruction which is executed whether or not the branch is taken. ... Simple superscalar pipeline. ... DEC Alpha AXP 21064 Microprocessor die photo Package for DEC Alpha AXP 21064 Microprocessor Alpha AXP 21064 bare die mounted on a business card with some statistics The DEC Alpha, also known as the Alpha AXP, is a 64-bit RISC microprocessor originally developed and fabricated by Digital Equipment Corp...


Exceptions

Suppose a 32-bit RISC is processing an ADD instruction which adds two large numbers together. Suppose the resulting number does not fit in 32 bits. What happens?


The simplest solution, provided by every architecture, is wrapping arithmetic. This is the standard arithmetic provided by the language C. Numbers greater than the maximum possible encoded value have their most significant bits chopped off until they fit. In the usual integer number system, 3000000000+3000000000=6000000000. With unsigned 32 bit wrapping arithmetic, 3000000000+3000000000=1705032704. This may not seem terribly useful. The largest benefit of wrapping arithmetic is that every operation has a well defined result. C is a general-purpose, block structured, procedural, imperative computer programming language developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system. ...


But the programmer, especially if programming in a language supporting Bignums (e.g. lisp or scheme), may not want wrapping arithmetic. Some architectures (e.g. MIPS), define special addition operations that branch to special locations on overflow, rather than wrapping the result. Software at the target location is responsible for fixing the problem. This special branch is called an exception. Exceptions differ from regular branches in that the target address is not specified by the instruction itself, and the branch decision is dependent on the outcome of the instruction.


The most common kind of software-visible exception on one of the classic RISC machines is a TLB miss (see virtual memory). The program thinks it has a large range of contiguous addresses; but in reality the parts it is currently using are scattered around RAM, and the inactive parts are saved in a disk file. ...


Exceptions are different from branches and jumps, because those other control flow changes are resolved in the decode stage. Exceptions are resolved in the writeback stage. When an exception is detected, the following instructions (earlier in the pipeline) are marked as invalid, and as they flow to the end of the pipe their results are discarded. The program counter is set to the address of a special exception handler, and special registers are written with the exception location and cause.


To make it easy (and fast) for the software to fix the problem and restart the program, the CPU must take a precise exception. A precise exception means that all instructions up to the excepting instruction have been executed, and the excepting instruction and everything afterwards have not been executed.


In order to take precise exceptions, the CPU must commit changes to the software visible state in the program order. This in-order commit happens very naturally in the classic RISC pipeline. Most instructions write their results to the register file in the writeback stage, and so those writes automatically happen in program order. Store instructions, however, write their results to the Store Data Queue in the access stage. If the store instruction takes an exception, the Store Data Queue entry is invalidated so that it is not written to the cache data SRAM later.


Cache miss handling

Occasionally, either the data or instruction cache will not have a datum or instruction required. In these cases, the CPU must suspend operation until the cache can be filled with the necessary data, and then must resume execution. The problem of filling the cache with the required data (and potentially writing back to memory the evicted cache line) is not specific to the pipeline organization, and is not discussed here.


There are two strategies to handle the suspend/resume problem. The first is a global stall signal. This signal, when activated, prevents instructions from advancing down the pipeline, generally by gating off the clock to the flip-flops at the start of each stage. The disadvantage of this strategy is that there are a large number of flip flops, so the global stall signal takes a long time to propagate. Since the machine generally has to stall in the same cycle that it identifies the condition requiring the stall, the stall signal becomes a speed-limiting critical path.


Another strategy to handle suspend/resume is to reuse the exception logic. The machine takes an exception on the offending instruction, and all further instructions are invalidated. When the cache has been filled with the necessary data, the instruction which caused the cache miss is restarted. To expedite data cache miss handling, the instruction can be restarted so that its access cycle happens one cycle after the data cache has been filled.


  Results from FactBites:
 
php-deluxe.net - description RISC (3939 words)
Reduced Instruction Set Computer (RISC), is a microprocessor CPU design philosophy that favors a smaller and simpler set of Instructions that all take about the same amount of time to execute.
The key to pipelining is the observation that the processor can start reading the next instruction as soon as it finishes reading the last, meaning that there are now two instructions being worked on (one is being read, the next is being decoded), and after another cycle there will be three.
However, despite many successes, RISC has made few inroads into the desktop PC and commodity server markets, where Intel s x86 platform remains the dominant processor architecture (Intel is facing increased competition from AMD, but even AMD s processors implement the x86 platform, or a 64-bit superset known as x86-64).
ooBdoo (7033 words)
This section describes what is generally referred to as the "Classic RISC pipeline," which in fact is quite common among the simple CPUs used in many electronic devices (often called microcontrollers).
Pipelining does, however, introduce the possibility for a situation where the result of the previous operation is needed to complete the next operation; a condition often termed data dependency conflict.
This requires that the instruction pipeline is filled as often as possible and gives rise to the need in superscalar architectures for significant amounts of CPU cache.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.