Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back) An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). Pipelining assumes that with a single instruction (SISD) concept successive instructions in a program sequence will overlap in execution, as suggested in the next diagram (vertical 'i' instructions, horizontal 't' time). Most modern CPUs are driven by a clock. The CPU consists internally of logic and flip flops. When the clock arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage. Image File history File links Download high resolution version (972x282, 7 KB) Title : Instruction scheduling using a 5 stages pipeline. ...
Image File history File links Download high resolution version (972x282, 7 KB) Title : Instruction scheduling using a 5 stages pipeline. ...
Reduced Instruction Set Computer (RISC), is a microprocessor CPU design philosophy that favors a smaller and simpler set of instructions that all take about the same amount of time to execute. ...
This article is about the machine. ...
SISD is an acronym for Single Instruction stream over a Single Data stream. ...
CPU redirects here. ...
In digital circuits, the flip-flop, latch, or bistable multivibrator is an electronic circuit which has two stable states and thereby is capable of serving as one bit of memory. ...
- Instruction fetch
- Instruction decode and register fetch
- Execute
- Memory access
- Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly. Processors with pipelining are organised inside into stages which can semi-independently work on separate jobs. Each stage is organised and linked into a 'chain' so each stage's output is inputted to another stage until the job is done. This organisation of the processor allows overall processing time to be significantly reduced. Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution. Advantages and Disadvantages
Pipelining does not help in all cases. There are several disadvantages associated. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline. In synchronous digital electronics, such as most computers, a clock signal is a signal used to coordinate the actions of two or more circuits. ...
Advantages of Pipelining: - The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Disadvantages of Pipelining: - A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
- The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.
- A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
In digital circuits, the flip-flop, latch, or bistable multivibrator is an electronic circuit which has two stable states and thereby is capable of serving as one bit of memory. ...
Examples Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other To the right is a generic pipeline with four stages: Image File history File links No higher resolution available. ...
Image File history File links No higher resolution available. ...
- Fetch
- Decode
- Execute
- Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline. Execution is as follows: | Time | Execution | | 0 | Four instructions are awaiting to be executed | | 1 | - the green instruction is fetched from memory
| | 2 | - the green instruction is decoded
- the purple instruction is fetched from memory
| | 3 | - the green instruction is executed (actual operation is performed)
- the purple instruction is decoded
- the blue instruction is fetched
| | 4 | - the green instruction's results are written back to the register file or memory
- the purple instruction is executed
- the blue instruction is decoded
- the red instruction is fetched
| | 5 | - the green instruction is completed
- the purple instruction is written back
- the blue instruction is executed
- the red instruction is decoded
| | 6 | - The purple instruction is completed
- the blue instruction is written back
- the red instruction is executed
| | 7 | - the blue instruction is completed
- the red instruction is written back
| | 8 | - the red instruction is completed
| | 9 | All instructions are executed | Bubble
A bubble in cycle 3 delays execution -
Main article: Bubble (computing) When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution. Image File history File links No higher resolution available. ...
Image File history File links No higher resolution available. ...
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7. Bubbles are unlike stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
Example 1 A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to: LOAD A, R1 LOAD B, R2 ADD R1, R2, R3 STORE R3, C LOAD next instruction The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'. In computer architecture, a processor register is a small amount of very fast computer memory used to speed the execution of computer programs by providing quick access to frequently used valuesâtypically, these values are involved in multiple expression evaluations occurring within a small region on the program. ...
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages. On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage. Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines. Atmel AVR ATmega8 PDIP. The AVR is a Modified Harvard architecture 8-bit RISC single chip microcontroller (µC) which was developed by Atmel in 1996. ...
PIC microcontrollers in DIP and QFN packages PIC is a family of Harvard architecture microcontrollers made by Microchip Technology, derived from the PIC1650 originally developed by General Instruments Microelectronics Division. ...
Example 2 To better visualize the concept, we can look at a theoretical 3-stages pipeline: | Stage | Description | | Load | Read instruction from memory | | Execute | Execute instruction | | Store | Store result in memory and/or registers | and a pseudo-code assembly listing to be executed: See the terminology section, below, regarding inconsistent use of the terms assembly and assembler. ...
LOAD #40, A ; load 40 in A MOVE A, B ; copy A in B ADD #20, B ; add 20 to B STORE B, 0x300 ; store B into memory cell 0x300 This is how it would be executed: Clock 1 | Load | Execute | Store | | LOAD | | | The LOAD instruction is fetched from memory. Clock 2 | Load | Execute | Store | | MOVE | LOAD | | The LOAD instruction is executed, while the MOVE instruction is fetched from memory. Clock 3 | Load | Execute | Store | | ADD | MOVE | LOAD | The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction. Clock 4 | Load | Execute | Store | | STORE | ADD | MOVE | The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating. And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do. In computer architecture, a hazard is a potential problem that can happen in a pipelined processor. ...
Complications Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long [1]. The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch predicting itself can end up exacerbating the problem if branches are predicted poorly. In certain applications, such as supercomputing, programs are specially written to rarely branch and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI). If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis, however such analysis is often a last-resort for optimization. Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ...
The Pentium 4[1] brand refers to Intels single-core mainstream desktop and laptop CPUs introduced on November 20, 2000[2] (August 8, 2008 is the date of last shipments of Pentium 4s[3]). They had the 7th-generation architecture - called NetBurst - which was the companys first all...
Prescott is the name of some places in the United States of America including Prescott, Arizona Prescott, Arkansas Prescott, Iowa Prescott, Michigan Prescott, Kansas Prescott, Oregon Prescott, Washington Prescott, Wisconsin Prescott, Ontario is a town in Canada. ...
Cedar Mill is a census-designated place and an unincorporated community in Washington County, Oregon, mostly north of U.S. Highway 26 and west of the Willamette Stone. ...
Pentium D logo as of 2006. ...
In computer architecture, a branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. ...
For other uses, see Supercomputer (disambiguation). ...
Cycles per instruction, also known as clock cycles per instruction, or clocks per instruction (CPI) is the number of clock cycles that happen when a instruction is being executed by a computer with a given clock frequency. ...
Code coverage is a measure used in software testing. ...
The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages. Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs. Most modern processors load their instructions some clock cycles before they execute them. ...
Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ...
In computer science, self-modifying code is code that alters its own instructions, whether or not it is on purpose, while it is executing. ...
See also A wait state is a delay experienced by a computer processor when accessing external memory or another device that is slow to respond. ...
In history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. ...
Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. ...
External links - ArsTechnica article on pipelining
CPU redirects here. ...
A typical vision of a computer architecture as a series of abstraction layers: hardware, firmware, assembler, kernel, operating system and applications (see also Tanenbaum 79). ...
An instruction set is (a list of) all instructions, and all their variations, that a processor can execute. ...
This article does not cite any references or sources. ...
A complex instruction set computer (CISC) is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. ...
Explicitly Parallel Instruction Computing (EPIC) is a computing paradigm that began to be researched in the early 1980s resulting in a U.S. patent 4,847,755 (Gordon Morrison, et. ...
A Very Long Instruction Word or VLIW CPU architecture implements a form of instruction level parallelism. ...
The One Instruction Set Computer is a single machine language opcode which is sufficient to produce a Turing complete machine. ...
In computer science, ZISC stands for Zero Instruction Set Computer, which refers to a chip technology based on pure pattern matching and absence of (micro-)instructions in the classical sense. ...
The term Harvard architecture originally referred to computer architectures that used physically separate storage and signal pathways for their instructions and data (in contrast to the von Neumann architecture). ...
Design of the Von Neumann architecture For the robotic architecture also named after Von Neumann, see Von Neumann machine The von Neumann architecture is a computer design model that uses a single storage structure to hold both instructions and data. ...
Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be dealt with at once. ...
Simple superscalar pipeline. ...
In computer engineering, out-of-order execution, OoOE, is a paradigm used in most high-performance microprocessors in order to make use of cycles that would otherwise be wasted by a certain type of costly delay. ...
In computer engineering, register renaming refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations. ...
In computer science, speculative execution is the execution of code whose result may not actually be needed. ...
Multithreading computers have hardware support to efficiently execute multiple threads. ...
Multiprocessing is traditionally known as the use of multiple concurrent processes in a system as opposed to a single process at any one instant. ...
A typical schematic symbol for an ALU: A & B are operands; R is the output; F is the input from the Control Unit; D is an output status In computing, an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. ...
A floating point unit (FPU) is a part of a computer system specially designed to carry out operations on floating point numbers. ...
Processor board of a CRAY YMP vector computer A vector processor, or array processor, is a CPU design that is able to run mathematical operations on multiple data elements simultaneously. ...
-1...
32-bit is a term applied to processors, and computer architectures which manipulate the address and data in 32-bit chunks. ...
In computing, a 64-bit component is one in which data are processed or stored in 64-bit units (words). ...
In computer architecture, a processor register is a small amount of very fast computer memory used to speed the execution of computer programs by providing quick access to frequently used valuesâtypically, these values are involved in multiple expression evaluations occurring within a small region on the program. ...
Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ...
This article does not cite any references or sources. ...
An Altera Stratix II GX FPGA. A field-programmable gate array is a semiconductor device containing programmable logic components called logic blocks, and programmable interconnects. ...
A digital signal processor (DSP) is a specialized microprocessor designed specifically for digital signal processing, generally in real-time. ...
It has been suggested that this article or section be merged with embedded microprocessor. ...
Application Specific Instruction-Set Processor or (ASIP) is a methodology used in System-on-a-Chip design. ...
System-on-a-chip (SoC or SOC) is an idea of integrating all components of a computer system into a single chip. ...
Power management is a feature of some electrical appliances, especially copiers and computer peripherals such as monitors and printers, that turns off the power or switches the system to a low-power state after a period of inactivity. ...
For the computer architecture technique to increase processor performance by increasing clock frequency, see frequency scaling. ...
Dynamic voltage scaling is a technique in computer architecture where a processor is run at a less-than-maximum voltage in order to conserve power. ...
Clock gating is one of the power-saving techniques used on the Pentium 4 processor. ...
|