Chap. 6 Enhancing Performance with Pipelining 臺大電機系吳安宇教授 V1. 2007/04/20 臺大電機吳安宇教授 - 計算機結構 1
Outline 6.1 An Overview of Pipelining 6.2 A Pipelined Datapath 6.3 Pipelined Control 6.4 Data Hazards and Forwarding 6.5 Data Hazards and Stalls 6.6 Branch Hazards 6.8 Exceptions (not covered) 6.9 Superscalar and dynamic pipelining (not covered) 臺大電機吳安宇教授 - 計算機結構 2
Pipelining is Natural! Ann, Brian, Cathy, and Don each have dirty clothes to be washed, dried, folded, and put away The washer, dryer, folder, storer each take 30 minutes for their task 臺大電機吳安宇教授 - 計算機結構 3
Sequential laundry If they learned pipelining, how long would it take? Sequential laundry takes 8 hours for 4 loads 臺大電機吳安宇教授 - 計算機結構 4
Pipelined laundry Pipelined laundry takes 3.5 hours for 4 loads 臺大電機吳安宇教授 - 計算機結構 5
Single-, Multi-cycle, vs. Pipeline 臺大電機吳安宇教授 - 計算機結構 6
Why Pipeline? Because the Resources Are There! 臺大電機吳安宇教授 - 計算機結構 7
Pipelining MIPS Execution 臺大電機吳安宇教授 - 計算機結構 8
Pipeline Hazards Structural hazard An occurrence in which a planned instruction cannot execute in the proper clock cycle because the hardware cannot support the combination of instructions that are set to execute in the given clock cycle. Data hazard Also called pipeline data hazard. An occurrence in which a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. Control hazard Also called branch hazard. An occurrence in which the proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is NOT the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected. 臺大電機吳安宇教授 - 計算機結構 9
Data hazard Load-use data hazard A specific form of data hazard in which the data requested by a load instruction has not yet become available when it is requested. Pipeline stall Also called bubble. A stall initiated in order to resolve a hazard Solution: Forwarding Also called bypassing. A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible register or memory. 臺大電機吳安宇教授 - 計算機結構 10
Control hazard Untaken branch One that falls through to the successive instruction. A taken branch is one that causes transfer to the branch target Solutions: Branch prediction A method of resolving a branch hazard that assumes a given outcome for the branch, and proceeds from that assumption rather than waiting to ascertain the actual outcome 臺大電機吳安宇教授 - 計算機結構 11
Performance Index of Pipelining Latency (pipeline) The number of stages in a pipeline or the number of stages between two instructions during execution. Throughput (pipeline) The number of instructions executed per unit time. 臺大電機吳安宇教授 - 計算機結構 12
Outline 6.1 An Overview of Pipelining 6.2 A Pipelined Datapath 6.3 Pipelined Control 6.4 Data Hazards and Forwarding 6.5 Data Hazards and Stalls 6.6 Branch Hazards 6.8 Exceptions 6.9 Superscalar and dynamic pipelining 臺大電機吳安宇教授 - 計算機結構 13
Designing a Pipelined Processor Examine the datapath and control diagram Starting with single-or multi-cycle datapath? Single-or multi-cycle control? Partition datapath into stages: IF (instruction fetch), ID (instruction decode and register file read), EX (execution or address calculation), MEM (data memory access), WB (write back) Associate resources with states Ensure that flows do not conflict, or figure out how to resolve Assert control in appropriate stage 臺大電機吳安宇教授 - 計算機結構 14
Use Multi-cycle Execution Steps But, use single-cycle datapath.. (separate memory, why??) 臺大電機吳安宇教授 - 計算機結構 15
Split Single-cycle Datapath What to add to split the datapath into stages 臺大電機吳安宇教授 - 計算機結構 16
Add Pipeline Registers Use registers between stages to carry data and control 臺大電機吳安宇教授 - 計算機結構 17
Consider load IF: Instruction Fetch Fetch the instruction from the Instruction Memory ID: Instruction Decode Registers fetch and instruction decode EX: Calculate the memory address MEM: Read the data from the Data Memory WB: Write the data back to the register file 臺大電機吳安宇教授 - 計算機結構 18
Pipelining load 5 functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File s Read ports (busa and busb) for the Reg/Dec stage ALU for the Exec stage Data Memory for the MEM stage Register File s Write port (busw) for the WB stage 臺大電機吳安宇教授 - 計算機結構 19
IF Stage of load word IF/ID= mem[pc] ; PC = PC + 4 臺大電機吳安宇教授 - 計算機結構 20
ID Stage of load word ID/EX(A)= Reg[IR[25-21]]; ID/EX(B)= Reg[IR[20-16]]; ID/EX = Sign-extension of ID[15:0] 臺大電機吳安宇教授 - 計算機結構 21
EX Stage of load word EX/MEM = A + sign-ext(ir[15-0]) % address computation 臺大電機吳安宇教授 - 計算機結構 22
MEM Stage of load word MEM/WB = mem[aluout] 臺大電機吳安宇教授 - 計算機結構 23
WB Stage of load Reg[IR[20-16]] = MEM/WB 臺大電機吳安宇教授 - 計算機結構 24
Pipelined Datapath 臺大電機吳安宇教授 - 計算機結構 25
The Four Stages of R-type IF: fetch the instruction from the Instruction Memory ID: registers fetch and instruction decode EX: ALU operates on the two register operands Update PC WB: write ALU output back to the register file 臺大電機吳安宇教授 - 計算機結構 26
Pipelining R-type and load We have a structural hazard: Two instructions try to write to the register file at the same time! Only one write port 臺大電機吳安宇教授 - 計算機結構 27
Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File s write port during its 5th stage R-type uses Register File s write port during its 4th stage Several ways to solve: 1) forwarding, 2) adding pipeline bubble, 3) making instructions same length 臺大電機吳安宇教授 - 計算機結構 28
Solution 1: Insert Bubble Insert a bubble into the pipeline to prevent two writes at the same cycle The control logic can be complex Lose instruction fetch and issue opportunity No instruction is started in Cycle 6! 臺大電機吳安宇教授 - 計算機結構 29
Solution 2: Delay R-type s Write Delay R-type s register write by one cycle: R-type also use Reg File s write port at Stage 5 MEM is a NOP stage: nothing is being done. 臺大電機吳安宇教授 - 計算機結構 30
The Four Stages of store IF: fetch the instruction from the Instruction Memory ID: registers fetch and instruction decode EX: calculate the memory address MEM: write the data into the Data Memory Add an extra stage: WB: NOP 臺大電機吳安宇教授 - 計算機結構 31
The Four Stages of beq IF: fetch the instruction from the Instruction Memory ID: registers fetch and instruction decode EX: compares the two register operand select correct branch target address latch into PC Add two extra stages: MEM: NOP WB: NOP 臺大電機吳安宇教授 - 計算機結構 32
Multiple-clock-cycle pipeline diagram Can help with answering questions like: How many cycles to execute this code? What is the ALU doing during cycle 4? Help understand datapaths 臺大電機吳安宇教授 - 計算機結構 33
Multiple-clock-cycle pipeline Timing diagram 臺大電機吳安宇教授 - 計算機結構 34
Traditional multi-clock-cycle pipeline Timing diagram IF 臺大電機吳安宇教授 - 計算機結構 35
Single-clock-cycle Timing Diagram A vertical slice through a multiple-clock-cycle diagram 臺大電機吳安宇教授 - 計算機結構 36
Example 1: Cycle 1 臺大電機吳安宇教授 - 計算機結構 37
Example 1: Cycle 2 臺大電機吳安宇教授 - 計算機結構 38
Example 1: Cycle 3 臺大電機吳安宇教授 - 計算機結構 39
Example 1: Cycle 4 臺大電機吳安宇教授 - 計算機結構 40
Example 1: Cycle 5 臺大電機吳安宇教授 - 計算機結構 41
Example 1: Cycle 6 臺大電機吳安宇教授 - 計算機結構 42
Outline 6.1 An Overview of Pipelining 6.2 A Pipelined Datapath 6.3 Pipelined Control 6.4 Data Hazards and Forwarding 6.5 Data Hazards and Stalls 6.6 Branch Hazards 6.8 Exceptions 6.9 Superscalar and dynamic pipelining 臺大電機吳安宇教授 - 計算機結構 43
Pipeline Control: Control Signals 臺大電機吳安宇教授 - 計算機結構 44
Group Signals According to Stages Can use control signals of single-cycle CPU 臺大電機吳安宇教授 - 計算機結構 45
Control Signal Details 臺大電機吳安宇教授 - 計算機結構 46
Control Signal Details 臺大電機吳安宇教授 - 計算機結構 47
Data Stationary Control Pass control signals along just like the data Main control generates control signals during ID 臺大電機吳安宇教授 - 計算機結構 48
Data Stationary Control (cont.) Signals for EX (ExtOp, ALUSrc,...) are used 1 cycle later Signals for MEM (MemWr, Branch) are used 2 cycles later Signals for WB (MemtoReg, MemWr) are used 3 cycles later 臺大電機吳安宇教授 - 計算機結構 49
Datapath with Control 臺大電機吳安宇教授 - 計算機結構 50
Let s Try it Out Sample Assembly Program lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 臺大電機吳安宇教授 - 計算機結構 51
Example 2: Cycle 1 臺大電機吳安宇教授 - 計算機結構 52
Example 2: Cycle 2 臺大電機吳安宇教授 - 計算機結構 53
Example 2: Cycle 3 臺大電機吳安宇教授 - 計算機結構 54
Example 2: Cycle 4 臺大電機吳安宇教授 - 計算機結構 55
Example 2: Cycle 5 臺大電機吳安宇教授 - 計算機結構 56
Example 2: Cycle 6 臺大電機吳安宇教授 - 計算機結構 57
Example 2: Cycle 7 臺大電機吳安宇教授 - 計算機結構 58
Example 2: Cycle 8 臺大電機吳安宇教授 - 計算機結構 59
Example 2: Cycle 9 臺大電機吳安宇教授 - 計算機結構 60
Summary of Pipeline Basics Pipelining is a fundamental concept Multiple steps using distinct resources Utilize capabilities of datapath by pipelined instruction processing Start next instruction while working on the current one Limited by length of longest stage (plus fill/flush) Need to detect and resolve hazards What makes it easy in MIPS? All instructions are of the same length Just a few instruction formats Memory operands only in loads and stores What makes pipelining hard? hazards 臺大電機吳安宇教授 - 計算機結構 61
Outline 6.1 An Overview of Pipelining 6.2 A Pipelined Datapath 6.3 Pipelined Control 6.4 Data Hazards and Forwarding 6.5 Data Hazards and Stalls 6.6 Branch Hazards 6.8 Exceptions 6.9 Superscalar and dynamic pipelining 臺大電機吳安宇教授 - 計算機結構 62
Data Hazards Order of operand accesses changed by pipeline Starting next instruction before first is finished Dependencies go backward in time 臺大電機吳安宇教授 - 計算機結構 63
Handling Data Hazards Detect Resolve remaining ones Compiler inserts NOP Stall Forward 臺大電機吳安宇教授 - 計算機結構 64
Software Solution Have compiler guarantee no hazards Where do we insert the NOPs? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: not efficient enough! 臺大電機吳安宇教授 - 計算機結構 65
Detecting Data Hazards Hazard conditions: EX/MEM.RegisterRd = ID/EX.RegisterRs (1a) EX/MEM.RegisterRd = ID/EX.RegisterRt (1b) MEM/WB.RegisterRd = ID/EX.RegisterRs (2a) MEM/WB.RegisterRd = ID/EX.RegisterRt (2b) Two optimizations: Don t forward if instruction does not write register check if RegWrite is asserted Don t forward if destination register is $0 check if RegisterRd = 0 臺大電機吳安宇教授 - 計算機結構 66
Detecting Data Hazards (cont.) Hazard conditions using control signals: At EX stage (EX hazard): If ( EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd=ID/EX.RegRs ) ForwardA = 10 臺大電機吳安宇教授 - 計算機結構 67
Detecting Data Hazards (cont.) Hazard conditions using control signals: At MEM stage: MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd=ID/EX.RegRs) (replace ID/EX.RegRt for ID/EX.RegRs for the other two conditions) 臺大電機吳安宇教授 - 計算機結構 68
Resolving Hazards: Forwarding Use temporary results, e.g., those in pipeline registers, don t wait for them to be written 臺大電機吳安宇教授 - 計算機結構 69
Forwarding Logic Forwarding: input to ALU from any pipe registers Add multiplexors to ALU input Control forwarding in EX carry Rs in ID/EX Control signals for forwarding: If both WB and MEM forward, e.g., add $1,$1,$2; add $1,$1,$3; add $1,$1,$4; => let MEM forward EX hazard: if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd=ID/EX.RegRs)) ForwardA=10 MEM hazard: if (MEM/WB.RegWriteand (MEM/WB.RegRd 0) and (EX/MEM.RegRd ID/EX.Reg.Rs) and (MEM/WB.RegRd=ID/EX.RegRs)) ForwardA=01 (ID/EX.RegRt <-> ID/EX.RegRs, ForwardB <-> ForwardA) 臺大電機吳安宇教授 - 計算機結構 70
No Forwarding 臺大電機吳安宇教授 - 計算機結構 71
With Forwarding 臺大電機吳安宇教授 - 計算機結構 72
Pipeline with Forwarding 臺大電機吳安宇教授 - 計算機結構 73
Example 3: Cycle 3 臺大電機吳安宇教授 - 計算機結構 74
Example 3: Cycle 4 臺大電機吳安宇教授 - 計算機結構 75
Example 3: Cycle 5 臺大電機吳安宇教授 - 計算機結構 76
Example 3: Cycle 6 臺大電機吳安宇教授 - 計算機結構 77
Can't Always Forward lw can still cause a hazard: if is followed by an instruction to read the loaded reg. 臺大電機吳安宇教授 - 計算機結構 78
Stalling Stall pipeline by keeping instructions in same stage and inserting an NOP instead 臺大電機吳安宇教授 - 計算機結構 79
Handling Stalls Hazard detection unit in ID to insert stall between a load instruction and its use: if (ID/EX.MemRead and ((ID/EX.RegisterRt= IF/ID.RegisterRs) or (ID/EX.RegisterRt= IF/ID.registerRt)) stall the pipeline for one cycle (ID/EX.MemRead=1 indicates a load instruction) How to stall? Stall instruction in IF and ID: not change PC and IF/ID => the stages re-execute the instructions What to move into EX: insert an NOP by changing EX, MEM, WB control fields of ID/EX pipeline register to 0 as control signals propagate, all control signals to EX, MEM, WB are deasserted and no registers or memories are written 臺大電機吳安宇教授 - 計算機結構 80
Pipeline with Stalling Unit Forwarding controls ALU inputs, hazard detection controls PC, IF/ID, control signals 臺大電機吳安宇教授 - 計算機結構 81
Example 4: Cycle 2 臺大電機吳安宇教授 - 計算機結構 82
Example 4: Cycle 3 臺大電機吳安宇教授 - 計算機結構 83
Example 4: Cycle 4 臺大電機吳安宇教授 - 計算機結構 84
Example 4: Cycle 5 臺大電機吳安宇教授 - 計算機結構 85
Example 4: Cycle 6 臺大電機吳安宇教授 - 計算機結構 86
Example 4: Cycle 7 臺大電機吳安宇教授 - 計算機結構 87
Outline 6.1 An Overview of Pipelining 6.2 A Pipelined Datapath 6.3 Pipelined Control 6.4 Data Hazards and Forwarding 6.5 Data Hazards and Stalls 6.6 Branch Hazards 6.8 Exceptions (optional) 6.9 Superscalar and dynamic pipelining (optional) 臺大電機吳安宇教授 - 計算機結構 88
Branch Hazards When decide to branch, other inst. are in pipeline! 臺大電機吳安宇教授 - 計算機結構 89
Handling Branch Hazard Predict branch always not taken Need to add hardware for flushing inst. if wrong Branch decision made at MEM => need to flush inst. in IF, ID, EX by changing control values to 0 Reduce delay of taken branch by moving branch execution earlier in the pipeline Move up branch address calculation to ID Check branch equality at ID (using XOR) by comparing the two registers read during ID Branch decision made at EX => one inst. to flush Add a control signal, IF.Flush, to zero instruction field of IF/ID => making the instruction an NOP Dynamic branch prediction Compiler rescheduling, delay branch 臺大電機吳安宇教授 - 計算機結構 90
Delayed Branch Predict-not-taken + branch decision at ID => the following inst. is always executed => branches take effect 1 cycle later 0 clock cycle per branch instruction if can find instruction to put in slot ( 50% of time) 臺大電機吳安宇教授 - 計算機結構 91
Pipeline with Flushing 臺大電機吳安宇教授 - 計算機結構 92
Example 5: Cycle 3 臺大電機吳安宇教授 - 計算機結構 93
Example 5: Cycle 4 臺大電機吳安宇教授 - 計算機結構 94
Summary Pipelines pass control information down the pipe just as data moves down pipe Forwarding/stalls handled by local control Exceptions stop the pipeline MIPS instruction set architecture made pipeline visible (delayed branch, delayed load) More performance from deeper pipelines, parallelism 臺大電機吳安宇教授 - 計算機結構 95