Chapter 4 (Part II) The Processor: Datapath and Control (Enhancing Performance with Pipelining) 陳瑞奇 (J.C. Chen) 亞洲大學資訊工程學系 Adapted from class notes by Prof. M.J. Irwin, PSU and Prof. D. Patterson, UCB 1 Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Clk Cycle 1 Cycle 2 lw sw Waste multicycle clock slower than 1/5 th of Multiple Cycle Implementation: single cycle clock due to stage register overhead Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch 2
How Can We Make It Even Faster? 1. Split the multiple instruction cycle into smaller and smaller steps (stages) 2. Start fetching and executing the next instruction before the current one has completed Pipelining modern processors are pipelined for performance Multiple instructions can be overlapped in execution. Remember the performance equation: CPU time = CPI * CCT * IC 3. Fetch (and execute) more than one instruction at a time (Parallel Processing; Superscalar) 3 Pipelining( 管線化 ): 觀念來自於生產線輸送帶 http://www.dtdsmt.com/upload/photo/ee99e86b50b43054bfd4217649475fba.jpg 4
Pipelining #1 5 Pipelining #2 6
Pipelining: It s Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and organize Washer takes 30 minutes A B C D Dryer takes 30 minutes Folder takes 30 minutes Closet takes 30 minutes 7 Sequential laundry takes 8 hours for 4 loads! pipelining Only takes 3.5 hours for 4 loads! 2.3 times faster 40/11.5=3.5 times faster Twenty loads (11.5hrs) would take about 5.75 times as long as one load (2 hrs)! p.261( 頁 271) Fig. 4.25 10
MIPS instructions classically take five Stages Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB Five stages, one step per stage IFetch: Instruction Fetch Dec: Instruction Decode and register file read Exec: Execution or address calculation Mem: data Memory access WB: Write Back (to a register) 11 A Pipelined MIPS Processor Start the next instruction before the current one has completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB - clock cycle (pipeline stage time) is limited by the slowest stage - for some instructions, some stages are wasted cycles 12
Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB 13 Pipeline Performance Single-cycle (T c = 800ps) p.264( 頁 273) Fig. 4.27 Pipelined (T c = 200ps) 15
PC IFetch/Dec Dec/Exec Exec/Mem Mem/WB 4.6 The single-cycle datapath from Fig. 4.17 p. 275 ( 頁 285) Fig. 4.33 What do we need to add to actually split the datapath into stages? 17 MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? State registers between each pipeline stage to isolate them IF:IFetch ID:Decode EX:Execute MEM: MemAccess WB: WriteBack 4 Instruction Memory Read Address Add p.277 ( 頁 287) Fig.4.35 System Clock 64bits 128bits 97bits 64bits IF/ID ID/EX EX/MEM MEM/WB Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Sign 16 Extend Shift left 2 Add ALU zero Address Write Data Data Memory Read Data 18
Graphically Representing Pipelines p.276( 頁 286) Fig.4.34 Totally R R used R W Not used 19 The first pipe stage (IF) of an instruction (Load) (Instruction fetch) IF System clock (Write) p.279( 頁 288) Fig.4.36 top System clock R 20
The second pipe stage (ID) of an instruction (Load) System clock (Write) (Instruction decode) Dec Write p.279( 頁 288) Fig.4.36 bottom decoding decoded R System clock 21 The third pipe stage (EX) of an instruction (Load) p.280( 頁 289) Fig.4.37 Dec Write (Execution) EX Write decoded effective address 22
The fourth pipe stage (MEM) of an instruction (Load) p.281( 頁 290) Fig.4.38 top EX write (Memory) MEM write R effective address 23 The fifth pipe stage (WB) of an instruction System clock (Write back) MEM W p.281( 頁 290) Fig.4.38 bottom 24
The third pipe stage (EX) of an instruction (store) p.282( 頁 292) Fig.4.39 Dec Write (Execution) EX Write store effective address 25 The fourth pipe stage (MEM) of an instruction (store) p.283( 頁 293) Fig.4.40 top EX Write (Memory) MEM Write W effective address 26
PC The fifth pipe stage (WB) of an instruction (store) p.283( 頁 293) Fig.4.40 bottom (Write back) MEM Write MEM Write decoded 27 Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers IF/ID ID/EX EX/MEM 4 Instruction Memory Read Address Add Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Shift left 2 Add ALU Address Write Data Data Memory Read Data MEM/WB Sign 16 Extend 28
PC 鑰匙怕丟掉, 怎麼辦? http://sy.police.taipei/ 放進口袋, 帶著走! 29 Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers IF/ID ID/EX EX/MEM 4 Instruction Memory Read Address Add Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Shift left 2 Add ALU Address Write Data Data Memory Read Data MEM/WB Sign 16 Extend rt/rd p.284( 頁 294) Fig.4.41 30
Corrected Datapath for Load p.284( 頁 294) Fig.4.41 31 Why Pipeline? For Performance! Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Inst 4 Time to fill the pipeline ALU IM Reg DM Reg ALU IM Reg DM Reg 34
Multiple-clock-cycle pipeline diagram Totally R R used R W Not used p.286( 頁 296) Fig. 4.43 35 The single-clock-cycle pipeline diagram 5 4 3 2 1 1 2 rt/rd 3 4 5 p.287( 頁 297) Fig. 4.45 Clock 5 37
Includes control lines 4 control Lines for EXE 3 control Lines for MEM 2 control Lines for WB p.289( 頁 299) Fig. 4.46 rt rd 39 傳說中的無敵鐵金剛 http://www.hbyty.com/images/product_images/info_images/001eyd000001-2.jpg http://www.chara-net.com/images-item-big/ref4-7522.jpg 40
https://www.youtube.com/watch?v=ojzp_zv5dwg 41 2 control Lines for WB decoding 4 control Lines for EXE 3 control Lines for MEM p.291( 頁 301) Fig. 4.50 42
https://farm4.staticflickr.com/3691/13534389474_87e00fdb79.jpg http://pic.pimg.tw/cgboy26/1404897101-38112107.jpg 43 Pipelined Control p.292( 頁 302) Fig. 4.51 45
Pipeline Summary The BIG Picture All modern day processors use pipelining Pipelining doesn t help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage Unbalanced pipe stages makes for inefficiencies Must detect and resolve hazards Stalling negatively affects CPI (makes CPI less than the ideal of 1) 56 第三次作業 : 第四章前半部習題 (Due in 2 weeks) 4.1 考慮下列指令 : 指令 :AND Rd,Rs,Rt 說明 :Reg[Rd] = Reg[Rs] AND Reg[Rt] 4.1.1(15%) 圖 4.2 中的控制器為了上述指令所產生的控制訊號其值為何? 4.1.2(5%) 哪些資源 ( 區塊 ) 會為該指令做出有用的功能? 4.1.3(5%) 哪些資源 ( 區塊 ) 會產生並不被該指令用到的輸出? 哪些資源 ( 區塊 ) 並不會對該指令產生輸出? 58
4.4 本習題的問題假設所需用以製作處理器資料通道的邏輯區塊具有以下延遲 : I-Mem Add Mux ALU Regs D-Mem Sign- Shift-Left-2 Extend 200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps 4.4.1(10%) 設若我們在處理器中唯一需要做的事是擷取連續的指令 ( 圖 4.6), 則時脈週期時間可以為若干? 4.4.2(10%) 考慮一個類似圖 4.11 中所示的資料通道, 然其處理器只有一種類型的指令 : 無條件 PC- 相對位址的分支 對於這個資料通道的週期時間為若干? 4.4.3(10%) 當我們僅需支援有條件 PC- 相對位址的分支時, 重複 4.4.2 題本習題中剩下的三題與資料通道中的 Shift-left-2 有關 : 4.4.4(5%) 哪些指令需要用到這項資源? 4.4.5(5%) 這項資源對哪些種指令 ( 如果有的話 ) 會位於關鍵路徑上? 4.4.6(5%) 假設我們僅支援 beq 及 add 指令, 討論這項資源在延遲上的變化會如何影響處理器的週期時間 假設其他資源的延遲不變 4 Add Instruction Memory 圖 4.6 PC Read Address Instruction 59 圖 4.11 PCSrc ALUSrc ALU operation MemWrite MemtoReg RegWrite MemRead 60
4.7 本習題中我們仔細檢視一道指令是如何在單週期的資料通道中執行的 習題中的各問題請參考擷取下列指令字當時的時脈週期 : 1010 1100 0110 0010 0000 0000 0001 0100 假設資料記憶體中的資料全為 0 且處理器的暫存器在擷取上述指令的週期開始時含有下列值 : r0 r1 r2 r3 r4 r5 r6 r8 r12 r31 0-1 2-3 -4 10 6 8 2-16 4.7.1(10%) 符號延伸及跳躍的 左移 2 單元 ( 位於圖 4.24 中的上方 ) 對該指令字所產出的輸出各為何? 4.7.2(5%) 對該指令而言 ALU 控制單元的輸入值應為何? 4.7.3(5%) 該指令執行後 PC 的新位址應為何? 標示出決定這個 PC 新位置所需用到的路徑 ( 按 : 這裡應是指在圖 4.24 中標示計算 PC 新位置所需用到的路徑 ) 4.7.4(5%) 在該指令執行期間以及上述暫存器內容值的情形下, 則每一個 Mux 的輸出值各為何? 4.7.5(5%)ALU 及兩個加法單元的輸入值各為何? 61 圖 4.24 62