没有幻灯片标题 - PDF Free Download

高等计算机系统结构现代指令级并行技术 ( 第四讲 ) 程旭 2012 年 3 月 26 日

流水线的性能通过更加复杂的流水线和动态调度开发隐形 (imlicit) 指令级并行性乱序执行执行, 同时保证 : 真数据相关 (RAW) 精确中断通过寄存器换名, 消除 WAR 和 WAW 冒险重排序缓冲器 (Reorder buffer) 保存尚未提交 (committing) 但已完成的结果, 以支持精确中断频繁出现的转移指令会产生控制冒险, 从而限制性能的改进

指令流水线的总体结构 In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit Kill Inject handler PC Kill Execute Kill Excetion? 取指和译码进入指令重排序缓冲器是按序进行的执行是乱序的乱序完成提交 (Commit : 回写道体系结构级的状态, 即寄存器对 & 存储器 ) 按序在提交之前, 需要临时存储来保存结果 ( 影子寄存器和存储缓冲器 )

控制流导致的性能损失在许多现代处理器中, 在下一 PC 计算和最终确定转移结果之间有 10 个以上的流水级! Next fetch started PC I-cache Fetch Buffer Fetch Decode 如果流水线不能及时选择正确指令, 会导致多少损失? ~ Loo length x ieline width Issue Buffer Func. Units Execute Branch executed Result Buffer Commit Arch. State

MIPS 的转移和跳转每条指令的取指都依赖于之前指令的一或二项信息 : 1) 之前的那条指令是发生转移的指令吗 (taken branch)? 2) 如果是, 转移目标地址是什么? Instruction Taken known? Target known? J JR After Inst. Decode After Inst. Decode After Inst. Decode After Reg. Fetch BEQZ/BNEZ After Reg. Fetch * After Inst. Decode * 假设在寄存器读时判断是否为 0

深度指令流水线中的转移损失 UltraSPARC-III instruction fetch ieline stages (in-order issue, 4-way suerscalar, 750MHz, 2000) Branch Target Address Known Branch Direction & Jum Register Target Known A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Comlete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute ieline (+ another 6 stages)

降低转移损失软件解决方案消除转移循环展开 (loo unrolling) 增大运行长度 (run length) 较小转移确定的时间 (resolution time): 指令调度尽早计算转移条件硬件解决方案发现其他一些可以做的事延迟槽 (delay slots) 用有效工作替换流水线中的空泡 ( 需要软件协助 ) 推测 (Seculate) - 转移预测跨越转移的指令推测式执行 (Seculative execution)

转移预测动机 : 转移损失 (Branch enalties) 制约了深度流水化处理器的性能提升现代转移预测器具有很好的正确率 (>95%), 可望显著减少转移损失需要硬件支持 : 预测结构部件 : 转移历史表 (Branch history tables), 转移目标缓冲器 (branch target buffers) 等错误预测恢复机制 : 将结果计算与确认 (commit) 分离开来消除流水线中跟随错误预测转移指令的指令将状态恢复到转移指令之后的正确状态

静态转移预测总体而言, 一条转移指令发生的概率大约为 60-70%, 但是 : backward 90% forward 50% JZ JZ ISA 也可以向转移指令附加上首选转移方向的语义, 例如 Motorola MC88110 bne0 (referred taken) beq0 (not taken)

动态转移预测 learning based on ast behavior 时间关联 (Temoral correlation) The way a branch resolves may be a good redictor of the way it will resolve at the next execution 空间关联 (Satial correlation) Several branches may resolve in a highly correlated manner (a referred ath of execution)

转移预测位 Branch Prediction Bits 假设每条指令 2 个转移预测位当连续两次出现转移预测错误时, 改变预测方向! taken take wrong taken taken take right taken taken take right taken taken take wrong taken BP state: (redict take/ take) x (last rediction right/wrong)

转移历史表 Branch History Table Fetch PC 0 0 I-Cache k BHT Index 2 k -entry BHT, 2 bits/entry Instruction Ocode offset Branch? + Target PC Taken/ Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct redictions

开采转移的空间关联 Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; 如果第一转移的条件为假, 第二个也一定为假历史寄存器 (History register, H,) 记录处理器最近执行的 N 条转移的方向

两级转移预测器 Pentium Pro 通过利用最近两条转移的结果来从四组 BHT 位中挑选出一组 (~95% 的正确率 ) 0 0 Fetch PC k 2-bit global branch history shift register Shift in Taken/ Taken results of each branch Taken/ Taken?

BHT 局限性仅能预测转移方向, 因而, 在确定转移目标之前, 并不能从转移目标处开始取指令流 Correctly redicted taken branch enalty Jum Register enalty A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Comlete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute UltraSPARC-III fetch ieline Remainder of execute ieline (+ another 6 stages)

IMEM 转移目标缓冲器 (Branch Target Buffer) k redicted target BPb Branch Target Buffer (2 k entries) PC target BP BP bits are stored with the redicted target address. IF stage: If (BP=taken) then npc=target else npc=pc+4 later: check rediction, if wrong then kill the instruction and udate BTB & BPb else udate BPb

地址冲突 (Address Collisions) Assume a 128-entry BTB target 236 BPb take What will be fetched after the instruction at 1028? BTB rediction = 236 Correct target = 1032 132 Jum 100 1028 Add... Instruction Memory kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles?

BTB 仅对控制指令有效 BTB 仅包含针对转移指令和跳转指令的有用信息对其他指令, 不能改变 BTB 内部的状态对所有其他指令的下一 PC 都是 PC+4! 如何在指令译码之前, 就达到上述效果?

Branch Target Buffer (BTB) I-Cache PC 2 k -entry direct-maed BTB (can also be associative) Entry PC Valid redicted target PC k = match valid target Kee both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jums held in BTB Next PC determined before branch fetched and decoded

在译码前查询 BTB 132 Jum 100 entry PC 132 target 236 BPb take 1028 Add... The match for PC=1028 fails and 1028+4 is fetched eliminates false redictions after ALU instructions BTB contains entries only for control transfer instructions more room to store branch targets

合并 BTB 和 BHT 相对 BHT 而言,BTB 的表项的实现成本更高, 但是可以在流水线较早的时候就对取指流进行重定向, 并能够加速间接转移 (JR) BHT 可以包含更多的表项, 并更加准确 BHT in later ieline stage corrects when BTB misses a redicted taken branch BTB BHT A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Comlete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only udated after branch resolves in E stage

跳转寄存器 (JR) 的使用切换状态 (jum to address of matching case) BTB works well if same case used reeatedly 动态过程调用 (jum to run-time function address) BTB works well if same function usually called, (e.g., in C++ rogramming, when objects have same tye in virtual function call) 子程序返回 (jum to return address) BTB works well if usually return to the same lace Often one function called from many distinct call sites! 对于上述情况,BTB 都可以很好工作吗?

子程序返回栈 (Subroutine Return Stack) 专设一个小的结构来加速针对子程序返回的 JR 处理, 通常比 BTBs 会更加准确. fa() { fb(); } fb() { fc(); } Push call address when function call executed fc() { fd(); } Po return address when subroutine return decoded &fd() &fc() &fb() k entries (tyically k=8-16)

按序执行机器 : 错误预测的恢复假设在转移解决之前, 没有该转移之后发射的指令会回写结果 (write-back) 将错误预测转移之后的所有指令都删除乱序执行? 在转移解决之前, 转移之后的多条指令 ( 按串行程序序 ) 可能均已完成

支持精确中断的按序提交 In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit Kill Inject handler PC Kill Execute Kill Excetion? Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order ( out-of-order comletion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order Temorary storage needed in ROB to hold results before commit

流水线中的转移错误预测 Inject correct PC Branch Prediction Kill Kill Branch Resolution Kill PC Fetch Decode Reorder Buffer Commit Comlete Execute Can have multile unresolved branches in ROB Can resolve branches out-of-order by killing all the instructions in ROB that follow a misredicted branch

Recovering ROB/Renaming Table Rename Table r 1 t t t t v v v v Rename Snashots Register File r 2 Ptr 2 next to commit rollback next available Ptr 1 next available Ins# use exec o 1 src1 2 src2 d dest data t 1 t 2.. t n Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Take snashot of register rename table at each redicted branch, recover earlier snashot if branch misredicted

双向推测执行与转移预测不同, 还可以对转移个两条可能的方向同时进行推测执行所需的资源数与并发推测执行的指令流数目成正比当同时对一条转移的两条可能指令流进行推测执行时, 只有一半的资源真正用于了有用工作基于转移预测的推测执行比对转移的所有方向都进行推测执行需要更少的资源当转移预测率很高时, 将所有的资源都用于预测的方向是效率很高 (cost effective) 的方案

Data in ROB Design (HP PA8000, Pentium Pro, Core2Duo) Reorder buffer Register File holds only committed state Ins# use exec o 1 src1 2 src2 d dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On disatch into ROB, ready sources can be in regfile or in ROB dest (coied into src1/src2 if ready before disatch) On comletion, write to dest field and broadcast to src fields. On issue, read from ROB src fields

Unified Physical Register File (MIPS R10K, Alha 21264, Pentium 4) r 1 r 2 t i t j Snashots for misredict recovery t 1 t 2. t n Reg File Rename Table Load Unit FU FU FU Store Unit (ROB not shown) < t, result > One regfile for both committed and seculative values (no data in ROB) During decode, instruction result allocated new hysical register, source regs translated to hysical regs through rename table Instruction reads data from regfile at start of execute (not in decode) Write-back udates reg. busy bits on instructions in ROB (assoc. search) Snashots of rename table taken at every branch to recover misredicts On excetion, renaming undone in reverse order of issue (MIPS R10000)

Pieline Design with Physical Regfile Branch Prediction kill kill Branch Resolution kill kill Out-of-Order Udate redictors In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit ALU MEM Store Buffer D$ Execute

物理寄存器的生命期 Physical regfile holds committed and seculative values Physical registers decouled from ROB entries (no data in ROB) ld r1, (r3) add r3, r1, #4 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r11) Rename ld P1, (Px) add P2, P1, #4 sub P3, Py, Pz add P4, P2, P3 ld P5, (P1) add P6, P5, P4 st P6, (P1) ld P7, (Pw) When can we reuse a hysical register? When next write of same architectural register commits

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) (LPRd requires third read ort on Rename Table for each instruction)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 P3 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 P3 P2 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 P3 P2 P4 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1)

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 P3 P2 P4 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R1> <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 P8 use ex o 1 PR1 2 PR2 Rd LPRd PRd x P8 x ld P7 r1 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit

R0 R1 R2 R3 R4 R5 R6 R7 P8 P7 P5 P6 ROB Rename Table P0 P1 P3 P2 P4 物理寄存器的管理 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R1> <R3> <R6> <R7> <R3> Free List P0 P1 P3 P2 P4 P8 P7 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 P8 P0 x x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit

Reorder Buffer Holds Active Instruction Window (Older instructions) ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) (Newer instructions) Commit Execute Fetch ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) Cycle t Cycle t + 1

Write Ports 超标量处理器的寄存器重命名在译码阶段, 指令被重新分配新的物理目标寄存器源操作数被重命名为具有最新数据的物理寄存器执行部件仅能看到物理寄存器号 Inst 1 O Dest Src1 Src2 O Dest Src1 Src2 Inst 2 Udate Maing Read Addresses Rename Table Read Data Register Free List O PDest PSrc1 PSrc2 O PDest Does this work? PSrc1 PSrc2

Write Ports 超标量处理器的寄存器重命名 Inst 1 O Dest Src1 Src2 O Dest Src1 Src2 Inst 2 Udate Maing Must check for RAW hazards between instructions issuing in same cycle. Can be done in arallel with rename looku. Read Addresses Rename Table Read Data =? =? Register Free List O PDest PSrc1 PSrc2 O PDest PSrc1 PSrc2 MIPS R10K renames 4 serially-raw-deendent insts/cycle

存储器相关 st r1, (r2) ld r3, (r4) 何时能够执行 load 指令?

按序存储队列 (In-Order Memory Queue) 按程序序执行所有的 load 和 store 操作 => 在之前的所有存储和装入指令完成之前, Load 和 store 指令不能离开 ROB 执行希望对 Load 和 Store 指令进行推测执行, 并与其他指令乱序执行

保守的 Load 乱序执行 st r1, (r2) ld r3, (r4) 将 Store 指令的执行分解为两个阶段 : 地址计算数据写入如果地址已知, 并可以确认 r4!= r2, 就可以在 Store 之前执行 load 指令每个 load 地址都需要与所有之前未确认的 STORE 之前的地址进行比较 ( 可以使用部分保守比较, 例如地址的低 12 位 ) 如有之前的任何 STORE 指令有地址尚不确定, 就不能执行 load (MIPS R10K, 16 entry address queue)

地址推测 (Address Seculation) 猜设 r4!= r2 st r1, (r2) ld r3, (r4) 在 store 地址未知情况先, 执行 load 指令需要按程序序保存所有完成但未提交的 load/store 地址如果后续发现 r4==r2, 碾压掉 load 及后续所有的指令 => 如果地址猜测不准确, 损失可能会很大!

Memory Deendence Prediction (Alha 21264) st r1, (r2) ld r3, (r4) Guess that r4!= r2 and execute load before store If later find r4==r2, squash load and all following instructions, but mark load instruction as storewait Subsequent executions of the same load instruction will wait for all revious stores to comlete Periodically clear store-wait bits

推测式 Loads / Stores 与寄存器的变更相同, 在之前的所有指令都提交之后,store 指令才能修改存储器需要引进一种新的结构 : 推测式存储缓冲器 ( seculative store buffer) 来保存推测式 store 的数据

Seculative Store Buffer Seculative Store Buffer Load Address L1 Data Cache V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag On store execute: mark entry valid and seculative, and save data and tag of instruction. On store commit: clear seculative bit and eventually move data to cache On store abort: clear valid bit Data Data Data Data Data Data Tags Store Commit Path Data Load Data

Seculative Store Buffer Seculative Store Buffer Load Address L1 Data Cache V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag Data Data Data Data Data Data Tags Store Commit Path Data Load Data If data in both store buffer and cache, which should we use: Seculative store buffer If same address in store buffer twice, which should we use: Youngest store older than load

Dataath: Branch Prediction and Seculative Execution PC Fetch Branch Prediction Decode & Rename kill kill Branch Resolution kill kill Reorder Buffer Udate redictors Commit Reg. File Branch Unit Execute ALU MEM Store Buffer D$