Training - PDF Free Download

计算机组织与系统结构存储系统 Memory System ( 第九讲 ) 程旭 2017.11.30

局部性原理局部性原理 : CPU 访问存储器时, 无论是取指令还是存取数据, 所访问的存储单元都趋于聚集在一个较小的连续区域中两种不同类型的局部性 : 时间局部性 (Temporal Locality): 如果一个信息项正在被访问, 那么在近期她很可能还会被再次访问 - 原因 : 程序循环堆栈空间局部性 (Spatial Locality): 在最近的将来将用到的信息很可能与现在正在使用的信息在空间地址上是临近的 - 原因 : 指令顺序执行数组存放

本讲提纲存储系统引论存储工艺技术 : SRAM 寄存器堆和 DRAM 高速缓冲存储器 (cache) 虚拟存储系统总结

教学目标 : 已经掌握的内容计算机的五个基本部件处理器输入控制存储器数据通路输出本讲主题 : 存储系统

存储系统的扩展图示处理器控制 Memory 数据通路 Memory Memory Memory Memory 速度 : 最快最慢容量 : 最小最大每位成本 : 最高最低

Typical Memory Hierarchy Datapath On-Chip Components Control RegFile Instr Data Cache Cache Second -Level Cache (SRAM) Third- Level Cache (SRA M) Main Memory (DRAM) Secondar y Memory (Disk Or Flash) Speed (cycles): ½ s 1 s 10 s 100 s-1000 1,000,000 s Size (bytes): 100 s 10K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

存储层次 : 工作原理在任何指定时间, 数据只能在相邻的两级之间拷贝 : 较高级 : 与处理器较近的存储级 - 较小较快使用较昂贵的技术工艺较低级 : 与处理器较远的存储级 - 较大较慢使用较廉价的技术工艺存储块 : 在两级存储层次中, 信息出现或者不出现的最小单位至处理器来自处理器较高级存储器 X 块较低级存储器 Y 块

存储层次 : 术语命中 (Hit): 数据在较高层的某一块中出现 ( 例如 X 块 ) 命中率 (Hit Rate): 在较高层发现存储访问的比率命中时间 (Hit Time): 访问在较高层命中数据的时间 RAM 访问时间 + 确定命中 / 失效的时间失效 (Miss): 需要从较低层中的块中找回数据 (Y 块 ) 失效率 (Miss Rate) = 1 - ( 命中率 ) 失效损失 (Miss Penalty): 替换较高层存储的一个数据块的时间 + 将该块交付给处理器的时间命中时间 << 失效损失至处理器来自处理器较高级存储器 X 块较低级存储器 Y 块

存储层次 : 如何工作? 时间局部性 (Temporal Locality): 如果一个信息项正在被访问, 那么在近期她很可能还会被再次访问将最近被访问的信息项放在离处理器较近的地方空间局部性 (Spatial Locality): 在最近的将来将用到的信息很可能与现在正在使用的信息在空间地址上是临近的将包括临近存储字的数据块一起移动到较高层存储中至处理器来自处理器较高级存储器 X 块较低级存储器 Y 块

Memory Address (one dot per access) Memory Reference Patterns Temporal Locality Spatial Locality Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Time

Principle of Locality Principle of Locality: Programs access small portion of address space at any instant of time (spatial locality) and repeatedly access that portion (temporal locality) What program structures lead to temporal and spatial locality in instruction accesses? In data accesses? What structures defeat temporal and spatial locality in instruction and data accesses?

Memory Reference Patterns Address n loop iterations Instruction fetches Stack accesses subroutine call argument access subroutine return Data accesses scalar accesses Time

现代计算机中的存储层次通过利用程序访问的局部性 : 向用户提供最廉价工艺实现的存储器的存储容量向用户提供最快速工艺可能实现的访问速度处理器控制数据通路寄存器堆片载 cache 第二级 Cache (SRAM) 主存 (DRAM) 二级存储 ( 磁盘 )

存储层次技术随机访问 : 随机是好事 : 对所有位置的访问时间是相同的 DRAM: Dynamic Random Access Memory - 高密低功耗价廉慢速 - 动态 : 需要定期刷新 SRAM: Static Random Access Memory - 低密高功耗昂贵快速 - 静态 : 如果不掉电, 内容将永久保持! 不是太随机的访问技术 : 访问时间因位置的不同而不同, 因访问的时刻不同而不同例如 : 磁盘磁带 CDROM 后续讲授内容将侧重于随机访问技术主存 : DRAMs Caches: SRAMs

随机存储器 (RAM) 技术为什么计算机设计人员需要了解 RAM 技术? 处理器的性能通常受到存储器代宽的限制随着集成电路密度的增加, 一些存储器将和处理器集成在同一芯片上 - 片载存储器来满足特殊需求 - 指令 cache - 数据 cache - 写缓冲器为什么不用触发器技术来实现 RAM? 密度 :RAM 需要更高的密度

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 微处理器 -DRAM 的延迟差距 1000 CPU Proc 60%/yr. 100 10 1 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. Time

静态 RAM 单元 6 管 SRAM 单元 0 1 word ( 行选 ) word 0 1 bit bit 写操作 : bit 1. 驱动位线 (bit) 2. 选择行读操作 : 拉高 1. 对两条位线预充电, 使得 bit Vdd 2. 选择行 3. 存储单元将一条线拉为低 4. 列上的信号放大器检测 bit 和 bit 之间的差异 bit

典型的 SRAM 组织 : 16 字 4 位 Din 3 Din 2 Din 1 Din 0 Precharge WrEn Wr Driver & Wr Driver & Wr Driver & Wr Driver & - Precharger + - Precharger + - Precharger + - Precharger + SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell Word 0 Word 1 Address Decoder A0 A1 A2 A3 : : : : SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Word 15 Dout 3 Dout 2 Dout 1 Dout 0

典型 SRAM 的逻辑图 A N WE_L OE_L 2 N words x M bit SRAM M D 写使能信号通常是低电平有效 (WE_L) Din 和 Dout 是结合在一起的 : 需要一个新的控制信号输出使能信号 (OE_L) WE_L 有效 (Low), OE_L 禁止 (High) - D 为数据输入 WE_L 禁止 (High), OE_L 有效 (Low) - D 为数据输出 WE_L 和 OE_L 都有效 : - 结果不确定. 千万不要这样做!!!

典型的 SRAM 定时 A N WE_L OE_L 2 N words x M bit SRAM M D Write Timing: Read Timing: D Data In High Z Garbage Data Out Junk Data Out A Write Address Junk Read Address Read Address OE_L WE_L Write Hold Time Write Setup Time Read Access Time Read Access Time

进一步分析 SRAM 单元 6 管 SRAM 单元 word ( 行选择 ) word P1 P2 N1 N2 bit bit bit bit 通常 SRAM 具有许多存储字 ( 行 ) 位线 (bit lines) 就很长, 因而也就具有较大的电容晶体管 N1 N2 P1 和 P2 就必须非常小晶体管 N1 P1 没有足够的能量来快速驱动位线 (Bit): 需要增设一个信号放大器 (sense amplifier) 来比较 Bit 和 Bit

寄存器堆所用的双端口 ( 读 ) SRAM 单元 6-Transistor SRAM Cell SelA SelB SelA SelB P1 P2 N1 N2 b a b a 具有较短的位线 (bit lines) 和较大的晶体管 : N1 和 P1 能够很快地驱动位线 a N2 和 P2 能够很快地驱动位线 b 可以同时读取两个存储字

寄存器堆所用的单端口 ( 写 ) SRAM 单元 SelA SelB SelW w b a w 为了将新值写入存储单元 : 我们需要同时驱动两边每次只能写入一个存储字增设另外一对位线 (w 和 w) 可以同时进行读和写

双读端口单写端口的寄存器堆 : : busw<31> busw<1> busw<0> WrEn - Wr Driver + - Wr Driver + - Wr Driver + SelA0 Register Cell Register Cell Register Cell : : : Register Cell Register Cell Register Cell SelB0 SelW0 SelA31 SelB31 Address Decoder 5 5 5 Ra Rb Rw SelW31 busa<31> busa<1> busa<0> busb<31> busb<1> busb<0>

单管单元写操作 : 1. 驱动位线 2. 选择行读操作 : 1. 预充电, 使得位线 Vdd 2. 选择行 3. 单元和位线共享电荷位线 - 在位线上只有非常小的电压变化 4. 感应 ( 非常奇妙的感应放大器 ) - 可以检测到大约一百万电子伏特的变化 5. 写 : 恢复电压值刷新 1. 仅仅需要对每个单元进行一次假读操作行选择

DRAM 引论 Dynamic RAM (DRAM): 需要刷新密度非常高耗电非常低 ( 工作时 0.1~0.5 W, 等待 (standby)0.25 ~10 mw) 每位的成本非常低管脚敏感 : - 输出使能 (Output Enable: OE_L) - 写使能 (Write Enable:WE_L) addr - 行地址过滤 (Row address strobe: ras) - 列地址过滤 (Col address strobe:cas) log N 为了提高性能页模式操作 (Page mode operation) 2 r o w cell array N bits c o l sense 单感应放大器耗电较少, 面积小 D

传统的 DRAM 组成位线 ( 数据 ) 行译码器 RAM 单元阵列 RAM Cell Array 每个交叉点代表一个单管 DRAM 单元字选择 ( 行选择 ) 行地址列选择器 & I/O 电路列地址数据行和列地址在一起 : 每次选择一位

512 行典型的 DRAM 组成典型 DRAMs: 并行访问多位例如 : 2 Mb DRAM = 256K x 8 = 512 行 x 512 列 x 8 位行和列地址并行作用于所有 8 个位面 (planes) 位面 7 512 列位面 1 256 Kb DRAM 位面 0 256 Kb DRAM 的一个位面 256 Kb DRAM D<7> D<1> D<0>

典型 DRAM 的逻辑框图 RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM 9 8 D 控制信号 (RAS_L, CAS_L, WE_L, OE_L) 都是低电平有效 Din 和 Dout 合并在一起 (D): WE_L 有效 ( 低 ), OE_L 禁止 ( 高 ) 时, - D 作为数据输入管脚 WE_L 禁止 ( 高 ), OE_L 有效 ( 低 ) - D 作为数据输出管脚行和列地址共享相同的一组管脚 (A) RAS_L 变成低 : 管脚 A 被锁定为行地址 CAS_L 变成低 : 管脚 A 被锁定为列地址

周期时间与访问周期访问时间周期时间时间 DRAM ( 读 / 写 ) 周期时间 >> DRAM ( 读 / 写 ) 访问时间 DRAM ( 读 / 写 ) 周期时间 : 我们可以以多快的频率来开始进行存储访问? 比喻 : 我们只能在 4x 的年度的夏天, 才能收看到奥运会足球赛 DRAM ( 读 / 写 ) 访问时间 : 一旦我们开始进行访问, 那么要过多长时间可以获得数据? 比喻 : 在奥运会期间, 一旦我们想看, 最多等一天就可以收看到下一场比赛 DRAM 的带宽限制 : 比喻 : 如果我们 2002 年还想看新的世界级足球比赛?

增加带宽交叉访问 (Interleaving) 非交叉访问的访问模式 : CPU Memory 得到 D1 开始访问 D1 四路交叉访问的访问模式 : 访问体 0 访问体 1 访问体 2 开始访问 D2 访问体 3 我们可以再次访问体 0 CPU Memory Bank 0 Memory Bank 1 Memory Bank 2 Memory Bank 3

N 行快速页模式 (Fast Page Mode)DRAM 常规 DRAM 组成 : N 行 x N 列 x M 位同时读和写 M 位每 M 位访问需要一个 RAS / CAS 周期列地址 N 列 DRAM 行地址快速页模式 DRAM N x M 寄存器来保存一行 M 位输出 M 位 RAS_L CAS_L 第一个 M 位访问第二次 M 位访问 A Row Address Col Address Junk Row Address Col Address Junk

N 行快速页模式操作列地址 N 列快速页模式 DRAM N x M SRAM 来保存行 DRAM 行地址在读取一行到寄存器后仅仅需要 CAS 来访问该行中的其他 M 位存储块在 RAS_L 保持有效, 同时 CAS_L 不断变化 M 位输出 N x M SRAM M 位第一个 M 位访问第二个 M 位访问第三个 M 位访问第四个 M 位访问 RAS_L CAS_L A Row Address Col Address Col Address Col Address Col Address

快速页模式操作 DRAM 性能指标 :(x-y-y-y, 例如 6-3-3-3) x:first data access time in clock/bus cycles y:successive burst data access time in clock/bus cycles

EDO DRAM(Extended Data Out) (20%-40% 性能提升 ) EDO DRAM 性能指标 :5-2-2-2 at 66MHz

Burst EDO DRAM

SDRAM(Synchronous DRAM) 基于 DRAM 的技术 (CAS RAS,etc) 允许在一个 DIMM 中包含多个 BANK DIMM SDRAM 168 pin 增加了 ba0 ba1 两个管脚与 CPU 或芯片组使用同步时钟信号五组控制信号, 可组成多种命令 CS:chip select RAS:raw address select CAS:col address select WE:write enable DQM:output enable 更好的支持 Burst 方式可编程设置模式 : Bust length,sequence...

SDRAM(Synchronous DRAM) SDRAM Mode Register

SDRAM read

SDRAM performance CAS Latency is important x-y-y( 例如 :3-2-2) CAS Latency the RAS-to-CAS delay RAS precharge time 时钟主频 PC66:66MHz PC100:100MHz PC133:133MHz

DDR SDRAM DDR:Double data rate 时钟上升沿和下降沿均可以发送数据 ( 带宽 X2!!) 在原有的 SDRAM 的架构基础上加以较小的改进 ( 可复用原有生产线 ) SDRAM 和 DDR 均为开放标准 (JEDEC)(Important!!) SDRAM DDR

DDR-SDRAM Timing Diagram

图 1: 图 2: SDRAM 的 Bank 和内存规范 4M X 1bit X 32chip 4bank in a dimm SIMM DIMM single/doul in-line memory module 目前使用的都是 DIMM 时钟频率 PC1600 100MHz - 100 2 8 MB/s PC2100 133MHz PC2400 150MHz 图 1 图 2

DDR3 SDRAM

DDR vs DDR2 vs DDR3 vs DDR4 All about increasing the rate at the pins Not an improvement in latency In fact, latency can sometimes be worse Internal banks often consumed for increased bandwidth DDR4 (January 2011) Samsung, Currently 2.13Gb/sec Target: 4 Gb/sec 4/18/2011 46

DRAM Packaging Clock and control signals Address lines multiplexed row/column address ~7 DRAM chip ~12 Data bus (4b,8b,16b,32b) DIMM (Dual Inline Memory Module) contains multiple chips arranged in ranks Each rank has clock/control/address signals connected in parallel (sometimes need buffers to drive signals to all chips), and data pins work together to return wide word e.g., a rank could implement a 64-bit data bus using 16x4-bit chips, or a 64-bit data bus using 8x8-bit chips. A modern DIMM usually has one or two ranks (occasionally 4 if high capacity) A rank will contain the same number of banks as each constituent chip (e.g., 4-8)

Main Memory Overview

DRAM: Banks Enable concurrent DRAM accesses (overlapping)

2Gb x8 DDR3 Chip [Micron] Observe: bank organization

Observe: row width, 64 8 bit datapath

小结两种不同类型的程序局部性时间局部性 (Temporal Locality): 如果正在访问某一信息项, 那么很可能近期还可能再次访问它空间局部性 (Spatial Locality): 如果正在访问某一信息项, 那么很可能近期还可能访问它附近的其他信息为了利用程序的局部性原理 : 向用户提供最廉价的工艺技术所能提供的那么大的存储容量向用户提供最快速的工艺技术所能提供的那么快的访问速度 DRAM 速度虽慢, 但是却价廉高密 : 向用户提供大存储系统的较好选择 SRAM 速度虽快, 但是昂贵密度也不很高 : 向用户提供快速访问时间的较好选择

微处理器 - 主存 (DRAM) 的延迟差距 1980 microprocessor executes ~one instruction in same time as DRAM access 2017 microprocessor executes ~1000 instructions in same time as DRAM access Slow DRAM access has disastrous impact on CPU performance!

Adding Cache to Computer Processor Control Enable? Read/Write Memory Input Datapath PC Registers Address Write Data Cache Program Bytes Memory (including cache) organized around blocks, which are typically multiple words Arithmetic & Logic Unit (ALU) Read Data Data Output Processor organized around words and bytes Processor-Memory Interface I/O-Memory Interfaces 54

Chip Photos

Alpha 微处理器 Time of a full cache miss in instructions executed: 1st Alpha : 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha : 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha : 180 ns/1.7 ns =108 clks x 6 or 648 1/2X latency x 3X clock rate x 3X Instr/clock?X

Cache 如何工作? 时间局部性 (Temporal Locality): 如果一个信息项正在被访问, 那么在近期她很可能还会被再次访问将最近被访问的信息项放在离处理器较近的地方空间局部性 (Spatial Locality): 在最近的将来将用到的信息很可能与现在正在使用的信息在空间地址上是临近的将包括临近存储字的数据块一起移动到较高层存储中至处理器来自处理器较高级存储器 (Cache) X 块较低级存储器 ( 主存 ) Y 块

最简单的 Cache: 直接映射 Cache 主存地址 0 1 2 3 4 5 6 7 8 9 A B C D E F 主存 4 字节直接映射 Cache Cache 索引 (Index) 0 1 2 3 位置 0 可以被来自以下位置的数据占用 : 主存位置 0 4 8,... 等等一般情况 : 任何地址最低两位为 0 的主存位置地址 <1:0> cache 索引我们应该将哪一个放置在 cache 中? 我们如何知道到底哪一个在 cache 中呢?

Cache 标签 (Tag) 和 Cache 索引 (Index) 假设存储器字节地址为 32 位 : 2 N 字节的直接映射 cache: - Cache 索引 : 存储器地址的较低 N 位 31 - Cache 标签 : 存储器地址的较高 (32 - N) 位 Cache 标签假设为 0x50 N Cache 索引假设为 0x03 0 有效位 Valid Bit : 作为 cache 状态的一部分保存起来 0x50 : 2 N 字节的直接映射 cache 字节 0 字节 0 字节 0 字节 0 : 0 1 2 3 字节 2 N -1 2 N- 1

Cache 访问示例开始 :Cache 空访问 000 01 ( 失效 ) V 标签数据访问 000 01 ( 命中 ) 000 M [00001] 010 M [01010] 写 Tag & Set V 失效处理 : 装载数据访问 010 10 ( 失效 ) 000 M [00001] 访问 010 10 ( 命中 ) 000 M [00001] 010 M [01010] 装载数据写 Tag & Set V 000 M [00001] 010 M [01010] 人生名言 : 失败是成功之母 : 义务失效 (Compulsory Misses) - ( 冷启失效 )

Cache 数据块的定义 Cache 块 : 相同的 cache 标志所指明的 cache 数据进一步分析前面介绍的示例 : 4 字节直接映射 cache: 块大小 = 1 Byte 可以利用时间局部性 : 如果正在访问某一字节, 那么在最近的将来还可能再次访问这一字节不能利用空间局部性 : 如果正在访问某一字节, 那么在最近的将来还可能访问这一字节邻近的字节为了利用空间局部性 : 增加数据块的大小有效位 Cache 标签直接映射 Cache 数据字节 0 字节 1 字节 2 字节 3

Direct-Mapped Cache Example One word blocks, cache size = 1K words (or 4KB) 31 30... 13 12 11... 2 1 0 Byte offset Valid bit ensures something useful in cache for this index Compare Tag with upper part of Address to see if a Hit Hit Tag 20 10 Index Index Valid 0 1 2... 1021 1022 1023 Tag 20 Data 32 Comparator Data Read data from cache instead of memory if a Hit What kind of locality are we taking advantage of?

Multiword-Block Direct-Mapped Cache Four words/block, cache size = 1K words Hit 31 30... 13 12 11... 4 3 2 1 0 Byte offset Data Tag 20 Index 8 2 Word offset IndexValid 0 1 2... 253 254 255 Tag 20 Data What kind of locality are we taking advantage of? 32

块大小的权衡一般而言, 数据块较大有利于利用空间局部性的优势, 但是 : 较大的数据块较大的失效损失 : - 需要较长的时间来填充数据块如果数据块的大小相对于 cache 大小而言太大了, 那么失效率也将增加平均访问时间等于 : 命中时间 x (1 - 失效率 ) + 失效损失 x 失效率失效损失失效率开发空间局部性数据块较少 : 弥补了时间局部性平均访问时间增加了失效损失和失效率块大小块大小块大小

另一个极端的示例有效位 Cache 标签 Cache 数据 Byte 3 Byte 2 Byte 1 Byte 0 0 Cache 大小 = 4 bytes 块大小 = 4 bytes cache 中只能有唯一一个数据块真 : 如果正在访问一个信息项, 那么很可能在最近的将来还将访问该信息项但是, 通常却不会立即再次使用该信息项!!! 那么, 再次访问该信息项时, 仍然很可能会失效 - 继续向 cache 中装入数据, 但是在它们被再次使用之前会被强行排出 cache - cache 设计人员最害怕出现的事情 : 乒乓现象冲突失效 (Conflict Misses) 是下述原因导致的失效 : 不同的存储位置映射到同一 cache 索引 - 解决方案 1: 增大 cache 容量 - 解决方案 2: 对同一 Cache 索引可以有多个信息项

两路组相联 Cache 有效位 N 路组相联 : 对每个 Cache 索引有 N 个数据块 (entries) N 个直接映射 cache 并行工作示例 : 两路组相联 cache Cache 索引来选择 cache 中的某一组该组种的两个标签进行并行比较根据标签比较的结果来选择数据 Cache 标签 Cache 数据 Cache 块 0 : : : Cache 索引 Cache 数据 Cache 块 0 : Cache 标签 : 有效位 : 地址标签比较 Sel1 1 Mux 0 Sel0 比较 Hit 或 Cache 块

组相联 Cache 缺点 N 路组相联 Cache 与直接映射 Cache: N 个比较器与 1 个比较器在输出数据时, 增加了额外的多路选择延迟在判断命中 / 失效时候, 才可以得到数据在直接映射 Cache 中, Cache 块在命中 / 失效之前就可以得到 : 可以先假设命中并继续执行, 如果以后发现失效再恢复有效位 Cache 标签 Cache 数据 Cache 块 0 Cache 索引 Cache 数据 Cache 块 0 Cache 标签有效位 : : : : : : 地址标签比较 Sel1 1 Mux 0 Sel0 比较或 Hit Cache 块

另一个极端的示例 : 全相联 : : 全相联 Cache -- 将组相联的思想推广到极限! 完全不要 Cache 索引并行地比较所有的 Cache 标签示例 : 块大小 = 32 字节, 需要 N 个 27 位的比较器根据定义 : 对于全相联 cache, 冲突失效 = 0 31 Cache 标签 (27 位长 ) 4 0 字节选择示例 : 0x01 Cache 标签 X X X X X 有效位 : : Cache 数据字节 31 字节 1 字节 63 字节 33 : 字节 0 字节 32

6 9 Miss Rate vs. Cache Size on the Integer Portion of SPECCPU2000

Cache 失效原因小结义务失效 (Compulsory cold start first reference): 第一次访问某一信息块人生名言 : 失败是成功之母冲突失效 (Conflict collision): 多个存储器位置映射到同一 cache 位置解决方案 1: 增加 cache 容量解决方案 2: 增加相联度容量失效 (Capacity): Cache 并不能保存程序访问所需的所有数据块解决方案 : 增加 cache 容量无效 (Invalidation): 其他进程 ( 例如, I/O) 修改存储器

3Cs Analysis Three sources of misses (SPEC2000 integer and floating-point benchmarks) Compulsory misses 0.006%; not visible Capacity misses, function of cache size Conflict portion depends on associativity and cache size

Cache 失效原因问答直接映射 N 路组相联全相联 Cache 容量义务失效冲突失效容量失效无效失效大中小高中低高中无低中高相同相同相同注释 : 如果我们运行上百万条的指令, 那么义务失效就微不足道了!

三种映射策略比较直接映射 Cache: 每个存储器位置只能映射到为一一个 cache 位置在定位时, 不需要进行任何判断 - 当前的信息项替换掉 Cache 中该位置上的前一个信息项 N 路组相联 Cache: 每个存储器位置可以选择 N 个 cache 位置之一全相联 Cache: 每个 memory 位置可以存放在任何 cache 位置 N 路组相联或全相联 Cache 中的访问失效 : 从存储器中取出新的信息块为给新块腾出空间, 替换出一个 cache 数据块于是, 就必须确定到底需要替换掉 Cache 中的哪个数据块!

Cache 块替换策略随机替换策略 : 硬件随机选择一个 Cache 数据项替换出去最近最少使用策略 (Least Recently Used): 硬件纪录访问历史信息替换出离现在最长时间没有使用的数据项一种简单的伪 LRU 策略的实现 : 假设有 64 个全相联的数据块替换指针硬件替换指针指向一个 cache 数据块每次对该数据块进行了一次访问, 就将该指针 : - 移向下一个数据项否则 : 不移动该指针数据块 0 数据块 1 : 数据块 63

Cache 写入策略 : 写穿透与写返回 Cache 读操作比 cache 写操作更容易处理 : 指令 cache 比数据 cache 更容易设计 Cache 写操作 : 如何保证 Cache 和存储器中数据的一致性? 两种选择 : 写返回 (Write Back): 只写入 Cache 当出现 Cache 失效某 Cache 数据块需要替换出 Cache 时, 才将该 Cache 块的数据回写到存储器中 - 对每个 Cache 数据块, 需要增加一个脏位 ( Dirty bit) - 极大地减小存储器的带宽需求 - 控制可能很复杂写穿透 (Write Through): 同时写入 cache 和存储器 - 什么!!! 这样可以吗? 这样, 存储器不是太慢了吗?

写穿透的写缓冲器 Processor Cache DRAM Write Buffer 在 Cache 和存储器之间需要一个写缓冲器处理器 : 将数据写入 cache 和写缓冲器存储器控制器 : 将缓冲器中的内容写入存储器写缓冲器就像一个先进先出栈 (FIFO): 典型数据项的数目 :4 可以很好地工作, 如果 : 存储操作的频率 ( 写时间 ) << 1 / DRAM 写周期存储器设计人员最害怕的事情 : 存储频率 ( 写时间 ) 1 / DRAM 写操作写缓冲器饱和

写缓冲器饱和 Processor Cache DRAM Write Buffer 存储操作频率 ( 写时间 ) 1 / DRAM 写周期时间如果这种情况持续很长时间 (CPU 周期时间太快和 / 或连续来了很多存储指令 ): - 无论设置了多大的写缓冲器, 它都有可能溢出 - CPU 时钟周期时间 <= DRAM 写周期时间对写缓冲器饱和的解决方案 : 使用一个写返回 cache 安装一个第二级 (L2) cache: Processor Cache L2 Cache DRAM Write Buffer

写分配与写不分配 : : : 假设 : 将一个 16 位数据写到存储器位置 0x0, 它产生了失效那么, 我们是否需要读取该块 ( 字节 2 3... 31) 的其他数据? 是的 : 写分配 (Write Allocate) 不是 : 写不分配 ( Write Not Allocate) 31 Cache Tag Example: 0x00 9 Cache Index Ex: 0x00 4 Byte Select Ex: 0x00 0 Valid Bit : Cache Tag 0x00 : Cache Data Byte 31 Byte 1 Byte 63 Byte 33 : Byte 0 Byte 32 0 1 2 3 Byte 1023 Byte 992 31

什么是一个子块? SB3 s V Bit SB2 s V Bit SB1 s V Bit SB0 s V Bit 子块 (Sub-block): 一个数据块中具有自己的有效位的单位数据例如 :1 KB 直接映射 Cache 32-B 数据块 8-B 子块 - 每个 cache 数据项将具有 : 32/8 = 4 个有效位写失效 : 只引进该子块中的数据 Cache 标签 : : : : : Cache 数据 B31:B24 Sub-block3 Sub-block2 : Sub-block1 B7 :B0 Sub-block0 0 1 2 3 Byte 1023 Byte 992 31

减少存储器传输时间 CPU CPU CPU Cache bus M mux cache bus M Cache bus M M M M 第一种解决方案高带宽 DRAM 例如 : 页模式 DRAM SDRAM DDRRAM RAMbus 第二种解决方案存储器和 Cache 之间宽数据通路 Cost 第三种解决方案存储模块交叉访问

Memory Hierarchy Processor Inner Levels in memory hierarchy Outer Level 1 Level 2 Level 3... Level n Increasing distance from processor, decreasing speed Size of memory at each level As we move to outer levels the latency goes up and price per bit goes down.

Local vs. Global Miss Rates Local miss rate: fraction of references to a given level of a cache that miss Local Miss rate L2$ = L2$ Misses / L1$ Misses = L2$ Misses / total_l2_accesses Global miss rate: the fraction of references that miss in all levels of a multilevel cache and go all the way to memory L2$ local miss rate >> than the global miss rate

Local vs. Global Miss Rates Local miss rate the fraction of references to one level of a cache that miss Local Miss rate L2$ = $L2 Misses / L1$ Misses Global miss rate the fraction of references that miss in all levels of a multilevel cache L2$ local miss rate >> than the global miss rate Global Miss rate = L2$ Misses / Total Accesses = (L2$ Misses / L1$ Misses) (L1$ Misses / Total Accesses) = Local Miss rate L2$ Local Miss rate L1$ AMAT = Time for a hit + Miss rate Miss penalty AMAT = Time for a L1$ hit + (local) Miss rate L1$ (Time for a L2$ hit + (local) Miss rate L2$ L2$ Miss penalty)

8 4 Multilevel Cache Considerations Different design considerations for L1$ and L2$ L1$ focuses on fast access: minimize hit time to achieve shorter clock cycle, e.g., smaller $ L2$, L3$ focus on low miss rate: reduce penalty of long main memory access times: e.g., Larger $ with larger block sizes/higher levels of associativity Miss penalty of L1$ is significantly reduced by presence of L2$, so can be smaller/faster even with higher miss rate For the L2$, fast hit time is less important than low miss rate L2$ hit time determines L1$ s miss penalty L2$ local miss rate >> than the global miss rate

8 5 L1 Cache: 32KB I$, 32KB D$ L2 Cache: 256 KB L3 Cache: 4 MB Cache Buster

Real world example caches

8 7 CPI/Miss Rates/DRAM Access SpecInt2006 Data Only Data Only Instructions and Data

8 8 Skylark: Intel s Recent Generation Laptop/Tablet Class CPUs

小结 : 程序局部性原理 : 在任何一段时间内, 程序都趋于访问较小的一段地址空间 - 时间局部性 (Temporal Locality) - 空间局部性 (Spatial Locality) 三种主要的 Cache 失效类型 : 义务失效 (Compulsory Misses): 例如, 冷启失效冲突失效 (Conflict Misses): 增加 cache 容量和 / 或相联度最要命的现象 : 颠簸现象! 容量失效 (Capacity Misses): 增加 cache 容量写入策略 (Write Policy): 写穿透 (Write Through): 需要一个写缓冲器要命的现象 : 写缓冲器饱和! 写返回 : 控制复杂

容量访问时间成本 CPU 寄存器 1000s Bytes <1s ns Cache K Bytes 1~10 ns $0.01~0.001/bit 主存 M Bytes 100ns-1us $0.01~0.001 磁盘 T Bytes 毫秒 10-5 ~10-6 美分磁带无限秒 ~ 分 10-6 存储层次的级别寄存器堆 Cache 存储器磁盘磁带指令操作数块 (Blocks) 页 (Pages) 文件 (Files) 分段传输单位程序 / 编译器 1-8 字节 Cache 控制 8-128 字节操作系统 512-4K 字节较高级别较快用户 / 操作员 Mbytes 较大较低级别

存储所有我们已读 ( 写 ) 听 ( 说 ) 看的信息人类数据类型 / 小时 / 天 (/4 年 ) / 一生阅读文字 ( 含少量图片 ) 200 K 2-10 M/G 60-300 G 以 120wpm 说 43 K 0.5 M/G 15 G 以 1KBps 说 3.6 M 40 M/G 1.2 T 50Kb/s POTS 的视频 22 M.25 G/T 25 T 200Kb/s VHS 质量的视频 90 M 1 G/T 100 T 4.3Mb/s HDTV/DVD 视频 1.8 G 20 G/T 1 P

虚拟存储提供一种理想中的非常非常大的存储器许多工作所需存储器的总和大于实际的物理存储器每个工作的地址空间大于实际的物理存储器使得可利用的 ( 快速昂贵 ) 的物理存储器得以很好地利用简化存储器的管理 ( 当今, 使用虚拟存储技术的主要原因 ) 使用存储层次, 保持平均访问时间很低至少包括两级存储层次 : 主存和二级存储虚拟地址程序员使用的地址虚拟地址空间上述地址的集合存储器地址物理存储器中存储字的地址, 也称为物理地址或实地址 ( Real address)

虚拟存储系统设计的基本问题从二级存贮器向主存传输信息块的大小如果某信息块需要引入主存 M, 但是 M 已经装满, 那么 M 中的一些区域就必须释放, 以为新的信息块腾出空间替换策略 (replacement policy) M 中的哪个区域将安放新的信息块放置策略 (placement policy ) 失效的信息项只有在出现页面失效 (fault) 时, 才从第二级存储器中取出读取 / 装入策略 (fetch/load policy) reg cache mem disk frame pages 分页组织虚拟和物理地址被划分为同等大小的信息块页帧 (page frames) 页 (pages)

Paged Memory Page Table Memory (DRAM) Page N Page Table Page Table Page 2 Page 1 Page 0 Each process has a dedicated page table. Physical memory non-consecutive.

Paged Memory Address Translation OS keeps track of which process is active Chooses correct page table Memory manager extracts page number from virtual address Looks up page address in page table Virtual address (e.g. 32 Bits) Physical address page number page table entry offset offset Physical addresses may (but do not have to) have more or fewer bits than virtual addresses Computes physical memory address from sum of Page address and Offset (from virtual address)

Protection Assigning different pages in DRAM to processes also keeps them from accessing each others memory Isolation Page tables handled by OS (in supervisory mode) Sharing is also possible OS may assign same physical page to several processes

Write Protection Write protected Page Table Memory (DRAM) Page N 0 0 1 Page Table 1 0 0 1 Page Table 0 Page 2 Page 1 Page 0 Exception when writing to protected page (e.g., program code)

地址映射 V = {0, 1,..., n - 1} 虚拟地址空间 M = {0, 1,..., m - 1} 物理地址空间 n > m 映射 : V M U {0} 地址映射函数 MAP(a) = a 如果在虚拟地址 a 的数据出现在物理地址 a, 并且 a 在 M 中 = 0 如果在虚拟地址 a 的数据不出现在 M 中处理器 a 名字空间 V 失效信息项故障缺页处理 a 地址变换机制 0 a' 主存二级存储物理地址 OS 完成这一传输

分页组织物理地址 0 1024 7168 页帧 0 1 7 物理存储器 1K 1K 1K 地址变换映射 0 1024 页 0 1 1K 1K 映射单位也是从虚拟存储器到物理存储器的传输单位 31744 31 1K 虚拟存储器地址映射 10 虚拟地址页号偏移量页表基址寄存器到页表的索引 V 页表访问权限 PA + 实际上, 经常采用串联方式位于物理存储器的表物理存储器地址

地址映射算法 If V = 1 then 所需页存放在页表中存放的页帧地址指示的主存地址中 else 该虚拟地址指示的页存放在第二级存储器中访问权限 R = 只读, R/W = 读 / 写, X = 只可执行 If 访问的类别与指定的访问权限不相符, then 保护违例故障 (protection_violation_fault) If 没有设置有效位 (V)then 缺页 (page fault)

地址映射算法保护故障 (Protection Fault): 访问权限违例 ( access rights violation); 向硬件微码或软件故障处理机制发出自陷处理请求缺页 (Page Fault): 所需页没有驻留在物理存储器中, 也将导致自陷 ; 通常还伴随着上下文切换 (context switch): 当从第二级存储中访取页面时, 当前进程挂起例如,VAX 11/780 每个进程可以看到 4 G (2 32 字节 ) 的虚拟地址空间 1/2 用于用户区域,1/2 用于被所有进程共享的系统区域页面大小为 512 字节

碎片和再分配碎片 (Fragmentation) 是由于某种原因, 存储空间中的一些区域变得不可利用再分配 (Relocation): 将程序或数据移动到地址空间中一个新的区域 ( 可能需要调整所有的指针 ) 外部碎片 (External Fragmentation): 数据块之间的剩余空间内部碎片 (Internal Fragmentation): 程序大小通常并不是页面大小的整数倍, 最后一个页帧中就会出现浪费 0 1. 已占用.. k-1

最佳页面大小选择页面大小使得碎片最少页面大小大内部碎片变得严重但是增加页面数量 / 名字空间更大的页表一般而论, 通常需要增加页面大小, 这是因为, 随着 RAM 价格下降, 存储器变得更大处理器速度和磁盘速度之间的差距变大编程人员需要越来越大的虚拟地址空间当今, 大多数机器的页面大小为 4K 字节, 另外, 页面大小有继续增加的趋势

碎片 ( 续 ) 页表碎片 (Table Fragmentation): 当页表非常大时, 就出现了页表碎片, 这是因为虚拟地址空间很大 ; 直接映射页表可能占据很大的存储空间例如 : VAX 系统结构注意 : 这意味着页表将需要 2 21 表项, 每个表项 4 字节长, 总共 8 M 字节可供选择的方案 : 21 9 XX 页号偏移量 00 用户进程的 P0 区域 01 用户进程的 P1 区域 10 系统名字空间 (1) 硬件相联映射 : 每个页帧需要一个表项 (O( M )), 而不是每个页一个表项 (O( N )) (2) 基于散列表的软件方案 ( 反向页表 :inverted page table) 页号偏移量是否出现访问权限页号物理地址页表相联存储或散列查找页号 ( 页号域 )

Size of Page Tables E.g., 32-Bit virtual address, 4-KiB pages Single page table size: - 4 x 2 20 Bytes = 4-MiB - 0.1% of 4-GiB memory Total size for 256 processes (each needs a page table) - 256 x 4 x 2 20 Bytes = 256 x 4-MiB = 1-GiB - 25% of 4-GiB memory! What about 64-bit addresses? How can we keep the size of page tables reasonable?

Options Increase page size E.g., doubling page size cuts PT size in half At the expense of potentially wasted memory Hierarchical page tables With decreasing page size Most programs use only fraction of memory Split PT in two (or more) parts

Hierarchical Page Table Exploits Sparsity of Virtual Address Space Use Physical Memory Virtual Address 3 1 2 2 1 1 p12 1 p2 2 1offset 0 10-bit L1 index 10-bit L2 index Root of the Current Page Table p1 p2 (Processor Register) Level 1 Page Table Page size 10b 1024 x 4096B page in primary memory page in secondary memory Level 2 Page Tables 12b 4096B PTE of a nonexistent page

页替换算法就像 Cache 块替换算法一样! 最近最少使用 : 选择出最近最少使用的页面来替换需要有关过去访问页面的信息, 非常难以实现 ( 将页表项根据从最近最多访问到最近最少访问的次序排序 ; 当一个页面被访问时, 将它移动到列表的头部 ; 于在列表尾部的页表项就是将被替换出的页表项 ) 性能很好, 遵从了局部性原则

页面替换 ( 续 ) 非最近使用 : 与每个页面相关联的是一个访问标志 : 访问标志 (reference flag) = 1 如果该页在最近的过去被访问过 = 0 否则如果需要替换, 选择任意一个访问标志为 0 的页帧替换掉该页是在最近的过去没有被访问过的页非最近使用 (NRU) 替换策略的实现 : 页表表项 1 0 1 0 1 0 0 0 访问位页表表项最近替换指针 (lrp) 如果需要进行替换, 将 lrp 前移到下一个页表项 ( 与表的大小取模 ) 直到发现一个具有 0 位的页表项 ; 这就是要替换的目标 ; 在上述工作的同时, 将所有检查过的页表项的访问位置为 0 一种优化的策略是查找最近既没有被访问, 又不脏的页

按需调页和预取页面访取策略 (Fetch Policy) 何时将该页换入存储器? 如果只有在缺页时才将页面调入, 那么这种策略就称为按需调页 (demand paging) 另外一种选择方案是预取 (prefetching): 预测未来的访问, 在实际使用这些页面之前, 预先装入 + 减少了页面传输的开销 - 将移出已经在页帧中的页, 从而不利于缺页率 - 预测未来访问通常很困难绝大多数系统只实现了按需调页, 而没有实现预测

虚拟地址和 Cache CPU VA PA miss Translation Cache hit Main Memory data 需要一次额外的存储访问来将虚拟地址变换成物理地址这使得 cache 访问的代价非常高, 而且这也是我们希望尽可能快得执行的最内层循环思考 : 为什么总是用物理地址来访问 cache? 虚拟地址 cache 有一个很难对付的问题! 同义问题 (synonym problem): 两个不同的虚拟地址可以映射到同一物理地址 => 两个不同的 cache 表项包含对同一物理地址的数据! 对于更新 : 必须更新所有具有同一物理地址的 cache 表项或者允许存储器不一致确定上述情况需要大量的硬件, 特别是需要对物理地址标签进行相联查找以确定是否出现多次命中

Translation Lookaside Buffers (TLB) Address translation is very expensive! In a two-level page table, each reference becomes three memory accesses Solution: Cache some translations in TLB TLB hit TLB miss Single-Cycle Translation Page-Table Walk to refill virtual address VPN offset V D tag PPN (VPN = virtual page number) (PPN = physical page number) hit? physical address PPN offset

快表 (TLB) 加速地址变换的一种方式是使用一种关于最近使用页表表项的特殊 cache 它有很多名字, 但最经常使用的名字是变换旁视缓冲器 (Translation Lookaside Buffer:TLB) 或快表! 虚拟地址物理地址脏位访问有效位访问方式 TLB 访问时间与 Cache 访问时间是可比的 ( 但要小一些 ) ( 也比主存访问时间要小得多 )

变换旁视缓冲器就像其他的 cache, TLB 可以是全相联的组相联的, 也可以是直接映射的 TLB 一般较小, 通常即使在高性能机器中也只有 128-256 个表项这样就可以实现全相联查找大多数中等性能的机器使用 n 路组相联的组成使用 TLB 的地址变换 CPU hit VA PA miss TLB Cache Lookup miss Translation hit Main Memory 1/2 t data t 20 t

VM-related Events in Pipeline PC Ins t TLB Inst. Cache D Decode E + M Data TLB Data Cache W TLB miss? Page Fault? Protection violation? TLB miss? Page Fault? Protection violation? Handling a TLB miss needs a hardware or software mechanism to refill TLB Usually done in hardware Handling a page fault (e.g., page is on disk) needs a precise trap so software handler can easily resume after retrieving page Protection violation may abort process

Page-Based Virtual-Memory Machine (Hardware Page-Table Walk) Page Fault? Protection violation? Virtual Address PC Inst. TLB Physical Address Inst. Cache D Decode E + M Page Fault? Protection violation? Virtual Address Data TLB Physical Address Data Cache W Miss? Page-Table Base Register Miss? Hardware Page Table Walker Physical Address Memory Controller Main Memory (DRAM) Physical Address Physical Address Assumes page tables held in untranslated physical memory

Address Translation: putting it all together Virtual Address TLB Lookup hardware hardware or software software miss hit Page Table Walk Protection Check the page is memory memory denied permitted Page Fault (OS loads page) Update TLB Protection Fault Physical Address (to cache) Where? SEGFAULT

Modern Virtual Memory Systems Illusion of a large, private, uniform store Protection & Privacy Several users/processes, each with their private address space Demand Paging Provides the ability to run programs larger than the primary memory OS user i Swapping Space (Disk) Primary Memory Hides differences in machine configurations The price is address translation on each memory reference VA mapping PA

减少变换时间具有 TLB 的机器可以进一步降低时钟周期数 / Cache 访问可以将 Cache 访问与 TLB 访问重叠由于使用虚拟地址中的高位来查找 TLB, 而用低位来作为访问 Cache 的索引, 因而上述策略可以工作

Cache 和 TLB 访问重叠 32 TLB 相联查找 index Cache 1 K 10 2 4 bytes 00 PA Hit/ Miss 20 12 page # disp PA Data Hit/ Miss IF cache 命中 AND (cache 标签 = 物理地址 ) THEN 将数据提交给 CPU ELSE IF [cache 失效 OR (cache 标签 = 物理地址 )] AND TLB 命中 THEN 用 TLB 中的物理地址来访问存储器 ELSE 做正常的虚拟地址变换 =

重叠 TLB 访问的问题重叠的访问只能当用于对 cache 进行索引的地址位不会作为虚拟地址变换的结果而改变时, 才能正常工作这通常只适用于小 cache 和大页面大小或者高 n 路组相联的大 cache 例如 : 假设 cache 的容量从 4K 增长到 8K, 而其他都不变 11 2 cache 索引 00 20 12 虚拟页号偏移量解决方案 : 变成 8K 字节的页面大小变成 2 路祖相联 cache ( 仍然使用 10 位索引 ) 这一位被虚拟地址变换所改变, 但是 cache 查找时也需要 1K 10 2 路组相联 cache 4 4

再论选择页面大小增大页面大小的原因页表大小与页大小成反比 ; 因而可以节省存储器当 cache 容量 = 页面大小时, 易于加快 cache 命中时间 ; 较大的页面使得单页 cache 是可行的在二级存储与存储器之间传输较大的页 ( 或者在网络上传输 ) 是比较有效的 TLB 中表项的数量受到时钟周期时间的限制, 因此较大的页可以映射更多的存储空间, 从而减少 TLB 的失效减小页面大小的原因不浪费存储 ; 页内的数据地址必须是连续的对较小的进程可以加快进程的启动? 混合方案 : 多页面大小 Alpha: 8KB 64KB 512 KB 和 4 MB 大小的页面

分段 ( 参见 x86) 相对于分页 ( 通常与分页结合使用 ) 每个程序模块分配一段 ; 可能具有不同的大小 ; 段是物理存储器和磁盘之间传输的基本单位段号偏移量 Present 访问方式段长物理始地址 BR 段表物理地址 + 出现位故障 : 段失效 (Present = 0) 溢出 (Displacement 超出了段长 ) 保护违例 ( 访问与段保护不相符 ) 基于段的寻址通常用于实现存储器的能力 (capabilities), 例如, 硬件支持复杂的保护机制

缺点 : 基于段的寻址方式 (1) 对可变大小的块进行存储分配 (2) 外部碎片 : 这种物理存储器分配可能导致所有可用的空间片都不足以再分配给任何一段在运行时解决这一问题非常费时段式和页式结合 : 段页式策略虚拟地址段号页号偏移量 IBM 中使用 : 4K 字节页面 16 x 1 M 字节或 64 x 64 K 字节段

虚拟存储器的历史自从发明虚拟存储器至今, DRAM 已经增大了 64,000 倍今天, 系统缺页率已经非常低那么, 是否可以放弃虚拟存储呢?

Typical Performance Stats secondary memory CPU cache primary memory CPU primary memory Caching Demand paging cache entry page frame cache block ( 32 bytes) page ( 4Ki bytes) cache miss rate (1% to 20%) page miss rate (<0.001%) cache hit ( 1 cycle) page hit ( 100 cycles) cache miss ( 100 cycles) page miss ( 5M cycles) 126

Impact of Paging on AMAT (1/2) Memory Parameters: L1 cache hit = 1 clock cycles, hit 95% of accesses L2 cache hit = 10 clock cycles, hit 60% of L1 misses DRAM = 200 clock cycles ( 100 nanoseconds) Disk = 20,000,000 clock cycles ( 10 milliseconds) Average Memory Access Time (no paging): 1 + 5% 10 + 5% 40% 200 = 5.5 clock cycles Average Memory Access Time (with paging): 5.5 (AMAT with no paging) +? 127

Impact of Paging on AMAT (2/2) Average Memory Access Time (with paging) = 5.5 + 5% 40% (1-HR Mem ) 20,000,000 AMAT if HR Mem = 99%? 5.5 + 0.02 0.01 20,000,000 = 4005.5 ( 728x slower) 1 in 20,000 memory accesses goes to disk: 10 sec program takes 2 hours! AMAT if HR Mem = 99.9%? 5.5 + 0.02 0.001 20,000,000 = 405.5 AMAT if HR Mem = 99.9999% 5.5 + 0.02 0.000001 20,000,000 = 5.9 128

Impact of TLBs on Performance Each TLB miss to Page Table ~ L1 Cache miss TLB Reach: Amount of virtual address space that can be simultaneously mapped by TLB: TLB typically has 128 entries of page size 4-8 KiB 128 4 KiB = 512 KiB = just 0.5 MiB What can you do to have better performance? Multi-level TLBs Variable page size (segments) Special situationally-used superpages Conceptually same as multi-level caches 129

Aside: Context Switching How does a single processor run many programs at once? Context switch: Changing of internal state of processor (switching between processes) Save register values (and PC) and change value in Page Table Base register What happens to the TLB? Current entries are for different process Set all entries to invalid on context switch 130

Virtual Memory Summary User program view: Contiguous memory Start from some set VA Infinitely large Is the only running program Reality: Non-contiguous memory Start wherever available memory is Finite size Many programs running simultaneously Virtual memory provides: Illusion of contiguous memory All programs starting at same set address Illusion of ~ infinite memory (2 32 or 2 64 bytes) Protection Implementation:, Sharing Divide memory into chunks (pages) OS controls page table that maps virtual into physical addresses memory as a cache for disk TLB is a cache for the page table 131

结论一虚拟存储是作为存储层次的另外一层发明的当时的争论是 : 软件是否可以自动在许多程序之间管理 64KB 存储器? DRAM 容量的迅猛增长消除了这一争论今天, 虚拟存储器可以使得多个进程共享同一存储, 而无需将所有进程交换到磁盘, 这时保护就非常重要 ( 多级 ) 页表将虚地址映射到物理地址 TLB 对于快速变换非常重要 TLB 失效对性能影响很大

结论二算法理论和编译基于操作的数目编译将消除一些操作, 并简化一些操作 : 整数加法 << 整数乘法 << 浮点加法 << 浮点乘法高级流水线 => 这些操作将需要相同的时间随着时钟频率增高和流水线加深, 指令将需要越来越少的时间, 但是 DRAM 的速度提高却低得多今天, 执行时间是 ( 操作数,cache 失效 ) 的函数考虑到 cache 的重要性, 这对下面主题意味着什么 : 编译器? 数据结构? 算法?

Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 5th edition, 2011

共享地址空间模型机器的物理地址空间进程组的虚拟地址空间的通信通过共享地址完成 Load P n P n 私用公共物理空间 Store P 0 P 1 P 2 地址空间共享部分 P 2 私用地址空间私用部分 P 1 P 0 私用私用

示例 : 小规多处理器设计存储器 : 具有相同访问时间 (uniform access time :UMA) 和总线互联 I/O 处理器处理器处理器处理器一级或多级一级或多级一级或多级一级或多级 Cache Cache Cache Cache 主存 I/O 系统

Symmetric Multiprocessors Processor Processor Memory symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) CPU-Memory bus I/O controller bridge I/O bus I/O controller Graphics output I/O controller Networks 137

Cache 一致性 (coherence) 问题时间事件 CPU A 的 Cache 内容 CPU B 的 Cache 内容位置 X 的存储器内容 0 1 1 A 读 X 1 1 2 B 读 X 1 1 1 3 A 将 0 存入 X 0 1 0

一致性的含义如何? 非正式 : 任何读操作都必须返回最近写的内容太严格, 也太难实现内存严格一致性 (Strict Consistency) 2 different aspects of memory system behavior Coherence: defines what values can be returned by a read Consistency:determines when a written value will be returned by a read

存储系统如何具有一致性 ( Coherent)? 写序列化 (write serialization) : 任何写操作的结果最终都会被任何一次读操作看见所有的写操作都被所有处理器以正确的次序看见 A memory system is coherent if : 如果 P 写 x 后,P 读 x, 并其间没有其他对 x 的写操作, 那么 P 读的都是 P 的写结果如果 P 写 x 后,P1 读 x, 且读和写之间足够远并其间没有其他对 x 的写操作, 那么 P 的写结果将被 P1 看见对单一位置的写操作是序列化的 : 以一确定次序可见 - 将看见最后的写 - 否则将以不合逻辑的次序看见多次写 ( 在较新的数值写后, 还看见较旧的数值 )

可能的硬件一致性解决方案 Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

Warmup: Parallel I/O Proc. Address (A) Data (D) Cache Memory Bus Physical Memory R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while the Processor is running DMA DISK (DMA stands for Direct Memory Access, means the I/O device can read/write memory autonomous from the CPU) 142

Problems with Parallel I/O Proc. Cached portions of page Memory Bus Cache DMA Physical Memory DMA transfers Memory Disk: Physical memory may be stale if cache copy is dirty DISK Disk Memory: Cache may hold stale data and not see memory writes 143

Snoopy Cache, Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 144

Snoopy Cache Actions for DMA Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified Address not cached DMA Write Cached, unmodified Disk Memory Cached, modified No action No action Cache intervenes No action Cache purges its copy??? 145

基本窥探的协议 Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss: - Write-through: memory is always up-to-date - Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration

窥探协议的一个例子 Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, its writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

Snoopy-Cache 状态机 -I State machine for CPU requests for each cache block Invalid CPU Write Place Write Miss on bus CPU Read Place read miss on bus CPU read miss Write-back block Write back block CPU Read hit Shared (read/only) CPU Read miss Place read miss on bus Cache Block State CPU read hit CPU write hit Exclusive (read/write) CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus

Snoopy-Cache 状态机 -II State machine for bus requests Write miss for each for this block cache block Invalid Shared (read/only) Write miss for this block Exclusive (read/write) Write Back Block; (abort memory access) Read miss for this block Write Back Block; (abort memory access) CPU Read miss

窥探 Cache: 状态机 CPU Read hit Remote Write or Miss due to address conflict Write back block Invalid CPU Write Place Write Miss on bus Remote Read Write back block Remote Write or Miss due to address conflict CPU Read Place read miss on bus CPU Read Miss CPU Write Place Write Miss on Bus Shared (read/only) CPU Read miss Place rd miss on bus CPU read hit CPU write hit Exclusive (read/write) CPU Write Miss Write back cache block Place write miss on bus

示例 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 P1: P1: Read A1 P2: Read A1 P2: Write 20 to A1 A1 P2: Write 40 to A2 A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2 Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

示例 ( 续一 ) P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2. Active arrow = Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

示例 ( 续二 ) P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2 Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

示例 ( 续三 ) P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 10 Shar. A1 10 RdDa P2 A1 10 10 P2: Write 20 to A1 10 P2: Write 40 to A2 10 10 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2. Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

示例 ( 续四 ) P1 P2 Bus Memory step State Addr Value State Addr ValueActionProc. Addr ValueAddrValue P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: P1: Read A1 A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 10 Shar. A1 10 RdDa P2 A1 10 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 10 P2: Write 40 to A2 10 10 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2 Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

示例 ( 续五 ) P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 10 Shar. A1 10 RdDa P2 A1 10 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 10 P2: Write 40 to A2 WrMs P2 A2 10 Excl. A2 40 WrBk P2 A1 20 20 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 A2 Remote Write or Miss Write Back Invalid Remote Write or Miss Read miss on bus Write miss on bus Remote Read Write Back Shared CPU Write Place Write Miss on Bus CPU Read hit CPU read hit CPU write hit Exclusive

实现复杂 Write Races: Cannot update cache until bus is obtained - Otherwise, another processor may get bus first, and then write the same cache block! Two step process: - Arbitrate for bus - Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus: - Bus transaction is not atomic: can have multiple outstanding transactions for a block - Multiple misses can interleave, allowing two caches to grab block in the Exclusive state - Must track and prevent multiple misses for one block Must support interventions and invalidations

Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a write-back is performed before the memory is read

Cache State Transition Diagram The MSI protocol Each cache line has state bits state bits Address tag Write miss (P1 gets line from memory) Other processor reads (P 1 writes back) M: Modified S: Shared I: Invalid M P 1 reads or writes Read miss (P1 gets line from memory) Read by any processor S Other processor intent to write I Other processor intent to write (P 1 writes back) Cache state in processor P 1

Two Processor Example (Reading and writing the same cache line) P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 reads, P 1 writes back S P 2 intent to write M I P 1 reads or writes Write miss P 2 intent to write P 2 P 1 reads, P 2 writes back M P 2 reads or writes Write miss Read miss S P 1 intent to write I P 1 intent to write

存储器一致性模型 Schemes faster execution to sequential consistency Not really an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x)... release (s) {unlock}... acquire (s) {lock}... read(x) Only those programs willing to be nondeterministic are not synchronized: data race : outcome f(proc. speed) Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Two classes of synchronization: Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time producer P1 consumer P2 Shared Resource

Sequential Consistency A Memory Model P P P P P P A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs First defined by [Lamport 1979], execution is the same as if: (R1) Memory ops of each processor appear in program order (R2) Memory ops of all processors were executed in some global sequential order M

Programmer s view of sequential consistency Informally, sequential consistency requires that all memory operations appear to execute one at a time, and the operations of a single processor appear to execute in the order described by that processor s program.

Memory Consistency Models -Who Should Care? As the interface between the programmer and the system, memory consistency affects: Programmability because programmers must use it to reason about the correctness of their programs. Performance of the system because it determines the types of optimizations that may be exploited by the hardware and the system software. Finally, Portability can be affected when moving software across systems supporting different models, due to a lack of consensus on a single model MCM influences the writing of parallel programs from the programmer s perspective, and virtually all aspects of designing a parallel system (including the processor, memory system, interconnection network, compiler, and programming languages) from a system designer s perspective

Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) what are the legitimate answers for X and Y? (X,Y ) {(1,11), (0,10), (1,10), (0,11)}? If y is 11 then x cannot be 0

Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) additional SC requirements Store (X ), R 2 (X = X) Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory? more on this later

Issues in Implementing Sequential Consistency P P P P P P M Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b Caches Caches can prevent the effect of a store from being seen by other processors No common commercial architecture has a sequentially consistent memory model!

Relaxed Consistency Models Relaxed consistency models allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent. ~ H&P Fifth Edition

Virtual Machine Protection via Virtual machine The increasing importance of isolation and security in modern systems The failures in security and reliability of standard operation systems The sharing of a single computer among many unrelated users The dramatic increases in raw speed of processors, which makes the overhead of VMs more acceptable

Virtual Machine Virtualization of Whole Machine A complete system-level environment at binary ISA level (Operating) System Virtual Machine Run different ISAs in the VM from native hardware

System Virtual Machine

Virtual machine 分类

System Virtual Machine CPU Virtualization Memory Virtualization I/O virtualization

Virtual Machine Monitor Virtual machine monitor(vmm) or Hypervisor Managing software Managing hardware Requirements Guest software should behave on a VM exactly as if it were running on the native hardware, except for performance-related behavior or limitations of fixed resources shared by multiple VMs. Guest software should not be able to change allocation of real system At least two processor modes, system and user. A privileged subset of instructions that is available only in system mode, resulting in a trap if executed in user mode. All system resources must be controllable only via these instructions.

Cloud Computing