ppl:CPI≈1?铁律? Program execution 200 400 600 800 1000 1200 1400 Time order (in instructions) 1dx1,100(x4) Instruction fetch Reg ALU Data 图4-25 access Reg Data 1dx2,200(x4) 200ps Instruction ALU fetch Reg access Reg Data 1dx3.400(x4) 200ps Instruction fetch Reg ALU access Reg 200ps200ps200ps 200ps200ps Memory CPU G UE bu3 VO bue Memory VO devicea Registere e Disk memory Regiater Cache Memory reference reference referenoe reference Size:500 bytee 64KB 1 GB 1TB Speed: 250pa 1n8 100n8 10 me
ppl:CPI ≈ 1?铁律? 图4-25
TC Memory Wall:1995,Wulf@Univ of Virginia 主存速度跟不上CPU性能(25MHz的80386之后) 100MHz的Pentium处理器平均10ns执行一条指令,而DRAM典 型访问时间60~120ns。 指令流水线:单周期访存 ● "处理器性能提升对系统的贡献被DRAM性能所屏蔽'” Processor-DRAM Memory Gap 100.000 μProc1.20y “Moore'sLaw" 10.000 μProc1.52yT. 1.000 (2X/1.5yr) DRAM Processo 7%yt. 100 Processor-Memory (2X/10yrs) Performance Gap: (grows 50%/year) 10 Memory 1980 1985 1990 1995 2000 2005 2010 Year
Memory Wall:1995,Wulf@Univ of Virginia • 主存速度跟不上CPU性能(25MHz的80386之后) – 100MHz的Pentium处理器平均10ns执行一条指令,而DRAM典 型访问时间60~120ns。 – 指令流水线:单周期访存 • “处理器性能提升对系统的贡献被DRAM性能所屏蔽
层次化存储:性能、容量、价格 CPU Processor 高速缓存 Cache 上层是下层的copy Data are transferred 辅助硬件 : 0.008008000复■000000 主存 辅助硬件和软件 图5-2 辅存 寄存器 Small, CPU Fast Memory Big,Slow Memory 缓存 RegFile SRAM DRAM 主存 holds frequently used data 磁盘 磁带
层次化存储:性能、容量、价格 上层是下层的copy 图5-2
Cache-主存”与“主存-辅存”层次的区别 sc》 存储层次 “Cache-主存”层次 “主存一辅存”层次 比较项目 CPU 目 的 为了弥补主存速度的不足 为了弥补主存容量的不足 存储管理实现 主要由专用硬件实现 由硬件和软件实现 高速缓存 Cache 访问速度的比值 几比一 几百比一 (第一级和第二级) 辅助硬件 飞84■88■88588 主存 典型的块(页)大小 几十个字节 几百到几千个字节 辅助硬件和软件 CPU对第二级的 访问方式 可直接访问 均通过第一级 辅存 失效时CPU是否切换 不切换 切换到其他进程 Virtual Memory Mapping Cache Mapping Secondary Storage Registers Words Blocks Pages Cache Processor Main Memory
“ Cache-主存”与“主存-辅存”层次的区别
PC机中的存储子系统 大 Graphics Co-Processor Frontside bus DRAM bus Backside bus AGP p2p CPU Off-Chip On-Chip Sys Mem Cache Cache/s Controllers DIMMs North Bridge SCSI bus PCI bus Hard SCSI Network Drive/s Controller Interface 1/O Keyboard Other Low-BW Controller 1/O Devices South Bridge 二〉 Mouse FIGURE Ov.3:Typical PC organization.The memory subsystem is one part of a relatively complex whole.This figure illustrates a two-way multiprocessor,with each processor having its own dedicated off-chip cache.The parts most relevant to this text are shaded in grey:the CPU and its cache system,the system and memory controllers,the DIMMs and their component DRAMs,and the hard drive/s. Bruce Jacob,Memory Systems:Cache,DRAM,Disk,2008
PC机中的存储子系统 Bruce Jacob,Memory Systems: Cache, DRAM, Disk,2008
本讲内容:Cache系统(单处理器) ·为什么需要Cache? 一性能、结构 o the write policy · Cache?有效性的理论基础 how the processor writes data to the cache so that main memory -局部性原理:时间,空间 eventually gets updated; ·影响Cache命中率的因素 ·the mapping function Cache的基本结构,5.3 the link between a block's address Cache的读写操作过程,5.3,5.8 in memory and its location in the cache; Cache-一致性 Block Placement Schemes ·阻塞式Cache the replacement algorithm Cache-MEM映射机制,5.4 the method used to figure out ·块放哪儿? which block to remove from the cache in order to free up a line. Cachel的替换策略,5.4 Cache控制器:5.9,5.12 COD5 Cache性能分析 5.3,5.4,5.9,5.12,5.8 ·主要见体系结构课 唐:4.3,附录4A Cache Coherence,5.10 ·见体系结构课
本讲内容:Cache系统(单处理器) • the write policy – how the processor writes data to the cache so that main memory eventually gets updated; • the mapping function – the link between a block's address in memory and its location in the cache; – Block Placement Schemes • the replacement algorithm – the method used to figure out which block to remove from the cache in order to free up a line. • COD5 – 5.3, 5.4, 5.9, 5.12, 5.8 • 唐:4.3,附录4A • 为什么需要Cache? – 性能、结构 • Cache有效性的理论基础 – 局部性原理:时间,空间 • 影响Cache命中率的因素 • Cache的基本结构,5.3 • Cache的读写操作过程,5.3,5.8 • Cache一致性 • 阻塞式Cache • Cache-MEM映射机制,5.4 • 块放哪儿? • Cache的替换策略,5.4 • Cache控制器:5.9,5.12 • Cache 性能分析 • 主要见体系结构课 • Cache Coherence,5.10 • 见体系结构课
Cache对系统性能的影响 -240 cycles Main Memory Core 0 Core 2 MC 10s L3B 8 System Bus -14 cycles L3 0 Controller MCU GX -3 cycles 1 Controller 2nd Level Cache 1 Level Data Cache MC L3B Core Level Instruction Cache Execution Unit sI cycle -3 cycles MIPS R4000指令流水线 IF IS RF EX DF DS TC WB First-half Second-half First-half Second-half Tag check Instruction memory Reg Data memory Reg
Cache对系统性能的影响 MIPS R4000指令流水线 First-half Second-half First-half Second-half Tag check
Cache对系统结构的影响 存储器冲突:取指与数据读写 >分体Cache ● 总线占用:CPU和I/O争抢访问主存 减少CPU访问主存 副作用:一致性,时序可预测性 CI C LOAD 主机 令1 processor memory I/O Interface I/O Interface (adapter) (adapter) Fetch I-Cache Decode Memory Execute Memory D-Cache 1/O device I/O device Write-back
Cache对系统结构的影响 • 存储器冲突:取指与数据读写 ➢ 分体Cache • 总线占用:CPU和I/O争抢访问主存 ➢ 减少CPU访问主存 • 副作用:一致性,时序可预测性
程序的访存特性 ·时间局部性temporal locality 一最近的访问项(指令数据)很可能在不 久的将来再次被访问 Address n loop iterations 往往会引起对最近使用区域的集中访问Instruction -策略:保留data,复用 fetches ·内存地址不一定集中! subroutine subroutine 空间局部性spatial locality Stack call 。return accesses 一个进程访问的访问项其地址彼此很近 argument access 往往会访问在存储器空间的同一区域 。 策略:保留data及其相邻者,预取 Data vector access ·内存地址连续! accesses 。scalar accesses 。。。。。。 例 Time for i:=0to 10000 do Typical Access Address Pattern A0=0; ·局部性实现:内存分块
程序的访存特性 • 时间局部性temporal locality – 最近的访问项(指令/数据)很可能在不 久的将来再次被访问 – 往往会引起对最近使用区域的集中访问 – 策略:保留data,复用 • 内存地址不一定集中! • 空间局部性spatial locality – 一个进程访问的访问项其地址彼此很近 – 往往会访问在存储器空间的同一区域 – 策略:保留data及其相邻者,预取 • 内存地址连续! • 例 – for i := 0 to 10000 do A[i] := 0; • 局部性实现:内存分块 Typical Access Address Pattern