Berger, D, Goodman, J.R., Sohi, G.s. "Memory Systems The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Berger, D., Goodman, J.R., Sohi, G.S. “Memory Systems” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
88 Memory Systems Doug burger University of Wisconsin-Madison James R. Goodman 88.2 Memory Hierarchies 88.3 Cache memories 88.4 Parallel and Interleaved Memories Gurindar s. sohi 88.5 Virtual Memory University of Wisconsin-Madison 68.6 Research issues 88.1 Introduction A memory system serves as a repository of information(data)in a computer system. The processor [also called the central processing unit( CPU)] accesses(reads or loads) data from the memory system, performs compu tations on them, and stores(writes)them back to memory. The memory system is a collection of storage locations. Each storage location, or memory word, has a numerical address. A collection of storage locations forms an address space. Figure 88. 1 shows the essentials of how a processor is connected to a memory system via address, data, and control lines When a processor attempts to load the contents of a memory location, the request is very urgent. In virtuall all computers, the work soon comes to a halt(in other words, the processor stalls) if the memory request does not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests, but even the most sophisticated computers will frequently exhaust their ability to process data and stall momentarily in the face of long memory delays. Thus, a key performance parameter in the design of any computer, fast or slow, is the effective speed of its leally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount of information and infinitely fast so that it does not limit the processing unit. Practically, however, this is not possible. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the third. Thus it is possible to have memories that are(1)large and cheap, but not fast; (2)cheap and fast, but small; or(3)large and fast, but expensive. The last of the three is further limited by physical constraints. A large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the speed of such a memory syste The latency(L)of the memory is the delay from when the processor first requests a word from memory until that word arrives and is available for use by the processor. The latency of a memory system is one attribut f performance. The other is bandwidth(BW), which is the rate at which information can be transferred from the memory system. The bandwidth and the latency are related. If R is the number of requests that the memory can service simultaneously, then: BW、R c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 88 Memory Systems 88.1 Introduction 88.2 Memory Hierarchies 88.3 Cache Memories 88.4 Parallel and Interleaved Memories 88.5 Virtual Memory 88.6 Research Issues 88.1 Introduction A memory system serves as a repository of information (data) in a computer system. The processor [also called the central processing unit (CPU)] accesses (reads or loads) data from the memory system, performs computations on them, and stores (writes) them back to memory. The memory system is a collection of storage locations. Each storage location, or memory word, has a numerical address. A collection of storage locations forms an address space. Figure 88.1 shows the essentials of how a processor is connected to a memory system via address, data, and control lines. When a processor attempts to load the contents of a memory location, the request is very urgent. In virtually all computers, the work soon comes to a halt (in other words, the processor stalls) if the memory request does not return quickly. Modern computers are generally able to continue briefly by overlapping memory requests, but even the most sophisticated computers will frequently exhaust their ability to process data and stall momentarily in the face of long memory delays. Thus, a key performance parameter in the design of any computer, fast or slow, is the effective speed of its memory. Ideally, the memory system must be both infinitely large so that it can contain an arbitrarily large amount of information and infinitely fast so that it does not limit the processing unit. Practically, however, this is not possible. There are three properties of memory that are inherently in conflict: speed, capacity, and cost. In general, technology tradeoffs can be employed to optimize any two of the three factors at the expense of the third. Thus it is possible to have memories that are (1) large and cheap, but not fast; (2) cheap and fast, but small; or (3) large and fast, but expensive. The last of the three is further limited by physical constraints. A large-capacity memory that is very fast is also physically large, and speed-of-light delays place a limit on the speed of such a memory system. The latency (L) of the memory is the delay from when the processor first requests a word from memory until that word arrives and is available for use by the processor. The latency of a memory system is one attribute of performance. The other is bandwidth (BW), which is the rate at which information can be transferred from the memory system. The bandwidth and the latency are related. If R is the number of requests that the memory can service simultaneously, then: BW (88.1) R L = Doug Burger University of Wisconsin-Madison James R. Goodman University of Wisconsin-Madison Gurindar S. Sohi University of Wisconsin-Madison
ADDREsS SYSTEM CONTROL FIGURE 88.1 The memory interface. From Eq (88.1)we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not proportionately. For example, we can build a memory system that takes 20 ns to service the access of a single 32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is 2 bit 20×10-9sec or 200 Mbytes/s If the memory system is modified to accept a new(still 20 ns)request for a 32-bit word every 5 ns by overlapping results, then its bandwidth 5×10 bits or 800 Mbytes/s. This memory system must be able to handle four requests at a given time uilding an ideal memory system(infinite capacity, zero latency and infinite bandwidth, with affordable cost)is not feasible. The challenge is, given a set of cost and technology constraints, to engineer a memor system whose abilities match the abilities that the processor demands of it. That is, engineering a memory system that performs as close to an ideal memory system( for the given processing unit)as is possible. For a processor that stalls when it makes a memory request(some current microprocessors are in this category),it is important to engineer a memory system with the lowest possible latency. For those processors that can handle nultiple outstanding memory requests(vector processors and high-end CPUs), it is important not only to reduce latency, but also to increase bandwidth(over what is possible by latency reduction alone) by designing a memory system that is capable of servicing multiple requests simultaneously Memory hierarchies provide decreased average latency and reduced bandwidth requirements, whereas parallel or interleaved memories provide higher bandwidth. 88.2 Memory hierarchies Technology does not permit memories that are cheap, large, and fast. By recognizing the nonrandom nature of memory requests, and emphasizing the average rather than worst case latency, it is possible to implement a hierarchical memory system that performs well. A small amount of very fast memory, placed in front of a large, slow memory, can be designed to satisfy most requests at the speed of the small memory. This, in fact, is the rimary motivation for the use of registers in the CPU: in this case, the programmer or compiler makes sure that the most commonly accessed variables are allocated to registers. A variety of techniques, employing either hardware, software, or a combination of the two, can be employed to assure that most memory references are satisfied by the faster memory. The foremost of these techniques is the exploitation of the locality of reference principle. This principle captures the fact that some memory locations are referenced much more frequently than others. Spatial locality is the property that an access to a given memory location greatly increases the probability that neighboring locations will soon be accessed. This is largely, but not exclusively, a result of the tendency to access memory locations sequentially. Temporal locality is the property that an access to a given memory location greatly increases the probability that the same location e 2000 by CRC Press LLC
© 2000 by CRC Press LLC From Eq. (88.1) we see that a decrease in the latency will result in an increase in bandwidth, and vice versa, if R is unchanged. We can also see that the bandwidth can be increased by increasing R, if L does not increase proportionately. For example, we can build a memory system that takes 20 ns to service the access of a single 32-bit word. Its latency is 20 ns per 32-bit word, and its bandwidth is or 200 Mbytes/s. If the memory system is modified to accept a new (still 20 ns) request for a 32-bit word every 5 ns by overlapping results, then its bandwidth is or 800 Mbytes/s. This memory system must be able to handle four requests at a given time. Building an ideal memory system (infinite capacity, zero latency and infinite bandwidth, with affordable cost) is not feasible. The challenge is, given a set of cost and technology constraints, to engineer a memory system whose abilities match the abilities that the processor demands of it. That is, engineering a memory system that performs as close to an ideal memory system (for the given processing unit) as is possible. For a processor that stalls when it makes a memory request (some current microprocessors are in this category), it is important to engineer a memory system with the lowest possible latency. For those processors that can handle multiple outstanding memory requests (vector processors and high-end CPUs), it is important not only to reduce latency, but also to increase bandwidth (over what is possible by latency reduction alone) by designing a memory system that is capable of servicing multiple requests simultaneously. Memory hierarchies provide decreased average latency and reduced bandwidth requirements, whereas parallel or interleaved memories provide higher bandwidth. 88.2 Memory Hierarchies Technology does not permit memories that are cheap, large, and fast. By recognizing the nonrandom nature of memory requests, and emphasizing the average rather than worst case latency, it is possible to implement a hierarchical memory system that performs well. A small amount of very fast memory, placed in front of a large, slow memory, can be designed to satisfy most requests at the speed of the small memory. This, in fact, is the primary motivation for the use of registers in the CPU: in this case, the programmer or compiler makes sure that the most commonly accessed variables are allocated to registers. A variety of techniques, employing either hardware, software, or a combination of the two, can be employed to assure that most memory references are satisfied by the faster memory. The foremost of these techniques is the exploitation of the locality of reference principle. This principle captures the fact that some memory locations are referenced much more frequently than others. Spatial locality is the property that an access to a given memory location greatly increases the probability that neighboring locations will soon be accessed. This is largely, but not exclusively, a result of the tendency to access memory locations sequentially. Temporal locality is the property that an access to a given memory location greatly increases the probability that the same location FIGURE 88.1 The memory interface. 32 20 10 9 ¥ – sec bits 32 5 10 9 ¥ – sec bits
ectronic speeds) Semiconductor DRAM 1 28 bytes. 4Kbytes 8 Kbytes. 8 Mbytes 4 Mbytes. 512 Mbytes 40 MBytes-32 Gbytes FIGURE 88.2 A memory hierarchy will be accessed again soon. This is largely, but not exclusively, a result of the frequency of looping behavior of programs. Particularly for temporal locality, a good predictor of the future is the past: the longer a variable has gone unreferenced, the less likely it is to be accessed soon Figure 88.2 depicts a common construction of a memory hierarchy. At the top of the hierarchy are the CPU registers, which are small and extremely fast. The next level down in the hierarchy is a special, high-speed semi- conductor memory known as a cache memory. The cache can actually be divided into multiple distinct levels most current systems have between one and three levels of cache. Some of the levels of cache may be on the CPU chip itself, they may be on the same module as the CPU, or they all may be entirely distinct. Below the cache is the conventional memory, referred to as main memory, or backing storage. Like a cache, main memory is semiconductor memory, but it is slower, cheaper, and denser than a cache. Below the main memory is the virtual memory, which is generally stored on magnetic or optical disk. Accessing the virtual memory can be tens of thousands times slower than accessing the main memory because it involves moving mechanical parts. As requests go deeper into the memory hierarchy, they encounter levels that are larger (in terms of capacity) and slower than the higher levels(moving left to right in Fig. 88.2). In addition to size and speed, the bandwidth in-between adjacent levels in the memory hierarchy is smaller for the lower levels. The bandwidth in-between the registers and top cache level, for example, is higher than that between cache and main memory or main memory and virtual memory. Since each level presumably intercepts a fraction of the requests, the bandwidth to the level below need not be as great as that to the intercepting level A useful performance parameter is the effective latency. If the needed word is found in a level of the hierarchy, it is a hit; if a request must be sent to the next lower level, the request is said to miss. If the latency LHrr is known in the case of a hit and the latency in the case of a miss is Miss, the effective latency for that level in the hierarchy can be determined from the hit ratio(H), the fraction of memory accesses that are hits L h+L MISS (1-H (88.2) The portion of memory accesses that miss is called the miss ratio(M=1-H). The hit ratio is strongly influenced by the program being executed, but is largely independent of the ratio of cache size to memory size. It is not uncommon for a cache with a capacity a few thousand bytes to exhibit a hit ratio greater than 90% 88.3 Cache Memories The basic unit of construction of a semiconductor memory system is a module or bank. a memory bank, constructed from several memory chips, can service a single request at a time. The time that a bank is busy servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank. e 2000 by CRC Press LLC
© 2000 by CRC Press LLC will be accessed again soon. This is largely, but not exclusively, a result of the frequency of looping behavior of programs. Particularly for temporal locality, a good predictor of the future is the past: the longer a variable has gone unreferenced, the less likely it is to be accessed soon. Figure 88.2 depicts a common construction of a memory hierarchy. At the top of the hierarchy are the CPU registers, which are small and extremely fast. The next level down in the hierarchy is a special, high-speed semiconductor memory known as a cache memory. The cache can actually be divided into multiple distinct levels; most current systems have between one and three levels of cache. Some of the levels of cache may be on the CPU chip itself, they may be on the same module as the CPU, or they all may be entirely distinct. Below the cache is the conventional memory, referred to as main memory, or backing storage. Like a cache, main memory is semiconductor memory, but it is slower, cheaper, and denser than a cache. Below the main memory is the virtual memory, which is generally stored on magnetic or optical disk. Accessing the virtual memory can be tens of thousands times slower than accessing the main memory because it involves moving mechanical parts. As requests go deeper into the memory hierarchy, they encounter levels that are larger (in terms of capacity) and slower than the higher levels (moving left to right in Fig. 88.2). In addition to size and speed, the bandwidth in-between adjacent levels in the memory hierarchy is smaller for the lower levels. The bandwidth in-between the registers and top cache level, for example, is higher than that between cache and main memory or main memory and virtual memory. Since each level presumably intercepts a fraction of the requests, the bandwidth to the level below need not be as great as that to the intercepting level. A useful performance parameter is the effective latency. If the needed word is found in a level of the hierarchy, it is a hit; if a request must be sent to the next lower level, the request is said to miss. If the latency LHIT is known in the case of a hit and the latency in the case of a miss is LMISS , the effective latency for that level in the hierarchy can be determined from the hit ratio (H), the fraction of memory accesses that are hits: Laverage = L HIT · H + L MISS · (1 – H) (88.2) The portion of memory accesses that miss is called the miss ratio (M = 1 – H). The hit ratio is strongly influenced by the program being executed, but is largely independent of the ratio of cache size to memory size. It is not uncommon for a cache with a capacity a few thousand bytes to exhibit a hit ratio greater than 90%. 88.3 Cache Memories The basic unit of construction of a semiconductor memory system is a module or bank. A memory bank, constructed from several memory chips, can service a single request at a time. The time that a bank is busy servicing a request is called the bank busy time. The bank busy time limits the bandwidth of a memory bank. FIGURE 88.2 A memory hierarchy
th caches and main memories are constructed in this fashion, although caches have significantly sho bank busy times than do main memory banks. The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be accessed soon. The cache contains only redundant copies of the address space. The cache memory is associative, or content-addressable. In an associative memory, the address of a memory location is stored, along with its content Rather than reading data directly from a memory location, the cache is given an address and responds by providing data which may or may not be the data requested. When a cache miss occurs, the memory access is then performed with respect to the backing storage, and the cache is updated to include the new data. The cache is intended to hold the most active portions of the memory, and the hardware dynamicall portions of main memory to store in the cache. when the cache is full, bringing in new data must be by deleting old data. Thus a strategy for cache management is necessary. Cache management strategie the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal locality is exploited in the choice of which block is removed. When a cache miss occurs, hardware copies a large, contiguous block of memory into the cache, which includes the word requested. This fixed-size region of memory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes block is a set of contiguous memory locations, the number of which is usually a power of two. a block said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a block of size B beginning at location A, the block is aligned if A modulo b=0 (88.3) Conventional caches require that all blocks be aligned When a block is brought into the cache, it is likely that another block must be evicted. The selection of the evicted block is based on some attempt to capture temporal locality. Since prescience is so difficult to achieve, other methods are generally used to predict future memory accesses. A least-recently-used(LRU) policy is often ne basis for the choice. Other replacement policies are sometimes used, particularly because true LRU replace ment requires extensive logic and bookkeeping The cache often comprises two conventional memories: the data memory and the tag memory, shown in Fig. 88.3. The address of each cache line contained in the data memory is stored in the tag memory, as well as other information(state information), particularly the fact that a valid cache line is present. The state also keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated a corresponding entry in the tag memory to indicate the full address of the cache line ompare Incoming Stored Tags and select Data word Data Word Hit/Miss FIGURE 88.3 Components of a cache memory.( Source: Adapted from M. D Hill, "A case for direct-mapped caches, "IEEE omputer;2l(12),27,1988) e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Both caches and main memories are constructed in this fashion, although caches have significantly shorter bank busy times than do main memory banks. The hardware can dynamically allocate parts of the cache memory for addresses deemed most likely to be accessed soon. The cache contains only redundant copies of the address space. The cache memory is associative, or content-addressable. In an associative memory, the address of a memory location is stored, along with its content. Rather than reading data directly from a memory location, the cache is given an address and responds by providing data which may or may not be the data requested. When a cache miss occurs, the memory access is then performed with respect to the backing storage, and the cache is updated to include the new data. The cache is intended to hold the most active portions of the memory, and the hardware dynamically selects portions of main memory to store in the cache. When the cache is full, bringing in new data must be matched by deleting old data. Thus a strategy for cache management is necessary. Cache management strategies exploit the principle of locality. Spatial locality is exploited by the choice of what is brought into the cache. Temporal locality is exploited in the choice of which block is removed. When a cache miss occurs, hardware copies a large, contiguous block of memory into the cache, which includes the word requested. This fixed-size region of memory, known as a cache line or block, may be as small as a single word, or up to several hundred bytes. A block is a set of contiguous memory locations, the number of which is usually a power of two. A block is said to be aligned if the lowest address in the block is exactly divisible by the block size. That is to say, for a block of size B beginning at location A, the block is aligned if A modulo B = 0 (88.3) Conventional caches require that all blocks be aligned. When a block is brought into the cache, it is likely that another block must be evicted. The selection of the evicted block is based on some attempt to capture temporal locality. Since prescience is so difficult to achieve, other methods are generally used to predict future memory accesses. A least-recently-used (LRU) policy is often the basis for the choice. Other replacement policies are sometimes used, particularly because true LRU replacement requires extensive logic and bookkeeping. The cache often comprises two conventional memories: the data memory and the tag memory, shown in Fig. 88.3. The address of each cache line contained in the data memory is stored in the tag memory, as well as other information (state information), particularly the fact that a valid cache line is present. The state also keeps track of which cache lines the processor has modified. Each line contained in the data memory is allocated a corresponding entry in the tag memory to indicate the full address of the cache line. FIGURE 88.3 Components of a cache memory. (Source: Adapted from M. D. Hill, “A case for direct-mapped caches,” IEEE Computer, 21(12), 27, 1988.)
The requirement that the cache memory be associative(content-addressable) complicates the design. Addressing data by content is inherently more complicated than by its address. All the tags must be compared concurrently, of course, because the whole point of the cache is to achieve low latency. The cache can be made simpler, however, by introducing a mapping of memory locations to cache cells. This mapping limits the number of possible cells in which a particular line may reside. The extreme case is known as direct mapping, in which ach memory location is mapped to a single location in the cache. Direct mapping makes many aspects of the design simpler, since there is no choice of where the line might reside, and no choice as to which line must be replaced. However, direct mapping can result in poor utilization of the cache when two memory locations are alternately accessed and must share a single cache cell. A hashing algorithm is used to determine the cache address from the memory address. The conventional mapping algorithm consists of a function of the form mod cache A (884) cache ine St where Acache is the address within the cache for main memory location Amemory, cache_size is the capacity of the cache in addressable units(usually bytes), and cache_ line of the cache line in addressable units Since the hashing function is simple bit selection, the tag memory need only contain the part of the address not implied by the hashing function. That is div size_of_cache (88.5) where Auag is stored in the tag memory and div is the integer divide operation. In testing for a match, the complete address of a line stored in the cache can be inferred from the tag and its storage location within the cache. A two-way set-associative cache maps each memory location into either of two locations in the cache and can be constructed essentially as two identical direct-mapped caches. However, both caches must be searched at each memory access and the appropriate data selected and multiplexed on a tag match(hit). On a miss, a choice must be made between the two possible cache lines as to which is to be replaced. A single LRU bit can be saved for each such pair of lines to remember which line has been accessed more recently. This bit must be toggled to the current state each time either of the cache lines is accessed. In the same way, an M-way associative cache maps each memory location into any of M memory locations in the cache and can be constructed from M identical direct-mapped caches. The problem of maintaining the LRU ordering of M cache lines quickly becomes hard, however, since there are M! possible orderings, and therefore it takes at least 「log2(M!)1 bits to store the ordering. In practice, this requirement limits true LRU replacement to 3-or 4-way set Figure 88.4 shows how a cache is organized into sets, blocks, and words. The cache shown is a 2-Kbyte, 4-way set-associative cache, with 16 sets. Each set consists of four blocks. The cache block size in this example is 32 bytes, so each block contains eight 4-byte words. Also depicted at the bottom of Fig. 88.4 is a 4-way interleaved main memory system(see the next section for details). Each successive word in the cache block maps into a different main memory bank. Because of the cache's mapping restrictions, each cache block obtained from main memory will be loaded into its corresponding set, but may appear anywhere within that set Write operations require special handling in the cache. If the main memory copy is updated with each write operation-a technique known as write-through or store-through-the writes may force operations to stall while the write operations are completing. This can happen after a series of write operations even if the processor is allowed to proceed before the write to the memory has completed. If the main memory copy is not updated e 2000 by CRC Press LLC
© 2000 by CRC Press LLC The requirement that the cache memory be associative (content-addressable) complicates the design. Addressing data by content is inherently more complicated than by its address. All the tags must be compared concurrently, of course, because the whole point of the cache is to achieve low latency. The cache can be made simpler, however, by introducing a mapping of memory locations to cache cells. This mapping limits the number of possible cells in which a particular line may reside. The extreme case is known as direct mapping, in which each memory location is mapped to a single location in the cache. Direct mapping makes many aspects of the design simpler, since there is no choice of where the line might reside, and no choice as to which line must be replaced. However, direct mapping can result in poor utilization of the cache when two memory locations are alternately accessed and must share a single cache cell. A hashing algorithm is used to determine the cache address from the memory address. The conventional mapping algorithm consists of a function of the form (88.4) where Acache is the address within the cache for main memory location Amemory , cache size is the capacity of the cache in addressable units (usually bytes), and cache line size is the size of the cache line in addressable units. Since the hashing function is simple bit selection, the tag memory need only contain the part of the address not implied by the hashing function. That is, Atag = A memory div size of cache (88.5) where Atag is stored in the tag memory and div is the integer divide operation. In testing for a match, the complete address of a line stored in the cache can be inferred from the tag and its storage location within the cache. A two-way set-associative cache maps each memory location into either of two locations in the cache and can be constructed essentially as two identical direct-mapped caches. However, both caches must be searched at each memory access and the appropriate data selected and multiplexed on a tag match (hit). On a miss, a choice must be made between the two possible cache lines as to which is to be replaced. A single LRU bit can be saved for each such pair of lines to remember which line has been accessed more recently. This bit must be toggled to the current state each time either of the cache lines is accessed. In the same way, an M-way associative cache maps each memory location into any of M memory locations in the cache and can be constructed from M identical direct-mapped caches. The problem of maintaining the LRU ordering of M cache lines quickly becomes hard, however, since there are M! possible orderings, and therefore it takes at least È log2 (M !) ˘ (88.6) bits to store the ordering. In practice, this requirement limits true LRU replacement to 3- or 4-way set associativity. Figure 88.4 shows how a cache is organized into sets, blocks, and words. The cache shown is a 2-Kbyte, 4-way set-associative cache, with 16 sets. Each set consists of four blocks. The cache block size in this example is 32 bytes, so each block contains eight 4-byte words. Also depicted at the bottom of Fig. 88.4 is a 4-way interleaved main memory system (see the next section for details). Each successive word in the cache block maps into a different main memory bank. Because of the cache’s mapping restrictions, each cache block obtained from main memory will be loaded into its corresponding set, but may appear anywhere within that set. Write operations require special handling in the cache. If the main memory copy is updated with each write operation—a technique known as write-through or store-through—the writes may force operations to stall while the write operations are completing. This can happen after a series of write operations even if the processor is allowed to proceed before the write to the memory has completed. If the main memory copy is not updated A A cache size cache line size cache memory = mod _ _ _
4-way set-associative cache Eight 4-byte words/block 4 cache blocks/set Each successive word in a block maps to a different main memory bank 4 main memory banks FIGURE 88.4 Organization of a cache with each write operation-a technique known as write-back or copy-back or deferred writesthe main memory locations become stale, that is, memory no longer contains the correct values and must not be relied upon to provide data. This is generally permissible, but care must be exercised to make sure that it is always update before the line is purged from the cache and that the cache is never bypassed. Such a bypass could occur with DMA(direct memory access), in which the I/O system writes directly into main memory without the involvement of the processor. Even for a system that implements write-through, care must be exercised if memory requests bypass the cache. While the main memory is never stale, a write that bypasses the cache, such as from I/O, could have the effect of making the cached copy stale. A later access by the CPU could then provide an incorrect value. This can only be avoided by making sure that cached entries are invalidated even if the cache is bypassed. The problem is relatively easy to solve for a single processor with I/O, but becomes very difficult to solve for multiple processors, particularly so if multiple caches are involved as well. This is known in general as the cache coherence The cache exploits spatial locality by loading an entire cache line after a miss. This tends to result in bursty affic to the main memory, since most accesses are filtered out by the cache. After a miss, ho system must provide an entire line at once. Cache memory nicely complements an interleaved, high-bandwidth main memory(described in the next section), since a cache line can be interleaved across many banks in a regular manner, thus avoiding memory conflicts, and thus can be loaded rapidly into the cache. The example main memory shown in Fig. 88.3 can provide an entire cache line with two parallel memory accesses Conventional caches traditionally could not accept requests while they were servicing a miss request. In other words, they locked up or blocked when servicing a miss. The growing penalty for cache has made it necessary for high-end commodity memory systems to continue to accept(and service) requests from the processor while a miss is being serviced. Some systems are able to service multiple miss requests simultaneously To allow this mode of operation, the cache design is lockup-free or non-blocking [Kroft, 1981. Lockup-free caches have one structure for each simultaneous outstanding miss that they can service. This structure holds the information necessary to correctly return the loaded data to the processor, even if the misses come back in a different order than that in which they were sent. Two factors drive the existence of multiple levels of cache memory in the memory hierarchy: access times and a limited number of transistors on the CPU chip. Larger banks with greater capacity are slower than smaller banks. If the time needed to access the cache limits the clock frequency of the CPU, then the first-level cache size may need to be constrained. Much of the benefit of a large cache may be obtained by placing a small first- level cache above a larger second-level cache; the first is accessed quickly and the second holds more data close to the processor. Since many modern CPUs have caches on the CPU chip itself, the size of the cache is limited by the CPU silicon real-estate. Some CPU designers have assumed that system designers will add large off-chip caches to the one or two levels of caches on the processor chip. The complexity of this part of the memory hierarchy may continue to grow as main memory access penalties continue to increase. e 2000 by CRC Press LLC
© 2000 by CRC Press LLC with each write operation—a technique known as write-back or copy-back or deferred writes—the main memory locations become stale, that is, memory no longer contains the correct values and must not be relied upon to provide data. This is generally permissible, but care must be exercised to make sure that it is always updated before the line is purged from the cache and that the cache is never bypassed. Such a bypass could occur with DMA (direct memory access), in which the I/O system writes directly into main memory without the involvement of the processor. Even for a system that implements write-through, care must be exercised if memory requests bypass the cache. While the main memory is never stale, a write that bypasses the cache, such as from I/O, could have the effect of making the cached copy stale. A later access by the CPU could then provide an incorrect value. This can only be avoided by making sure that cached entries are invalidated even if the cache is bypassed. The problem is relatively easy to solve for a single processor with I/O, but becomes very difficult to solve for multiple processors, particularly so if multiple caches are involved as well. This is known in general as the cache coherence or consistency problem. The cache exploits spatial locality by loading an entire cache line after a miss. This tends to result in bursty traffic to the main memory, since most accesses are filtered out by the cache. After a miss, however, the memory system must provide an entire line at once. Cache memory nicely complements an interleaved, high-bandwidth main memory (described in the next section), since a cache line can be interleaved across many banks in a regular manner, thus avoiding memory conflicts, and thus can be loaded rapidly into the cache. The example main memory shown in Fig. 88.3 can provide an entire cache line with two parallel memory accesses. Conventional caches traditionally could not accept requests while they were servicing a miss request. In other words, they locked up or blocked when servicing a miss. The growing penalty for cache misses has made it necessary for high-end commodity memory systems to continue to accept (and service) requests from the processor while a miss is being serviced. Some systems are able to service multiple miss requests simultaneously. To allow this mode of operation, the cache design is lockup-free or non-blocking [Kroft, 1981]. Lockup-free caches have one structure for each simultaneous outstanding miss that they can service. This structure holds the information necessary to correctly return the loaded data to the processor, even if the misses come back in a different order than that in which they were sent. Two factors drive the existence of multiple levels of cache memory in the memory hierarchy: access times and a limited number of transistors on the CPU chip. Larger banks with greater capacity are slower than smaller banks. If the time needed to access the cache limits the clock frequency of the CPU, then the first-level cache size may need to be constrained. Much of the benefit of a large cache may be obtained by placing a small firstlevel cache above a larger second-level cache; the first is accessed quickly and the second holds more data close to the processor. Since many modern CPUs have caches on the CPU chip itself, the size of the cache is limited by the CPU silicon real-estate. Some CPU designers have assumed that system designers will add large off-chip caches to the one or two levels of caches on the processor chip. The complexity of this part of the memory hierarchy may continue to grow as main memory access penalties continue to increase. FIGURE 88.4 Organization of a cache
Caches that appear on the CPu chip are manufactured by the CPU vendor. Off-chip caches, however, are a ommodity part sold in large volume. An incomplete list of major cache manufacturers is Hitachi, IBM Micro, Micron, Motorola, NEC, Samsung, SGS-Thomson, Sony, and Toshiba. Although most personal computers and all major workstations now contain caches, very high-end machines(such as multi-million dollar supercom puters)do not usually have caches. These ultra-expensive computers can afford to implement their main memory in a comparatively fast semiconductor technology such as static RAM (SRAM), and can afford so many banks that cacheless bandwidth out of the main memory system is sufficient. Massively parallel processors (MPPs), however, are often constructed out of workstation-like nodes to reduce cost MPPs therefore contain cache hierarchies similar to those found in the workstations on which the nodes of the mpps are based Cache sizes have been steadily increasing on personal computers and workstations. Intel Pentium-based personal computers come with 8 Kbyte each of instruction and data caches. Two of the Pentium chip sets, nanufactured by Intel and OPTi, allow level-two caches ranging from 256 to 512 Kbyte and 64 Kbyte to 2 Mbyte, respectively. The newer Pentium Pro systems also have 8 Kbyte, first-level instruction and data caches, but they also have either a 256 Kbyte or a 512 Kbyte second-level cache on the same module as the processor chip. Higher-end workstations--such as DEC Alpha 21164-based systems--are configured with substantially more cache. The 21164 also has 8 Kbyte, first-level instruction and data caches. Its second-level cache is entirely on chip, and is 96 Kbyte. The third-level cache is off-chip, and can have a size ranging from 1 to 64 Mbyte. For all desktop machines, cache sizes are likely to continue to grow-although the rate of growth compared processor speed increases and main memory size increases is unclear. 88.4 Parallel and Interleaved Memories Main memories are comprised of a series of semiconductor memory chips. A number of these chips, like caches, rm a bank. Multiple memory banks can be connected together to form an interleaved (or parallel)memory stem. Since each bank can service a request, an interleaved memory system with K banks can service Reques simultaneously, increasing the peak bandwidth of the memory system to K times the bandwidth of a single bank. In most interleaved memory systems, the number of banks is a power of two, that is, K=2.An n-bit memory word address is broken into two parts: a k-bit bank number and an m-bit address of a word within a bank. Though the k bits used to select a bank number could be any k bits of the n-bit word address, typical interleaved memory systems use the low-order k address bits to select the bank number; the higher order m n-k bits of the word address are used to access a word in the selected bank. The reason for using the low order k bits will be discussed shortly. An interleaved memory system which uses the low-order k bits to select the bank is referred to as a low-order or a standard interleaved memor There are two ways of connecting multiple memory banks: simple interleaving and complex interleaving. Sometimes simple interleaving is also referred to as interleaving, and complex interleaving as banking Figure 88.5 shows the structure of a simple interleaved memory system. m address bits are simultaneously i pplied to every memory bank. All banks are also connected to the same read/write control line(not shown ig 88.5). For a read operation, the banks start the read operation and deposit the data in their latches. Data can then be read from the latches, one by one, by setting the switch appropriately. Meanwhile, the banks could accessed again, to carry out another read or write operation. For a write operation, the latches are loaded, one by one. When all the latches have been written, their contents can be written into the memory banks by supplying m bits of address(they will be written into the same word in each of the different banks). In a simple interleaved memory, all banks are cycled at the same time; each bank starts and completes its individual operations at the same time as every other bank; a new memory cycle can start(for all banks)once the previos cycle is complete. Timing details of the accesses can be found in The Architecture of Pipelined Computers, Kogge One use of a simple interleaved memory system is to back up a cache memory. To do so, the memory must be able to read blocks of contiguous words(a cache block)and supply them to the cache. If the low-order k bits of the address are used to select the bank number, then consecutive words of the block reside in different banks, and they can all be read in parallel, and supplied to the cache one by one. If some other address bits are used for bank selection, then multiple words from the block might fall in the same memory bank, requiring multiple accesses to the same bank to fetch the block e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Caches that appear on the CPU chip are manufactured by the CPU vendor. Off-chip caches, however, are a commodity part sold in large volume. An incomplete list of major cache manufacturers is Hitachi, IBM Micro, Micron, Motorola, NEC, Samsung, SGS-Thomson, Sony, and Toshiba. Although most personal computers and all major workstations now contain caches, very high-end machines (such as multi-million dollar supercomputers) do not usually have caches. These ultra-expensive computers can afford to implement their main memory in a comparatively fast semiconductor technology such as static RAM (SRAM), and can afford so many banks that cacheless bandwidth out of the main memory system is sufficient. Massively parallel processors (MPPs), however, are often constructed out of workstation-like nodes to reduce cost. MPPs therefore contain cache hierarchies similar to those found in the workstations on which the nodes of the MPPs are based. Cache sizes have been steadily increasing on personal computers and workstations. Intel Pentium-based personal computers come with 8 Kbyte each of instruction and data caches. Two of the Pentium chip sets, manufactured by Intel and OPTi, allow level-two caches ranging from 256 to 512 Kbyte and 64 Kbyte to 2 Mbyte, respectively. The newer Pentium Pro systems also have 8 Kbyte, first-level instruction and data caches, but they also have either a 256 Kbyte or a 512 Kbyte second-level cache on the same module as the processor chip. Higher-end workstations—such as DEC Alpha 21164-based systems—are configured with substantially more cache. The 21164 also has 8 Kbyte, first-level instruction and data caches. Its second-level cache is entirely onchip, and is 96 Kbyte. The third-level cache is off-chip, and can have a size ranging from 1 to 64 Mbyte. For all desktop machines, cache sizes are likely to continue to grow—although the rate of growth compared to processor speed increases and main memory size increases is unclear. 88.4 Parallel and Interleaved Memories Main memories are comprised of a series of semiconductor memory chips.A number of these chips, like caches, form a bank. Multiple memory banks can be connected together to form an interleaved (or parallel) memory system. Since each bank can service a request, an interleaved memory system with K banks can service K requests simultaneously, increasing the peak bandwidth of the memory system to K times the bandwidth of a single bank. In most interleaved memory systems, the number of banks is a power of two, that is, K = 2k . An n-bit memory word address is broken into two parts: a k-bit bank number and an m-bit address of a word within a bank. Though the k bits used to select a bank number could be any k bits of the n-bit word address, typical interleaved memory systems use the low-order k address bits to select the bank number; the higher order m = n – k bits of the word address are used to access a word in the selected bank. The reason for using the loworder k bits will be discussed shortly. An interleaved memory system which uses the low-order k bits to select the bank is referred to as a low-order or a standard interleaved memory. There are two ways of connecting multiple memory banks: simple interleaving and complex interleaving. Sometimes simple interleaving is also referred to as interleaving, and complex interleaving as banking. Figure 88.5 shows the structure of a simple interleaved memory system. m address bits are simultaneously supplied to every memory bank. All banks are also connected to the same read/write control line (not shown in Fig. 88.5). For a read operation, the banks start the read operation and deposit the data in their latches. Data can then be read from the latches, one by one, by setting the switch appropriately. Meanwhile, the banks could be accessed again, to carry out another read or write operation. For a write operation, the latches are loaded, one by one. When all the latches have been written, their contents can be written into the memory banks by supplying m bits of address (they will be written into the same word in each of the different banks). In a simple interleaved memory, all banks are cycled at the same time; each bank starts and completes its individual operations at the same time as every other bank; a new memory cycle can start (for all banks) once the previous cycle is complete. Timing details of the accesses can be found in The Architecture of Pipelined Computers, [Kogge, 1981]. One use of a simple interleaved memory system is to back up a cache memory. To do so, the memory must be able to read blocks of contiguous words (a cache block) and supply them to the cache. If the low-order k bits of the address are used to select the bank number, then consecutive words of the block reside in different banks, and they can all be read in parallel, and supplied to the cache one by one. If some other address bits are used for bank selection, then multiple words from the block might fall in the same memory bank, requiring multiple accesses to the same bank to fetch the block
rn address k address bits Switch (switch select Computers, Ist ed, New York: McGraw-Hill, 198, n.(Source: Adapted from P. M. Kogge, The Architecture of pipelined FIGURE 88.5 A simple interleaved memor m address Latch k address bits Switch FIGURE 88.6 A complex interleaved memory system. ( Source: Adapted from P. M. Kogge, The Architecture of Pipelined Computers, Ist ed, New York: McGraw-Hill, 1981, P. 42.) Figure 88.6 shows the structure of a complex interleaved memory system. In such a system, each bank is set up to operate on its own, independent of the operation of the other banks. For example, bank 1 could carry out a read operation on a particular memory address, and bank 2 carries out a write operation on a completely unrelated memory address. Contrast this with the operation in a simple interleaved memory where all banks are carrying out the same operation, read or write, and the locations accessed within each bank represent a contiguous block of memory. )Complex interleaving is accomplished by providing an address latch and a read/write command line for each bank. The memory controller handles the overall operation of the interleaved memory. The processing unit submits the memory request to the memory controller, which determines which bank needs to be accessed. The controller then determines if the bank is busy(by monitoring a busy line for e 2000 by CRC Press LLC
© 2000 by CRC Press LLC Figure 88.6 shows the structure of a complex interleaved memory system. In such a system, each bank is set up to operate on its own, independent of the operation of the other banks. For example, bank 1 could carry out a read operation on a particular memory address, and bank 2 carries out a write operation on a completely unrelated memory address. (Contrast this with the operation in a simple interleaved memory where all banks are carrying out the same operation, read or write, and the locations accessed within each bank represent a contiguous block of memory.) Complex interleaving is accomplished by providing an address latch and a read/write command line for each bank. The memory controller handles the overall operation of the interleaved memory. The processing unit submits the memory request to the memory controller, which determines which bank needs to be accessed. The controller then determines if the bank is busy (by monitoring a busy line for FIGURE 88.5 A simple interleaved memory system. (Source: Adapted from P. M. Kogge, The Architecture of Pipelined Computers, 1st ed., New York: McGraw-Hill, 1981, p. 41.) FIGURE 88.6 A complex interleaved memory system. (Source: Adapted from P. M. Kogge, The Architecture of Pipelined Computers, 1st ed., New York: McGraw-Hill, 1981, p. 42.)
each bank). The controller holds the request if the bank is busy, submitting it later when the bank is to accept the request. when the bank responds to a read request, the switch is set by the controller the request from the bank and forward it to the processing unit. Details of the timing of accesses can in The Architecture of Pipelined Computers [Kogge, 1981] A typical use of a complex interleaved memory system is in a vector processor. In a vector processor, the processing units operate on a vector, for example a portion of a row or a column of a matrix. If consecutive elements of a vector are present in different memory banks, then the memory system can sustain a bandwidth of one element per clock cycle. By arranging the data suitably in memory and using standard interleaving(for example, storing the matrix in row-major order will place consecutive elements in consecutive memory banks) the vector can be accessed at the rate of one element per clock cycle as long as the number of banks is greater bank busy tim Memory systems that are built for current machines vary widely, the price and purpose of the machine being the main determinant of the memory system design. The actual memory chips, which are the compon the memory systems, are generally commodity parts built by a number of manufacturers. The major commodity DRAM manufacturers include(but certainly are not limited to) Hitachi, Fujitsu, LG Semicon, NEC, Oki, Samsung, Texas Instruments, and Toshiba. The low-end of the price/performance spectrum is the personal computer, presently typified by Intel Pentium systems. Three of the manufacturers of Pentium-compatible chip sets(which include the memory controllers) are Intel, OPTi, and VLSI Technologies. Their controllers provide for memory systems that are simply inter- leaved, all with minimum bank depths of 256 Kbyte, and maximum system sizes of 192 Mbyte, 128 Mbyte and 1 Gbyte, respectively Both higher-end personal computers and workstations tend to have more main memory than the lower-end systems, although they usually have similar upper limits. Two examples of such systems are workstations built with the DEC Alpha 21164, and servers built with the Intel Pentium Pro. The Alpha systems, using the 21171 chip set, are limited to 128 Mbyte of main memory using 16 Mbit DRAMs, although they will be expandable to 512 Mbyte when 64-Mbit DRAMs are available. Their memory systems are 8-way simply interleaved, providing 128 bits per DRAM access. The Pentium Pro systems support slightly different features. The 82450KX and 82450GX chip sets include memory controllers that allow reads to bypass writes(performing writes when the memory banks are idle). These controllers can also buffer eight outstanding requests simultaneously. The 82450KX controller permits 1-or 2-way interleaving, and up to 256 Mbyte of memory when 16-Mbit DRAMs are used. The 82450GX chip set is more aggressive, allowing up to four separate(complex-interleaved)memory controllers, each of which can be up to 4-way interleaved and have up to 1 Gbyte of memory(again with 16 Mbit DRAMs Interleaved memory systems found in high-end vector supercomputers are slight variants on the basic complex interleaved memory system of Fig. 88. 6. Such memory systems may have hundreds of banks, with multipl memory controllers that allow multiple independent memory requests to be made every clock cycle. Two examples of modern vector supercomputers are the Cray T-90 series and the NEC SX series. The Cray T-90 models come with varying numbers of processors--up to 32 in the largest configuration. Each of these processors is coupled with 256 Mbyte of memory, split into 16 banks of 16 Mbyte each. The T-90 has complex interleaving among banks. the largest configuration(the T-932) has 32 processors, for a total of 512 banks and 8 Gbyte of main memory. The T-932 can provide a peak of 800 Gbyte/s bandwidth out of its memory system NECs SX-4 product line, their most recent vector supercomputer series, has numerous models. Their largest single-node model (with one processor per node) contains 32 processors, with a maximum of 8 Gbyte of memory, and a peak bandwidth of 512 Gbyte/s out of main memory. Although the sizes of the memory systems vastly different between workstations and vector machines, the techniques that both use to increase total dwidth and minimize bank conflicts are similar 88.5 Virtual Memory Cache memory contains portions of the main memory in dynamically allocated cache lines. Since the data portion of the cache memory is itself a conventional memory, each line present in the cache has two addresses associated with it: its main memory address and its cache address. Thus, the main memory address of a word e 2000 by CRC Press LLC
© 2000 by CRC Press LLC each bank). The controller holds the request if the bank is busy, submitting it later when the bank is available to accept the request. When the bank responds to a read request, the switch is set by the controller to accept the request from the bank and forward it to the processing unit. Details of the timing of accesses can be found in The Architecture of Pipelined Computers [Kogge, 1981]. A typical use of a complex interleaved memory system is in a vector processor. In a vector processor, the processing units operate on a vector, for example a portion of a row or a column of a matrix. If consecutive elements of a vector are present in different memory banks, then the memory system can sustain a bandwidth of one element per clock cycle. By arranging the data suitably in memory and using standard interleaving (for example, storing the matrix in row-major order will place consecutive elements in consecutive memory banks), the vector can be accessed at the rate of one element per clock cycle as long as the number of banks is greater than the bank busy time. Memory systems that are built for current machines vary widely, the price and purpose of the machine being the main determinant of the memory system design. The actual memory chips, which are the components of the memory systems, are generally commodity parts built by a number of manufacturers. The major commodity DRAM manufacturers include (but certainly are not limited to) Hitachi, Fujitsu, LG Semicon, NEC, Oki, Samsung, Texas Instruments, and Toshiba. The low-end of the price/performance spectrum is the personal computer, presently typified by Intel Pentium systems. Three of the manufacturers of Pentium-compatible chip sets (which include the memory controllers) are Intel, OPTi, and VLSI Technologies. Their controllers provide for memory systems that are simply interleaved, all with minimum bank depths of 256 Kbyte, and maximum system sizes of 192 Mbyte, 128 Mbyte, and 1 Gbyte, respectively. Both higher-end personal computers and workstations tend to have more main memory than the lower-end systems, although they usually have similar upper limits. Two examples of such systems are workstations built with the DEC Alpha 21164, and servers built with the Intel Pentium Pro. The Alpha systems, using the 21171 chip set, are limited to 128 Mbyte of main memory using 16 Mbit DRAMs, although they will be expandable to 512 Mbyte when 64-Mbit DRAMs are available. Their memory systems are 8-way simply interleaved, providing 128 bits per DRAM access. The Pentium Pro systems support slightly different features. The 82450KX and 82450GX chip sets include memory controllers that allow reads to bypass writes (performing writes when the memory banks are idle). These controllers can also buffer eight outstanding requests simultaneously. The 82450KX controller permits 1- or 2-way interleaving, and up to 256 Mbyte of memory when 16-Mbit DRAMs are used. The 82450GX chip set is more aggressive, allowing up to four separate (complex-interleaved) memory controllers, each of which can be up to 4-way interleaved and have up to 1 Gbyte of memory (again with 16 Mbit DRAMs). Interleaved memory systems found in high-end vector supercomputers are slight variants on the basic complex interleaved memory system of Fig. 88.6. Such memory systems may have hundreds of banks, with multiple memory controllers that allow multiple independent memory requests to be made every clock cycle. Two examples of modern vector supercomputers are the Cray T-90 series and the NEC SX series. The Cray T-90 models come with varying numbers of processors—up to 32 in the largest configuration. Each of these processors is coupled with 256 Mbyte of memory, split into 16 banks of 16 Mbyte each. The T-90 has complex interleaving among banks. the largest configuration (the T-932) has 32 processors, for a total of 512 banks and 8 Gbyte of main memory. The T-932 can provide a peak of 800 Gbyte/s bandwidth out of its memory system. NEC’s SX-4 product line, their most recent vector supercomputer series, has numerous models. Their largest single-node model (with one processor per node) contains 32 processors, with a maximum of 8 Gbyte of memory, and a peak bandwidth of 512 Gbyte/s out of main memory. Although the sizes of the memory systems are vastly different between workstations and vector machines, the techniques that both use to increase total bandwidth and minimize bank conflicts are similar. 88.5 Virtual Memory Cache memory contains portions of the main memory in dynamically allocated cache lines. Since the data portion of the cache memory is itself a conventional memory, each line present in the cache has two addresses associated with it: its main memory address and its cache address. Thus, the main memory address of a word