Memory System Based on slides from Jones & Bartlett
6.1 Introduction • Memory lies at the heart of the stored-program computer. • In previous chapters, we studied the components from which memory is built and the ways in which memory is accessed by various ISAs. • In this chapter, we focus on memory organization. A clear understanding of these ideas is essential for the analysis of system performance. COMP 228, Fall 2009
Memory System
1
6.2 Types of Memory • Two typical kinds of main memory: random access memory, RAM, and read-only-memory, ROM. • Two typical types of RAM, dynamic RAM (DRAM) and static RAM (SRAM). • Dynamic RAM consists of memory cells implemented as static charge stored in capacitors that leak over time. Thus they must be refreshed every few milliseconds to prevent data loss. • DRAM is cheaper owing to its simpler design.
COMP 228, Fall 2009
Memory System
2
6.2 Types of Memory • SRAM consists of circuits similar to a D flip-flop with its storage intact until changed or power is turned off. • SRAM is faster memory and it doesn’t need to be refreshed like DRAM does. It can be used to build both the main memory and cache memory, which we will discuss in detail later. • Technology has evolved into many variations of memory cell and chip designs, some allowing overlapping of accesses. • ROM is used to store permanent, or semipermanent data that persists even while power is turned off. COMP 228, Fall 2009
Memory System
3
6.3 The Memory Hierarchy • Generally speaking, faster memory is more expensive than slower memory. • To provide the best performance at the lowest cost, memory is organized in a hierarchical fashion. • Small, fast storage elements are kept in the CPU, larger, slower main memory is accessed through the system bus. • Larger, (almost) permanent storage in the form of disk and tape drives is physically farther away from the CPU. COMP 228, Fall 2009
Memory System
4
6.3 The Memory Hierarchy • This storage organization can be thought of as a pyramid:
COMP 228, Fall 2009
Memory System
5
6.3 The Memory Hierarchy • To access a particular piece of data from ‘memory’, the CPU first sends a request to its nearest memory, usually cache. • If the data is not in cache, then main memory is queried. If the data is not in main memory, then the request goes to disk and the program is temporally suspended by the operating system. • Once the data is located, then the data, and a number of its nearby data elements (in a block) are fetched into cache memory. COMP 228, Fall 2009
Memory System
6
6.3 The Memory Hierarchy • This leads us to some general terminologies. – A hit occurs when data is found at a given memory level. – A miss occurs when data is not found at a memory level. – The hit rate is the percentage of time data is found at a given memory level. – The miss rate is the percentage of time it is not. – Miss rate = 1 - hit rate. – The hit time is the time required to access data at a given memory level (when a hit occurs) – The miss penalty is the additional time required to process a miss, including the time that it takes to replace a block of memory plus the time it takes to deliver the data to the processor.
COMP 228, Fall 2009
Memory System
7
6.3 The Memory Hierarchy • An entire blocks of data is copied after a hit because the principle of locality asserts that once a byte is accessed, it is likely that a nearby data element will be needed soon. • There are two forms of locality: – Temporal locality- Recently-accessed information tend to be accessed again in the near future. – Spatial locality - Accesses tend to cluster in memory region.
COMP 228, Fall 2009
Memory System
8
6.4 Cache Memory • The purpose of cache memory is to speed up accesses by buffering data closer to the CPU via temporal/spatial locality, instead of having to access them in main memory. • Although cache is much smaller than main memory, its access time is a fraction of that of main memory.
COMP 228, Fall 2009
Memory System
9
6.4 Cache Memory • Both the main memory and the cache are partitioned into a set of fixed size blocks. • To avoid confusion with memory blocks, we use the term cache line to refer to a block in cache. • Cache mapping refers to the mapping of memory blocks to cache lines. • Generally, a cache line can be used to buffer a set of the memory blocks. Each cache line has a tag register that identifies the actual memory block currently buffered in the line. COMP 228, Fall 2009
Memory System
10
6.4 Cache Memory • The simplest cache mapping scheme is direct mapped cache. • In a direct mapped cache consisting of N lines of cache, block X of main memory is mapped to cache line Y = X mod N. • Thus, suppose we have 4 lines of cache, line 2 of cache may hold blocks 2, 6, 10, 14, . . . of main memory. • Once a block of memory is copied into a line in cache, a valid bit is set for the cache line to let the system know that the line contains valid data. What could happen if there were no valid bit? COMP 228, Fall 2009
Memory System
11
6.4 Cache Memory • The diagram below is a schematic of what cache looks Line like.
• Line 0 contains block 0 from main memory, identified with the tag 00000000. Line 1 contains a memory block from memory that is identified with the tag 11110101. • Line 1 is buffering memory block 1111010101 = 3D516 • The other two lines are not valid. COMP 228, Fall 2009
Memory System
12
6.4 Cache Memory • The size of each field into which a memory address is divided depends on the size of the cache. • Suppose our memory consists of 214 words, cache has 16 = 24 lines, and each line holds 8 words. – Thus memory is divided into 214 / 2 8 = 211 blocks. • For our field sizes, we know we need 4 bits for the cache line (#), 3 bits for the word (#), and the tag (#) is what’s left over: Line
COMP 228, Fall 2009
Memory System
13
6.4 Cache Memory • Suppose a program generates the address 1AA. In 14-bit binary, this number is: 00000110101010. • The first 7 bits of this address go in the tag field, the next 4 bits go in the line field, and the final 3 bits indicate the word within the line/block. • The (7+4) bits together identifies the memory block. • Memory block # 3516 is mapped into cache line # 5. Line
COMP 228, Fall 2009
Memory System
14
6.4 Cache Memory • If subsequently the program generates the address 1AB, it will find the data it is looking for in line 0101, word 011. Line
• However, if the program generates the address, 3AB, instead, the block loaded for address 1AA would be evicted from the cache, and replaced by the blocks associated with the 3AB reference. COMP 228, Fall 2009
Memory System
15
6.4 Cache Memory • Suppose a program generates a series of memory references such as: 1AB, 3AB, 1AB, 3AB, . . . The cache will continually evict and replace blocks. • The theoretical advantage offered by the cache is lost in this extreme case. • This is the main disadvantage of direct mapped cache; if the locality involves multiple blocks that are mapped into a same line, continual cache misses may result. • Other cache mapping schemes are designed to prevent this kind of thrashing. COMP 228, Fall 2009
Memory System
16
Clicker 6.1 Suppose: Main Memory is 1M words. Cache is 1K words. Block Size is 16 words. Memory address = <tag, line#, word#> Cache address = <line#, word#> How many cache lines are there? A: B: C: D:
16 32 64 128
COMP 228, Fall 2009
Memory System
17
Clicker 6.2 How many memory blocks are there? A: 16K B: 32K C: 64K D: 128K
COMP 228, Fall 2009
Memory System
18
Clicker 6.3 Which cache line is used to buffer memory block # 600? A: B: C: D: E:
Line # 0 Line # 20 Line # 40 Line # 60 None of the above
COMP 228, Fall 2009
Memory System
19
Clicker 6.4 What is the length of each field in the memory address (written as <tag, line #, word #>? A: B: C: D: E:
<6, 10, 4> <10, 6, 6> <10, 10, 4> <10, 6, 4> None of the above
COMP 228, Fall 2009
Memory System
20
Clicker 6.5 What is the cache line (#) used to buffer the memory address FFF00 (in hex)? A: B: C: D: E:
Line # 16 Line # 32 Line # 48 Line # 64 None of the above
COMP 228, Fall 2009
Memory System
21
Clicker 6.6 Suppose a program accesses memory locations FFF00, FFF01, â&#x20AC;Ś.. FFFFF sequentially. How many cache misses will occur? A: 16
B: C: D: E:
32 128 256 None of the above
COMP 228, Fall 2009
Memory System
22
Clicker 6.7 Continuing with the previous sequence of 256 memory accesses (FFF00 to FFFFF), what is the resulting hit ratio? A: B: C: D: E:
9/10 8/9 7/8 3/4 None of the above
COMP 228, Fall 2009
Memory System
23
6.4 Cache Memory • Instead of placing a memory block to a specific cache line, we could allow a block to go anywhere in cache. • In this way, cache would have to fill up before any block eviction is needed. • This is how fully associative cache works. • As a result, a memory address contains only two fields: tag and word. The tag is actually the memory block #. COMP 228, Fall 2009
Memory System
24
6.4 Cache Memory • Suppose, as before, we have 14-bit memory addresses and a cache with 16 blocks, each block of size 8. The field format of a memory reference is:
• When the cache is searched, all tags are searched in parallel to retrieve the data quickly. • This requires special, costly hardware. So it is not actually used in cache systems. COMP 228, Fall 2009
Memory System
25
Clicker 6.7 In a fully associative memory, the memory address is <tag, word #> = <16 bits, 4 bits>. How many comparisons with tag are needed in checking a memory address for cache hit? A: 1 B: 16 C: 64K D: None of the above COMP 228, Fall 2009
Memory System
26
6.4 Cache Memory • You will recall that direct mapped cache evicts a block whenever another memory reference needs that block. • With fully associative cache, we have no such mapping, thus we must devise an algorithm to determine which block to evict from the cache. • The block that is evicted is the victim block. • There are a number of ways to pick a victim, we will discuss them shortly.
COMP 228, Fall 2009
Memory System
27
6.4 Cache Memory • Set associative cache combines the ideas of direct mapped cache and fully associative cache to form a compromised solution. • An N-way set associative cache mapping is like direct mapped cache in that a cache line can be the target (buffer) of a proper subset of the memory blocks. • An alternative view is that each memory block can be mapped into a set of N cache lines. As a result, each memory access can be checked at a set of N cache lines, rather than a singleton in the case of direct mapping or all of the cache lines in the case of fully associative mapping. COMP 228, Fall 2009
Memory System
28
6.4 Cache Memory • The number of cache blocks per set in set associative cache varies according to overall system design. • For example, a 2-way set associative cache can be conceptualized as shown in the schematic below. • Each set may buffer two different memory blocks. Line
COMP 228, Fall 2009
Line
Memory System
29
6.4 Cache Memory • In set associative cache mapping, a memory reference is divided into three fields: tag, set, and word, as shown below. • As with direct-mapped cache, the word field chooses the word within the cache block, and the tag field uniquely identifies the memory address. • The set field determines the set to which the memory block maps. Note that tag + set = block # (of memory).
COMP 228, Fall 2009
Memory System
30
6.4 Cache Memory • Suppose we have a main memory of 214 bytes. • This memory is mapped to a 2-way set associative cache having 16 lines where each line contains 8 words. • Since this is a 2-way cache, each set consists of 2 lines, and there are 8 sets. • Thus, we need 3 bits for the set, 3 bits for the word, giving 8 leftover bits for the tag:
COMP 228, Fall 2009
Memory System
31
Clicker 6.8 In a 4-way set-associative memory mapping system where memory address = <tag, set #, word #> = <10 bits, 4 bits, 4 bits>. How many comparisons are needed for checking a memory access for cache hit? A: 1K B: 16 C: 4 D: None of the above COMP 228, Fall 2009
Memory System
32
Clicker 6.9 Consider the direct-mapped cache (address = <10,6,4>) and the 4-way set-associative cache (address = <10, 4, 4>). 1. The set-associative cache will always have a higher hit ratio for a same program. 2. The hit ratio of both cache will improve by increasing the block size (without changing the cache size). A: 1 and 2 are true B: Only 1 is true C: Only 2 is true D: Neither is true COMP 228, Fall 2009
Memory System
33
6.4 Cache Memory • With fully associative and set associative cache, a replacement policy is invoked when it becomes necessary to evict a block from cache. • An optimal replacement policy requires knowledge of the future to see which block whose next reference is farthest away in future. • It is impossible to implement an optimal replacement algorithm because this future knowledge is not available in general.
COMP 228, Fall 2009
Memory System
34
6.4 Cache Memory • The replacement policy that we choose depends upon the locality that we are trying to optimize-usually, we are interested in temporal locality. • A least recently used (LRU) algorithm keeps track of the last time that a block was accessed and evicts the block that has been unused for the longest period of time. • The disadvantage of this approach is its implementation complexity: LRU has to maintain an access history for each block, which ultimately slows down the cache. • Spatial locality is exploited in using blocks as basic units of mapping. COMP 228, Fall 2009
Memory System
35
6.4 Cache Memory • First-in, first-out (FIFO) is a popular cache replacement policy. • In FIFO, the block that has been in the cache the longest, regardless of when it was last used. Notice that FIFO is not equivalent to LRU. • A random replacement policy does what its name implies: It picks a block at random and replaces it with a new block. • Random replacement can certainly evict a block that will be needed often or needed soon, but the randomness avoids thrashing. COMP 228, Fall 2009
Memory System
36
6.4 Cache Memory • The performance of hierarchical memory is measured by its effective access time (EAT). • EAT is a weighted average that takes into account the hit ratio and relative access times of successive levels of memory. • The EAT for a two-level memory is given by: EAT = H × TC + (1-H) × TM. where H is the cache hit rate and TC and TM are the access time in case of cache hit and that in case of cache miss respectively. Equivalently, we can rewrite the above as: EAT= TC + (1 – H) TP where TP is the miss penalty (additional time in case of cache miss) = TM – TC COMP 228, Fall 2009
Memory System
37
Clicker 6.10 Consider the following parameters: main memory access time = 10 ns cache hit time = 2 ns cache miss penalty = 20 ns hit ratio = 85% What is the memory speedup provided the cache? A: B: C: D:
5 4 2 None of the above
COMP 228, Fall 2009
Memory System
38
Clicker 6.11 For the same memory/cache system as the previous, what is the cache hit ratio required to double the previous speedup (of 2)? A: B: C: D: E:
Impossible between 98% and 99% between 99% and 99.5% between 99.5% and 100% between 85% and 98%
COMP 228, Fall 2009
Memory System
39
6.4 Cache Memory • For example, consider a system with a main memory access time of 200ns supported by a cache having a 10ns access time and a hit rate of 99%. If we assume TM = 200 ns as well, then the EAT is: 0.99(10ns) + 0.01(200ns) = 9.9ns + 2ns = 11ns. • This equation for determining the effective access time can be extended to any number of memory levels, as we will see in later sections. • Usually TM > main memory access time because a block is transferred, not just a single word.
COMP 228, Fall 2009
Memory System
40
6.4 Cache Memory • Cache replacement policies must also take into account dirty blocks, those blocks that have been updated while they were in the cache [as a result, the copy in memory differs from the copy in cache]. • Dirty blocks must be written back to memory. A write policy determines how this will be done. • There are two common types of write policies,write through and write back. • Write through updates cache and main memory simultaneously on every write. • Write back updates main memory when the block is evicted. COMP 228, Fall 2009
Memory System
41
6.4 Cache Memory • Write back cache updates memory only when the block is selected for replacement. • The disadvantage of write through is that memory must be updated with each cache write, which may tie up the system bus and memory unnecessarily (when a location is updated multiple times). • The advantage of write back is that memory traffic is minimized, but its disadvantage is potential inconsistency between data in memory and data in cache which causes problem when memory is shared among processors and I/O devices.
COMP 228, Fall 2009
Memory System
42
6.4 Cache Memory • The cache we have been discussing is called a unified or integrated cache where both instructions and data are stored in a same cache. • Many modern systems employ separate caches for data and instructions. These are called
instruction cache and data cache. • The separation of data from instructions provides better locality and concurrency, at the extra cost of separate management. COMP 228, Fall 2009
Memory System
43
6.4 Cache Memory â&#x20AC;˘ Cache performance can also be improved by adding a small associative cache to hold blocks that have been evicted recently. This is called a victim cache. A victim cache can be accessed upon cache miss before memory is accessed. â&#x20AC;˘ A trace cache is a variant of an instruction cache that holds decoded instructions for program branches, giving faster access to instructions from branch targets. This allows more efficient processing of program loops, especially in the context of a pipelined processor.
COMP 228, Fall 2009
Memory System
44
6.4 Cache Memory • Most of today’s small systems employ multilevel cache hierarchies. • The levels of cache form their own small memory hierarchy. • Level1 cache (small size) is situated on the processor itself. – Access time is typically about 1ns. • Level 2 cache (medium size) may be on the motherboard, or on an expansion card. – Access time is usually around a few ns. COMP 228, Fall 2009
Memory System
45
6.4 Cache Memory • In systems that employ three levels of cache, the Level 2 cache is placed on the same die as the CPU (reducing access time to about 10ns) • Accordingly, the Level 3 cache (2MB to 256MB) refers to cache that is situated between the processor and main memory. • Once the number of cache levels is determined, the next thing to consider is whether data (or instructions) can exist in more than one cache level.
COMP 228, Fall 2009
Memory System
46
6.4 Cache Memory • If the cache system used an inclusive cache, the same data may be present at multiple levels of cache. • Strictly inclusive caches guarantee that all data in a smaller cache also exists at the next higher level. • Exclusive caches permit only one copy of the data. • The tradeoffs in choosing one over the other involve weighing the variables of access time, memory size (cost) and consistency in use.
COMP 228, Fall 2009
Memory System
47
Program Performance Analysis Consider MARIE program: S1, Load AddI Store Load Add Store Load Subt Store Skipcond Jump Halt Sum, Dec One, Dec C, Dec X, Dec COMP 228, Fall 2009
Sum X Sum X One X C One C 400 S1 0 1 128 128 Memory System
48
Program Performance Analysis Assumptions:
Cache size = 128 words block/line size = 8 words memory access time = 10 ns cache hit time = 1 ns miss penalty = 30 ns
Analysis: # Cache lines = 128/8 = 16. Code stored in memory block # 0 to # 1 , buffered also in cache line # 0 to # 1 . Data stored in memory blocks # 16 to # 31 , buffered in cache line # 0 to # 15 . #Instructions executed per iteration = 10 . #Memory operand accessed per iteration = 10 . COMP 228, Fall 2009
Memory System
49
Program Performance Analysis Cache Hit/Miss Tracing for first eight iteration: Instruction Instruction Data Load Sum Miss (Block 0 Æ Line 0)* Miss (Block 1 Æ Line 1)* AddI X Miss (Block 16 Æ Line 0) Store Sum Miss (Block 0 Æ Line 0) Load X Add One Store X Load C Subt One Store C Skipcond 400 Jump S1 *only in the first iteration COMP 228, Fall 2009
Memory System
50
Program Performance Analysis Cache Hit/Miss Tracing for second set of eight iteration: Instruction Instruction Data Load Sum AddI X Miss (Block 17 Æ Line 1) Store Sum Miss (Block 1 Æ Line 1) Load X Add One Store X Load C Subt One Store C Skipcond 400 Jump S1
COMP 228, Fall 2009
Memory System
51
Program Performance Analysis Cache Hit/Miss Tracing for 17th iteration: (and similarly 25th, 33rd, …) Instruction Instruction Data Load Sum AddI X Miss (Block 18 Æ Line 2) Store Sum Load X Add One Store X Load C Subt One Store C Skipcond 400 Jump S1
COMP 228, Fall 2009
Memory System
52
Program Performance Analysis
Total number of cache misses = (2 + 2*8) + 2*8 + 14 = 48 1st 8 iter
2nd 8 iter rest
Miss ratio = 48/ (20*128) = 3/160 Speedup = tM / [ tC + (1-r)tP ] = 10/[1 + 9/16] = 6.4
COMP 228, Fall 2009
Memory System
53
Clicker 6.12 For the previous example, how would the miss ratio change if the initial content of C is 1K instead? A: B: C:
not much change increase noticeably decrease noticeably
COMP 228, Fall 2009
Memory System
54
Clicker 6.13 For the same example, how would the miss ratio change if the block size is changed from 8 to 16? A: B: C:
not much change increase noticeably decrease noticeably
COMP 228, Fall 2009
Memory System
55
Clicker 6.14 For the same example, if 2-way set-associative mapping is used, how will the miss ratio change? A: B: C:
not much changed increase noticeably decrease noticeably
COMP 228, Fall 2009
Memory System
56