





# Calculating Tag Size "4KB cache" means cache holds 4KB of data Called capacity Tag storage is considered overhead (not included in capacity) Calculate tag overhead of 4KB cache with 1024 4B frames Not including valid bits 4B frames → 2-bit offset 1024 frames → 10-bit index 32-bit address - 2-bit offset - 10-bit index = 20-bit tag 20-bit tag \* 1024 frames = 20Kb tags = 2.5KB tags 63% overhead

# Measuring Cache Performance

- Ultimate metric is t<sub>avg</sub>
   Cache capacity roughly determines t<sub>hit</sub>
  - Lower-level memory structures determine t<sub>miss</sub>
  - Measure %<sub>miss</sub>
    - Hardware performance counters (Pentium, Sun, etc.)
    - Simulation (write a program that mimics behavior)
    - Hand simulation (next slide)
  - $\ensuremath{\%_{\text{miss}}}$  depends on program that is running

• Why?

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152 32

# **Cache Performance Simulation**

- Parameters: 8-bit addresses, 32B cache, 4B blocks
  - Addresses initially in cache : 0, 4, 8, 12, 16, 20, 24, 28 • To find location in cache, do mod32 arithmetic (why 32?)

| Cache contents (prior to access)            | Address         | Outcome |
|---------------------------------------------|-----------------|---------|
| 0, 4, 8, 12, 16, 20, 24, 28                 | 200 (200%32=8)  | Miss    |
| 0, 4, <mark>200</mark> , 12, 16, 20, 24, 28 | 204 (204%32=12) | Miss    |
| 0, 4, 200, <b>204</b> , 16, 20, 24, 28      | 144 (144%32=16) | Miss    |
| 0, 4, 200, 204, <b>144</b> , 20, 24, 28     | 6               | Hit     |
| 0, 4, 200, 204, 144, 20, 24, 28             | 8               | Miss    |
| 0, 4, <mark>8</mark> , 204, 144, 20, 24, 28 | 12              | Miss    |
| 0, 4, 8, <b>12</b> , 144, 20, 24, 28        | 20              | Hit     |
| 0, 4, 8, 12, 144, 20, 24, 28                | 16              | Miss    |
| 0, 4, 8, 12, <b>16</b> , 20, 24, 28         | 144             | Miss    |
| 0, 4, 8, 12, <b>144</b> , 20, 24, 28        | 200             | Miss    |
|                                             |                 |         |
| 2012 Daniel J. Sorin from Roth and Lebeck   | ECE 152         | 33      |



# Calculating Tag Size

- Calculate tag overhead of 4KB cache with 512 8B frames
  - Not including valid bits
  - 8B frames  $\rightarrow$  3-bit offset
  - 512 frames  $\rightarrow$  9-bit index
  - 32-bit address 3-bit offset 9-bit index = 20-bit tag
  - 20-bit tag \* 512 frames = 10Kb tags = 1.25KB tags
  - + 32% overhead
    - + Less tag overhead with larger blocks

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152

- Parameters: 8-bit addresses, 32B cache, 8B blocks
  - Addresses in base4 ("nibble") notation
  - Initial contents : 0000(0010), 0020(0030), 0100(0110), 0120(0130)

| Cache contents (prior to access)                       | Address | Outcome                 |
|--------------------------------------------------------|---------|-------------------------|
| 0000(0010), 0020(0030), 0100(0110), 0120(0130)         | 3020    | Miss                    |
| 0000(0010), <b>3020(3030)</b> , 0100(0110), 0120(0130) | 3030    | Hit (spatial locality!) |
| 0000(0010), 3020(3030), 0100(0110), 0120(0130)         | 2100    | Miss                    |
| 0000(0010), 3020(3030), <b>2100(2110)</b> , 0120(0130) | 0012    | Hit                     |
| 0000(0010), 3020(3030), 2100(2110), 0120(0130)         | 0020    | Miss                    |
| 0000(0010), <b>0020(0030)</b> , 2100(2110), 0120(0130) | 0030    | Hit (spatial locality)  |
| 0000(0010), 0020(0030), 2100(2110), 0120(0130)         | 0110    | Miss (conflict)         |
| 0000(0010), 0020(0030), <b>0100(0110)</b> , 0120(0130) | 0100    | Hit (spatial locality)  |
| 0000(0010), 0020(0030), 0100(0110), 0120(0130)         | 2100    | Miss                    |
| 0000(0010), 0020(0030), <b>2100(2110)</b> , 0120(0130) | 3020    | Miss                    |
|                                                        |         | 36                      |
| © 2012 Daniel J. Sorin from Roth and Lebeck ECE 152    |         | 50                      |

# Effect of Block Size

- Increasing block size has two effects (one good, one bad) + Spatial prefetching
  - For blocks with adjacent addresses
  - Turns miss/miss pairs into miss/hit pairs
  - Example from previous slide: 3020,3030
  - Conflicts

© 2012 Daniel J. Sorin from Roth and Lebeck

- For blocks with non-adjacent addresses (but adjacent frames)
- Turns hits into misses by disallowing simultaneous residence • Example: 2100,0110

ECE 152

- Both effects always present to some degree •
- Spatial prefetching dominates initially (until 64–128B) Interference dominates afterwards
- Optimal block size is 32–128B (varies across programs) •

37

#### Conflicts

- What about pairs like 3030/0030, 0100/2100?
  - These will **conflict** in any size cache (regardless of block size) Will keep generating misses
- Can we allow pairs like these to simultaneously reside?
  - · Yes, but we have to reorganize cache to do so

| Cache contents (prior to access)                       | Address | Outcome |
|--------------------------------------------------------|---------|---------|
| 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130         | 3020    | Miss    |
| 0000, 0010, 3020, 0030, 0100, 0110, 0120, 0130         | 3030    | Miss    |
| 0000, 0010, 3020, <b>3030</b> , 0100, 0110, 0120, 0130 | 2100    | Miss    |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130         | 0012    | Hit     |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130         | 0020    | Miss    |
| 0000, 0010, 0020, 3030, 2100, 0110, 0120, 0130         | 0030    | Miss    |
| 0000, 0010, 0020, <b>0030</b> , 2100, 0110, 0120, 0130 | 0110    | Hit     |
| © 2012 Daniel J. Sorin from Roth and Lebeck ECE 152    |         | 38      |





# **Cache Performance Simulation**

Parameters: 32B cache, 4B blocks, 2-way set-associative
 Initial contents: 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130

| Cache contents                                              | Address | Outcome              |
|-------------------------------------------------------------|---------|----------------------|
| [0000,0100], [0010,0110], [0020,0120], [0030,0130]          | 3020    | Miss                 |
| [0000,0100], [0010,0110], [0120, <b>3020</b> ], [0030,0130] | 3030    | Miss                 |
| [0000,0100], [0010,0110], [0120,3020], [0130, <b>3030</b> ] | 2100    | Miss                 |
| [0100, <b>2100</b> ], [0010,0110], [0120,3020], [0130,3030] | 0012    | Hit                  |
| [0100,2100], [0010,0110], [0120,3020], [0130,3030]          | 0020    | Miss                 |
| [0100,2100], [0010,0110], [3020, <b>0020</b> ], [0130,3030] | 0030    | Miss                 |
| [0100,2100], [0010,0110], [3020,0020], [3030, <b>0030</b> ] | 0110    | Hit                  |
| [0100,2100], [0010,0110], [3020,0020], [3030,0030]          | 0100    | Hit (avoid conflict) |
| [2100,0100], [0010,0110], [3020,0020], [3030,0030]          | 2100    | Hit (avoid conflict) |
| [0100,2100], [0010,0110], [3020,0020], [3030,0030]          | 3020    | Hit (avoid conflict) |



- Set-associative caches present a new design choice
   On cache miss, which block in set to replace (kick out)?
- Some options
  - Random
  - FIFO (first-in first-out)
    - When is this a good idea?
  - LRU (least recently used)
  - Fits with temporal locality, LRU = least likely to be used in future NMRU (not most recently used)

- An easier-to-implement approximation of LRU
  NMRU=LRU for 2-way set-associative caches
- Belady's: replace block that will be used furthest in future
- Unachievable optimum (but good for comparisons)
- Which policy is simulated in previous slide?

```
© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152
```

















| <ul> <li>Parameters: 8-bit addresses, 32B</li> <li>Initial contents : 0000, 0010, 0020, 0</li> <li>Initial blocks accessed in increasing</li> </ul> | ,<br>0030, 0100 |                   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------|
| Cache contents                                                                                                                                      | Address         | Outcome           |
| 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130                                                                                                      | 3020            | Miss (compulsory) |
| 0000, 0010, <b>3020</b> , 0030, 0100, 0110, 0120, 0130                                                                                              | 3030            | Miss (compulsory) |
| 0000, 0010, 3020, <b>3030</b> , 0100, 0110, 0120, 0130                                                                                              | 2100            | Miss (compulsory) |
| 0000, 0010, 3020, 3030, <b>2100</b> , 0110, 0120, 0130                                                                                              | 0012            | Hit               |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130                                                                                                      | 0020            | Miss (capacity)   |
| 0000, 0010, <b>0020</b> , 3030, 2100, 0110, 0120, 0130                                                                                              | 0030            | Miss (capacity)   |
| 0000, 0010, 0020, <b>0030</b> , 2100, 0110, 0120, 0130                                                                                              | 0110            | Hit               |
| 0000, 0010, 0020, 0030, 2100, 0110, 0120, 0130                                                                                                      | 0100            | Miss (capacity)   |
| 0000, 1010, 0020, 0030, <b>0100</b> , 0110, 0120, 0130                                                                                              | 2100            | Miss (conflict)   |
| 1000, 1010, 0020, 0030, <b>2100</b> , 0110, 0120, 0130                                                                                              | 3020            | Miss (capacity)   |

# ABC

#### • Associativity (increase) + Decreases conflict misses

- Increases t<sub>hit</sub>
   Block size (increase)
  - Increases conflict misses
  - + Decreases compulsory misses
  - ± Increases or decreases capacity misses
  - Negligible effect on t<sub>hit</sub>
- Capacity (increase)
  - + Decreases capacity misses
  - Increases t<sub>hit</sub>

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152

| <ul> <li>Conflict misses: not enough associativity         <ul> <li>High-associativity is expensive, but also rarely needed</li> <li>3 blocks mapping to same 2-way set and accessed (ABC)*</li> </ul> </li> <li>Victim buffer (VB): small FA cache (e.g., 4 entries)         <ul> <li>Sits on I\$/D\$ fill path</li> <li>VB is small → very fast</li> </ul> </li> </ul> |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>3 blocks mapping to same 2-way set and accessed (ABC)*</li> <li>Victim buffer (VB): small FA cache (e.g., 4 entries)</li> <li>Sits on I\$/D\$ fill path</li> </ul>                                                                                                                                                                                              |
| <ul> <li>Victim buffer (VB): small FA cache (e.g., 4 entries)</li> <li>Sits on I\$/D\$ fill path</li> </ul>                                                                                                                                                                                                                                                              |
| Sits on I\$/D\$ fill path                                                                                                                                                                                                                                                                                                                                                |
|                                                                                                                                                                                                                                                                                                                                                                          |
|                                                                                                                                                                                                                                                                                                                                                                          |
| Blocks kicked out of I\$/D\$ placed in VB                                                                                                                                                                                                                                                                                                                                |
| • On miss, check VB: hit ? Place block back in I\$/D\$                                                                                                                                                                                                                                                                                                                   |
| 4 extra ways, shared among all sets     + Only a few sets will need it at any given time     VB                                                                                                                                                                                                                                                                          |
| + Very effective in practice                                                                                                                                                                                                                                                                                                                                             |
| L2                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                          |





# Write Issues

• So far we have looked at reading from cache (loads)

ECE 152

- What about writing into cache (stores)?
- Several new issues
- Tag/data access
- Write-through vs. write-back
- Write-allocate vs. write-not-allocate

© 2012 Daniel J. Sorin from Roth and Lebeck









## Write-allocate vs. Write-non-allocate

- What to do on a write miss?
  - Write-allocate: read block from lower level, write value into it
     + Decreases read misses

ECE 152

62

- Requires additional bandwidth
- Use with write-back
- Write-non-allocate: just write to next level
  - Potentially more read misses
  - + Uses less bandwidth

© 2012 Daniel J. Sorin from Roth and Lebeck

Use with write-through



# Typical Processor Cache Hierarchy

- First level caches: optimized for t<sub>hit</sub> and parallel access
  - Insns and data in separate caches (I\$, D\$)
  - Capacity: 8–64KB, block size: 16–64B, associativity: 1–4
  - Other: write-through or write-back
  - t<sub>hit</sub>: 1–4 cycles
- Second level cache (L2): optimized for %<sub>miss</sub>
  - Insns and data in one cache for better utilization
  - Capacity: 128KB–1MB, block size: 64–256B, associativity: 4–16
  - Other: write-back
  - t<sub>hit</sub>: 10-20 cycles
- Third level caches (L3): also optimized for %<sub>miss</sub>
  - Capacity: 1-8MB
  - t<sub>hit</sub>: 30 cycles

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152 64

# Performance Calculation Example

#### Parameters

- Reference stream: 20% stores, 80% loads
- L1 D\$:  $t_{hit} = 1ns$ ,  $\%_{miss} = 5\%$ , write-through + write-buffer L2:  $t_{hit} = 10ns$ ,  $\%_{miss} = 20\%$ , write-back, 50% dirty blocks
- Main memory:  $t_{hit} = 50ns$ ,  $\%_{miss} = 0\%$
- What is  $t_{\mbox{avgL1D}\mbox{\$}}$  without an L2?
  - · Write-through+write-buffer means all stores effectively hit
  - $t_{missL1D\$} = t_{hitM}$
  - $t_{avgL1D\$} = t_{hitL1D\$} + \%_{loads} * \%_{missL1D\$} * t_{hitM} = 1ns + (0.8*0.05*50ns) = 3ns$
- What is t<sub>avgD\$</sub> with an L2?
  - t<sub>missL1D\$</sub> = t<sub>avgL2</sub>
  - Write-back (no buffer) means dirty misses cost double

  - $t_{avgL2} = t_{hlL2} + (1+\%_{olarty})^{*}\%_{missL2}^{*} t_{hltM} = 10ns + (1.5*0.2*50ns) = 25ns$   $t_{avgL105} = t_{hltL105} + \%_{loads}^{*}\%_{missL105} t_{avgL2}^{*} = 1ns + (0.8*0.05*25ns) = 2ns$

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152

#### Summary

- Average access time of a memory component
  - $t_{avg} = t_{hit} + \%_{miss} * t_{miss}$
  - + Hard to get low  $t_{\text{hit}}$  and  $\mathscr{W}_{\text{miss}}$  in one structure  $\rightarrow$  hierarchy
- Memory hierarchy
  - Cache (SRAM) → memory (DRAM) → swap (Disk)
  - Smaller, faster, more expensive  $\rightarrow$  bigger, slower, cheaper
- SRAM
  - · Analog technology for implementing big storage arrays
  - Cross-coupled inverters + bitlines + wordlines
  - Delay ~ √#bits \* #ports

66

### Summary, cont'd

- Cache ABCs
  - · Capacity, associativity, block size
  - 3C miss model: compulsory, capacity, conflict
- Some optimizations
  - Victim buffer for conflict misses
  - Prefetching for capacity, compulsory misses
- Write issues
- Pipelined tag/data access
  - Write-back vs. write-through/write-allocate vs. write-no-allocate
  - Write buffer

#### Next Course Unit: Main Memory

© 2012 Daniel J. Sorin from Roth and Lebeck ECE 152 67