

Slides are derived from work by Amir Roth (Penn) Spring 2011

## Where We Are in This Course Right Now

#### • So far:

- We know how to design a processor that can fetch, decode, and execute the instructions in an ISA
- We can pipeline this processor
- We understand how to design caches
- Now:
  - We learn how to implement main memory in DRAM
  - We learn about virtual memory
- Next:

1

• We learn about the lowest level of storage (disks) and I/O

ECE 152

2

4

© 2009 Daniel J. Sorin from Roth



## Readings

Patterson and Hennessy
Still in Chapter 5

1 Irom Koth

ECE 152

## Memory Hierarchy Review

Memory component performance

•  $\mathbf{t}_{avg} = \mathbf{t}_{hit} + \%_{miss} * \mathbf{t}_{miss}$ 

Memory hierarchy

© 2009 Daniel J. Sorin from Roth

Storage: registers, memory, disk
Memory is fundamental element (unlike caches or disk)

- Can't get both low  $t_{\text{hit}}$  and  $\%_{\text{miss}}$  in a single structure

-  $t_{\text{avg}}$  of hierarchy is close to  $t_{\text{hit}}$  of upper (fastest) component

• 10/90 rule: 90% of stuff found in fastest component

• Temporal/spatial locality: automatic up-down data movement

ECE 152

5

• Upper components: small, fast, expensive

• Lower components: big, slow, cheap









## Moore's Law (DRAM capacity)

| Year       | Capacity     | \$/MB   | Access time |
|------------|--------------|---------|-------------|
| 1980       | 64Kb         | \$1500  | 250ns       |
| 1988       | 4Mb          | \$50    | 120ns       |
| 1996       | 64Mb         | \$10    | 60ns        |
| 2004       | 1Gb          | \$0.5   | 35ns        |
| 2008       | 4Gb          | ~\$0.15 | 20ns        |
| Imoalty DR | RAM paramete | ers     |             |







#### Access Time and Cycle Time Brief History of DRAM • DRAM access much slower than SRAM • DRAM (memory): a major force behind computer industry • More bits $\rightarrow$ longer wires • Modern DRAM came with introduction of IC (1970) • Preceded by magnetic "core" memory (1950s) • Buffered access with two-level addressing Core more closely resembles today's disks than memory • SRAM access latency: 2–3ns DRAM access latency: 20-35ns "Core dump" is legacy terminology • And by mercury delay lines before that (ENIAC) • DRAM cycle time also longer than access time • Re-circulating vibrations in mercury tubes • Cycle time: time between start of consecutive accesses "the one single development that put computers on their feet was the invention of a reliable form of memory, namely the core memory... It's • SRAM: cycle time = access time • Begin second access as soon as first access finishes cost was reasonable, it was reliable, and because it was reliable it • DRAM: cycle time = 2 \* access time could in due course be made large" • Why? Can't begin new access while DRAM is refreshing row Maurice Wilkes Memoirs of a Computer Programmer, 1985 15 16 ECE 152 ECE 152 © 2009 Daniel J. Sorin from Roth





## DRAM: A Vast Topic

- Many flavors of DRAMs
  - DDR3 SDRAM, RDRAM, etc.
- Many ways to package them
   SIMM, DIMM, FB-DIMM, etc.
- Many different parameters to characterize their timing
   t<sub>RCr</sub> t<sub>RACr</sub> t<sub>RACr</sub> t<sub>RASr</sub> etc.
- Many ways of using row buffer for "caching"
- Etc.
- There's at least one whole textbook on this topic!
   And it has ~1K pages
- We could, but won't, spend rest of semester on DRAM







| I t                                                      | T (ns) | DRAM    | Bus   |
|----------------------------------------------------------|--------|---------|-------|
| 4b                                                       | 10     | [31:30] | bub   |
| 4M                                                       | 20     | [31:30] |       |
|                                                          | 30     | refresh | [31H] |
| 2B                                                       | 40     | refresh | [31L] |
|                                                          | 50     | [29:28] | [30H] |
| <ul> <li>1 DRAM + 4b bus</li> </ul>                      | 60     | [29:28] | [30L] |
| <ul> <li>One DRAM chip, don't need 16b bus</li> </ul>    | 70     | refresh | [29H] |
| • DRAM: 2B / 40ns $\rightarrow$ 4b / 10ns                | 80     | refresh | [29L] |
|                                                          |        |         |       |
| <ul> <li>Balanced system → match bandwidths</li> </ul>   | 600    | [1:0]   | [2H]  |
|                                                          | 610    | [1:0]   | [2L]  |
| <ul> <li>Access time: 660ns (30ns longer=+4%)</li> </ul> | 620    | refresh | [1H]  |
| Cycle time: 640ns (same as before)                       | 640    | refresh | [1L]  |
| + Much cheaper!                                          | 650    |         | [0H]  |
|                                                          | 660    |         | [OL]  |





| 1 +                                             | T (ns) | DRAM0   | DRAM1   | DRAM2   | DRAM3   | Bus     |
|-------------------------------------------------|--------|---------|---------|---------|---------|---------|
| 2B                                              | 10     | [31:30] | [29:28] | [27:26] | [25:24] | bus     |
| ↓ 4M 4M 4M 4M x<br>2B 2B 2B 2B 2B 2B 2B 0 1 2 3 | 20     | [31:30] | [29:28] | [27:26] | [25:24] |         |
|                                                 | 30     | refresh | refresh | refresh | refresh | [31:30] |
|                                                 | 40     | refresh | refresh | refresh | refresh | [29:28] |
|                                                 | 50     | [23:22] | [21:20] | [19:18] | [17:16] | [27:26] |
|                                                 | 60     | [23:22] | [21:20] | [19:18] | [17:16] | [25:24] |
| 2B bus                                          |        |         |         |         |         |         |
| <ul> <li>Bus b/w: 2B/10ns</li> </ul>            | 110    | refresh | refresh | refresh | refresh | [15:14] |
| <ul> <li>DRAM b/w: 2B/40ns</li> </ul>           | 120    | refresh | refresh | refresh | refresh | [13:12] |
| 1 1                                             | 130    | [7:6]   | [5:4]   | [3:2]   | [1:0]   | [11:10] |
| <ul> <li>4 DRAM chips</li> </ul>                | 140    | [7:6]   | [5:4]   | [3:2]   | [1:0]   | [9:8]   |
| <ul> <li>Access time: 180ns</li> </ul>          | 150    | refresh | refresh | refresh | refresh | [7:6]   |
| Cycle time: 160ns                               | 160    | refresh | refresh | refresh | refresh | [5:4]   |
|                                                 | 170    |         |         |         |         | [3:2]   |
|                                                 | 180    |         |         |         |         | [1:0]   |

|                                                                                                                   | s own clock, typically much son bus clock |                                             |
|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------|---------------------------------------------|
| <ul> <li>Performance metri</li> <li>Clock frequency in</li> <li>May make misses</li> <li>At some point</li> </ul> | creases don't reduce memor                | y or bus latency<br>ome a <b>bottleneck</b> |
|                                                                                                                   |                                           |                                             |

## Error Detection and Correction

- One last thing about DRAM technology: errors
  - DRAM fails at a higher rate than SRAM or CPU logic
    - Capacitor wear
      Bit flips from energetic α-particle strikes
    - Bit hips from energetic αMany more bits
  - Modern DRAM systems: built-in error detection/correction

### • Key idea: checksum-style redundancy

- Main DRAM chips store data, additional chips store f(data)
  |f(data)| < |data|</li>
- On read: re-compute f(data), compare with stored f(data)
   Different ? Error...
- Option I (detect): kill program
- Option II (correct): enough information to fix error? fix and go on

ECE 152

© 2009 Daniel J. Sorin from Roth

29



## Error Detection Example: Parity

- **Parity**: simplest scheme
  - f(data<sub>N-1...0</sub>) = XOR(data<sub>N-1</sub>, ..., data<sub>1</sub>, data<sub>0</sub>)
  - + Single-error detect: detects a single bit flip (common case)Will miss two simultaneous bit flips...
    - But what are the odds of that happening?
  - Zero-error correct: no way to tell which bit flipped
- Many other schemes exist for detecting/correcting errors
   Take ECE 254 (Fault Tolerant Computing) for more info

ECE 152

31

## Memory Organization

- So data is striped across DRAM chips
- But how is it organized?
  - Block size?
  - Associativity?
  - Replacement policy?
  - Write-back vs. write-thru?
  - Write-allocate vs. write-non-allocate?
  - Write buffer?
  - Optimizations: victim buffer, prefetching, anything else?

ECE 152

## Low %<sub>miss</sub> At All Costs

+ For a memory component:  $t_{hit} \mbox{ vs. } \%_{miss} \mbox{ tradeoff}$ 

- - $t_{miss}$  is not bad  $\rightarrow$  minimal  $\%_{miss}$  less important
  - Low capacity/associativity/block-size, write-back or write-through

ECE 152

- Moving down (L2) emphasis turns to %<sub>miss</sub>
  - Infrequent access  $\rightarrow$  minimal t<sub>hit</sub> less important
  - $t_{\text{miss}}$  is bad  $\rightarrow$  minimal  $\%_{\text{miss}}$  important
  - High capacity/associativity/block size, write-back

# For memory, emphasis entirely on %<sub>miss</sub> t<sub>miss</sub> is disk access time (measured in ms, not ns)

© 2009 Daniel J. Sorin from Roth

33

## **Typical Memory Organization Parameters**

| Parameter          | I\$/D\$   | L2        | Main Memory   |
|--------------------|-----------|-----------|---------------|
| t <sub>hit</sub>   | 1-2ns     | 10ns      | 30ns          |
| t <sub>miss</sub>  | 10ns      | 30ns      | 10ms (10M ns) |
| Capacity           | 8–64KB    | 128KB-2MB | 512MB-8GB     |
| Block size         | 16-32B    | 32-256B   | 8-64KB pages  |
| Associativity      | 1-4       | 4-16      | Full          |
| Replacement Policy | NMRU      | NMRU      | working set   |
| Write-through?     | Sometimes | No        | No            |
| Write-allocate?    | Sometimes | Yes       | Yes           |
| Write buffer?      | Yes       | Yes       | No            |
| Victim buffer?     | Yes       | No        | No            |
| Prefetching?       | Sometimes | Yes       | Sometimes     |

## One Last Gotcha

• On a 32-bit architecture, there are 2<sup>32</sup> byte addresses

ECE 152

- Requires 4 GB of memory
- But not everyone buys machines with 4 GB of memory
- And what about 64-bit architectures?
- Let's take a step back...



© 2009 Daniel J. Sorin from Roth

\_\_\_\_

## Virtual Memory

· Idea of treating memory like a cache

- Contents are a dynamic subset of program's address space Dynamic content management is transparent to program
- Actually predates "caches" (by a little)

#### • Original motivation: compatibility

- IBM System 370: a family of computers with one software suite
- + Same program could run on machines with different memory sizes - Caching mechanism made it appear as if memory was  $2^{\scriptscriptstyle N}$  bytes
- Regardless of how much memory there actually was
- Prior, programmers explicitly accounted for memory size

#### Virtual memory

• Virtual: "in effect, but not in actuality" (i.e., appears to be, but isn't) ECE 152

37

© 2009 Daniel J. Sorin from Roth



## Other Uses of Virtual Memory

- Virtual memory is quite useful
  - Automatic, transparent memory management just one use "Functionality problems are solved by adding levels of indirection"
- Example: multiprogramming
  - Each process thinks it has 2<sup>N</sup> bytes of address space
  - Each thinks its stack starts at address 0xFFFFFFF
  - "System" maps VPs from different processes to different PPs + Prevents processes from reading/writing each other's memory



#### Still More Uses of Virtual Memory

- Inter-process communication
  - Map VPs in different processes to same PPs
- Direct memory access I/O
  - Think of I/O device as another process
  - · Will talk more about I/O in a few lectures

#### Protection

- Piggy-back mechanism to implement page-level protection
- Map VP to PP ... and RWX protection bits
- Attempt to execute data, or attempt to write insn/read-only data?

ECE 152

• Exception  $\rightarrow$  OS terminates program













## **Address Translation Mechanics**

- The six questions
  - What? address translation
  - Why? compatibility, multi-programming, protection
  - How? page table
  - Who performs it?
  - When?
  - Where does page table reside?
- Option I: process (program) translates its own addresses
  - Page table resides in process visible virtual address space
  - Bad idea: implies that program (and programmer)...
    - ...must know about physical addresses
      - Isn't that what virtual memory is designed to avoid?
    - ...can forge physical addresses and mess with other programs
  - Translation on L2 miss or always? How would program know?

aniel J. Sorin from Roth

ECE 152

47

## Who? Where? When? Take II

- Option II: operating system (OS) translates for process
  - Page table resides in OS virtual address space
  - + User-level processes cannot view/modify their own tables
  - + User-level processes need not know about physical addresses
  - Translation on L2 miss
  - Otherwise, OS SYSCALL before any fetch, load, or store
- L2 miss: interrupt transfers control to OS handler
  - Handler translates VA by accessing process's page table
  - Accesses memory using PA
  - Returns to user process when L2 fill completes
  - Still slow: added interrupt handler and PT lookup to memory access

ECE 152

- What if PT lookup itself requires memory access? Head spinning...

© 2009 Daniel J. Sorin from Roth



## **TB** Misses

- **TB miss:** requested PTE not in TB, but in PT • Two ways of handling
- 1) OS routine: reads PT, loads entry into TB (e.g., Alpha)
  - Privileged instructions in ISA for accessing TB directly
  - Latency: one or two memory accesses + OS call
- 2) Hardware FSM: does same thing (e.g., IA-32)
   Store PT root pointer in hardware register
  - Make PT root and 1st-level table pointers physical addresses
     So FSM doesn't have to translate them

ECE 152

+ Latency: saves cost of OS call

© 2009 Daniel J. Sorin from Roth

## Nested TB Misses

- Nested TB miss: when OS handler itself has a TB miss
  - TB miss on handler instructions
  - TB miss on page table VAs
  - Not a problem for hardware FSM: no instructions, PAs in page table
- Handling is tricky for SW handler, but possible
  - First, save current TB miss info before accessing page table
     So that nested TB miss info doesn't overwrite it
  - Second, lock nested miss entries into TB
    - Prevent TB conflicts that result in infinite loop
    - Another good reason to have a highly-associative TB

ECE 152

```
© 2009 Daniel J. Sorin from Roth
```

\_\_\_\_\_

51

## Page Faults

- Page fault: PTE not in TB or in PT
  - Page is simply not in memory
  - Starts out as a TB miss, detected by OS handler/hardware FSM

#### • OS routine

- OS software chooses a physical page to replace
  - "Working set": more refined software version of LRU
    - Tries to see which pages are actively being used
    - Balances needs of all current running applications
  - If dirty, write to disk (like dirty cache block with writeback  $\$

ECE 152

- Read missing page from disk (done by OS)
- Takes so long (10ms), OS schedules another task
- Treat like a normal TB miss from here

© 2009 Daniel J. Sorin from Roth













## Flavors of Virtual Memory

- · Virtual memory almost ubiquitous today
  - Certainly in general-purpose (in a computer) processors
  - But even some embedded (in non-computer) processors support it

#### • Several forms of virtual memory

- Paging (aka flat memory): equal sized translation blocks
   Most systems do this
  - Most systems do this
- Segmentation: variable sized (overlapping?) translation blocks • IA32 uses this

ECE 152

- Makes life very difficult
- Paged segments: don't ask

© 2009 Daniel J. Sorin from Roth

59

## Summary

- DRAM
  - Two-level addressing
  - Refresh, access time, cycle time
- Building a memory system
  - DRAM/bus bandwidth matching
- Memory organization
- Virtual memory
  - Page tables and address translation
  - Page faults and handling
  - Virtual, physical, and virtual-physical caches and TLBs

#### Next part of course: I/O

ECE 152

© 2009 Daniel J. Sorin from Roth