DV1460 / DV1492:

Real-Time Operating Systems

08:15-10:00 Thursday, September 22nd, 2016

Virtual Memory.

§3.3.1-3.4pg198-222
Table of Contents
Page TablesPage Replacement
TLB

1. Introduction

  • We've seen one implementation of an address space (Base+Limit)
    • Issue: limited by size of physical memory.
    • Issue: contiguous, external fragmentation.
  • Virtual Memory is an implementation that solves these issues.
    • Cut up the process address space into smaller blocks (pages).
    • Map each page independently to a physical frame.
    • Pages can be unmapped to allow a larger logical space.
    • Read/write idle pages to a backing store (e.g. hard-drive).

2. Using The Page Table

  • Each process has its own page-table.
    • One entry per page (e.g. 4KB).
    • If it is currently in physical memory, shows the frame number (physical address).
    • If not mapped in, entry in the table indicates this (e.g. 48-52KB).
  • All memory accesses in the program go through the table.
  • MOV REG,0.
    • Page is \(\frac{0}{4096}=0\).
    • Table entry 0 shows frame 2.
    • Address is 2*4096+0.

3. Indirection Through A Page-Table

  • The steps needed per-access are more complex (slower) than Base+Limit.
  • Easy to see in the C-style psuedo code below.
  • Running this procedure per-instruction would be very slow.
  • Needs hardware support to be feasible.
uint32 memRead(uint32 address) { uint32 page = address>>12, offset=address&0xFFF; if( pageTable[page].present ) { uint32* physical = pageTable[page].frame<<12 + offset; return *physical; } throw PageFault; }

4. Memory Management Unit (MMU)

  • The MMU is a hardware implementation of the previous psuedo code (and writes).
  • Sits between processor and bus.
  • Page table is held in memory in the MMU.
  • Performs:
    • Decoding of page / offset.
    • Lookup of frame numbers.
    • Mapping to target address.
    • Interrupts for page faults.
  • The table is fixed size.

5. Page Table Entries

  • Each entry in the table stores information about a single page.
    • The present/absent bit shows if the page is resident in memory.
    • When the page is written to, the modified bit records this.
    • Protection bits indicate read/write privileges.
    • Referenced shows the page has been used (read or write).
  • Dirty pages has the page been altered since it was loaded/created.
  • This memory is very expensive (MMU is generally on-die, similar to cache).
  • The hardware table is limited in size.
  • A 4GB machine with 4KB pages requires \(2^{20}\) table entries...

6. MMU Price/Performance

  • Consider the different design choices in placing the page table.
  • Extreme: put it entirely in main memory.
    • The size is no-longer an issue: price and performance are low..
    • Each memory access in the program will require one extra access.
  • Extreme: put it entirely inside the MMU.
    • Fast memory in a processor is expensive: registers / L1-cache.
    • Using MMU to cache the page-table: reduces price / similar performance.

7. Caching the page-table

  • This economic tradeoff is similar to memory use in general.
    • The memory hierarchy optimises this with a cache.
  • We can use a similar solution here.
    • A small (expensive) memory inside the MMU.
    • A large page-table in main memory.
    • The frequently used table entries are cached.
  • This works because programs exhibit locality reference on the large-scale (pages in their working set) as well as the small-scale.

8. Example TLB

  • The Translation Lookahead Buffer (TLB) is an associative memory.
  • Stores some of the page-table entries (caches table in main memory).
    • Policies for deciding which are in the second part of the lecture.
  • In the MMU table shown earlier the entry index was the page number.
    • Now it is an explicit field within the table entry.
    • The associative map uses this field as a key.

9. Using the TLB

  • Example shows the hot pages used by a program.
  • Executing a loop in pages 19,20,21 (readable and executable).
  • Array in pages 129 and 130.
  • Heap variables (including indices) in page 140.
  • Stack is in page 860 and 861.
  • On an access the page number is compared against all entries in parallel.
  • If it is found (a hit) the frame number is used.
  • On a miss the MMU looks up the table entry in memory.
  • After reading it, one entry in the TLB is overwritten with the entry.
  • Idea: temporal locality results in good performance.

10. Managing the TLB

  • The MMU hardware can be simplified by handling TLB misses in software.
    • Extra trap on the processor: serviced by overwriting the table entry.
    • This is slower than the MMU updating the TLB.
    • The advantage is that the MMU hardware can be smaller.
  • A larger TLB is required to achieve the same performance.
  • A strange tradeoff to make as fully-associate sets are expensive.
    • It relates into late 90s research on partitioned caches.
  • Idea: Perhaps the software can estimate usage better than a simple circuit?
Design tradeoffs for software-managed TLBs.
Richard Uhlig et al.ACM TOCS, Volume 12 Issue 3, Aug. 1994
http://dl.acm.org/citation.cfm?id=185515

11. Page Table Sizes

  • A single page-table for a virtual address space is large.
    • Assume 32-bit physical address space, 4kb frames: frame number 20-bits.
    • Needs to be aligned, assume 4-bytes per table entry.
    • Assume the virtual address space is also 32-bits: 4MB per table.
  • Each process in the system has a table.
  • A typical desktop has about 100 processes: 512MB of tables!
  • It is rare for a process to use all entries (virtual space is sparse)...

12. Multi-level Page Tables

  • The first step in reducing the memory footprint is to split the table into two-levels.
  • Bits in the virtual address indicate the position in two tables.
  • The entry in the first table stores the address of a second-level table.
  • Some of the second-level tables will be completely empty...
    • If we avoid storing them the structure is sparse.
    • It uses much less memory than the single-level structure.

13. Multi-level Page Table Example

  • Each table contains 1024 entries, 4kb pages/frames, 4MB total.
  • Virtual address 0x00403004
  • PT1=1 PT2=3 Offset=4
  • MMU looks at index 1 in the first table.
  • Retrieves the frame number of the second table, reads index 3.
  • This yields the frame containing the physical address of 0x00403004.
  • The offset is added to produce the target physical address.
  • Present/absent bits in the first-level indicate if the table exists.

14. Inverted Page Tables

  • In a 64-bit address space further table levels are needed.
  • An alternative is to reverse the mapping: store page number per frame.
  • MMU must search the table to find frame from a page number.
  • TLB can cache this mapping once found, but it is slow on a TLB-miss.
  • The search can be accelerated by hashing the page numbers, storing frames matching the hash in a chain.

Break (15mins)





Intermission

15. Page Replacement Context

  • Page tables and TLBs are both caches.
    • Quick reminder of cache design.
    • Small, fast, fixed size storage.
    • In front of slower, larger, possible variable size storage.
  • Assume the cache is always full.
  • If a request is made that misses the cache.
    • Retrieve from memory: need a free slot.
    • Cache must evict an entry.
  • The second part today is algorithms for picking which page entry to evict from the table.

16. Algorithms Overview

  • Complex softare makes the system slow.
  • Complex hardware is expensive.
  • We can arrange the algorithms we will see into four categories:
    • Magic: we can only see the theoretically optimal behaviour after it has happened.
    • We can record this in a simulation, use it for evaluation of algorithms.
    • Best achievable: this is the highest performance that we could actually build.
  • LRU is too complex to be practical: predication performance is high, but the speed of operation is too slow. Needs approximations...

17. Page Replacement: LRU

  • Idea: count clock cycles in the MMU.
  • On every memory access, store the counter in the page table (TLB).
  • Problem: increases the entry size by at least 100% (64-bit).
  • Cache memory is very expensive: impractical, but it does exist.
  • Problem 2: The low-latency counter is also tricky to implement.
  • On a page fault:
    • Evict the page not used for longest.
    • Probably not going to be used soon.
  • Temporal locality: approx. of optimal.

18. Page Replacement: Simple Counters

  • Storing a counter was too expensive.
  • Referenced Flag, Modifed Flag: 2-bits .
  • Flags are set by MMU hardware.
  • When the entry is loaded into the table, both zero.
  • Manual operation to reset the R flag.
  • Simple algorithm for eviction:
    • Not Recently Used (NRU).
    • Prefer 00 - cheapest to evict.
    • Prefer 01 to 10 - flush, might not reload.
    • Avoid 11 - flush and probably will reload.
  • Pick any entry from the best category available (coarse approximation of LRU).

19. Page Replacement: Further simple approaches

  • Very simple approach (no hardware needed): FIFO.
    • Queue pages as they are added during page-faults.
    • Pick the oldest for eviction.
    • Problem: doesn't care if busy or not.
  • 2nd chance: combine FIFO with the simple R and M flags.
    • Page fault at time 20: A is the oldest page in the table.
    • If R is clear: evict the page.
    • If R is set: move to back, rest time ; becomes newest page.
  • If all pages have R set: drop back to evicting oldest.

20. Page Replacement: Clock

  • Single linked list: head takes \(\mathcal{O}(1)\), tail takes \(\mathcal{O}(n)\)
  • Double-links: head takes \(\mathcal{O}(1)\), tail also takes \(\mathcal{O}(1)\)
  • Double-linked list forms a loop, has a single entry point.
  • Looks vaguely like a clock...
  • On an eviction look at the entry (C).
  • Is R=0?
    • Oldest hasn't been used - evict.
  • Is R=1?
    • Second chance: reset time and R.
    • Move the entry forward one step.
    • C is now the oldest.
  • Just a faster way to implement second chance.

21. Page Replacement: NFU

  • Now we add a timing interrupt to the simple 1-bit flags.
    • Software handler can add some functionality.
    • Not as expensive as actions on every access.
    • Sample and reset R and M periodically.
  • Store the samples in a counter per page.
    • Estimate of how frequently the page is accessed.
  • NFU: Not Frequently Used.
    • Evict the page with the lowest Referenced count.
    • Advantage: least popular page, hope it remains so.
    • Problem: old counts are as valuable as new counts.
  • Back in lecture 4 we saw a similar problem: estimating process behaviour.
    • The solution was called "aging": discounted older information.

22. Page Replacement: Aging

  • Aging uses exponential decay of the value of R counts.
    • In the process context we used exponentially weighted average.
    • Not much diference: \(x' = \frac{x}{2} + R\)
  • As R is either 0 or 1, the difference is that we do not divide by 2.
  • Both NFU and Aging approximate LRU in software.
    • Aging is more accurate.
    • Does not store the exact order.
    • Two pages referenced in the same tick have the same score.

23. Page Replacement: Working-set

  • The working set of a program is the current set of pages it is actively using at a point in time.
  • We cannot know exactly what this is.
  • Some data that is used will never be used again.
  • Some data will be used soon (stays in the working set for a short time).
  • Some data will be used later (stays in set for a long time).
  • While we cannot know exactly, we can estimate by accesses in the last \(k\) cycles.
  • At clock cycles \(t\) during the execution.
  • Size of the working set is shown in the graph.
  • Idea: Programs do not access all data at once
  • Consequence: If \(k\) is large enough the approximation is accurate.

24. Page Replacement: Working-set

  • If we knew \(w(k,t)\) for a program:
    • Preload pages before they are accesssed: minimise page-faults.
  • As with all great ideas this is fantastic in theory, impossible in reality.
    • But leads to good approximations...
  • Drop using last \(k\) memory accesses: counting is expensive.
  • Instead use a window of time: similar to NFU / Aging.
  • Difference to NFU / Aging:
    • Don't count - store the last reference time.
  • Simulating LRU with the coarseness of the timer tick.
    • Evict pages older than parameter \(\tau\) (picked by tuning).
    • Expensive to implement: needs a tweak...

25. Page Replacement: WSClock

  • The clock algorithm was an efficient way to find the oldest page.
    • Age referred to the time since loading into the page table.
  • The Working Set approach provided more information about which page to evict.
    • But it required a slow scan.
  • The final algorithm that we look at is a combination of the two:
    • Store coarse access information in a timer tick.
    • Use a double-linked list for a cheaper search.

26. Page Replacement: WSClock

  • Pages are added into the list in the order they are loaded.
  • Time of last use field is updated on a timer tick.
    • Same as the basic Working Set approach.
  • Leads to the state in a).
  • A page fault occurs: first page has R set.
    • Part b) shows the hand advances and R is reset on the page.
    • Note: b) and c) appear to be identical, publisher screwed up?
  • Part b) also shows a case where R=0.
    • Decide if the access time is older than \(\tau\).
    • Estimates if it is in the current working-set.

27. Page Replacement: WSClock

  • Looking at c) and d), access time is older than \(\tau\).
    • Not dirty, evict and load new.
  • If WSClock finds a dirty page it schedules it to be written to the disk.
    • It is not the working set so we want to evict later.
  • Limit the number of old clean pages to evict in one pass.
    • Spreads out the disk traffic.
  • What if we do a full loop and don't find a page to evict?
    • If a write was scheduled during first loop: wait for it to finish.
    • No writes - everything was in the working set: evict any page.

28. Summary

  • Revision Guide:
    • I don't know what is going into the exam, but I would expect:
    • How does algorithm X implement property / feature Y?
    • Explain how algorithm X and Y differ in property / feature Z?