# Real-Time Operating Systems

08:15-10:00 Thursday, September 22nd, 2016

Virtual Memory.

# 1. Introduction

• We've seen one implementation of an address space (Base+Limit)
• Issue: limited by size of physical memory.
• Issue: contiguous, external fragmentation.
• Virtual Memory is an implementation that solves these issues.
• Cut up the process address space into smaller blocks (pages).
• Map each page independently to a physical frame.
• Pages can be unmapped to allow a larger logical space.
• Read/write idle pages to a backing store (e.g. hard-drive).

# 2. Using The Page Table

• Each process has its own page-table.
• One entry per page (e.g. 4KB).
• If it is currently in physical memory, shows the frame number (physical address).
• If not mapped in, entry in the table indicates this (e.g. 48-52KB).
• All memory accesses in the program go through the table.
• MOV REG,0.
• Page is $$\frac{0}{4096}=0$$.
• Table entry 0 shows frame 2.

# 3. Indirection Through A Page-Table

• The steps needed per-access are more complex (slower) than Base+Limit.
• Easy to see in the C-style psuedo code below.
• Running this procedure per-instruction would be very slow.
• Needs hardware support to be feasible.
uint32 memRead(uint32 address) { uint32 page = address>>12, offset=address&0xFFF; if( pageTable[page].present ) { uint32* physical = pageTable[page].frame<<12 + offset; return *physical; } throw PageFault; }

# 4. Memory Management Unit (MMU)

• The MMU is a hardware implementation of the previous psuedo code (and writes).
• Sits between processor and bus.
• Page table is held in memory in the MMU.
• Performs:
• Decoding of page / offset.
• Lookup of frame numbers.
• Interrupts for page faults.
• The table is fixed size.

# 5. Page Table Entries

• Each entry in the table stores information about a single page.
• The present/absent bit shows if the page is resident in memory.
• When the page is written to, the modified bit records this.
• Protection bits indicate read/write privileges.
• Referenced shows the page has been used (read or write).
• Dirty pages has the page been altered since it was loaded/created.
• This memory is very expensive (MMU is generally on-die, similar to cache).
• The hardware table is limited in size.
• A 4GB machine with 4KB pages requires $$2^{20}$$ table entries...

# 6. MMU Price/Performance

• Consider the different design choices in placing the page table.
• Extreme: put it entirely in main memory.
• The size is no-longer an issue: price and performance are low..
• Each memory access in the program will require one extra access.
• Extreme: put it entirely inside the MMU.
• Fast memory in a processor is expensive: registers / L1-cache.
• Using MMU to cache the page-table: reduces price / similar performance.

# 7. Caching the page-table

• This economic tradeoff is similar to memory use in general.
• The memory hierarchy optimises this with a cache.
• We can use a similar solution here.
• A small (expensive) memory inside the MMU.
• A large page-table in main memory.
• The frequently used table entries are cached.
• This works because programs exhibit locality reference on the large-scale (pages in their working set) as well as the small-scale.

# 8. Example TLB

• The Translation Lookahead Buffer (TLB) is an associative memory.
• Stores some of the page-table entries (caches table in main memory).
• Policies for deciding which are in the second part of the lecture.
• In the MMU table shown earlier the entry index was the page number.
• Now it is an explicit field within the table entry.
• The associative map uses this field as a key.

# 9. Using the TLB

• Example shows the hot pages used by a program.
• Executing a loop in pages 19,20,21 (readable and executable).
• Array in pages 129 and 130.
• Heap variables (including indices) in page 140.
• Stack is in page 860 and 861.
• On an access the page number is compared against all entries in parallel.
• If it is found (a hit) the frame number is used.
• On a miss the MMU looks up the table entry in memory.
• After reading it, one entry in the TLB is overwritten with the entry.
• Idea: temporal locality results in good performance.

# 10. Managing the TLB

• The MMU hardware can be simplified by handling TLB misses in software.
• Extra trap on the processor: serviced by overwriting the table entry.
• This is slower than the MMU updating the TLB.
• The advantage is that the MMU hardware can be smaller.
• A larger TLB is required to achieve the same performance.
• A strange tradeoff to make as fully-associate sets are expensive.
• It relates into late 90s research on partitioned caches.
• Idea: Perhaps the software can estimate usage better than a simple circuit?
 Design tradeoffs for software-managed TLBs. Richard Uhlig et al.ACM TOCS, Volume 12 Issue 3, Aug. 1994 http://dl.acm.org/citation.cfm?id=185515

# 11. Page Table Sizes

• A single page-table for a virtual address space is large.
• Assume 32-bit physical address space, 4kb frames: frame number 20-bits.
• Needs to be aligned, assume 4-bytes per table entry.
• Assume the virtual address space is also 32-bits: 4MB per table.
• Each process in the system has a table.
• A typical desktop has about 100 processes: 512MB of tables!
• It is rare for a process to use all entries (virtual space is sparse)...

# 12. Multi-level Page Tables

• The first step in reducing the memory footprint is to split the table into two-levels.
• Bits in the virtual address indicate the position in two tables.
• The entry in the first table stores the address of a second-level table.
• Some of the second-level tables will be completely empty...
• If we avoid storing them the structure is sparse.
• It uses much less memory than the single-level structure.

# 13. Multi-level Page Table Example

• Each table contains 1024 entries, 4kb pages/frames, 4MB total.
• PT1=1 PT2=3 Offset=4
• MMU looks at index 1 in the first table.
• Retrieves the frame number of the second table, reads index 3.
• This yields the frame containing the physical address of 0x00403004.
• Present/absent bits in the first-level indicate if the table exists.

# 14. Inverted Page Tables

• In a 64-bit address space further table levels are needed.
• An alternative is to reverse the mapping: store page number per frame.
• MMU must search the table to find frame from a page number.
• TLB can cache this mapping once found, but it is slow on a TLB-miss.
• The search can be accelerated by hashing the page numbers, storing frames matching the hash in a chain.

Intermission

# 15. Page Replacement Context

• Page tables and TLBs are both caches.
• Quick reminder of cache design.
• Small, fast, fixed size storage.
• In front of slower, larger, possible variable size storage.
• Assume the cache is always full.
• If a request is made that misses the cache.
• Retrieve from memory: need a free slot.
• Cache must evict an entry.
• The second part today is algorithms for picking which page entry to evict from the table.

# 16. Algorithms Overview

• Complex softare makes the system slow.
• Complex hardware is expensive.
• We can arrange the algorithms we will see into four categories:
• Magic: we can only see the theoretically optimal behaviour after it has happened.
• We can record this in a simulation, use it for evaluation of algorithms.
• Best achievable: this is the highest performance that we could actually build.
• LRU is too complex to be practical: predication performance is high, but the speed of operation is too slow. Needs approximations...

# 17. Page Replacement: LRU

• Idea: count clock cycles in the MMU.
• On every memory access, store the counter in the page table (TLB).
• Problem: increases the entry size by at least 100% (64-bit).
• Cache memory is very expensive: impractical, but it does exist.
• Problem 2: The low-latency counter is also tricky to implement.
• On a page fault:
• Evict the page not used for longest.
• Probably not going to be used soon.
• Temporal locality: approx. of optimal.

# 18. Page Replacement: Simple Counters

• Storing a counter was too expensive.
• Referenced Flag, Modifed Flag: 2-bits .
• Flags are set by MMU hardware.
• When the entry is loaded into the table, both zero.
• Manual operation to reset the R flag.
• Simple algorithm for eviction:
• Not Recently Used (NRU).
• Prefer 00 - cheapest to evict.
• Prefer 01 to 10 - flush, might not reload.
• Avoid 11 - flush and probably will reload.
• Pick any entry from the best category available (coarse approximation of LRU).

# 19. Page Replacement: Further simple approaches

• Very simple approach (no hardware needed): FIFO.
• Queue pages as they are added during page-faults.
• Pick the oldest for eviction.
• Problem: doesn't care if busy or not.
• 2nd chance: combine FIFO with the simple R and M flags.
• Page fault at time 20: A is the oldest page in the table.
• If R is clear: evict the page.
• If R is set: move to back, rest time ; becomes newest page.
• If all pages have R set: drop back to evicting oldest.

# 20. Page Replacement: Clock

• Single linked list: head takes $$\mathcal{O}(1)$$, tail takes $$\mathcal{O}(n)$$
• Double-links: head takes $$\mathcal{O}(1)$$, tail also takes $$\mathcal{O}(1)$$
• Double-linked list forms a loop, has a single entry point.
• Looks vaguely like a clock...
• On an eviction look at the entry (C).
• Is R=0?
• Oldest hasn't been used - evict.
• Is R=1?
• Second chance: reset time and R.
• Move the entry forward one step.
• C is now the oldest.
• Just a faster way to implement second chance.

# 21. Page Replacement: NFU

• Now we add a timing interrupt to the simple 1-bit flags.
• Software handler can add some functionality.
• Not as expensive as actions on every access.
• Sample and reset R and M periodically.
• Store the samples in a counter per page.
• Estimate of how frequently the page is accessed.
• NFU: Not Frequently Used.
• Evict the page with the lowest Referenced count.
• Advantage: least popular page, hope it remains so.
• Problem: old counts are as valuable as new counts.
• Back in lecture 4 we saw a similar problem: estimating process behaviour.
• The solution was called "aging": discounted older information.

# 22. Page Replacement: Aging

• Aging uses exponential decay of the value of R counts.
• In the process context we used exponentially weighted average.
• Not much diference: $$x' = \frac{x}{2} + R$$
• As R is either 0 or 1, the difference is that we do not divide by 2.
• Both NFU and Aging approximate LRU in software.
• Aging is more accurate.
• Does not store the exact order.
• Two pages referenced in the same tick have the same score.

# 23. Page Replacement: Working-set

• The working set of a program is the current set of pages it is actively using at a point in time.
• We cannot know exactly what this is.
• Some data that is used will never be used again.
• Some data will be used soon (stays in the working set for a short time).
• Some data will be used later (stays in set for a long time).
• While we cannot know exactly, we can estimate by accesses in the last $$k$$ cycles.
• At clock cycles $$t$$ during the execution.
• Size of the working set is shown in the graph.
• Idea: Programs do not access all data at once
• Consequence: If $$k$$ is large enough the approximation is accurate.

# 24. Page Replacement: Working-set

• If we knew $$w(k,t)$$ for a program:
• Preload pages before they are accesssed: minimise page-faults.
• As with all great ideas this is fantastic in theory, impossible in reality.
• But leads to good approximations...
• Drop using last $$k$$ memory accesses: counting is expensive.
• Instead use a window of time: similar to NFU / Aging.
• Difference to NFU / Aging:
• Don't count - store the last reference time.
• Simulating LRU with the coarseness of the timer tick.
• Evict pages older than parameter $$\tau$$ (picked by tuning).
• Expensive to implement: needs a tweak...

# 25. Page Replacement: WSClock

• The clock algorithm was an efficient way to find the oldest page.
• Age referred to the time since loading into the page table.
• But it required a slow scan.
• The final algorithm that we look at is a combination of the two:
• Store coarse access information in a timer tick.
• Use a double-linked list for a cheaper search.

# 26. Page Replacement: WSClock

• Pages are added into the list in the order they are loaded.
• Time of last use field is updated on a timer tick.
• Same as the basic Working Set approach.
• Leads to the state in a).
• A page fault occurs: first page has R set.
• Part b) shows the hand advances and R is reset on the page.
• Note: b) and c) appear to be identical, publisher screwed up?
• Part b) also shows a case where R=0.
• Decide if the access time is older than $$\tau$$.
• Estimates if it is in the current working-set.

# 27. Page Replacement: WSClock

• Looking at c) and d), access time is older than $$\tau$$.
• Not dirty, evict and load new.
• If WSClock finds a dirty page it schedules it to be written to the disk.
• It is not the working set so we want to evict later.
• Limit the number of old clean pages to evict in one pass.
• Spreads out the disk traffic.
• What if we do a full loop and don't find a page to evict?
• If a write was scheduled during first loop: wait for it to finish.
• No writes - everything was in the working set: evict any page.

# 28. Summary

• Revision Guide:
• I don't know what is going into the exam, but I would expect:
• How does algorithm X implement property / feature Y?
• Explain how algorithm X and Y differ in property / feature Z?