DV1460 / DV1492:

Realtime- (and) Operating-Systems

I/O : Software

§5.1-5.10 (pg 351-428)

Table of Contents
Programming stylesRAID
I/O Software StackMisc
Device Drivers

1. Introduction

  • Each device is essentially unique.
    • The low-level interface of shared registers is tied directly to the hardware design of the device.
    • Bits map onto specific functionality in hardware.
    • Timing constraints differ from device to device.
    • Valid sequences of commands and responses (protocols) may be unique.
  • Differences in these areas impose different trade-offs in the choice between interrupts, DMA and direct control.
  • The I/O sub-system has to hide the madness.
    • Each device will have a custom piece of software: the driver.
    • All of the drivers should present a common(-ish) interface.

2. Goals

Device Independence
Programs should be able to access devices without knowing exactly what kind they are.
  • Example: cat /x/y should not care what kind of device stores the file.
  • All storage devices should operate the same way, all ethernet cards...
  • Only categories of device should have differences.
Uniform Naming
The OS (not the device) should choose the names as strings.
  • Example: /dev/sda6 and /dev/disk/by-uuid/...
  • Gives the OS flexibility to substitute one device to replace another.

3. Goals

Error Handling
Only propagate errors as highly as necessary.
  • If an error can be handled at a low-level, do it silently.
  • Example: if the driver can retry a failed read - transparent to the FS.
Support synchronous operations
Supply synchronous APIs for asynchronous operations.
  • Most interactions with devices are asynchronous.
    • Actions are initiated, some time later a reply will arrive.
    • Hide these details from layers above: asynchronous programming is harder.
    • Where performance is critical, support asychronous access as well.

4. Goals

Decouple processing steps from device operation.
Related to the above issue: supply buffers where necessary to allow batching of computation.
  • Supplying enough memory means that the device can always write data.
  • Latency between device operations is probably much longer than between processing steps.
  • Too much buffering introduces overhead (e.g. cost of copying data).
Support resource-sharing.
Single-user devices (e.g. printers) need some form of resource locking for sharing.
  • We saw concurency primitives in process scheduling.
    • Shared resources need a form of mutual exclusion.

5. Programming styles for I/O

  • Using a printer as an example we will look at different I/O styles.
    • These match the hardware mechanisms in the previous lecture.
    • Three styles will be hidden beneath the API provided to the user.
  • Diagram a) shows a process about to access the printer.
    • Built up some data inside the process memory space.
  • b) shows the kernel copying it into kernel memory space.
    • This allows control to return to the user process, provides the illusion of an atomic operation, while c) shows the asynchronous operation.

6. Style: Programmed I/O

copy_from_user(buffer, kbuffer, size); for(i=0; i<size; i++) { while( ! inb(PRINTER_STATUS)&READY_MASK ); outb( PRINTER_DATA, kbuffer[i] ); } return_to_user();
  • Code is roughly the same as the textbook: uses the real macros.
  • When data is written, device flips status flag until it can take more.
  • Busy-waiting (polling) loop, ties up CPU waiting for the device.
  • Takes a long time to return control to user.
  • Very simple to write: might do it this way in an embedded system.
    • Depends on the timing constraints - can put real-time delays in the loop easily.
    • Also referred to as "bit banging", or "bare metal".

7. Style: Interrupt-driven I/O

void handle_write() { copy_from_user(buffer, kbuffer, size); enable_interrupts(); while( ! inb(PRINTER_STATUS)&READY_MASK ); outb( PRINTER_DATA, kbuffer[0] ); pos = 1; yield(); } // Not a real function: ISV jumps here. { if(pos==size) unblock_user(); else { outb( PRINTER_DATA, kbuffer[pos] ); pos++; } finish_interrupt(); // Ack and resume. }
  • x

8. Style: I/O using DMA

void handle_write() { copy_from_user(buffer, kbuffer, size); dma_write( PRINTER_DATA, PRINTER_IRQ, kbuffer, size); yield(); } // ISV jumps here when DMA finishes. { unblock_user(); finish_interrupt(); }
  • The coprocessor runs the loop to transfer the data.
    • Printer interrupt triggers the DMA controller.
    • CPU is not involved: single interrupt when transfer is finished.

9. Summary of programming styles.

  • In all three cases:
    • The operation is blocking - asynchronous steps hidden from user process.
  • DMA is obviously preferable:
    • Least complex code (although setting up the DMA controller is fiddly).
    • Least overhead for the CPU.
    • Require extra hardware (standard on PC, common on ARM).
    • Controller must be fast enough to keep the device supplied.
  • Interrupt style is a good fall-back in a non-realtime environment.
  • Programmed I/O allows precise control over timing.

10. I/O Software Layers

  • The typical software stack for I/O is shown in the diagram.
  • As with all layers of abstraction: hide the messiest details at the bottom.
  • Interrupt handlers are very low-level (assembly).
  • The actions that they take will interact with the device driver code directly (C).
  • When an interrupt handler is written:
    • It interacts with one specific piece of hardware.
    • It has knowledge of the internal details of the device driver it supports.
  • We can provide a general overview of interrupt handlers.

11. Interrupt Handlers

  • Save any state required to resume the program (including condition codes).
    • On linux/x86 there is a specific stack to use.
    • Hardware saves codes / flags / user-stack for us.
  • Setup the memory context for the handler (e.g. TLB, MMU and page-table).
    • x86 stores the data segment for interrupt handling.
    • The switch occurs before jumping into the handler.
  • Execute the handler.
    • Normally this is an asm stub, C body.
    • Has a stack to work in, can access kernel data structures.
  • If the handler unblocks a process, inform the scheduler.
  • Note: steps 7-10 in the textbook are specific to context-switches.

12. Typical driver structure

  • This is what the big ol' wall-of-text on pg 360 is describing.
  • Each category of device (e.g. block, character, NIC) has a generic API.
  • The driver is started up with an initialisation (sets parameters).
  • It handles mapping generic addressing (e.g. block adddresses) onto device-specific.
  • Builds up a queue of commands, monitors progress (including interrupts).
  • Handles any scheduler interactions to provide blocking API.

13. Device Independence

  • The generic API at the top provides device independence.
    • We can plug in another device/driver as replacement without changing any software higher up the stack.
  • Each category of device then has common device-independent functionality.
    • e.g. the block-device part of the I/O subsystem.
  • Device naming scheme (e.g. sda1, sda2 etc) lives in the device-independent part.

14. Buffering

  • Buffering is device-independent (as far as possible).
    • General schemes that are repeated.
    • Small buffer in kernel memory (pinned page).
    • Larger buffer for user process (paged memory).
    • Small buffer hides latency in the larger buffer.
  • This is extended to n-levels of buffering in a ring.
  • Another scheme is a circular buffer.

Break (15mins)





Intermission

15. Motivation for RAID

  • A fast mechanical drive (7200RPM) can read 150MB/s.
    • Assume an expensive drive, and sequential reads of large files.
  • A single SATA port is bottlenecked at 6Gb/s (750MB/s).
  • A single (mechanical) drive cannot saturate a single controller.
Performance motivation
Using multiple drives in parallel will produce higher speeds
  • Drives do not last forever, their end is always sudden and unexpected.
  • Files die, users cry.
Redundancy motivation
Using multiple drives in parallel we can store copies.
  • What are our options for maximising redundancy, speed and storage size?

16. Striping for performance (RAID-0)

  • Imagine two models of HDD:
    • Model A is 1TB made of 2M 512-byte sectors.
    • Model B is 2TB made of 4M 512-byte sectors.
    • Model B probably costs more than twice the price of model A*.
  • In both cases, a sequential read of sectors 0-2000...

17. RAID-0 Overview.

  • The drive presented to the OS is stored on multiple physical disks.
  • If the OS assumes that contiguous data will lead to higher performance...
    • The simple striping scheme will distribute evenly over both drives.
  • Random access is a bit more complex.
    • If the drives are similar (e.g. same model) the disk geometry should be the same.
    • Arm motions should correlate between the two drives.
    • Otherwise we have to wait for the larger seek time of the two.
    • There will be a small overhead in the RAID controller.
  • Summary:
    • Latency will be slightly worse than a single drive.
    • Bandwidth should double for realisitic workloads.
    • Probability of a fatal error has also doubled...

18. Mirroring for redundancy (RAID-1)

  • We trade storage size for redudancy.
    • In the simplest scheme we store two copies of everything.
    • The storage size is halved, but if one drive fails we can continue.
    • Read/write performance will differ.
  • If the hardware supports hot-swap the failed drive can be replaced.

19. How parity works

  • Consider storing four decimal digits:
    • x = (0, 6, 2, 9), \(\Sigma x_i \equiv 7 \; \pmod{10}\)
  • The parity is the remainder when divided by the base.
  • If one piece of storage is lost we can recompute.
    • e.g. \( x_1 = 7-(0+2+9) \; \bmod 10 = 6\)
  • For efficiency we normally do this in binary.
    • XOR is addition (subtraction) modulo 2.
    • The logic table that you are familiar with is the final bit.
  • So parity can be calculated quickly in software, or cheaply in hardware...
XOR01
00001
10110

20. Cake, and eating it too (RAID-5)

  • RAID-0 maximised storage size and speed.
  • RAID-1 maximised speed and redundancy.
  • There are configurations that offer other trade-offs.
  • RAID-5 uses parity-striping across 5 drives.
    • e.g. Parity strip 0-3 = 0 XOR 1 XOR 2 XOR 3
  • One failed drive can be tolerated.
    • Should be removed quickly.
    • Rebuilding the array is slow...

21. Taking RAID to a higher level

  • RAID-6: Add a second parity check, orthogonal to the first.
    • Can tolerate two disk loss - rebuilding an array is quite intense, most probable failure is while rebuilding RAID-5 after losing a disk.
  • RAID can be nested, one RAID scheme exporting a "disk" to another.
    • e.g. RAID0+1 shown in diagram.
  • Also non-standard RAID levels, custom error-checking codes... storage management at scale is a large area.

22. Clocks

  • We use clocks (timers) to measure time:
    • How long has a piece of code been executing for?
    • To communicate with the user using the time of day (real time).
    • To trigger code at particular times (timeouts, alarms etc).
  • Every clock is discrete: ticks at a particular rate.
  • Faster clocks are more accurate - measure shorter periods of time with the same number of bits counting the ticks.
  • Most systems feature a real-time clock:
    • Programmable rate (length of a tick) and period (number of ticks).
    • Generates a hardware interrupt when the period expires.
    • Allows sleeping processes (suspended for a period of time).
    • Allows pre-emptive multitasking (trigger the scheduler every period).
    • Basis of any real-time resource measurement (e.g. fair scheduling).

23. Clocks

  • The highest accuracy clock in the system is provided by the CPU itself.
    • Originally called multimedia timers these tick at the frequency of the processor (e.g. 4Ghz).
    • Can be accessed with inline assembly - read a 64-bit clock cycle counter.
  • Generally the system ticks at a slower rate - 10-1000Hz on a PC.
  • Watchdog timers (timeouts) guard protocols.
    • Send a message, expect a response within n-ms: start a timeout.
    • If the response is received, cancel the timeout.
    • If the timeout expires, runs some error handling code (e.g. resend data).
  • Between user process requests (and OS accounting) the system needs to run a lot of clocks.
    • Like any hardware resource - there are a limited number of clocks.
    • The system needs to multiplex virtual clocks onto the hardware.

24. Clocks

  • For each timer expiry, store in a linked list: sort by time.
  • Timers are measured in ticks (rounded where necessary).
  • Next event: difference between current tick count and head of list.
    • Program next interrupt for that many ticks.
  • Soft timers: no interrupt, best effort expiry.
    • For frequent events it is likely we will enter the kernel (e.g. a syscall) soon enough to process them (soft real-time).
    • No interrupt overhead, feasible for 10\(\mu\)-sec timers.

25. Event Loops

  • Most GUI programming styles operate around event loops.
  • Event queue with a semaphore: program sleeps until event / message.
  • Program must execute its handler code quickly and return (beachball / zzzz).
  • Need for longer processing can be handled using threads.
  • All I/O presented through consistent interface (message passing).
while(!quit) { getEventMsg(&event); // blocks until event/message switch(event.type) { case TIMER: timerHandler(); break; case PAINT: render(); break; case INPUT: updateState(); break; ... } }

26. Power Management

  • Computers use a lot of power.
    • Laptop: 54Whr battery. 3 hours without power management (18W for system).
    • 14 hours with power management: 3.9W (entire system!).
    • Desktop (performance): 800W (under high load).
  • How does power management work?
    • If no processes are running, the system has an idle loop.
    • Naive approach (early PCs): busy loop for a while, check scheduler.
    • Simple approach: sleep for a quantum, check scheduler.
  • In the sleep state processor drops voltage and frequency.
    • Hard-disks can power down between uses.
    • Network can power down antenna, check less frequently.

27. Power Management

  • Smartest power management:
    • It is tempting to think that doubling frequency would double power.
    • Static power always dissipates: inefficiencies in physical process.
    • Dynamic power dissipates under load: energy to drive transistors.
    • Static power is constant(-ish) and dynamic power is cubic(-ish).
    • Exact numbers change for each generation of technology.
  • So running at 2x the frequency uses more than 2x the power.
    • Consequence: operating in bursts is more power efficient.
  • A system can bundle together some operations (timers, I/O, scheduling...).
    • Run them quickly at a high processor frequency.
    • Switch down to a lower frequency for longer (hurry up and sleep).
  • Lots of current research in this area.

28. Summary

  • Drivers hide device specific details from the system.
    • Handle awkward programming models like interrupt-driven code.
  • Device-indepedent layer provides APIs for each category.
  • The goal is device independence.
  • Storage is quite complex (big business), there are many layers of complexity beyond the basic I/O on disks.
  • Clocks (and real-time behaviour) underpin most modern OS primitives.
  • Some global system properties do not break nicely into independent problems:
    • Power management often cuts across abstraction boundaries in the OS.
    • Cannot be solved by purely local (nicely abstracted) means.