# Realtime- (and) Operating-Systems

I/O : Software

§5.1-5.10 (pg 351-428)

Programming stylesRAID
I/O Software StackMisc
Device Drivers

# 1. Introduction

• Each device is essentially unique.
• The low-level interface of shared registers is tied directly to the hardware design of the device.
• Bits map onto specific functionality in hardware.
• Timing constraints differ from device to device.
• Valid sequences of commands and responses (protocols) may be unique.
• Differences in these areas impose different trade-offs in the choice between interrupts, DMA and direct control.
• The I/O sub-system has to hide the madness.
• Each device will have a custom piece of software: the driver.
• All of the drivers should present a common(-ish) interface.

# 2. Goals

Device Independence
Programs should be able to access devices without knowing exactly what kind they are.
• Example: cat /x/y should not care what kind of device stores the file.
• All storage devices should operate the same way, all ethernet cards...
• Only categories of device should have differences.
Uniform Naming
The OS (not the device) should choose the names as strings.
• Example: /dev/sda6 and /dev/disk/by-uuid/...
• Gives the OS flexibility to substitute one device to replace another.

# 3. Goals

Error Handling
Only propagate errors as highly as necessary.
• If an error can be handled at a low-level, do it silently.
• Example: if the driver can retry a failed read - transparent to the FS.
Support synchronous operations
Supply synchronous APIs for asynchronous operations.
• Most interactions with devices are asynchronous.
• Actions are initiated, some time later a reply will arrive.
• Hide these details from layers above: asynchronous programming is harder.
• Where performance is critical, support asychronous access as well.

# 4. Goals

Decouple processing steps from device operation.
Related to the above issue: supply buffers where necessary to allow batching of computation.
• Supplying enough memory means that the device can always write data.
• Latency between device operations is probably much longer than between processing steps.
• Too much buffering introduces overhead (e.g. cost of copying data).
Support resource-sharing.
Single-user devices (e.g. printers) need some form of resource locking for sharing.
• We saw concurency primitives in process scheduling.
• Shared resources need a form of mutual exclusion.

# 5. Programming styles for I/O

• Using a printer as an example we will look at different I/O styles.
• These match the hardware mechanisms in the previous lecture.
• Three styles will be hidden beneath the API provided to the user.
• Diagram a) shows a process about to access the printer.
• Built up some data inside the process memory space.
• b) shows the kernel copying it into kernel memory space.
• This allows control to return to the user process, provides the illusion of an atomic operation, while c) shows the asynchronous operation.

# 6. Style: Programmed I/O

copy_from_user(buffer, kbuffer, size); for(i=0; i<size; i++) { while( ! inb(PRINTER_STATUS)&READY_MASK ); outb( PRINTER_DATA, kbuffer[i] ); } return_to_user();
• Code is roughly the same as the textbook: uses the real macros.
• When data is written, device flips status flag until it can take more.
• Busy-waiting (polling) loop, ties up CPU waiting for the device.
• Takes a long time to return control to user.
• Very simple to write: might do it this way in an embedded system.
• Depends on the timing constraints - can put real-time delays in the loop easily.
• Also referred to as "bit banging", or "bare metal".

# 7. Style: Interrupt-driven I/O

void handle_write() { copy_from_user(buffer, kbuffer, size); enable_interrupts(); while( ! inb(PRINTER_STATUS)&READY_MASK ); outb( PRINTER_DATA, kbuffer[0] ); pos = 1; yield(); } // Not a real function: ISV jumps here. { if(pos==size) unblock_user(); else { outb( PRINTER_DATA, kbuffer[pos] ); pos++; } finish_interrupt(); // Ack and resume. }
• x

# 8. Style: I/O using DMA

void handle_write() { copy_from_user(buffer, kbuffer, size); dma_write( PRINTER_DATA, PRINTER_IRQ, kbuffer, size); yield(); } // ISV jumps here when DMA finishes. { unblock_user(); finish_interrupt(); }
• The coprocessor runs the loop to transfer the data.
• Printer interrupt triggers the DMA controller.
• CPU is not involved: single interrupt when transfer is finished.

# 9. Summary of programming styles.

• In all three cases:
• The operation is blocking - asynchronous steps hidden from user process.
• DMA is obviously preferable:
• Least complex code (although setting up the DMA controller is fiddly).
• Least overhead for the CPU.
• Require extra hardware (standard on PC, common on ARM).
• Controller must be fast enough to keep the device supplied.
• Interrupt style is a good fall-back in a non-realtime environment.
• Programmed I/O allows precise control over timing.

# 10. I/O Software Layers

• The typical software stack for I/O is shown in the diagram.
• As with all layers of abstraction: hide the messiest details at the bottom.
• Interrupt handlers are very low-level (assembly).
• The actions that they take will interact with the device driver code directly (C).
• When an interrupt handler is written:
• It interacts with one specific piece of hardware.
• It has knowledge of the internal details of the device driver it supports.
• We can provide a general overview of interrupt handlers.

# 11. Interrupt Handlers

• Save any state required to resume the program (including condition codes).
• On linux/x86 there is a specific stack to use.
• Hardware saves codes / flags / user-stack for us.
• Setup the memory context for the handler (e.g. TLB, MMU and page-table).
• x86 stores the data segment for interrupt handling.
• The switch occurs before jumping into the handler.
• Execute the handler.
• Normally this is an asm stub, C body.
• Has a stack to work in, can access kernel data structures.
• If the handler unblocks a process, inform the scheduler.
• Note: steps 7-10 in the textbook are specific to context-switches.

# 12. Typical driver structure

• This is what the big ol' wall-of-text on pg 360 is describing.
• Each category of device (e.g. block, character, NIC) has a generic API.
• The driver is started up with an initialisation (sets parameters).
• Builds up a queue of commands, monitors progress (including interrupts).
• Handles any scheduler interactions to provide blocking API.

# 13. Device Independence

• The generic API at the top provides device independence.
• We can plug in another device/driver as replacement without changing any software higher up the stack.
• Each category of device then has common device-independent functionality.
• e.g. the block-device part of the I/O subsystem.
• Device naming scheme (e.g. sda1, sda2 etc) lives in the device-independent part.

# 14. Buffering

• Buffering is device-independent (as far as possible).
• General schemes that are repeated.
• Small buffer in kernel memory (pinned page).
• Larger buffer for user process (paged memory).
• Small buffer hides latency in the larger buffer.
• This is extended to n-levels of buffering in a ring.
• Another scheme is a circular buffer.

Intermission

# 15. Motivation for RAID

• A fast mechanical drive (7200RPM) can read 150MB/s.
• Assume an expensive drive, and sequential reads of large files.
• A single SATA port is bottlenecked at 6Gb/s (750MB/s).
• A single (mechanical) drive cannot saturate a single controller.
Performance motivation
Using multiple drives in parallel will produce higher speeds
• Drives do not last forever, their end is always sudden and unexpected.
• Files die, users cry.
Redundancy motivation
Using multiple drives in parallel we can store copies.
• What are our options for maximising redundancy, speed and storage size?

# 16. Striping for performance (RAID-0)

• Imagine two models of HDD:
• Model A is 1TB made of 2M 512-byte sectors.
• Model B is 2TB made of 4M 512-byte sectors.
• Model B probably costs more than twice the price of model A*.
• In both cases, a sequential read of sectors 0-2000...

# 17. RAID-0 Overview.

• The drive presented to the OS is stored on multiple physical disks.
• If the OS assumes that contiguous data will lead to higher performance...
• The simple striping scheme will distribute evenly over both drives.
• Random access is a bit more complex.
• If the drives are similar (e.g. same model) the disk geometry should be the same.
• Arm motions should correlate between the two drives.
• Otherwise we have to wait for the larger seek time of the two.
• There will be a small overhead in the RAID controller.
• Summary:
• Latency will be slightly worse than a single drive.
• Bandwidth should double for realisitic workloads.
• Probability of a fatal error has also doubled...

# 18. Mirroring for redundancy (RAID-1)

• We trade storage size for redudancy.
• In the simplest scheme we store two copies of everything.
• The storage size is halved, but if one drive fails we can continue.
• If the hardware supports hot-swap the failed drive can be replaced.

# 19. How parity works

• Consider storing four decimal digits:
• x = (0, 6, 2, 9), $$\Sigma x_i \equiv 7 \; \pmod{10}$$
• The parity is the remainder when divided by the base.
• If one piece of storage is lost we can recompute.
• e.g. $$x_1 = 7-(0+2+9) \; \bmod 10 = 6$$
• For efficiency we normally do this in binary.
• XOR is addition (subtraction) modulo 2.
• The logic table that you are familiar with is the final bit.
• So parity can be calculated quickly in software, or cheaply in hardware...
 XOR 0 1 0 00 01 1 01 10

# 20. Cake, and eating it too (RAID-5)

• RAID-0 maximised storage size and speed.
• RAID-1 maximised speed and redundancy.
• There are configurations that offer other trade-offs.
• RAID-5 uses parity-striping across 5 drives.
• e.g. Parity strip 0-3 = 0 XOR 1 XOR 2 XOR 3
• One failed drive can be tolerated.
• Should be removed quickly.
• Rebuilding the array is slow...

# 21. Taking RAID to a higher level

• RAID-6: Add a second parity check, orthogonal to the first.
• Can tolerate two disk loss - rebuilding an array is quite intense, most probable failure is while rebuilding RAID-5 after losing a disk.
• RAID can be nested, one RAID scheme exporting a "disk" to another.
• e.g. RAID0+1 shown in diagram.
• Also non-standard RAID levels, custom error-checking codes... storage management at scale is a large area.

# 22. Clocks

• We use clocks (timers) to measure time:
• How long has a piece of code been executing for?
• To communicate with the user using the time of day (real time).
• To trigger code at particular times (timeouts, alarms etc).
• Every clock is discrete: ticks at a particular rate.
• Faster clocks are more accurate - measure shorter periods of time with the same number of bits counting the ticks.
• Most systems feature a real-time clock:
• Programmable rate (length of a tick) and period (number of ticks).
• Generates a hardware interrupt when the period expires.
• Allows sleeping processes (suspended for a period of time).
• Allows pre-emptive multitasking (trigger the scheduler every period).
• Basis of any real-time resource measurement (e.g. fair scheduling).

# 23. Clocks

• The highest accuracy clock in the system is provided by the CPU itself.
• Originally called multimedia timers these tick at the frequency of the processor (e.g. 4Ghz).
• Can be accessed with inline assembly - read a 64-bit clock cycle counter.
• Generally the system ticks at a slower rate - 10-1000Hz on a PC.
• Watchdog timers (timeouts) guard protocols.
• Send a message, expect a response within n-ms: start a timeout.
• If the response is received, cancel the timeout.
• If the timeout expires, runs some error handling code (e.g. resend data).
• Between user process requests (and OS accounting) the system needs to run a lot of clocks.
• Like any hardware resource - there are a limited number of clocks.
• The system needs to multiplex virtual clocks onto the hardware.

# 24. Clocks

• For each timer expiry, store in a linked list: sort by time.
• Timers are measured in ticks (rounded where necessary).
• Next event: difference between current tick count and head of list.
• Program next interrupt for that many ticks.
• Soft timers: no interrupt, best effort expiry.
• For frequent events it is likely we will enter the kernel (e.g. a syscall) soon enough to process them (soft real-time).
• No interrupt overhead, feasible for 10$$\mu$$-sec timers.

# 25. Event Loops

• Most GUI programming styles operate around event loops.
• Event queue with a semaphore: program sleeps until event / message.
• Program must execute its handler code quickly and return (beachball / zzzz).
• Need for longer processing can be handled using threads.
• All I/O presented through consistent interface (message passing).
while(!quit) { getEventMsg(&event); // blocks until event/message switch(event.type) { case TIMER: timerHandler(); break; case PAINT: render(); break; case INPUT: updateState(); break; ... } }

# 26. Power Management

• Computers use a lot of power.
• Laptop: 54Whr battery. 3 hours without power management (18W for system).
• 14 hours with power management: 3.9W (entire system!).
• Desktop (performance): 800W (under high load).
• How does power management work?
• If no processes are running, the system has an idle loop.
• Naive approach (early PCs): busy loop for a while, check scheduler.
• Simple approach: sleep for a quantum, check scheduler.
• In the sleep state processor drops voltage and frequency.
• Hard-disks can power down between uses.
• Network can power down antenna, check less frequently.

# 27. Power Management

• Smartest power management:
• It is tempting to think that doubling frequency would double power.
• Static power always dissipates: inefficiencies in physical process.
• Dynamic power dissipates under load: energy to drive transistors.
• Static power is constant(-ish) and dynamic power is cubic(-ish).
• Exact numbers change for each generation of technology.
• So running at 2x the frequency uses more than 2x the power.
• Consequence: operating in bursts is more power efficient.
• A system can bundle together some operations (timers, I/O, scheduling...).
• Run them quickly at a high processor frequency.
• Switch down to a lower frequency for longer (hurry up and sleep).
• Lots of current research in this area.

# 28. Summary

• Drivers hide device specific details from the system.
• Handle awkward programming models like interrupt-driven code.
• Device-indepedent layer provides APIs for each category.
• The goal is device independence.
• Storage is quite complex (big business), there are many layers of complexity beyond the basic I/O on disks.
• Clocks (and real-time behaviour) underpin most modern OS primitives.
• Some global system properties do not break nicely into independent problems:
• Power management often cuts across abstraction boundaries in the OS.
• Cannot be solved by purely local (nicely abstracted) means.