DV1460 / DV1492:

Realtime- (and) Operating-Systems

08:15-10:00 Thursday September 29th, 2016

Further FS Implementation, Design Choices, External Tools.

§ 4.3.4-4.6 pg 290-331

Table of Contents
LinksDesign Choices
Journalling FSExternal Tools
Virtual FS Layer

1. Introduction

  • Last time we looked at how to implement the bulk of a FS.
    • Disk layout.
    • Representing free space.
    • Logical structure of directories.
    • Pointers to file contents as FAT.
    • Pointers to file contents as inodes.
  • This is sufficient to build a simple FS implementation - the project.
  • Today we look at more advanced implementation issues.
    • Links.
    • Journals.
    • VFS.

2. Shared Files

  • Directory entries point to file contents.
  • File-system on multi-user machine.
  • Users B and C would like to share a file.
  • We allow entries in multiple directories to point to a single file.
  • Opening either path gives access to one file; allows sharing.
  • Problem: synchronise views from multiple locations.
  • Problem: B and C contain disk addresses of the file.
    • If the file is modified in one directory (say B) then those changes need to be propagated to C somehow.
    • Otherwise synchronisation is lost, B and C view different versions of the file.

3. Hard and Soft Links

  • Two approaches to synchronising multiple views of a file:
Hard Link
Directory entries list the disk block of a file-structure (i-node), both directories point to same i-node
Soft (symbolic) Link
One entry lists the disk block of the file-structure, the other sets an attribute (link flag), file contains the (textual) path to the other file
  • Neither approach is perfect: hard-links break the semantics of files in a tree.
    • Files really appear in multiple places under different names.
    • How should a program cope with this? (e.g. counting, traversing...)
  • Symbolic links have more overhead, and semantic issues of their own.
    • If the file moves, how do we find all the links that point to it?

4. Issues with links

  • A semantic issue with hard-links.
    • a) shows an initial part of the DAG earlier: C owns a file.
    • b) B links to the file
    • c) C deletes the file from their directory.
  • Deleting B's access breaks a fundamental assumption of storage: things should not disappear by themselves.
  • Not deleting it leaves a file that C owns (and is in their quota) but which only B can access.
  • The expressive power of links is useful.
    • Users learn to live with the issues.

5. Journalling File-Systems

  • Description so far is enough to implement a simple file-system:
    • FAT / FAT32 / Ext / Ext2
  • These simple systems are not robust.
    • Power-loss during a write can destroy the file-system.
    • We cannot rely on this for important data.
  • Increasing the robustness of the system is vital.
    • NTFS / Ext3 and onwards are journaling file systems.
    • They are all designed to survive a loss of power while updating the FS structures.

6. Journalling File-Systems

Example: delete a file
1. Remove entry from directory
2. Release i-node to free pool.
3. Release each block of file to free pool.
  • A power-loss can store this process partway-through.
    • Will the system be consistent if this is only partly done?
    • Example shows deletion - applies to any operation with multiple steps.
  • All of the steps describe an atomic operation.
    • In scheduling we looked at how to prevent atomic operations overlapping.
    • Here the issue is how to restart them from an unknown position.
  • Basic Idea: record what will be done first, then do it.

7. Journalling File-Systems

  • Each operation that alters the FS structure is split into atomic steps.
    • The FS keeps a journal that records what each step will do.
    • The journal entry is written to a known block on the disk.
      • This does not change the FS structure.
    • The journal structure is updated to include the new entry.
      • This is designed to be atomic (e.g. a single block).
    • Now nothing happens for a while...
    • When the writes have all really occured the FS starts to execute the steps.
  • If something goes wrong - the record of what is being done allows the system to repeat until it works.
    • Once it is finished the log entry is deleted.
  • Idempotent: always reaches the same state.
    • e.g. Repeating "write(7)" vs "write(i++)"

8. Virtual File Systems

  • UNIX: one large DAG, many kinds of file-systems (hetrogeneous).
    • Need to make all FS look similar.
    • Otherwise processes need to handle corner cases, users see more complexity.
    • Standardise the interface to each individual file-system.
  • VFS has an interface to processes above.
    • POSIX: open, read, write, lseek etc.
  • VFS has an interface to FS below.
    • OO: objects (e.g. superblocks, directory) and methods.
    • Uses function pointers when accessed from C.

9. VFS example

  • Process structure contains table of open files.
  • Each file descriptor points to a v-node in the VFS layer.
  • V-node contains function pointers for each VFS method.
  • Point into implementation in the actual file-system the file is stored within.
  • Example shows a read() being despatched to the concrete FS.
  • System can be extended with different file-systems: they export the common VFS interface.

10. The story so far...

11. Block Size

  • Important issue in file-system management: block-size.
    • Choosing an appropriate block-size determines performance.
    • It also determines storage efficiency.
  • Reading a file:
    • Work out which blocks it is stored in.
    • Issue read(k) requests to the disk.
    • Wait for the data to arrive.
  • There is latency between each request and response.
    • Fewer (larger) blocks will decrease the latency.
    • Larger blocks will waste more space - decrease storage efficiency.
  • Individual read latencies can be correlated.
    • e.g. mechanical disk reads from different areas on the disk have higher latencies.

12. Block Size: Performance

  • Dashed line is the data rate in MB/s.
  • 5 + 4.165 + (k/1000000) * 8.33
  • Average seek: 5ms.
  • Bus is so much faster than the disk we treat transfer time as a constant: 4.165ms.
  • 1MB tracks, each rotation 8.33ms.
  • A single byte would take 9.165ms (about 100 blocks/s).
  • One track (1MB) would be about 17ms (about 60 blocks/s).
  • Within this range 1000x more data takes 2x as long.
  • Curve shows increased performance for bigger blocks.

13. Block Size: Space Efficiency

  • Space efficiency is more complex: depends on the distribution of file-sizes within the FS.
  • Three empirical studies: median size (50% files) was less than 4KB.
  • Wasted space changes drastically between 1KB, 2KB and 4KB...
  • ...but only for the least common files.
  • Most space (93%) in 10% largest files.
  • No good tradeoff for the data shown.
  • Sequential parts of files (extents) can allow higher speeds for smaller block sizes.
  • Most file-systems uses some kind of i-node (linked list) approach but attempt to organise the files contiguously as far as possible.

14. Tracking Free Blocks

  • Blocks are fixed-sizes, tracking free is similar to memory allocation.
    • Linked-list vs bitmap approach.
  • Major difference: the free list can be in free-space on the disk.
    • As we use the free-space the list shrinks in size.
    • Tradeoff is different to memory allocation as the storage is essentially free.
  • Space difference between the two is easy to model (diagram).
  • Run-time performance is difficult to tune and optimise.
    • Relevant data-set is "all the files in the world".
    • Large-scale corpus (VU 1984, 1995 etc) are rare.

Break (15mins)


15. Performance Issues in Free Blocks

  • Some of the performance issues can only be seen from realistic test cases.
  • Many systems use temporary files: many short-lived small files.
  • a) one block of free-blocks in memory, full blocks of free pointers on disk.
  • Only space for two pointer: deleting a 3-block file overflows the free-list.
  • b) shows the situation with the new almost empty block of free pointers.
  • Allocating a small file flips back in to a).
  • Lots of extra disk I/O close to the boundary.
  • Tolerate more range in the last disk-block, can keep the memory block about 50% full so no disk I/O in bounadry cases.

16. Quotas

  • Multi-user systems need to prevent resources being unfairly consumed.
  • Disk quotas share out the space.
  • Hard-limit is an error case.
    • Any attempt to write new blocks from a process will cause an exception.
  • Soft-limit triggers a warning.
    • Ignoring the warning too many times locks the account.
  • Turns a hard technical problem (zero available disk) into a softer policy isssue.
    • Long "direct" talk with sysadmin.

17. Backups

  • Data integrity goes far beyond the implementation.
    • Even when a FS is "production quality" it still might fail.
    • Even when the code is perfect - the hardware is not.
  • Failure is a one-time event: it is too late to plan recovery.
    • If we want to minimise the risk that data is destroyed.
    • We need to plan recovery first, and implement a layer around the FS.
  • Types of failure:
    • 1. Unexpected disaster.
    • 2. Expected stupidity.
  • Easiest form of recovery is a copy of the data.
    • Problem: Data tends to change, needs to be copied frequently.
    • Problem: Copying an entire FS takes a long time, taxes the IO subsystem.

18. Backups

  • Avoid copying data that can already be recovered.
    • Non-unique data: system install, standard applications etc.
    • Data that has not changed since the last backup.
  • Incremental backups:
    • Dump an entire copy of the FS periodically, e.g. once per month.
    • Dump the changes (deltas) from the last copy frequently, e.g. every hour.
    • The set that has changed is generally quite small vs the FS.
  • Synchronisation:
    • Backup is an independent application, reading entire FS.
    • Are all of the files consistent at this point in time?
    • Unlike synchronisation in the scheduling context, we cannot control the other party.
      • Do it quickly, and frequently, hope for the best.

19. Backups

  • Physical dump: copy every block in sequence.
    • e.g. sudo dd if=/dev/sda1 of=jessie-bitfunky.ext4
    • Creates a big file - raw disk image.
  • Logical dump: copy the directory tree.
    • e.g. sudo cp -R /media/jessie /backups/jessie/bitfunky/
    • Creates a copy of the directory tree.
  • Sparse files (e.g. cores)
  • Device nodes (may not terminate).
  • Freezing for copy.

20. Logical vs Physical Dumps

  • Incremental Logical Dump shown in diagrams.
    • Tree shows tree, shaded parts changed since last backup.
    • Bitmaps show progress of deciding what to dump.
    • a) Changed files + all directories.
    • b) Remove directories without any changes below.
    • c) Output directory inodes first
    • d) Output file inodes.

21. Consistency

  • Idea: checking if the FS state is valid.
  • An error or corruption in the structure may cuse data loss.
  • Idea: don't store the FS structure as compact as possible.
    • Make the representation slightly redundant.
    • Check if multiple copies of the same information are consistent.

22. Consistency

  • Idea: checking if the FS state is valid.
  • Detect problems as early as possible.
  • Early fix can prevent further damage to data.
  • Diagram a) consistent file-system: Every block is free, or used, but not both.
  • Diagram b) block 2 is not listed anywhere: add it to free-list.
  • Diagram c) shows a block with two entries in the free-list: rebuild list
  • Diagram d) block 4 occurs in two separate files: no clean fix.
    • This is really bad - we have already lost a block of data.
    • Duplicate the block and give one copy to each file.
    • One file is probably already corrupted: warn the user.

23. Caching

  • Disk cache is very similar to the page-cache seen in VM.
  • Different time-scale: LRU is feasible.
  • Not all blocks are equal.
  • Directory listings, data blocks and free-list blocks have different degrees of reuse.
  • Deviate from pure-LRU depending on likelyhood of block reuse.
  • Single long double-linked list shows LRU order (+ deviations).
  • Hash-table used to accelerate finding a block.
  • The list-node have extra pointer to store the hash-chains independently.

24. Arm Movement

  • Disk geometry can be taken into account.
    • The outer edge is larger, spins faster (linear vs angular).
    • Bandwidth is higer on outer cylinders.
  • File blocks on the same cylinder do not require lateral arm motion.
    • Waiting for the data to spin under the head is faster.
  • System can use bigger allocation units than actual block size.
    • e.g. paired blocks would be consecutive.
  • SSDs have uniform access costs (no geometry to worry about).
    • But elements have limited life (so wear leveling is needed instead).

25. Finished here

  • Note to self: this is where you ran out of time two years running.
  • Would make sense to unpack some of the previous slides a little...