DV1460 / DV1492:
Realtime- (and) Operating-Systems
08:15-10:00 Thursday September 29th, 2016
Further FS Implementation, Design Choices, External Tools.
§ 4.3.4-4.6 pg 290-331
Table of Contents
|Journalling FS||External Tools|
|Virtual FS Layer|
- Last time we looked at how to implement the bulk of a FS.
- Disk layout.
- Representing free space.
- Logical structure of directories.
- Pointers to file contents as FAT.
- Pointers to file contents as inodes.
- This is sufficient to build a simple FS implementation - the project.
- Today we look at more advanced implementation issues.
2. Shared Files
- Directory entries point to file contents.
- File-system on multi-user machine.
- Users B and C would like to share a file.
- We allow entries in multiple directories to point to a single file.
- Opening either path gives access to one file; allows sharing.
- Problem: synchronise views from multiple locations.
- Problem: B and C contain disk addresses of the file.
- If the file is modified in one directory (say B) then those changes need to be propagated to C somehow.
- Otherwise synchronisation is lost, B and C view different versions of the file.
3. Hard and Soft Links
- Two approaches to synchronising multiple views of a file:
Directory entries list the disk block of a file-structure (i-node), both directories point to same i-node
Soft (symbolic) Link
One entry lists the disk block of the file-structure, the other sets an attribute (link flag), file contains the (textual) path to the other file
- Neither approach is perfect: hard-links break the semantics of files in a tree.
- Files really appear in multiple places under different names.
- How should a program cope with this? (e.g. counting, traversing...)
- Symbolic links have more overhead, and semantic issues of their own.
- If the file moves, how do we find all the links that point to it?
4. Issues with links
- A semantic issue with hard-links.
- a) shows an initial part of the DAG earlier: C owns a file.
- b) B links to the file
- c) C deletes the file from their directory.
- Deleting B's access breaks a fundamental assumption of storage: things should not disappear by themselves.
- Not deleting it leaves a file that C owns (and is in their quota) but which only B can access.
- The expressive power of links is useful.
- Users learn to live with the issues.
5. Journalling File-Systems
- Description so far is enough to implement a simple file-system:
- These simple systems are not robust.
- Power-loss during a write can destroy the file-system.
- We cannot rely on this for important data.
- Increasing the robustness of the system is vital.
- NTFS / Ext3 and onwards are journaling file systems.
- They are all designed to survive a loss of power while updating the FS structures.
6. Journalling File-Systems
Example: delete a file
1. Remove entry from directory
2. Release i-node to free pool.
3. Release each block of file to free pool.
- A power-loss can store this process partway-through.
- Will the system be consistent if this is only partly done?
- Example shows deletion - applies to any operation with multiple steps.
- All of the steps describe an atomic operation.
- In scheduling we looked at how to prevent atomic operations overlapping.
- Here the issue is how to restart them from an unknown position.
- Basic Idea: record what will be done first, then do it.
7. Journalling File-Systems
- Each operation that alters the FS structure is split into atomic steps.
- The FS keeps a journal that records what each step will do.
- The journal entry is written to a known block on the disk.
- This does not change the FS structure.
- The journal structure is updated to include the new entry.
- This is designed to be atomic (e.g. a single block).
- Now nothing happens for a while...
- When the writes have all really occured the FS starts to execute the steps.
- If something goes wrong - the record of what is being done allows the system to repeat until it works.
- Once it is finished the log entry is deleted.
- Idempotent: always reaches the same state.
- e.g. Repeating "write(7)" vs "write(i++)"
8. Virtual File Systems
- UNIX: one large DAG, many kinds of file-systems (hetrogeneous).
- Need to make all FS look similar.
- Otherwise processes need to handle corner cases, users see more complexity.
- Standardise the interface to each individual file-system.
- VFS has an interface to processes above.
- POSIX: open, read, write, lseek etc.
- VFS has an interface to FS below.
- OO: objects (e.g. superblocks, directory) and methods.
- Uses function pointers when accessed from C.
9. VFS example
- Process structure contains table of open files.
- Each file descriptor points to a v-node in the VFS layer.
- V-node contains function pointers for each VFS method.
- Point into implementation in the actual file-system the file is stored within.
- Example shows a read() being despatched to the concrete FS.
- System can be extended with different file-systems: they export the common VFS interface.
11. Block Size
- Important issue in file-system management: block-size.
- Choosing an appropriate block-size determines performance.
- It also determines storage efficiency.
- Reading a file:
- Work out which blocks it is stored in.
- Issue read(k) requests to the disk.
- Wait for the data to arrive.
- There is latency between each request and response.
- Fewer (larger) blocks will decrease the latency.
- Larger blocks will waste more space - decrease storage efficiency.
- Individual read latencies can be correlated.
- e.g. mechanical disk reads from different areas on the disk have higher latencies.
12. Block Size: Performance
- Dashed line is the data rate in MB/s.
- 5 + 4.165 + (k/1000000) * 8.33
- Average seek: 5ms.
- Bus is so much faster than the disk we treat transfer time as a constant: 4.165ms.
- 1MB tracks, each rotation 8.33ms.
- A single byte would take 9.165ms (about 100 blocks/s).
- One track (1MB) would be about 17ms (about 60 blocks/s).
- Within this range 1000x more data takes 2x as long.
- Curve shows increased performance for bigger blocks.
13. Block Size: Space Efficiency
- Space efficiency is more complex: depends on the distribution of file-sizes within the FS.
- Three empirical studies: median size (50% files) was less than 4KB.
- Wasted space changes drastically between 1KB, 2KB and 4KB...
- ...but only for the least common files.
- Most space (93%) in 10% largest files.
- No good tradeoff for the data shown.
- Sequential parts of files (extents) can allow higher speeds for smaller block sizes.
- Most file-systems uses some kind of i-node (linked list) approach but attempt to organise the files contiguously as far as possible.
14. Tracking Free Blocks
- Blocks are fixed-sizes, tracking free is similar to memory allocation.
- Linked-list vs bitmap approach.
- Major difference: the free list can be in free-space on the disk.
- As we use the free-space the list shrinks in size.
- Tradeoff is different to memory allocation as the storage is essentially free.
- Space difference between the two is easy to model (diagram).
- Run-time performance is difficult to tune and optimise.
- Relevant data-set is "all the files in the world".
- Large-scale corpus (VU 1984, 1995 etc) are rare.
15. Performance Issues in Free Blocks
- Some of the performance issues can only be seen from realistic test cases.
- Many systems use temporary files: many short-lived small files.
- a) one block of free-blocks in memory, full blocks of free pointers on disk.
- Only space for two pointer: deleting a 3-block file overflows the free-list.
- b) shows the situation with the new almost empty block of free pointers.
- Allocating a small file flips back in to a).
- Lots of extra disk I/O close to the boundary.
- Tolerate more range in the last disk-block, can keep the memory block about 50% full so no disk I/O in bounadry cases.
- Multi-user systems need to prevent resources being unfairly consumed.
- Disk quotas share out the space.
- Hard-limit is an error case.
- Any attempt to write new blocks from a process will cause an exception.
- Soft-limit triggers a warning.
- Ignoring the warning too many times locks the account.
- Turns a hard technical problem (zero available disk) into a softer policy isssue.
- Long "direct" talk with sysadmin.
- Data integrity goes far beyond the implementation.
- Even when a FS is "production quality" it still might fail.
- Even when the code is perfect - the hardware is not.
- Failure is a one-time event: it is too late to plan recovery.
- If we want to minimise the risk that data is destroyed.
- We need to plan recovery first, and implement a layer around the FS.
- Types of failure:
- 1. Unexpected disaster.
- 2. Expected stupidity.
- Easiest form of recovery is a copy of the data.
- Problem: Data tends to change, needs to be copied frequently.
- Problem: Copying an entire FS takes a long time, taxes the IO subsystem.
- Avoid copying data that can already be recovered.
- Non-unique data: system install, standard applications etc.
- Data that has not changed since the last backup.
- Incremental backups:
- Dump an entire copy of the FS periodically, e.g. once per month.
- Dump the changes (deltas) from the last copy frequently, e.g. every hour.
- The set that has changed is generally quite small vs the FS.
- Backup is an independent application, reading entire FS.
- Are all of the files consistent at this point in time?
- Unlike synchronisation in the scheduling context, we cannot control the other party.
- Do it quickly, and frequently, hope for the best.
- Physical dump: copy every block in sequence.
- e.g. sudo dd if=/dev/sda1 of=jessie-bitfunky.ext4
- Creates a big file - raw disk image.
- Logical dump: copy the directory tree.
- e.g. sudo cp -R /media/jessie /backups/jessie/bitfunky/
- Creates a copy of the directory tree.
- Sparse files (e.g. cores)
- Device nodes (may not terminate).
- Freezing for copy.
20. Logical vs Physical Dumps
- Incremental Logical Dump shown in diagrams.
- Tree shows tree, shaded parts changed since last backup.
- Bitmaps show progress of deciding what to dump.
- a) Changed files + all directories.
- b) Remove directories without any changes below.
- c) Output directory inodes first
- d) Output file inodes.
- Idea: checking if the FS state is valid.
- An error or corruption in the structure may cuse data loss.
- Idea: don't store the FS structure as compact as possible.
- Make the representation slightly redundant.
- Check if multiple copies of the same information are consistent.
- Idea: checking if the FS state is valid.
- Detect problems as early as possible.
- Early fix can prevent further damage to data.
- Diagram a) consistent file-system: Every block is free, or used, but not both.
- Diagram b) block 2 is not listed anywhere: add it to free-list.
- Diagram c) shows a block with two entries in the free-list: rebuild list
- Diagram d) block 4 occurs in two separate files: no clean fix.
- This is really bad - we have already lost a block of data.
- Duplicate the block and give one copy to each file.
- One file is probably already corrupted: warn the user.
- Disk cache is very similar to the page-cache seen in VM.
- Different time-scale: LRU is feasible.
- Not all blocks are equal.
- Directory listings, data blocks and free-list blocks have different degrees of reuse.
- Deviate from pure-LRU depending on likelyhood of block reuse.
- Single long double-linked list shows LRU order (+ deviations).
- Hash-table used to accelerate finding a block.
- The list-node have extra pointer to store the hash-chains independently.
24. Arm Movement
- Disk geometry can be taken into account.
- The outer edge is larger, spins faster (linear vs angular).
- Bandwidth is higer on outer cylinders.
- File blocks on the same cylinder do not require lateral arm motion.
- Waiting for the data to spin under the head is faster.
- System can use bigger allocation units than actual block size.
- e.g. paired blocks would be consecutive.
- SSDs have uniform access costs (no geometry to worry about).
- But elements have limited life (so wear leveling is needed instead).
25. Finished here
- Note to self: this is where you ran out of time two years running.
- Would make sense to unpack some of the previous slides a little...