DV1460 / DV1492:

Realtime- (and) Operating-Systems

08:15-10:00 Thursday October 4th, 2016

FS Examples

§4.5 (pg 320-331), §10.6 (pg775-798) and §11.8 (pg 953-964)

Table of Contents
LimitsFile Allocation / Freelists
Disk Structures

1. Introduction

  • In the first half we look at disk-structures in the context of ext.
  • Simplified explanation: put some of the previous lecture in context.
  • The gory details are documented extensively here ,here and here
  • (background only for people who are really interested in the details).
  • After the break we look at hardcoded limits and their evolution.
  • Finish up with some other examples of FS.
    • ISO9660 (optical discs)
    • MFT in Windows.
    • Network File System (NFS).

2. Disk Layout

  • The disk is simply an array of fixed size blocks (0..n).
  • When a PC boots the BIOS sees a large array of 512-byte sectors.
  • First sector is the Master Boot Record (MBR).
  • Stage-1 boot-loader is only 446-bytes of code.
  • It calls BIOS routines to load/execute a larger stage-2.
  • The rest of the sector is the partition table.
    • Only 4 slots: size, type and position for each.

3. Ext Layout

  • Each partition contains a File-System.
  • The ext family start with a boot record.
  • Defines kernel location, block-size, group-size.
  • The partition is split into many equal-sized groups.
  • Partly for robustness, also helps in allocation later.
1KB Blocks8192 inodes/data8MB Groups
2KB Blocks16384 inodes/data32MB Groups
4KB Blocks32768 inodes/data128MB Groups
64KB Blocks0.5M inodes/data32GB Groups

4. Ext Superblocks

  • Superblock contains FS metadata.
  • We need it to work out where the groups are in the partition.
  • How each group is split up into separate pieces.
  • Very important - a copy in every group size.
  • Few combinations of possible configurations; possible to brute-force search for superblocks.
  • Helpful in disk recovery.

5. Ext Allocation

  • Two structures on the disk indicate free inodes and data blocks.
  • FS must scan the bitmaps to find free entries to allocate.
  • Two techniques to make this faster.
  • Preallocate contiguous ranges (avoid as many scans).
  • Deferred allocation ext4 avoids the allocation as long as possible.
  • i.e. keep data from multiple writes in a memory buffer to see how big the data is.

6. Special Groups

  • The ext family try to keep directories/contents in a single group.
  • Improves performance, commonly read together, nearby cylinders on disk.
  • The first group contains the root directory: hardcoded to inode 2.
  • Opening a path (file or directory) starts from a known location.
  • Read the superblock of the first group, find inode 2.
  • One other block has a special purpose.
  • It contains the journal file for ext3/ext4.
  • A "normal" file in the logical structure / at a known location.

7. UNIX v7 File Allocation

  • Directory structures in ext evolved from UNIX v7 - good design.
  • An inode is a single block on the disk; describes a file/dir.
  • Size, creation, modification, last access, protection, incoming link count.
  • Ten slots hold block addresses for data (linked-list for small files).
  • With more than 10 blocks, final three addresses indicate indirection.
  • Point to blocks dedicated to disk pointers.
  • Uses 1,2 or 3 levels of indirection (trees).

8. UNIX v7 Directory Structures

  • Directories are tables of fixed-length entries.
  • Limits filenames to 14 characters.
  • Maximum of 64K inodes in total.
  • inodes for both directories and files use the same code.
  • Only a flag in the attributes to indicate type.

9. UNIX v7 walking a path example

  • Relative paths depend on processing . and ..
  • FS uses hard-links to make this simple.
  • Resolving an absolute path starts at root directory.
  • This is a known location on the disk.
  • Each entry lists the block number for that file / directoies inode.
  • Directory i-nodes store the block number for entries.
  • Both . and .. are normal entries, linking to self or parent.
  • CWD for process is used as start for resolving path (same resolution).

10. Directory Structure in ext2

  • Table stored in directories is a variable length structure (max 255 chars).
  • Two size fields: entry / name - allows internal padding.
  • To resolve a component in a path: linear scan through table.
  • inode in each entry - overall process is as in UNIV v7.
  • Part b) shows effect of deleting "voluminous" on structure.
  • Will attempt to reuse for newer files if they fit.
  • When files are deleted - no attempt to reclaim space. Compact manually.

11. Worked example

  • Consider what happens when a user types cp passwd /home/bob/passwd.
  • The cp program is a simple wrapper for a common sequence of syscalls.
  • The system needs to find the data for path passwd.
int main(int argc, char **argv) { char buffer[512]; int len; FILE *src = open(argv[1],O_RDONLY), *dest = open(argv[2],O_WRONLY); // assume file while( (len=read(src,buffer,512) >0 ) write(dest,buffer,len); return 0; }

12. Opening the source

  • System needs to walk path supplied to find file data on the disk.
  • Path is relative, starts walk from inode in the process CWD.

13. Opening the destination

  • This time from root, and allocate free disk blocks.
  • Note: changed filename slightly from previous code.

14. The copy loop

  • File API aims for simplicity - all state is implicit.
  • It's not quite obvious where to put this state.
  • Options: 1. In the inode 2. In the fd-table 3. A global table
  • Consider (echo "Results:"; grep foo data) >t

Break (15mins)


15. FS Families

  • We've seen the ext-family disk layout in detail.
  • Historically it evolved alongside two other families of FS.
  • MSDOS (and windows up to Win98).
    • FAT-12, FAT-16, FAT-32,
  • NTFS
    • v1 (Windows NT), v3 (Win2000), v3.1 (XP onwards).
  • Across all three families of FS we can see some similar design choices.
    • Structures - ext is closest to the "theory".
    • File Allocation / Freelist - we will see the Windows approach.
    • Limits - Volume sizes, file sizes, filename lengths.
    • Block-sizes.
  • Limits changed rapidly so we need to see the historical context...

16. FS Historical Context

17. Limits : Volume-size

  • The history of MSDOS showed frequent revision to hardcoded limits.
    • Question: should they not have used larger limits to start with?
    • Maximum volume size was revised many times.
    • FAT12: 2MB (32MB addressable sectors, <4096 "clusters".
    • FAT16: 32MB (as \(2^{16}\) 512-byte clusters)
    • FAT16B: 2GB (using 32KB clusters)
    • FAT32: \(2^{28}\) clusters of 8KB (2TB) ... 64KB (16TB)
  • Harddrive sizes increased rapidly, limits were frequently increased.
  • FAT are cached in memory to improved performance.
  • A 12-bit FAT only takes 1.5*4096 = 6KB of memory.
  • FAT-32 requires 4b * \(2^{28}\) = 1GB of memory.
  • Small limits are inconvenient, large limits impact memory usage.

18. Limits : Filename-length

  • We can see a similar issue with limits on filename lengths.
  • Until '95 MSDOS used a 8+3 filename length, Win95 introduced 255.
    • Backward compatability was a requirement...
    • "Long" filenames live in VFAT - an extra mapping on the drive.
    • FS structures still use 8+3 filenames, extra dictionary for "long" versions.
  • NTFS in 2000 switched to direct support for 255-codepoint filenames.
  • In the UNIX world, v7 used a 14-character limit, ext used 255.
  • We saw the difference in implementation difficulty between fixed sizes and variable.
Hardcoded limits are tricky in exponential growth
Variable limits cause implementation difficulties. Set too large, wastes space. Set too small, awkward to use.

19. Limits: Summary

  • We saw two kinds of hard-limits in the FS (volume-size and filename length).
    • In both cases they were revised upwards during ongoing development.
Alternative 1
Set the limit large enough to start with.
  • It is difficult (and costly) to future-proof against exponential growth.
    • What if five years in the future means 32x larger?
    • Making the limit 32x larger today, may make some structures 32x larger.
    • That can be a large cost in today's resources, e.g. 32x more memory for a table.
  • Allowing for 32x larger volumes in 5-years...
    • ...means slow performance using 32x more memory.
    • (this is actually the worst-case, e.g. FAT tables).

20. Limits: Summary

  • We saw two kinds of hard-limits in the FS (volume-size and filename length).
    • In both cases they were revised upwards during ongoing development.
Alternative 2
Implement variable-length structures.
  • Variable-length structures are much harder to get right.
    • The changes to representation are small, the extra code is not so difficult.
    • But how do we avoid fragmentation?
  • No good answer for that question: e.g. wasting "padding" in ext directory entries.

21. ISO 9660 (read-only)

  • In a read-only file-system variable length structures cause fewer problems.
  • Numbers are stored twice: big/little-endian.
  • Directory entries are variable length.
  • Files are contiguous: described by start/length.
  • Directory structure is a list of these entries.
    • Only 1 byte for length (256 entries), depth is limited to 8.
    • Depending on target profile ("level") filenames are 8.3 / 31-char.
    • Files up to 4GB (double encoded in 8 bytes).

22. ISO Extensions

  • While UNIXv7 is regarded as an example of good design...
  • ...ISO9660 was designed by a committee.
  • Some of the ISO9660 limitations are for portability.
    • Committee involved Unix vendors, Microsoft and embedded systems vendors.
    • Least-common-denominator of FS on different platforms.
  • Extensions are now used to plug some of the holes.
Rockridge (UNIX)
Filenames, file attributes, symbol links, timestamps and unlimited depth
Joliet (Windows)
Long unicode filenames, deeper nesting, extension names on directory names

23. NTFS Disk Organisation

  • Directory entries are simple in NTFS - a name and a file ID.
    • The file ID gives the index of a record in the Master File Table (MFT).
  • The MFT is a sequence of fixed-size 1KB records; equivalent to inodes.
    • Each record is a directory or file: list of attribute fields.
    • Each field is an identifier and a length.
    • Small data may be inlined (resident), or store a pointer to another block non-resident.
    • Hence, small files can be stored directly inside the MFT.
    • Larger files have a list of pointers to their data blocks.
    • Attribute for extension records: span multiple 1KB records.

24. NTFS Storage Allocation

  • File storage is in extents (contiguous runs).
    • As shown, the data attributes record extents where the file is stored.
    • The system makes an effort to arrange contiguous storage where possible.
    • Generally, this means preallocating blocks for files at creation time.
  • Sparse files are support by placing contiguous runs in separate MFT records.

25. Network File System (NFS)

  • All of the FS we've seen so far map API calls onto local harddisks.
  • But the FS abstraction can be used over a network as well.
  • Allows folders that can be shared between users / machines.
  • Cloud clients (e.g. Dropbox, Google Drive) provide a similar service.
    • But they try to sync a local folder with a remote server.
Network File System (NFS)
Implement the basic syscalls as a network protocol
  • Avoids the synchronisation issue: are local/remote changes correct?
  • Only a single state: on the server.
    • Although it is partly cached for performance.

26. NFS overview

  • Shared namespace across multiple servers and clients.
  • Different sub-trees can reside on different physical machines.
  • Hetrogeneous protocol - implemented on many different platforms.

27. NFS operation

  • Stateless operation - no open/close, just read/write.
  • NFS client has to fake stateful operation to translate VFS to NFS.

28. Summary

  • End of the File-System chapter.
    • We've looked at the key issues in File System design.
    • The implementation structures / algorithms.
    • Walked through some real examples to see how it works.
  • Try it out for real in the project...

Old. The ext family

  • The first modern file-system that we look at is the ext family.
    • ext: Designed to replace original MINIX FS in linux.
    • ext2: Redesigned around the Berkley Unix File System.
    • ext3: Added support for journals.
    • ext4: Increases limits and performance.
  • Because of the ext2 redesign, ext inherits a lot from UNIXv7.
  • Overall layout on the disk, splits blocks into groups.
  • Each block-group looks like a small file-system.
    • Separate pool of inodes and data blocks.

Old. Disk Organisation of ext2

  • Extensive documentation available online.
    • Organisation based on 1KB blocks: boot block, superblock.
    • The superblock describes the boundaries of the block groups.
    • Some block groups contain backups of the superblock.
  • Block groups are based on BSD cylinder groups: prevent thrashing.
    • Modern disks tend to hide their real geometry.
    • Block-groups are fixed regions on the disk (hopefully cylinders).
    • Preserves some locality of access: inodes are close to data blocks.
  • Each block group has an indepedent allocation of inodes / data-blocks.
    • Uses two bitmaps to track free blocks within each group.
    • Preallocation of space for files reduces fragmentation.
    • System attempts to keep file blocks within block groups.