We Are Doing Files Wrong

2021/01/29

... in which we discuss files, digress towards C64s and other weird kinds of files, and eventually come back to describing a new way of handling files on computers. Maybe it's not new... but maybe interesting to think about.

tl;dr: make directories and files literally the same thing! All file formats are broken; we must mount all the things; let's rewrite the world.

Where we currently stand

Back in the 90s, if you wanted to explain "computers" to people who never saw one, you'd surely get to the following fact in the first 5 minutes:

Documents, pictures etc are "files". You can put files in folders. You could even put folders in folders.

a screenshot of Norton Commander

This really didn't change a lot in the 2000s.

win98 folder tree

(... technically, the screenshot isn't from the 2000s, but the ability of GUIs to show what's actually happening has been steadily declining ever since. But... this is pretty much how people have been imagining file systems.)

Now, pedantic Linux / Unix people might say: but directories are files, too! In Linux, you can get a file descriptor pointing to a folder, and they do indeed behave fairly similarly on a file system level. However... as far as users and developers are concerned, the following is still true:

Files are blobs of bytes. Directories hold files and other directories. They're two distinct things.

As trivial as this might seem, this is not the only way this could be, and indeed, it's not the way things always were.

Counterexamples!

To begin with, "directories" weren't always infinitely hierarchical things. DOS 1.0, for example, didn't support subdirectories at all; you had a single catalog. The same goes for Commodore 64 disk catalogs.

"Sure", you might say, "computers in the past were simple and primitive; Progress has happened since, so by now we have the Most Full-Featured System Imaginable already. Especially if you consider symlinks and hard links."

Well, no. There were, for example, resource forks. On the "classic" Mac OS line (up to 9), you could store multiple binary blobs under a single file name, identified by a four-byte resource type, an id and (optinally) a name. They were used to store e.g. bitmaps, localized strings, even program code, apparently, sometimes. That much about our assumption of "files are a single binary blob".

However, things get even more weird once we go back somewhat more. Namely, Commodore 64 files are... even more alien from a modern point of view. To begin with, unlike with the PC, the 1541 disk drive was its own standalone intelligence, coming with yet another instance of the same CPU also found in the C64 itself. Thus, most of what we currently consider part of a file system (NTFS, ext3, etc.) driver was done by the drive itself.

The drive, and such the "file system", could handle several distinct kinds of files. "PRG" ones were for programs; "SEQ" and "USR" could only be used sequentially, while "REL" ones also had random access, coming with a concept of fixed-length "records": e.g. you create the file as one that contains records of 26 bytes, and then you seek to record number 10 (... see this example). This is not just a bunch of OS convenience functions covering up a binary blob: this is the Actual Way the OS talks to the disk drive (... which will then translate all this to cyliders and sectors; no one really has to think about a flat array of bytes in the entire process).

Of course, this is yet another example of a combination of "we didn't yet know better" and "we needed to optimize because computers were stupid". Fair point. A point I really wanted to get to. Since... I'm going to argue for the same thing in the next section.

The thesis

Why isn't every file also a directory?

Currently have files, that are binary blobs. We have directories that contain other things. Directories aren't binary blobs. Files cannot contain other files. And yet... why not???

If files were allowed to contain other files on the file system level, who would need "directories" anymore? They're just empty files with a bunch of sub-files, after all.

I'll argue that this would make life a lot easier from many perspectives.

... but why?

What's the point of sub-files? Why would you want to put a sub-thing into a JPEG file, for example?

Well... here is the second part of the thesis: most file formats come with ad-hoc, rigid, poorly expandable, application-specific implementation of a subdirectory tree.

(... yes this sounds intentionally like Greenspun's tenth rule. The one about Common Lisp.)

Example time!

JPEG files consist of a series of chunks, with lengths prepended (... mostly). They come with embedded color space info, EXIF, XMP (... which is an XML metadata format), and, optionally, another entire @#$% JPEG file, embedded, as a thumbnail.
MP4 files are a tree (!) of boxes (length, 4-byte box type, contents with possibly more boxes... imagine a binary version of XML), describing a set of video tracks. You can also add the same XMP XML (... of course you need a different tool this time to extract it). Most of the file is taken up by a binary blob, containing the little video / audio packets, pointed to by offsets in the (much smaller) box tree.
HEIC files (the fancy new image format used by iPhones) are remarkably similar to MP4s (you could even create ones with a video in them). As in: tree of boxes, with the actual compressed images sitting in a blob.
ZIP files are, um, well, I don't think there is more explanation needed here. Admittedly, the files contained within are stored in a compressed format. This, actually, also happens to cover:
- the new Office docx / xlsx / etc formats, which are, in fact, a bunch of XML (and some other) files in a ZIP file, with a special extension.
- Android APK files: ditto. Contains: binary-ified resource XMLs, images, .dex files for the code. Some of them stored without any compression so you don't even have to uncompress it to memory-map and run what's inside.
- Java's jar files, with a bunch of .class files inside.
Windows EXE (PE) files come with a set of resources, e.g. icons, dialogs, localized text, etc. Pretty much as the Mac resource forks, except without explicit file system support.
Linux ELF files come with a header and several sections: the native code to be mapped and executed, the constant strings to be mapped but not to be executed, etc.

The list goes on and on. Actually, it's somewhat harder to find file formats that don't actually contain trees, or at least a flat list, of other blobs: text files and MP3s come to mind that don't really. Or raw audio.

Plus, there are some "formats" which aren't even really file formats: Mac OS (OSX) app packages are in fact directories, which show up as files on the GUI.

Returning to the question of "why we want files to have sub-files": because they mostly already do have them, and acknowledging this would help reduce all the chaos by a lot.

In an imaginary world...

... where files are indeed directories, how would things look like?

From an "user" perspective: not radically different. "Folders with documents in them" would still be a thing; it'd perhaps be easier to copy out an image out of a Word document. Otherwise: Mac OS bundles work well today; it'd be not that much different. From a "complexity" perspective though...

... imagine that most overcomplicated file formats just do not exist.

Why come up with an ad-hoc scheme of addressing resources, if you could just rely on the OS keeping them in order? Why embed resources into your binary if you could just add them as sub-files?

Remember: every file can have sub-files. If you start out as a binary blob but suddenly need a sub-file, you can add one. If you need to add a JPEG thumbnail to anything, you can specify that it should go into the "/thumbnail.jpg" subfile. And thereby we solved "thumbnails" for every kind of file, even those that our OS doesn't support. Done.

Also, if you ever have the temptation for embedding that JPEG into one of your blobs... well, remember that a JPEG file is an entire tree now. So you either turn it into a flat binary first (ZIP it up?), or... it's actually much simpler to you if you can keep it as a sub-file. So now your format has sub-files, too, further motivating everyone else to make use of them.

If this sounds horrible from a "tooling" perspective: it doesn't need to be. In this world, "attaching a file to an email" would imply all sub-files automatically. Or a configurable subset. Want to redact EXIF location? Well, you just don't send that "/location.txt" subfile of the JPEG. Since everything has subfiles now, all the tools would include them by default, with "just the binary blob only" being the weird special case (... analogous to using dd to extract a JPEG header today: doable, but why). File copy: copies subfiles by default. (Compare this with today's world: try declaring that your JPEG file is suddenly a directory... and good luck with getting everyone reading JPEG files onboard.)

Of course, we'd need wire formats for this: you can't send over a file tree over an Ethernet cable. But... can you really send over a binary blob either? (It's broken up into packets eventually, after all.) But all of this can be lower-level standards. "The Internet Wire Format for Trees" can equally apply to a JPEG file and a video, and would probably be an OS-level feature instead, perhaps transparently supplied by HTTP. E.g.

GET http://some.site.example.com/background.jpg

would get you the picture, subtrees and all, while

GET http://some.site.example.com/background.jpg/thumbnail.jpg

would supply the thumbnail, just because it's there in the actual file. Transparently. Imagine trying to do this with a current-world JPEG. Technically, you could fetch the header... figure out binary offsets... send a GET request to fetch part of the file... and, in the process, do so many round trips that "just fetch the entire thing" would've been simpler.

ZIP files wouldn't exist. Or... in fact, there would be ZIP "files": file-directories with some blobs compressed inside, with extra info attached on how to decompress them. No need for an extra catalog (think of the way gzip doesn't bundle files). You could even do compression as an OS service, since, suddenly, the OS would know the internal structure of files! And you don't need them for the other popular usage, "let's send multiple files in one go" either: if every file is multiple files, you could just create a file/directory, throw your documents in, and send that one (as a single file, no more complex than a single JPEG).

The way forward

Well, there is bad news:

... this is not how we've been doing "files" for the past... um... 30-40-50 years.

So, if someone writes an OS that works like this, it'd still need to deal with all the half-broken mini-file-systems that are our file formats.

Here is the good news though: who said we can't mount mini file systems? In fact, we could on-demand automount everything. The way you can browse into an iso image. (I guess you could even cd into it on a low level with a sufficiently sophisticated automounter.)

Of course, a lot of this would be read only. You can't just attach a thumbnail to, say, a random MP4 file, if your mini-file-system driver has no idea where to put thumbnails in MP4s (... because it's just not a thing in the format spec). You could definitely have an EXIF editor by loading up

our_image.jpg/exif.txt

in your favorite text editor.

Meanwhile, tools on this hypothetical new OS would use these old, flat-binary formats as mounted file systems (unlike current OSes, which open them directly as binaries). Meanwhile, new-OS-native file formats can just use a wire format (a bit similar to ZIP files, with an explicit catalog) to produce self-contained binary blob files to the rest of the world that still doesn't think of them as trees.

You could even use traditional-style file systems to emulate a sub-file one: you could just have a directory tree with every directory containing a file called "data". It'd be seriously ugly, with way too many directories around, but it'd work.

Summary

It is a historical accident that files can't have sub-files. This made everything worse, with all file formats trying to be their own miniature file systems that can do this. We can do better.

We can partially fix this by mounting them as file systems in OSes that do support sub-files. (I don't currently know of any that do, though.) We can't get all the nice things this way, but it's a viable upgrade route.

... comments welcome, either in email or on the (eventual) Mastodon post on Fosstodon.