First, I'd like to talk about the structure of a Google Doc. It's like a tree. It has inherently nested parts like lists, tables, paragraphs, text elements, images, etc. This lends itself to a directory structure - especially a list which has IDs and nested levels and whatnot.
But did I take advantage of this fact? Of course not.
I did take advantage of bookmarks, however. They act as convenient pointers (because document positions and named ranges seemed stupid). When I get into the description of the data structures written to disk, we'll see how we reference files on the disk.
At a more abstract level, this is a flat file system. Directories are an illusion. Paths can be any Javascript string, but it makes me feel more comfortable to use slashes to pretend I have directories. So that's what I do in my example code. Likewise, this makes searching for files a little easier. I implemented a function that searches the drive for files based on a regular expression. You can't really do that quickly without a flat file system, can you? Ha!
The directory data structure in the document header is a JSON object that contains an array of files mapping paths to bookmark IDs. It also contained some useless metadata about the file system creation (time). These bookmarks were contained in body paragraph objects, each one containing a JSON object that represented a file. It contained the contents and some metadata (creation time, modification time, and a placeholder for a permissions system).
I guess you could call it a non-fragmenting file system in that it's literally impossible to have a fragmented file. I guess if you screw up the directory listing somehow, you can have dead data on the disk that can never be removed. I have not written a utility to get rid of the data. Doing this would be trivial, you basically find all of the bookmarks that aren't in the directory table and delete them and the paragraph they're in. It also has the notion of inodes given that the data is separated from the directory that points to it. Paragraphs are inodes, bookmarks are pointers. So you could have hard links here as well which would also be sorta easy to implement. But it's hard to draw too many more parallels between this file system and another one because it doesn't actually have to deal with real hardware structure or physical space.
Another important thing you have to keep in mind when you write a file system is how efficient it is with disk space. Google Docs are not infinite - you have a space limit of "Up to 1.02 million characters." So we have to squeeze the storage potential out of each individual character we can. So we use a small library called LZString to compress strings to UTF16 strings. Google Docs happily stores this and we get massive compression ratios. It's slower, but it's worth it.
In my tests, I store some small strings, a large passage of Lorem Ipsum, and a 1008x756 JPEG in the file system. I had to convert the JPEG to Base64 because the compression library only worked with strings and I'm still not 100% sure how to convert it back. Uncompressed, all of this was too big for the Docs editor to handle. So I dumped the count of characters in the log.
Uncompressed: 226302 characters in 4.531 seconds
Compressed: 107853 characters in 76.396 seconds.
Now, I know what you're thinking. The performance drop off is horrible. And you're right. The performance drop off is horrible. You get about half of the space back at a cost of 19 times the time. This is the fault of the compression library (obviously), and I'll quote the website here:
For performace comparison, I use LZMA level 1 as a comparison point.The string library doesn't like bigger strings. But that's okay, the file system is useless anyway. You can disable compression by replacing the methods in the LZString library I use with the identity function. I have them commented out in the LZString.gs file.
- For strings smaller than 750 characters, this program is 10x faster than LZMA level 1. It produces smaller output.
- For strings smaller than 100 000 characters, this program is 10x faster than LZMA level 1. It produces bigger output.
- For strings bigger than 750 000 characters, this program is slower than LZMA level 1. It produces bigger output.
You can view the project here. I don't think you have to sign in - that's what the permissions dialog box says. But if you have to, you shouldn't have to request access. It's probably just to run the code. If you even want to. I don't know why you'd want to. But here it is, in existence.
Also, if you want to see the contents on disk, here's what it looks like compressed:
No comments:
Post a Comment