An In-Depth Look at Reiserfs

Prev		Next

http://www.linuxplanet.com/linuxplanet/print/2926/

 By: Scott Courtney
Monday, January 22, 2001 08:42:21 AM EST
URL: http://www.linuxplanet.com/linuxplanet/tutorials/2926/1/[]

Included in the Linux kernel

Reiserfs will soon become the first journaled file system to be bundled as part of the standard Linux kernel tree. What is a journaled file system, how does Reiserfs fit into that category, and why should you care that it's about to become part of the Linux core?

Let's start with a discussion of filesystems in general. If you are coming to Linux from a Windows or DOS environment, then you have been using filesystems already — you just haven't called them that. A filesystem can mean either a specific disk drive or partition, or it can mean, in a more general way, the internal format of how the data is organized on a mass storage device. For example, you have a root filesystem on your Linux machine and perhaps another filesystem for /home, and another for /opt, and so on. Each of these corresponds to a partition on a disk drive. Other directories that are undeneath these may not necessarily have their own disk partition, so they aren't filesystems.

On the other hand, we use the term "filesystem" to represent the particular way that the data is stored and how the operating system keeps track of it. Information such as the date of a file's creation and last modification, which user and group own it, what permissions are granted for reading and modifying the file, how large it is, and where it is located on the drive or partition, are all part of the responsibility of the filesystem. If the file itself is "data" then all these other items are "data about the data" and they are collectively called "metadata." So any filesystem must manage all the files and all of their metadata.

In Windows, the most common filesystems are File Allocation Table (FAT) and its newer flavors such as FAT32 and VFAT. FAT is a holdover from the dark ages of DOS and is very primitive internally. To be fair, it was created in the days of 8- and 16-bit computers and single-tasking operating systems, and it was as complex as the systems of the day could really support. Windows NT introduced a much more sophisticated filesystem called NTFS which is more reliable, faster, and capable of supporting extremely large drives and partitions. NTFS, by the way, is quite similar to its ancestor, the High Performance File System (HPFS) from IBM's OS/2 operating system.

FAT and its variants support essentially no user-level security, and they are extremely vulnerable to data corruption after a system crash. Given the poor reliability of Windows 95 and 98, the fact that their filesystems don't do a good job of recovering from crashes is a recipe for disaster! The problem isn't so much that the files may get corrupted, but rather that the metadata about those files can be corrupted.

How Filesystems Become Corrupted

Here's an example of how this happens: Suppose you have a word processing document, which you open in the application and to which you add some content. If the machine crashes before you save the file, you have lost all your changes but your original file will still be okay. If the machine crashes after you save the file, then you really haven't lost anything except the time it takes to reboot and reload the program. But what happens if the machine crashes during the exact moment when the disk is being written?

The answer is, "Things get very ugly." Since the new version of the file is physically overwriting all or part of the old version, the data can have some of each at the moment the drive stops writing. You end up with a file that you can't open because the internal format of its data is inconsistent with what the application expects.

This gets even worse if the drive was writing the metadata areas, such as the directory itself. Now instead of one corrupted file, you have one corrupted filesystem — in other words, you can lose an entire directory or all the data on an entire disk partition. On a large system this can mean hundreds of thousands of files. Even if you have a backup, restoring such a large amount of data can take a long time.

Most PC operating systems have no good way to prevent the loss of a single file that was being written during a system failure. Modern systems, such as Linux, OS/2, and NT, however, do make an attempt to prevent and recover from the horrible metadata corruption case. To accomplish this, the system performs an extensive filesystem analysis during bootup. Well-designed filesystems often incorporate redundant copies of critical metadata, so that it is extremely unlikely for that data to be completely lost. The system figures out where the corrupt metadata is, and then either repairs the damage by copying from the redundant version or simply deletes the file or files whose metadata is affected. Losing files this way is bad, but it is much better than losing the whole partition.

Unfortunately, such an extensive diagnostic analysis requires a great deal of time. Even on a very fast PC, a large and heavily-used partition can require several minutes to check. Most of the time, however, the check is not really needed because the system was shut down normally, without a sudden crash. To prevent unnecessary delays, the operating system's normal shutdown process puts a status flag on the filesystem as it is unmounted, marking it as a "clean" filesystem. If a crash happened, the system never has the chance to mark the filesystem as "clean" and the bootup process knows that it needs to run the extensive filesystem tests just to be safe. A filesystem that has not been shut down cleanly is called, appropriately enough, a "dirty" filesystem.

Enter the Journaled Filesystem

Modern filesystem designs, such as OS/2's HPFS, NT's NTFS, and Linux's popular ext2, do a very good job of implementing the things discussed in the previous section. If you have a system crash, it may take a great deal of time to check the metadata during bootup but the odds are good that you will still have all your files when it's done. As Linux begins to take on more complex applications, on larger servers, and with less tolerance for downtime, there is a need for more sophisticated filesystems that do an even better job of protecting data and metadata. The journaled filesystems now available for Linux are the answer to this need.

It's important to note here that we are talking about journaled filesystems in general. There are a number of such systems available, including "xfs" from Silicon Graphics, "Reiserfs" from The Naming System Venture, "ext3" currently hosted at Red Hat, and "Journaled File System" from IBM. In this article, I use "journaled filesystem" in lower case to mean the generic type of system, as opposed to the capitalized version that refers specifically to IBM's software. You can find links to all of these projects in the references attached to this article.

No matter which journaled filesystem is used, there are certain principles that always apply. The term "journaled" means that the filesystem maintains a log or record of what it is doing to the main data areas of the disk, so that if a crash occurs it can re-create anything that was lost. That can be a little confusing, so let's take a closer look at this process.

In public speaking classes, there is an old saying that goes, "Tell them what you're going to tell them, then tell them, then tell them what you told them." This is similar to what the journal does in a filesystem. When the system is about to alter the metadata, it first makes an entry in the journal saying, "Here is what I'm going to change." Then it makes the change. Finally, it goes back to the journal and either marks that change as "completed" or simply deletes the journal entry entirely. There are variations on this sequence, and other ways to accomplish the same thing, but this simplified view will suffice for our purposes.

The idea is that the system can crash at any point in this process but that such a crash won't have lasting effect. If the crash happens before the first journal entry, then the original data is still on the disk. You lost your new changes, but you didn't lose the file in its previous state. If the crash happens during the actual disk update, you still have the journal entry showing what was supposed to have happened. So when the system reboots, it can simply replay the journal entries and complete the update that was interrupted, or it can back out a partially completed update to restore the file's previous state. In either case, you have valid data and not a trashed partition.

These concepts are familiar to anyone who works with SQL databases with their transaction logic. Replaying and completing an operation that was interrupted is called "roll forward" and backing out such an operation to its previous, consistent state is called "roll back." Ideas that were developed to prevent lost data in SQL databases are also valuable on regular mass storage devices. That is the real benefit of journaled filesystems.

Reiserfs

Now that you understand the need for journaled filesystems, we can take a look at one particular type, called Reiserfs. Originally designed by Hans Reiser, Reiserfs carries the analogy between databases and filesystems to its logical conclusion. In essence, Reiserfs treats the entire disk partition as if it were a single database table. Directories, files, and file metadata are organized in an efficient data structure called a "balanced tree." This differs somewhat from the way in which traditional filesystems operate, but it offers large speed improvements for many applications, especially those which use lots of small files.

Reading and writing of large files, such as CDROM images, is often limited by the speed of the disk hardware or the I/O channel, but access to small files such as shell scripts is often limited by the efficiency of the filesystem design. The reason for this is that opening a file requires the system first to locate the file, and that means reading directories off the disk. Furthermore, the system needs to examine the security metadata to see if the user has permission to access the file, and that means additional disk reads. The system can literally spend more time deciding whether to allow the access, and then locating the data on the drive, than it does actually reading such a small amount of information from the file itself.

Reiserfs uses its balanced trees to streamline the process of finding the files and retrieving their security (and other) metadata. For extremely small files, the entire file's data can actually be stored physically near the file's metadata, so that both can be retrieved together with little or no movement of the disk seek mechanism. If an application needs to open many small files rapidly, this approach significantly improves performance.

Another feature of Reiserfs is that the balanced tree stores not just metadata, but also the file data itself. In a traditional filesystem such as ext2, space on the disk is allocated in blocks ranging in size from 512 bytes to 4096 bytes, or even larger. If a file's size happens to be anything other than an exact multiple of the block size, space will be wasted. For example, suppose the block size is 1024 bytes but you need to store a file that is 8195 bytes long. Eight blocks is 8192, so almost all of the file will fit into eight blocks. The remaining three bytes have their own block, which is mostly empty! The wasted space is almost one whole block out of nine, or about 11 percent. Now imagine a file 1025 bytes long. It almost, but not quite, fits into one block, but requires two. The wasted space is nearly 50 percent. The worst case is a very tiny file, such as a trivial (but useful) one-line shell script. Such a file may be only 50 bytes or so (for example) and would fit into just one block. But if the block is 1024 bytes, then the file has wasted about 95 percent of its allocated space. As you can see, the wasted space (as a percentage) is smaller if the files are larger.

Reiserfs doesn't use a traditional block approach to allocating space, instead relying on the tree structure to keep track of exact byte counts. On small files, this can save a lot of storage space. Furthermore, since more files are placed closer together, the system is able to open and read many small files with just one physical access to the drive. This further improves performance by eliminating time-consuming head seek operations.

Some applications benefit more than others from this type of optimization. Imagine a directory with hundreds of tiny PNG or GIF files used as web page icons, on a busy site. This situation is tailor-made for something like Reiserfs. Likewise, a web site with thousands of HTML files, each just a few kilobytes in size, is an excellent candidate. On the other hand, a disk partition that stores ISO9660 CDROM images, each hundreds of megabytes in size, will see little performance gain from Reiserfs. As with so many other things in the world of computing, the best performance is gained by matching the right tool with the job at hand. (Note that I'm not saying Reiserfs is slower than ext2 on large files — only that there won't be much difference in some cases.)

On top of everything else, Reiserfs is a true journaled filesystem like xfs, ext3, and IBM's JFS. Each of these systems implements the journaling feature in a different way, but the effect is the same: extremely good reliability, and extremely fast recovery after an abrupt shutdown or crash. On my system, I have found that filesystems that took several minutes to check using ext2 take only a second or two under Reiserfs. This difference is typical of any journaled filesystem versus a traditional filesystem.

Installation — Aye, There's the Rub!

For all its benefits, Reiserfs requires a bit of effort to install and configure. It is supplied as a set of kernel patches, which are applied to the kernel source code. After extracting the standard, generic kernel into a directory such as /usr/src/linux, you make that the current directory and run the patch command. If the Reiserfs patch is stored in /usr/local/src/reiserfs-patch.gz (note that the ".gz" means it's gzipped), then you would run a command like this:

zcat /usr/local/src/reiserfs-patch.gz | patch -p1

After that, you can configure the kernel as usual. There will now be a new option in the "filesystems" section that allows you to enable Reiserfs as either a built-in kernel feature or as a loadable module. You have to choose the built-in option if you want to boot the system from a Reiserfs partition.

Once the new kernel is built and tested, you can use the mkreiserfs utility to format partitions as Reiserfs filesystems. As with the regular mke2fs command, this will erase all the data on the partition! (By the way, after you build the kernel you have to go into a subdirectory of the kernel source and separately build the Reiserfs utilities, which include the mkreiserfs command. Instructions for doing this are found on the Reiserfs web site.) Once the desired Reiserfs partitions are formatted, you can use the mount command (with -t reiserfs options) to mount them just as you would with regular ext2 partitions. You can even put Reiserfs filesystems in the /etc/fstab file to have them mount automatically during system initialization.

Each time a new kernel version is installed, a new version of Reiserfs is required. The data on disk is upward-compatible so you don't have to reformat each time, but the filesystem code changes slightly. It takes a certain amount of care to ensure that you don't try to put the wrong Reiserfs version with the wrong kernel, and there is a slight delay between the release of a new kernel and the release of the Reiserfs patch for that kernel.

None of this is especially difficult if you have previous experience building kernels. The process is quite similar to applying the Alan Cox kernel patches, or the international crypto patches. Nonetheless, it can be quite daunting for someone new to Linux. It's especially tricky to create a Reiserfs root partition, since this requires temporarily booting off a backup partition as well as copying and recopying the files from the root partition. One false step, and you can be left with an unbootable machine and be forced to reinstall Linux from scratch or get help from a guru. It's certainly not something a new user should try!

Practical Considerations for Reiserfs

Reiserfs isn't perfect, and has problems and limitations like any other software. Because it changes the conceptual way in which the disk is allocated and managed, Reiserfs doesn't work well with network file system (NFS) servers. There are some patches available to remedy part of the problem, but they don't completely solve it. Likewise, using software RAID to create fault-tolerant drive arrays doesn't work under Reiserfs (but hardware RAID is fine). As with any other piece of software, you have to look at Reiserfs in relation to your needs and the system's intended purpose, and then make a reasoned decision as to whether it's the right tool to use.

Performance gains under Reiserfs can be substantial, or can be miniscule, depending on what you are doing. I have found that Reiserfs is extremely responsive for most of my work, and I wouldn't want to live without it. Compiling source code, something that typically opens hundreds or thousands of files in rapid succession, really zooms. The biggest difference I have noticed is when using the find command to scan large directory trees. Scans that used to take thirty seconds or more now take just five or ten seonds. Copying large files takes just about the same amount of time as with ext2, though deleting unwanted files is significantly faster.

Traditional filesystems such as ext2 can be well-designed and reliable, and I certainly have found ext2 to be quite acceptable in the past. I have never, ever, lost a filesystem after a crash under ext2. Yet the long bootup delay while ext2 does its checking is annoying, especially on a test machine where crashes are more frequent because, well, it's a test machine. All things considered, I am thoroughly sold on journaled filesystems in general and Reiserfs is certainly a fine implementation.

Into the Kernel It Goes!

A lot of people are very happy that Reiserfs is being added to the standard Linux kernel. Instead of being a separate, complex process that has to be done on a complete and working system, Reiserfs becomes a part of the normal installation process, just another option that can be selected in your favorite distribution's install tool.

The fast crash-recovery of journaled filesystems, such as Reiserfs, makes Linux more friendly toward novice users. I have seen new users, when faced with a system that sat for a minute or more at the "checking local filesystems…" message, decide that the machine is completely hung when in fact it's just very, very busy. They instinctively reach for the power switch or reset button, a habit that was probably acquired under Windows. OUCH! There is not much worse than killing power during a filesystem check, and if you didn't have disk corruption before, you probably do now! So a journaled filesystem makes Linux behave in a more intuitive way and makes it more forgiving of mistakes like accidentally hitting the power switch or reset button. And, let's face it, even advanced users don't enjoy waiting ten minutes for their systems to reboot.

Having world-class journaled filesystems in Linux also makes it more enterprise-ready for corporate deployments. We all know how seldom Linux crashes if properly installed, but in a major data center application even a few minutes of downtime once a year may be too much, and even a small risk of corrupt filesystems cannot be tolerated. Journaled filesystems bring Linux to parity with commercial UNIX-like systems such as Irix and AIX, and this can only help Linux in the corporate marketplace.

It will take time for the commercial distrubutions to catch up with the kernel, so that Reiserfs is an integral part of the installation. Yet it will happen, and when it does Linux will take another leap forward in usability.

There is, of course, no reason why this benefit is gained only with Reiserfs. The other journaled filesystems (xfs, JFS, and ext3) each have their own advantages, and each offers something the others do not. Reiserfs is the most widely-used right now, and it has the longest track record in the Linux world. Both xfs and JFS are Linux versions of proven commercial filesystems (on Irix and AIX, respectively) but they are still considered beta quality in their Linux incarnations, and their development teams still recommend against using them for production systems. ext3 simply adds the journaling capability to regular ext2, and as such it is less disruptive and potentially less risky — but it is still called a beta. Hopefully, all four of the journaled filesystems will eventually be part of the standard kernel, letting distribution vendors and users choose the right one for their individual needs.