Engineering

Challenges of Scale: Solving a Linux Kernel Bug

As an engineering-first organization, we have a natural tendency to want to observe how our systems behave at their limits. This allows us to uncover issues that no one else that uses the same technologies has been able to explore. One such issue even led us to discover a bug in the Linux kernel, a core part of a widely used operating system.

For those less familiar, a kernel is the central part of an operating system that is responsible for how the software and hardware on the system interact. These kernels are among the most scrutinized codebases in the world, as they are the core for many enterprise systems. After a thorough investigation, we solved a problem with a file system that caused it to, under certain circumstances, create invisible files that could not be deleted via conventional means. What started off as an unexplainable error then led to a deep-dive into the internals of the Linux kernel itself.

How a standard cleanup job unveiled a Linux kernel bug

The issue began when our servers started reporting errors while attempting to delete a directory — typically a trivial operation. Specifically, a clean-up job on one of our ad servers was regularly communicating it was unable to delete supposedly empty directories on a tmpfs mount.

Every minute. Same ad server. Same directory. Even worse, identical symptoms started appearing on other ad servers over the next few days.  With no discernible cause and with many more servers across the world available to fall victim, this began to be troubling.

Digging deeper into the issue

Given that the problem wasn’t going away on its own, we logged on to one of the problem servers to further investigate. Here is what we noticed:

  • The bad directories were indeed empty  (ls -a /some/directory)
  • There were no open file handles under the bad directories (lsof +D /some/directory)
  • Manually trying to remove the bad directories gave the same error (rm -rf and rmdir)
  • Moving the bad directories to another directory gave the same error (mv /some/directory /other/directory)
  • Renaming the bad directories within the same directory was possible (mv /some/directory /some/ghost_directory)

Suppressing the error by renaming the directories allowed us to exclude them from being checked by our clean-up job, which seemed to be an easy immediate fix until we had the resources to conduct a deeper investigation. We applied it to the few affected servers and monitored the situation to understand if it would only happen under exceptional circumstances.

One month later, three percent of our ad servers were afflicted with these unexplainable, empty — yet not empty directories. While the effects were benign, not knowing the root cause worried us that it might be a symptom of something more serious.

After a deeper investigation, we found the following clues:

  • Empty directories on a tmpfs mount take up 40 bytes and an additional 20 bytes per file contained within them
  • Our “ghost” directories had a size of 60 bytes
  • The inode numbers of our “ghost” directories were very close to 232
    • Inode numbers for newly created files and directories on a tmpfs mount seemed to increase sequentially

Since 232 is close to the largest inode number that could be assigned, we wondered what would happen if the file system tried to assign an inode number larger than this limit.. Typically, when the variable used to hold a number goes over its maximum value, it loops back to its minimum value, which would be 0 in this case. Interestingly enough, research online suggested that a file with an inode number of 0 would be ignored by certain file systems. What if this was the cause of the “ghost” directories?

To prove this, we wrote a script to create and delete files on a tmpfs mount, keeping the files close to the 232 boundary and stopping creation once it was slightly past it. 9 hours later, we had a directory that looked something like this:

Spot the missing file? There should be a file named file_3, seemingly with an assigned inode number of 0. After trying to remove the directory, we experienced the same error when trying to remove the “ghost” directory. Even more interesting, the invisible file itself could be opened, edited and deleted so long as the filename was explicitly used.

It looks like it’s there but its inode number isn’t recognized.
Confirmation it has an inode number of 0.

With strong evidence that our hypothesis was correct, we dug into the Linux kernel code to confirm that there was no protection against integer overflows when assigning inode numbers to tmpfs files. We found that the default assigned inode number for any new inode was taken from a globally incremented counter. Most file systems ignore this default and use their own inode number allocation algorithm, but not in the tmpfs implementation. This confirmed our suspicions and made us wonder how this issue had never been noticed before. This was likely due to the combination of conditions needed to trigger it:

  • Have a system running long enough such that 232 (~4.3B) files are created
  • Have the 4.3 billionth file be created on a tmpfs mount
  • Try to delete the 4.3 billionth file indirectly, e.g. by deleting its parent directory

Creating a workaround to satisfying a bug report

Our first action was to file a bug report with Red Hat so that it could be fixed at the source. In the meantime, we were able to create an automated workaround now that we had a complete understanding of the cause of the “ghost directories”.

What started out as a mysterious error resulted in a deep investigation, some crazy theories and a satisfying patch to the Linux kernel. This is not the only time we’ve uncovered issues with a widely used codebase due to the sheer scale at which we use it, and as Index Exchange continues to grow, we will undoubtedly have more such issues to overcome.

Leave a Reply

Your email address will not be published. Required fields are marked *