lwn.net

A peek at the DragonFly Virtual Kernel (part 1) [LWN.net]

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

In this article, we will describe several aspects of the architecture of DragonFly BSD's virtual kernel infrastructure, which allows the kernel to be run as a user-space process. Its design and implementation is largely the work of the project's lead developer, Matthew Dillon, who first announced his intention of modifying the kernel to run in userspace on September 2nd 2006. The first stable DragonFlyBSD version to feature virtual kernel (vkernel) support was DragonFly 1.8, released on January 30th 2007.

The motivation for this work (as can be found in the initial mail linked to above) was finding an elegant solution to one immediate and one long term issue in pursuing the project's main goal of Single System Image clustering over the Internet. First, as any person who is familiar with distributed algorithms will attest, implementing cache coherency without hardware support is a complex task. It would not be made any easier by enduring a 2-3 minute delay in the edit-compile-run cycle while each machine goes through the boot sequence. As a nice side effect, userspace programming errors are unlikely to bring the machine down and one has the benefit of working with superior debugging tools (and can more easily develop new ones).

The second, long term, issue that virtual kernels are intended to address is finding a way to securely and efficiently dedicate system resources to a cluster that operates over the (hostile) Internet. Because a kernel is a more or less standalone environment, it should be possible to completely isolate the process a virtual kernel runs in from the rest of the system. While the problem of process isolation is far from solved, there exist a number of promising approaches. One option, for example, would be to use systrace (refer to [Provos03]) to mask-out all but the few (and hopefully carefully audited) system calls that the vkernel requires after initialization has taken place. This setup would allow for a significantly higher degree of protection for the host system in the event that the virtualized environment was compromised. Moreover, the host kernel already has well-tested facilities for arbitrating resources, although these facilities are not necessarily sufficient or dependable; the CPU scheduler is not infallible and mechanisms for allocating disk I/O bandwidth will need to be implemented or expanded. In any case, leveraging preexisting mechanisms reduces the burden on the project's development team, which can't be all bad.

Preparatory work

Getting the kernel to build as a regular, userspace, elf executable required tidying up large portions of the source tree. In this section we will focus on the two large sets of changes that took place as part of this cleanup. The second set might seem superficial and hardly worthy of mention as such, but in explaining the reason that lead to it, we shall discuss an important decision that was made in the implementation of the virtual kernel.

The first set of changes was separating machine dependent code to platform- and CPU-specific parts. The real and virtual kernels can be considered to run on two different platforms; the first is (only, as must reluctantly be admitted) running on 32-bit PC-style hardware, while the second is running on a DragonFly kernel. Regardless of the differences between the two platforms, both kernels expect the same processor architecture. After the separation, the cpu/i386 directory of the kernel tree is left with hand-optimized assembly versions of certain kernel routines, headers relevant only to x86 CPUs and code that deals with object relocation and debug information. The real kernel's platform directory (platform/pc32) is familiar with things like programmable interrupt controllers, power management and the PC bios (that the vkernel doesn't need), while the virtual kernel's platform/vkernel directory is happily using the system calls that the real kernel can't have. Of course this does not imply that there is absolutely no code duplication, but fixing that is not a pressing problem.

The massive second set of changes involved primarily renaming quite a few kernel symbols so that there are no clashes with the libc ones (e.g. *printf(), qsort, errno etc.) and using kdev_t for the POSIX dev_t type in the kernel. As should be plain, this was a prerequisite for having the virtual kernel link with the standard C library. Given that the kernel is self-hosted (this means that, since it cannot generally rely on support software after it has been loaded, the kernel includes its own helper routines), one can question the decision of pulling in all of libc instead of simply adding the (few) system calls that the vkernel actually uses. A controversial choice at the time, it prevailed because it was deemed that it would allow future vkernel code to leverage the extended functionality provided by libc. Particularly, thread-awareness in the system C library should accommodate the (medium term) plan to mimic multi-processor operation by the use of one vkernel thread for each hypothetical CPU. It is safe to say that if the plan is materialized, linking against libc will prove to be a most profitable tradeoff.

The Virtual Kernel

In this section, we will study the architecture of the virtual kernel and the design choices made in its development, focusing on its differences from a kernel running on actual hardware. In the process, we'll need to describe the changes made in the real (host) kernel code, specifically in order to support a DragonFly kernel running as a user process.

Address Space Model

The first design choice made in the development of the vkernel is that the whole virtualized environment is executing as part of the same real-kernel process. This imposes well defined limits on the amount of real-kernel resources that may be consumed by it and makes containment straightforward. Processes running under the vkernel are not in direct competition with host processes for cpu time and most parts of the bookkeeping that is expected from a kernel during the lifetime of a process are handled by the virtual kernel. The alternative[1], running each vkernel process[2] in the context of a real kernel process, imposes extra burden on the host kernel and requires additional mechanisms for effective isolation of vkernel processes from the host system. That said, the real kernel still has to deal with some amount of VM work and reserve some memory space that is proportional to the number of processes running under the vkernel. This statement will be made clear after we examine the new system calls for the manipulation of vmspace objects.

In the kernel, the main purpose of a vmspace object is to describe the address space of one or more processes. Each process normally has one vmspace, but a vmspace may be shared by several processes. An address space is logically partitioned into sets of pages, so that all pages in a set are backed by the same VM object (and are linearly mapped on it) and have the same protection bits. All such sets are represented as vm_map_entry structures. VM map entries are linked together both by a tree and a linked list so that lookups, additions, deletions and merges can be performed efficiently (with low time complexity). Control information and pointers to these data structures are encapsulated in the vm_map object that is contained in every vmspace (see the diagram below).

[diagram]

A VM object (vm_object) is an interface to a data store and can be of various types (default, swap, vnode, ...) depending on where it gets its pages from. The existence of shadow objects somewhat complicates matters, but for our purposes this simplified model should be sufficient. For more information you're urged to have a look at the source and refer to [McKusick04] and [Dillon00].

In the first stages of the development of vkernel, a number of system calls were added to the kernel that allow a process to associate itself with more than one vmspace. The creation of a vmspace is accomplished by vmspace_create(). The new vmspace is uniquely identified by an arbitrary value supplied as an argument. Similarly, the vmspace_destroy() call deletes the vmspace identified by the value of its only parameter. It is expected that only a virtual kernel running as a user process will need access to alternate address spaces. Also, it should be made clear that while a process can have many vmspaces associated with it, only one vmspace is active at any given time. The active vmspace is the one operated on by mmap()/munmap()/madvise()/etc.

The virtual kernel creates a vmspace for each of its processes and it destroys the associated vmspace when a vproc is terminated, but this behavior is not compulsory. Since, just like in the real kernel, all information about a process and its address space is stored in kernel memory[3], the vmspace can be disposed of and reinstantiated at will; its existence is only necessary while the vproc is running. One can imagine the vkernel destroying the vproc vmspaces in response to a low memory situation in the host system.

When it decides that it needs to run a certain process, the vkernel issues a vmspace_ctl() system call with an argument of VMSPACE_CTL_RUN as the command (currently there are no other commands available), specifying the desired vmspace to activate. Naturally, it also needs to supply the necessary context (values of general purpose registers, instruction/stack pointers, descriptors) in which execution will resume. The original vmspace is special; if, while running on an alternate address space, a condition occurs which requires kernel intervention (for example, a floating point operation throws an exception or a system call is made), the host kernel automatically switches back to the previous vmspace handing over the execution context at the time the exceptional condition caused entry into the kernel and leaving it to the vkernel to resolve matters. Signals by other host processes are likewise delivered after switching back to the vkernel vmspace.

Support for creating and managing alternate vmspaces is also available to vkernel processes. This requires special care so that all the relevant code sections can operate in a recursive manner. The result is that vkernels can be nested, that is, one can have a vkernel running as a process under a second vkernel running as a process under a third vkernel and so on. Naturally, the overhead incurred for each level of recursion does not make this an attractive setup performance-wise, but it is a neat feature nonetheless.

The previous paragraphs have described the background of vkernel development and have given a high-level overview of how the vkernel fits in with the abstractions provided by the real kernel. We are now ready to dive into the most interesting parts of the code, where we will get acquainted with a new type of page table and discuss the details of FPU virtualization and vproc <->; vkernel communication. But this discussion needs an article of its own, therefore it will have to wait for a future week.

Bibliography

[McKusick04] The Design and Implementation of the FreeBSD Operating System, Kirk McKusick and George Neville-Neil

[Dillon00] Design elements of the FreeBSD VM system Matthew Dillon

[Lemon00]  Kqueue: A generic and scalable event notification facility Jonathan Lemon

[AST06] Operating Systems Design and Implementation,Andrew Tanenbaum and Albert Woodhull.

[Provos03]  Improving Host Security with System Call PoliciesNiels Provos

[Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI, Richard Stevens.

Notes

[1]

There are of course other alternatives, the most obvious one being having one process for the virtual kernel and another for contained processes, which is mostly equivalent to the choice made in DragonFly.

[2]

A process running under a virtual kernel will also be referred to as a "vproc" to distinguish it from host kernel processes.

[3]

The small matter of the actual data belonging to the vproc is not an issue, but you will have to wait until we get to the RAM file in the next subsection to see why.


Index entries for this article
GuestArticlesEconomopoulos, Aggelos