LKSM: How It Works

LKSM is used to migrate the state of a running system to a replacement system with no downtime and minimal service interruption. Click here for a video demo of LKSM performing a migration.

First, a replacement server is hooked up to the running server via a network connection. For best performance it is recommended that migration occur over a separate NIC from the one handling the server's normal LAN traffic. A crossover cable is ideal for this:

The target system boots into a special environment running in "crashkernel" space. It is vital that the source system be booted with the same "crashkernel" parameter on the command line to reserve this space and ensure that it is not overwritten on the target. The only device active on the target system is the migration NIC.

Once the target system is ready, the migrationer application is run on the source system to begin the migration. The migrationer loads up the appropriate kernel support module and invokes it to communicate with the target. One of the first things it does is copy all of memory (other than the reserved crashkernel space) over to the target system. Here's what it looks like after this copy is complete:

Note that this copy is performed while the system is running. The problem with this, of course, is that the memory state is a moving target; while the copy was taking place other processes on the system are dirtying and modifying pages. In order to be able to perform the migration on a running system (and thus keep service interruption to a minimum), we have special code in the VM subsystem to keep track of pages that are "re-dirtied" after we have copied them or whose state has otherwise changed. After the initial memory copy, LKSM performs a series of "brownout" passes where the changed pages are copied to the target system.

After several brownout passes, the set of pages that still need to be copied is short enough that we can now suspend system operation to perform this copy. This phase is known as "blackout" and typically lasts less than a minute. During blackout the PCI devices are removed and the remaining pages are copied to the target system along with information about the CPU state at the moment of blackout.

After the blackout pass is complete operations cease on the source system. On the target, the crash kernel jumps to some "trampoline" code in main memory which then completes the migration by adding the PCI devices and resuming CPU state. Operation then resumes where it left off on the source system.

News :

View a video demo of LKSM in action!

Links :

Sourceforge Project Page
LKSM talk, Part 1
LKSM talk, Part 2
LKSM talk, Part 3

Design by Minimalistic Design

Live Kernel Self Migration (LKSM)

LKSM: How It Works

News :

Links :