LKSM: How It Works
LKSM is used to migrate the state of a
running system to a replacement system with no downtime and minimal
service interruption. Click here for a
video demo of LKSM performing a migration.
First, a replacement server is hooked up to the running server via a network connection. For best performance it is recommended that migration occur over a separate NIC from the one handling the server's normal LAN traffic. A crossover cable is ideal for this:

First, a replacement server is hooked up to the running server via a network connection. For best performance it is recommended that migration occur over a separate NIC from the one handling the server's normal LAN traffic. A crossover cable is ideal for this:

The target system boots into a special
environment running in "crashkernel" space. It is vital that the
source system be booted with the same "crashkernel" parameter on the
command line to reserve this space and ensure that it is not
overwritten on the target. The only device active on the target
system is the migration NIC.
Once the target system is ready, the migrationer application is run on the source system to begin the migration. The migrationer loads up the appropriate kernel support module and invokes it to communicate with the target. One of the first things it does is copy all of memory (other than the reserved crashkernel space) over to the target system. Here's what it looks like after this copy is complete:

Once the target system is ready, the migrationer application is run on the source system to begin the migration. The migrationer loads up the appropriate kernel support module and invokes it to communicate with the target. One of the first things it does is copy all of memory (other than the reserved crashkernel space) over to the target system. Here's what it looks like after this copy is complete:

Note that this copy is performed while
the system is running. The problem with this, of course, is that
the memory state is a moving target; while the copy was taking place
other processes on the system are dirtying and modifying pages.
In order to be able to perform the migration on a running system (and
thus keep service interruption to a minimum), we have special code in
the VM subsystem to keep track of pages that are "re-dirtied" after we
have copied them or whose state has otherwise changed. After the
initial memory copy, LKSM performs a series of "brownout" passes where
the changed pages are copied to the target system.


After several brownout passes, the set
of pages that still need to be copied is short enough that we can now
suspend system operation to perform this copy. This phase is
known as "blackout" and typically lasts less than a minute.
During blackout the PCI devices are removed and the remaining pages are
copied to the target system along with information about the CPU state
at the moment of blackout.


After the blackout pass is complete
operations cease on the source system. On the target, the crash
kernel jumps to some "trampoline" code in main memory which then
completes the migration by adding the PCI devices and resuming CPU
state. Operation then resumes where it left off on the source
system.
News :
View a video
demo of LKSM in action!
Design by Minimalistic Design