===============================================================================
Process Checkpointing and Restarting using the core dump

http://www.geocities.com/asimshankar/chekpointing/
README Last Updated: March 1, 2005
VERSION: 1.1
===============================================================================
This README contains information on installing and running
this process checkpointing and restarting system.

For details on how it works, refer to the doc/ directory the source and the
website.

If you have any questions, contact the author (see AUTHORS file).

Also, make sure you read the COPYING file for some licensing issues.

CONTENTS:
---------
        0. CHANGELOG
	I. INSTALLATION
		A. Requirements
		B. Building
	II. EXAMPLE RUNS
		A. Complete, with file descriptors
	    B. Checkpointing using gdb
	III. LIMITATIONS

===============================================================================
0. CHANGELOG
===============================================================================

[Version 1.1 - March 01, 2005]
- Fixed "Could not read name of note #1 error".  The problem arose because it
  seems that the kernel rounds up the "name" and "description" field of the
  Elf32_Note structure to a multiple of 4-bytes (Specifically: the functions
  notesize() and writenote() called by elf_core_dump() in fs/binfmt_elf.c of
  the kernel sources), and I wasn't taking this into account.

- Modified the fprintf example a bit, if filename supplied is "-", then no file
  is created, numbers are just dumped to stderr.

[Version 1.0 - April 19, 2003]
- Initial release

===============================================================================
I. INSTALLATION
===============================================================================

---------------
A. Requirements
---------------
	1. Linux, kernel 2.4 or above
	   (It seems kernel 2.2 and below don't really support the mmap2 system call
           the way I use it)
	   (Seems to work with 2.6 as well)
	2. gcc
	3. gdb 5.2 [Optional]
	   (We need the gcore command to be implemented. I know gdb 5.0 doesn't 
	   recognize the gcore command)

-----------
B. Building
-----------
	1. Explode the tarball (which you probably have)
	2. Execute "make"
	   In case of any trouble, contact the AUTHORS
	3. Installation is complete with the restart 
	   utility and some example programs in bin/
	   and a library used to checkpoint filedescriptors
	   in lib/


===============================================================================
II. EXAMPLE RUNS
===============================================================================

In this section we demonstrate how a process can be checkpointed and then
restarted using some example programs provided. The source of these example 
programs is available in examples/

NOTE: In the following, "$" stands for the shell (bash) prompt.

NOTE: Some other examples might be present in the examples/ directory but
      are not documented here. Play around with them.

-------------------------------------------------------
A. fprintf  - Checkpointing along with file descriptors
-------------------------------------------------------
bin/fprintf is a program that takes as input a filename and a number and then 
prints all numbers from 1..given number into the given file, sleep()ing for 2 
seconds between each print. The program uses the standard C library's fprintf()
function which may not immediately write to a file but does some buffering.

For checkpointing file descriptors, you need to add the libsavefds.so library
to the LD_PRELOAD environment variable.

$ cd bin
$ ulimit -c unlimited

[This is bash shell specific. What this does is increase the amount of space
that CAN be taken by a core dump. Often, this space is set to zero and cores
are not dumped. The csh equivalent is limit, I think]

$ export LD_PRELOAD=../lib/libsavefds.so
$ fprintf

[Now, send the process a SIGQUIT signal using Ctrl+'\' or the kill command]

Checkpoint information is in core.<pid> and information on the open file 
descriptors is in a file called 'filedescriptors'. This is a simple text file 
with lines of the format:
<fd> : <filename> : <offset>

You can edit this file too. So in case you wish to restart the process on 
another computer where the file is in a different location, just edit this file 
appropriately.

Now, to restart
$ restart -f -n -w fprintf <core filename>

And it was as if the process never stopped!

NOTE: You would want to
$unset LD_PRELOAD
when you're done, as above we have given a relative path to the library.
If you give a fully qualified path, then you won't have to worry about this.


------------------------------------
B. linklist - Checkpointing with gdb
------------------------------------

bin/linklist is an example that takes as input a number from the user, then 
creates a link list of nodes containing integers from 0 to the supplied number. 
If you give a large number, a lof of memory is allocated from the heap. The
program then prints out all the numbers onto stdout.

Here we demonstrate how to checkpoint a process using gdb's gcore command.

$ cd bin
$ gdb linklist
(gdb) break 29
(gdb) run
[Enter a large number, say 4000]
(gdb) cont 500
(gdb) gcore core.linklist
(gdb) quit (say yes to the confirmation question)

What just happened here is that the linklist program was run, and a link
list of 4000 nodes was to be created. 

break 29 inserted a breakpoint at the source line 29 
(use list to see the source within gdb).

cont 500 told gdb to continue execution of the program till it passed the
breakpoint 500 times. At source line 29 we were creating the linked list,
and by cont 500 we created a list of 500 nodes, the remainder hadn't been 
created.

gcore core.linklist created a core dump file for the process state as of now
(creating a linklist, 500 of 4000 nodes created). THIS IS OUR CHECKPOINT.

Now, we will restart the program from this point. As a result, the rest of the
link list will be created and all numbers will be printed on stdout, which
would have happened had we not checkpointed the process.

To do this, do
$ restart -n -w linklist core.linklist

And voila, things were as if they never stopped!
You can experiment with checkpointing the program at different states 
(for example, break at line 37 instead of 29 and allow a few numbers
to be printed onto stdout. Then gcore to checkpoint and restart to see
only the remaining numbers print).

NOTE: You can also use gdb to checkpoint a RUNNING process. Use:
gdb <executable filename> <process id>
and then use gcore to checkpoint the running process.

===============================================================================
III. LIMITATIONS
===============================================================================

The way things work as of now, there are some restrictions on the processes
that can be succesfully restarted from a checkpoint. Some things that come to 
mind:

* Processes that use the dlopen() call to open dynamic libraries CANNOT be
restarted as of now.
* LD_PRELOAD must be the same when the checkpoint was made and when restart
was called
* Signal handlers are NOT restored
* Processes that use mmap() to map files to address space CANNOT be restarted
as of now.

===============================================================================
===============================================================================
