[Information Coding Laboratory]

Suggestions on running long jobs

Reducing the impact on other users

To minimize the impact on the responsiveness of the system the program can be run with a high 'nice' level (the highest is usually 19) which corresponds to a low priority. Check out the web page description on nice and nohup. Also, see the man pages on 'nice' and 'renice' for details. You can also change the nice level of a running process using the program 'top'. With a high nice level, the program should not interfere with someone actively using the system, but it will get all the CPU when the machine would otherwise be idle. Something to consider is the memory usage of the program. If the program uses a significant portion of the physical memory, say within 10-15 Mbytes of the physical memory size, then the performance of the machine will probably be severely degraded when someone is running X windows and a few applications (X + a few apps can easily take up more than 10-15 Mbytes). The performance degradation occurs despite having a high nice level since the machine has to swap memory in and out to disk. This is something to keep in mind when writing simulations - if the machine starts swapping then performance really goes down. If at all possible, design the simulation so that it isn't necessary to use a large portion of physical memory at any given time. If you really need to use large blocks of memory, another idea is to stop the program during normal working hours and let it run at night when there is less impact on the other users. You can stop a program without killing it by sending it the stop signal (i.e. kill -STOP pid). The program can be restarted later by sending the continue signal (i.e. kill -CONT pid). This could be automated in a cron job to stop the process in the morning and continue the process at night.

Improving the odds of getting output from the program

The other area I'll mention is getting results from a simulation. If the program will take a long time to finish, it is important to have some mechanism of obtaining partial results. Otherwise, if the power fails 5 minutes before the simulation finishes after it has been running for a week, you may have to start over from the beginning unless you have some idea where it left off. The most likely cause of a program terminating early is that the machine needs to be rebooted for maintenance or because of software problems. You can have the program periodically dump some state information to a file to guard against a power failure. For cases where the machine is shutdown, the program will be sent the TERM signal, so it is possible to catch this signal and dump state information at that point before terminating. You could also have the program catch other signals such as HUP so it would dump state and continue - allowing you to get state information from the program at any time by 'kill -HUP pid' without killing the program. I have some example code which implements this signal catching/state dumping concept. There are two source files and a header file in the example code:

A file containing functions which could be placed in a library and linked with simulation programs to allow registering and dumping state values is in state_lib.c. Its header file is state_lib.h.

A file containing a simple example to show how the library routines are used is in test_term.c.

A compressed tar file containing all the source files as well as a makefile is provided also provided -> state_lib.tgz.


Welcome · Projects · People · Papers · Calendar · Links · Internal
© 1997 Information Coding Laboratory
Send comments to www@code.ucsd.edu

Last Updated: $Date: 1997/11/21 23:04:52 $