2007年6月16日星期六

Controlling your Linux system processes

All modern operating systems are able to run many programs at the same time. For example, a typical Linux server might include a Web server, an email server, and probably a database service. Each of these programs runs as a separate process. What do you do if one of your services stops working? Here are some handy command-line tools for managing processes.

Each process uses time on a system's CPU, as well as other system resources such as memory and disk space. If a program goes wrong, it can start to use too much CPU time or memory and so deny other programs the resources they need to run.

Knowing how to manage rogue processes is an essential part of Linux system management. To help, turn to command-line tools such as ps, top, service, kill, and killall.

ps

ps shows the current processes running on the machine. ps has many options, but one of the most useful invocations is ps aux, which shows every process on the system.

A normal Linux server may have 100 processes running after boot up, so the output from the ps command can be quite long. Here are the first few lines from my CentOS 5 test machine:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 10308 668 ? S 15:03 0:00 init [5]
root 2 0.0 0.0 0 0 ? S 15:03 0:00 [migration/0]
root 3 0.0 0.0 0 0 ? SN 15:03 0:00 [ksoftirqd/0]
root 4 0.0 0.0 0 0 ? S 15:03 0:00 [watchdog/0]
root 5 0.0 0.0 0 0 ? S

Here is a brief explanation of each of the columns:

USER is the name of the user that owns the processes.
Each process has a unique process ID (or PID for short).
%CPU shows the CPU utilization of the process. It is the CPU time used divided by the time the process has been running expressed as a percentage.
%MEM is the amount of the physical memory the process is using.
VSZ show the virtual memory size of the process in kilobytes.
RSS is similar to VSZ, but rather than virtual memory size, RSS shows how much non-swapped, physical memory the process is using in kilobytes.
TTY is the controlling terminal.
STAT is the status of the process, where S means the process is sleeping and can be woken at any time, N means the process has a low priority, and < means the process has a high priority. Other letters to watch for are l which means the process is multi-threaded and R which means the processes is running.
START shows when the process was started.
TIME is the accumulated CPU time. This includes time spent running the processes and time spent in the kernel on behalf of that process.

For a complete explanation see the ps man page.

Finding a specific process in such a long list can be a problem. To help, you can use the grep command to look for matches in the text. For example, to look for the sendmail process, use the command:

ps aux | grep sendmail
root 2401 0.0 0.4 66444 2064 ? Ss 15:04 0:00 sendmail: accepting connections
smmsp 2409 0.0 0.3 53040 1752 ? Ss 15:04 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
gary 3807 0.0 0.1 60224 700 pts/2 R+ 15:17 0:00 grep sendmail

When you run it, the grep command itself will be shown (in this case PID 3807) as it matches the string we are looking for, namely sendmail. But of course it isn't part of the sendmail service.

top

While ps shows only a snapshot of the system process, the top program provides a dynamic real-time view of a system. It displays a system summary (with CPU usage, memory usage, and other statistics) as well as a list of running processes that changes dynamically as the system is in use. It lists the processes using the most CPU first.

The first few lines of top look something like this:

top - 15:18:00 up 54 min, 0 users, load average: 0.00, 0.10, 0.11
Tasks: 115 total, 2 running, 113 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.0%sy, 0.0%ni, 99.0%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 467888k total, 458476k used, 9412k free, 15264k buffers
Swap: 3204884k total, 0k used, 3204884k free, 222108k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
554 root 15 0 229m 9940 4548 S 0.7 2.1 0:10.29 Xorg
1 root 15 0 10308 668 552 S 0.0 0.1 0:00.11 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0

The bottom part of the output is similar to the output from the ps command. In the top part, the Swap: line is useful for checking how much swap space is being used. For more information see the top man page.

service

The easiest way to start and stop services such as sendmail or the Apache Web server from the command line is to use the service command. Each service provide a script for easily starting and stopping the service.

To discover the status of a service, type service sendmail status. This should output something similar to:

sendmail (pid 4660 4652) is running...

If you want to shutdown a running sendmail, you can type service sendmail stop. To start it again, use service sendmail start. To stop and restart sendmail, use service sendmail restart.

If you can't stop a running or rogue service using the service command then you may need to resort to the kill and killall commands.

kill and killall

The kill command attempts to shut down a running process. In Linux, a process is stopped when the operating system sends it a signal telling it to shut down. The default signal for kill is TERM (signal 15), meaning software terminate. If it receives the signal, the process should shut down in an orderly way. If the process has become rogue, chances are that it won't respond to being told politely to shut down. In that case you have to send the KILL signal (signal 9 for short). So to kill off a running process (e.g. process 1234) we would use kill -9 1234.

The killall command kills running processes by name rather than by PID. This bring two immediate advantages. First, to kill a process we don't need to look for the PID using the ps command. Second, if there are multiple processes with the same name (as is the case with the Apache Web server) then all the processes will be killed in one fell swoop. As with kill, killall takes a signal parameter, and -9 is used to terminate the processes. So to kill off all the Apache processes you would use killall -9 httpd.

Restarting an unresponsive Web server

Let's look at an example of how to use these commands to solve a real-life problem. If you find that your Web server has stopped responding and needs to be restarted, first try the service command. The start/stop script for your Web server should be able to get it running again. For Apache on CentOS 5 we would type:

service httpd restart

If that fails, next try the killall command to eliminate the old instance of the Web server:

killall -9 httpd

Run ps to check that all the Apache services died:

ps aux | grep httpd

If there are any strays, kill them off individually with the kill command. Finally, restart the Web server with:

service httpd start

A friend of mine recently had problem with the fetchmail process. Fetchmail is a program that fetches mail from external mail servers and pulls them down onto the local server. One morning he discovered that his system was running slowly. A quick use of the top command revealed that the fetchmail process was using 99% of the system memory. He noted the fetchmail process's PID, then killed the process and restarted it using the service command. The memory was freed and the system sprang back to life.

Conclusion

You should monitor your system to ensure that none of your processes have gone haywire. One simple method is to permanently run a terminal window with the top command. A quick glance every so often will assure you that all is OK. If something does start to go bad, Linux provides useful tools to stop and restart processes. Only rarely will a full system reboot be needed.



read more | digg story

0 评论: