"The server's been slow all day. Fix it."
Sooner or later, it happens to everyone. Maybe the new marketing campaign doubled your server load. Maybe a new application is more demanding than you thought it would be. For whatever reason, the server is too slow, and you have to fix it.
But wait a minute - before you call your vendor, or hold a three-hour engineering meeting, or scour the net for tuning tips, take the time to monitor the server when it's slow. You should be able to see where the bottleneck is: CPU, memory, disk, or network. Knowing this from the start will make it easier to find a solution, and you'll waste less time chasing bottlenecks that aren't there.
The first step is to figure out when the server is slowing down. There's no point looking for a leak unless it's raining. Servers tend to see peak loads diurnally: there's a lot of load in the morning as everyone logs in for the first time, and another load peak in the afternoon or early evening. But other factors can interfere with this pattern: maybe your slowdown is caused by a monthly report, instead. There may be a pattern in the user complaints, or maybe the server is always slow.
Finding the Bottleneck
Once you've figured out when to monitor the server, look at one subsystem at a time. Let's start by ruling out the network.
If your server is on the Internet, chances are you have a limited-bandwidth pipe from your ISP, or metered bandwidth from a web hosting environment. Either way, they should give you a web page that shows your bandwidth consumption for the past day, week, month, and year. Take a glance at this, and look for trends and peaks.
Do your bandwidth peaks match up with your server slowdowns? Are you regularly using all of your available bandwidth?
If your server only communicates with 10BaseT and Fast Ethernet clients, then the network is less likely to be a bottleneck. But it's still worth looking at your network utilization - especially if you can't pin your performance problems on the CPU, memory, or disk.
Checking memory utilization is fairly easy. On Windows NT or 2000, use ctrl-alt-delete to bring up the Task Manager. On Solaris or Linux, run the top program (Linux's /proc/meminfo file also contains useful information).
How much physical memory do you have, and how much is being used? Note that the amount of free memory can be a little deceptive in Linux, since the system tends to use all available free memory for caching: it's more reliable to look at /proc/meminfo, which will tell you the exact amount of RAM used as cache.
If all of your server's memory is in use during peak periods, your server may spend much of its time moving data between memory and swap (also known as virtual memory). The top program shows swap time on Solaris, and on Linux you can watch for the kswapd program. If lots of time is being spent on swap, then memory is probably your bottleneck.
While we're in Task Manager or top, let's take a look at CPU utilization. Does the server have idle time? Idle time of 10% or less generally indicates a CPU-bound application. Also, if you're looking at top, note the load average. Ideally, the load average should be less than or equal to the number of CPUs.
Note that both of these tools report system (or kernel) time, and user time. This is important: a server that spends all of its time in user tasks is doing work for user applications (such as web servers, java applications, mail engines, etc.). Time spent on system or kernel tasks is more likely to be network I/O, disk I/O, virtual memory, or other core server activities. A CPU that's 100% busy with user tasks can usually be helped through application tuning or code changes. A CPU that's 100% busy with system tasks might benefit from system tuning, or might indicate a memory, disk, or network bottleneck.
One way to see if your disk is the bottleneck is to stand in front of the server when it's running slowly. If the disk light looks like the Vegas Strip, or you can hear the drive seeking constantly, you might be disk-bound.
For a closer look, you can use the Windows Performance Monitor or the Unix iostat program. With Performance Monitor, watch the disk busy percentile and see if it's exceeding 50%. With iostat, try the -x parameter and look for high busy times. Access times greater than 50 ms also indicate that the disk is too slow.
Fixing the Bottleneck
Now that you know which subsystem is the bottleneck, you can make more intelligent choices about how to fix it. Memory limits can often be fixed by buying more RAM, and RAM is cheap.
You can also buy more network bandwidth, though it gets expensive. If the bottleneck is your Ethernet interface, you may be able to add another port, or upgrade to a faster one.
A CPU-bound system isn't easy to fix. Even if you can physically add a second, third, or fourth CPU, your application isn't likely to show much benefit past the second CPU. This is because many applications have serious lock-contention problems, and so adding CPUs merely turns a CPU-bound server into a lock-bound server.
There are exceptions to this rule: SSL is one of them, and well-written static web servers are another. Even so, don't count on your server making full use of additional CPUs. You might get a 75% boost from the second CPU, and diminishing returns beyond that.
If your system is disk-bound, ask yourself why this might be. A web server with static content should cache most of its documents, so the only writes are to the access log (make sure you're using a buffered access log). If your server also runs a local database, mail accounts, or other applications that do lots of reads and writes, you probably need to profile your disk activity and split it up across several physical disks (and possibly several disk busses).
Systems that are disk-bound or CPU-bound may also be candidates for load-balancing. This basically means running two servers in parallel, and letting each server handle half of the user requests. Load-balancing is a complex subject: I'll talk about it more in another article.
All Linux distributions come with top. The manual for top is also available on-line at http://man-pages.net/linux/man1/top.1.html.
Depending on the version, Solaris may not include top. You can always download it from http://sunfreeware.com/.
Microsoft provides Task Manager and Performance Manager with Windows NT and 2000. Support is available at http://support.microsoft.com/.
This was first published in July 2001