We've been getting anecdotal reports of "your web site is slow" every once in a while, and in increasing frequency, over the last 2 months. I've long noticed that sometimes even opening small files with vi can take SECONDS on the production machine. This seemed bad, especially with monit emailing me alerts every hour of high iowait and load averages. The current situation gave me the opportunity/excuse to dive in and see what's up. Being an optimization freak, and knowing that I hadn't ever really optimized the server much, I figured I could squeeze out a big bump in performance pretty quickly with a few tweaks.
The first order of business was getting some good instrumentation set up. Here's a list of all of my favorite io performance tools I've found:
- munin
- sysstat / pidstat
- iotop
- dtrace (Mac/Solaris/bsd?) or SystemTap (Linux)
We've had Cacti installed for years, but only a few reports, and frankly trying to add more was very frustrating. Cacti seems over-architected to the point of being useless. It is nowhere near simple to use.
When we set up cacti, we also looked at munin. Not sure why we chose cacti over munin at the time (2-3 years ago), but in hindsight seems like a bad move. In under 30 minutes my sysadmin managed to download, install, configure, setup, and customize munin with a bunch of plugins. We have now gone from information scarcity to information overload.
Armed with this massive pile of new charts and graphs, I noticed that our iowait time averages about 5% during the day, and ~12% overall (including some painful overnight cron jobs that bring down the average). There is no memory swapping, which is a good sign. CPU utilitzation is only ~7%, so I knew my bottleneck was IO:

But how much iowait is too much? I have no idea. The best real way I have seen to answer that question is with "iostat -x 60" which tells you an approximate "utilization" percentage of a disk resource over 60 seconds. Mine varied on subsequent runs from 3-4% to 100%. While it does corroborate the "sometimes it's fast, sometimes it's REAL slow", frankly this wasn't that useful. I'm back to "it seems laggy, must be something wrong."
Our server is a Dell 2950 with a 4x500GB RAID 5. This one machine runs the DB for 3 web apps as well as the web servers, so I figured that the competition for IO among all of these processes could really be thrashing the drives.
I spent a few hours tuning postgres, thinking that it was thrashing the drive. Turns out that was a waste of time, I should've just restarted PG so that I could see the graph of DB disk vs cache reads in munin:
So then I pretty much started guessing, and that's where it went all wrong. I researched common disk latency problems and came up with the expected list, but also some best-practices:
What I found is that it's actually httpd that's doing most of the reading & writing. This is the summary results of 10 minutes of IO on my server:

As you can see, really it's not all that much IO. A few hundred kBytes/second is something you'd think a laptop could handle, right? The database is hardly used. Apache is the only one really pounding the disks. The writing I think mostly comes from uploads (I run some picture sharing sites so there are lots of uploads). I think that this is another reason my RAID5 is working against me. On the serving side, I have about 6-7 million images occupying ~400GB of disk space on a server with only 4GB of RAM, so I think nearly every http image served thrashes pretty hard on the seek side (I have yet to try mod_cache because from what I understand it's tough to beat linux disk page cache for simple file caching).
In the end, unfortunately I was not able to decrease the IOWAIT of my machine in a noticeable way. It either means I don't understand what I'm doing (probably true), the server isn't really all that slow (maybe it's handling the load pretty well most of the time), or that because of the variety of load put on my server by the multitude of apps running on it, the RAID 5 just can't handle it and stays busy at a certain baseline level.
We are spinning up another server and we're upgrading to a 2x500GB RAID 1 for the OS and db transaction logs and a 4x500GB RAID10 for the web apps and db storage. I am hopeful that will help.
While sadly my epic blog post didn't have a happy ending, I hope that my trials will inspire you to a) not waste time over-optimizing, and b) point you to a bunch of resources that might help save you some time and sanity in your own future.
When we set up cacti, we also looked at munin. Not sure why we chose cacti over munin at the time (2-3 years ago), but in hindsight seems like a bad move. In under 30 minutes my sysadmin managed to download, install, configure, setup, and customize munin with a bunch of plugins. We have now gone from information scarcity to information overload.
Armed with this massive pile of new charts and graphs, I noticed that our iowait time averages about 5% during the day, and ~12% overall (including some painful overnight cron jobs that bring down the average). There is no memory swapping, which is a good sign. CPU utilitzation is only ~7%, so I knew my bottleneck was IO:

But how much iowait is too much? I have no idea. The best real way I have seen to answer that question is with "iostat -x 60" which tells you an approximate "utilization" percentage of a disk resource over 60 seconds. Mine varied on subsequent runs from 3-4% to 100%. While it does corroborate the "sometimes it's fast, sometimes it's REAL slow", frankly this wasn't that useful. I'm back to "it seems laggy, must be something wrong."
Our server is a Dell 2950 with a 4x500GB RAID 5. This one machine runs the DB for 3 web apps as well as the web servers, so I figured that the competition for IO among all of these processes could really be thrashing the drives.
I spent a few hours tuning postgres, thinking that it was thrashing the drive. Turns out that was a waste of time, I should've just restarted PG so that I could see the graph of DB disk vs cache reads in munin:
So then I pretty much started guessing, and that's where it went all wrong. I researched common disk latency problems and came up with the expected list, but also some best-practices:
- Use RAID10 for database loads (or any write-heavy loads) as RAID5 is slow at writing. (too hard to change on this server)
- Use different spindles for db transaction logs vs db disk storage. (too hard to change on this server)
- Use noatime mount option - heavily recommended by Linus and Ingo -- and this did drop my iostat write requests by half, but didn't affect my iowait %
- Move php session store to a ramdisk - NOT memcached; it has no locking mechanism - worked, but didn't affect iowait % noticeably
- Use apache mod_log_config and BufferedLogs On - did reduce disk writes, but didn't affect iowait %
What I found is that it's actually httpd that's doing most of the reading & writing. This is the summary results of 10 minutes of IO on my server:

As you can see, really it's not all that much IO. A few hundred kBytes/second is something you'd think a laptop could handle, right? The database is hardly used. Apache is the only one really pounding the disks. The writing I think mostly comes from uploads (I run some picture sharing sites so there are lots of uploads). I think that this is another reason my RAID5 is working against me. On the serving side, I have about 6-7 million images occupying ~400GB of disk space on a server with only 4GB of RAM, so I think nearly every http image served thrashes pretty hard on the seek side (I have yet to try mod_cache because from what I understand it's tough to beat linux disk page cache for simple file caching).
In the end, unfortunately I was not able to decrease the IOWAIT of my machine in a noticeable way. It either means I don't understand what I'm doing (probably true), the server isn't really all that slow (maybe it's handling the load pretty well most of the time), or that because of the variety of load put on my server by the multitude of apps running on it, the RAID 5 just can't handle it and stays busy at a certain baseline level.
We are spinning up another server and we're upgrading to a 2x500GB RAID 1 for the OS and db transaction logs and a 4x500GB RAID10 for the web apps and db storage. I am hopeful that will help.
While sadly my epic blog post didn't have a happy ending, I hope that my trials will inspire you to a) not waste time over-optimizing, and b) point you to a bunch of resources that might help save you some time and sanity in your own future.
[UPDATE]
I did one more thing last night that's worth noting. It didn't help overall system baseline load like I'd been hoping, but did provide a noticeable decrease in load during my overnight cron jobs.
On the left you'll see what normally happened when my cron jobs were running; horrible IOwait and almost no CPU usage. Then, 24h later, you see the spikes again, but it's not as consistently slow, and the big indicator of success is the much larger CPU load. Previously the scripts couldn't do their thing b/c they were bottlenecked at IO. I originally had 3 postgres full vacuumdb's + analyze running at 3AM, 1 for each database. After reading the PG 8.1 docs, it seems that a full vacuum isn't necessary every day, only a maintenance vacuum. So I spread out the 3 vacuumdb commands and am no longer doing a full vacuum. Major improvement!


0 comments:
Post a Comment