Troubleshooting common WordPress performance issues, a systems perspective

This one is, primarily, for all the people responsible for ensuring a WordPress site remains available and running well. “Systems” people if we must name them. If you’re a WordPress developer you might want to ride along on this one as well so you and the systems or DevOps team can be speaking a common language when things go bad. Often times, systems people will immediately blame developers for writing bad code but the two disciplines must cooperate to keep things running smoothly, especially at scale. It’s important for systems AND developers to understand how code works and scales on servers.

What I’m about to cover is some common performance issues that I see come up and then be misdiagnosed or “fixed” incorrectly. They’re the kind of thing that causes a WordPress site to become completely unresponsive or very slow. What I cover may seem obvious to some, and they are certainly very generalized, but I’ve seen enough bad calls to know there are a number of people out there that get tripped up by these situations. None of the issues are necessarily code related nor are they strictly WordPress related, they apply to many PHP based apps; it’s all about how sites behave at scale. I am going to explore WordPress site performance issues since that’s where my talents are currently focused.

In all scenarios I am expecting that you are running something getting a decent amount of traffic at the server(s). I am assume you are running a LEMP stack consisting of Linux, Nginx, PHP-FPM and MySQL. Maybe you even have a caching layer like Memcached or Redis (and you really should). I’m also assuming you have basic levels of visibility into the app using something like New Relic.

Let’s get started.

Your site is down!

First thing is to determine where to focus your efforts. Is DNS ok? Is each part of the system actually running? Is any part of the system CPU bound? Are you receiving traffic at the servers and so on. Most of the time, you should find your issue falls into one of the following buckets, of which I’ll only discuss a few:

  • DNS failed
  • Your CDN went boom
  • Your load balancers went boom
  • Some service went boom
  • Your web servers are CPU bound
  • Your database servers are CPU bound
  • Your database servers are not CPU bound but there are a lot of queries
  • Nothing is CPU bound but you have a lot of PHP processes doing nothing
  • Nothing is CPU bound but you have a lot of database connections doing nothing
  • The previous two items are both true

Basic connectivity issues

If you are seeing obvious signs of connectivity issues, like DNS errors, 502/504 errors and so on work through ensuring you have basic connectivity into the servers. This includes ensuring DNS is working, from multiple geographic locations, load balancers are up, SSL is terminating correctly (and not expired) and so on.

CPU bound web servers

Assuming no new code was pushed and the issue came on suddenly, check if you have an increase in uncached traffic to the site. Check logs to if you are being attacked by a worm, bot or otherwise. You may find that a bad marketing push was released to the public resulting in unique links to the same content. In this situation, unless configured properly, caching layers will not be effective. If none of this is true you may have missed the signs that your servers were reaching capacity and you have now crossed the threshold of what the servers could handle and are now suffering the consequences.

CPU bound database servers

Similar to CPU bound web servers this, CPU bound database servers are usually caused by an increase in traffic, an increase in pages that have bad queries, batch processing with a lot of queries or a combination of the three. Before yelling at developers to fix their code do your homework to rule out any and all system configuration problems that could lead to poor performance.

When looking at the database servers, ask:

Are the processors waiting to get data from storage?

If you have high i/o wait then you need to determine if the i/o wait is being caused by poor queries that are doing large table scans or if the system simply lacks memory. There are two main reasons to have a lot of memory in a database server. On a dedicated database server, most of the memory should be set aside for MySQL’s InnoDB buffer pool. This allows MySQL to load data directly into memory and work with it there. Remember, the buffer pool is a LRU system so it is NOT necessary to hold your entire database in memory, only what is actively used. If the amount of data being actively used is larger than the buffer pool you will see problems.

The second reason to have a lot of memory (and not give it all to MySQL) is disk cache. For data that does need to be loaded from disk it is quite useful if it can be read from memory, even if it isn’t in the buffer pool.

There is a lot that goes into tuning MySQL but sizing your memory correctly and setting the innodb_buffer_pool_size correctly is the most important aspect.

Do I have a lot of backed up queries?

First thing to determine if you see a lot of backed up queries (using show processlist) is if they are backed up because of CPU usage or a table alter.

In the case of too much CPU usage try to determine if there is a common query that is running slow. Once you find it, first go back to determining if you have configured memory correctly. If you have then record the query and work with developers to either call that query fewer times or optimize it. If you’re a database wizard go ahead and see if you are missing an index that will speed the query up. If neither are your problem you need more CPU. Add more servers, spread the load, etc.

If you have a lot of pending queries and almost no CPU load then someone probably put in a table alter on a large table and you just have to wait. Table alters can be nearly invisible in development or on unloaded systems. They’ll run quickly either because the system is not loaded up or there is a lot less data than production. Once on production where there is more data and more work being done table alters will ruin everything because all queries must wait for the alter to finish.

Do I have too many database connections?

A variation of having too many backed up queries hitting the connection limit. Usually this manifests itself as a lot of idle connections doing nothing and almost no CPU load. This is one of the more confusing situations for less experienced admins. This is caused either by persistent connections being on, which you should almost never do, or because PHP itself has opened a connection but hasn’t gotten the chance to close it. Often times I see people respond to this scenario by raising the connection limit. Never raise the connection limit in response to a lot of idle connections or backed up queries. Connection limits must be careful evaluated based on expected throughput from the servers when operating optimally while balancing memory and CPU usage. Every connection will use memory so you need to ensure that you have calculated the total amount of memory MySQL will use from connections, buffers/caches and the buffer pool. Blindly raising the connection limit will not solve the issue and will create a ticking time bomb that will ruin a future weekend.

Is MySQL being OOM killed?

If you find that MySQL was recently restarted (or worse, isn’t running at al) then check to see if the operating system OOM killed it using dmesg. If you find it was OOM killed you need to adjust settings to reduce your memory footprint to fit the machine. Use mysqltuner.pl to help setup your server. On anything more than a basic, personal blog you should consider running a dedicated database server or cluster.

Nothing is CPU bound but there are a lot of PHP processes doing nothing

One of the best ways to ruin the performance of any site is to do (web based) external requests during user requests. Sometimes these are necessary due to the nature of the site but, as a general rule, they should be minimized as much as possible. Not only do they greatly slow any requests where an external is being made they can also need to entire sites going down because of resource exhaustion, aka “Denial of Service”.

Whenever I encounter a system with low CPU utilization but a lot of PHP processes, I usually suspect hanging external requests to be the issue. Try to determine if the cause is due to some kind of external request. This includes all external requests, even requests to Memcached/Redis or MySQL. Confirm that the site is talking to Memcached/Redis properly and confirm that your database server is not struggling, particularly with a table alter as described above. If these services look ok then start looking for any external APIs that are down or responding slowly.

Like idle database connections, a common response I see to this issue is to raise the number of PHP process. Again, never raise the number of PHP processes in response to a lot of bored processes that can’t get any work done. This will not solve the issue. Instead, focus on determining why the existing processes aren’t getting work done and resolve that issue. Only adjust the number of PHP processes as a method to tune a server’s CPU and memory utilization so you can squeeze the most out of it before adding more.

Often times, PHP processes that are waiting on external web requests will have opened database connections but then won’t close them until the page is finished. In this case, you may run into a lot of idle database connections and find that you either hit the connection limit or your PHP process limit.

Closing thoughts

I’ve covered just a few of the more common WordPress related issues I run into and some of the ways I’ve seen people incorrectly respond to them. Every site is different and will have its own unique ways of failing. Hopefully what I’ve outlined here will help you more quickly determine the cause of performance issues and resolve them properly. Let me know on Twitter if you feel I left something out or got something wrong.