Won’t Someone Think of the Unicorns?!
Earlier last week, we noticed an increase in 504 errors on front-facing web pages. This usually means that something is taking way too long and Unicorn is killing the worker process.
After digging through some of our slow requests in NewRelic, we found that Redis was struggling when the site was under heavy load. We found that killing other services on the server that Redis was sharing eliminated the 504 errors. After probing the server, we found that Redis was swapping about 1-2 GB worth of data to disk at any given time. When Redis starts swapping large chunks of data to disk, it stops performing predictably. Clearly we had a resource issue. In order to understand this problem, it’s important to look at why we started using Redis to begin with.
Our Introduction to Redis
Several years ago, when we were having problems scaling with workling/starling and DelayedJob we started looking for options. After Chris Wanstrath gave a talk at a Gilt Groupe event about Github growing up, Jay and I drilled him with questions about Resque. They were having a lot of the same problems we were having and Resque seemed like a perfect fit for our infrastructure. We moved to Resque and never looked back. It just worked.
The queue for Resque is stored and managed in Redis and even at our peak of over 20 jobs/sec and running ~400 background workers it held up fine. Redis was quite literaly the last thing we were expecting to give us problems. It dutifully sat on one of the boxes that had several hundred workers running on it and never gave us an ounce of trouble.
Redis, for those that don’t know, is a key/value store like Memcached that also stores the contents of your database to disk at varying intervals. It supports atomic operations, sorting, sets, etc. It’s extremely fast, and in some of our testing, even faster than Memcached.
Today we use Redis as an overflow for session-related items (to prevent cookie bloat) and as a cache for things that we could technically store in Memcached, but would be expensive to recache on the fly if they were evicted from Memcached. We continue to use it for background workers as well. It’s a perfect fit and it handles a lot of load really well. We have several web servers and tons of background workers all pounding on it.
When this problem started, we took a look at the Redis database and realized it was using nearly 12 GB of space. After further investigation, we determined that a lot of the session data that we thought was being expired actually was sticking around forever. Several years of random session data bloat along with a bunch of other garbage meant that Redis was holding more data than it needed and subsequently consuming more memory than we were expecting.
At this point, we were pretty stuck. We could identify about half of the keys in the database easily, and mark them for expiration. That still left us with 6 GB of data that we couldn’t cleanup or mark for expiration because we weren’t sure what it was. We could dump all of the keys, evaluate them, etc but it’s an expensive operation and it can lock up your Redis server especially with the volume of data in there.
We evaluated our code base and found all of the possible areas that might be causing data bloat, upgraded or modified them, and rolled it out to production. Now that the bleeding had stopped, we still had to fix the problem. Instead of wasting time trying to identify all of the problematic keys, we decided to just do a wholesale flip over to new Redis servers. We spun up a few cloud servers under Rackspace’s RackConnect and ensured that our metal could talk to our cloud servers.
Initially, we attempted to use redis_failover with ZooKeeper but in production load tests, all of our workers began dying from segmentation faults. Unable to determine the cause (likely related to REE), we looked at Redis Sentinel. There is a pretty light weight
redis-sentinel gem that you can use with the normal
redis client for Ruby but it has a few caveats that we resolved in our fork of the gem. Here’s a quick summary of the fixes:
- Compatibility with the redis 2.2.2 gem (3.0.x has socket-related problems with REE).
- Proper timeouts using SystemTimer.
- Resolved a few infinite loops triggered when all of your sentinels are down.
In development and on staging, we try to run pretty lean so we built a Redis connector that attempts to use Sentinel and if it can’t, it falls back to just connecting to the local Redis server without failover. This is why having proper errors when no Sentinels are connectable is important. You can get an idea of how we handle these errors in this gist.
Ultimately, we feel that Redis Cluster is where we want to be but a beta is still several months out. In the mean time, we’re pretty happy with this solution until Redis Cluster is production-ready.
Deploying was actually pretty trivial. This is how the rollout went:
- Brought down half of our workers and made sure they exited cleanly.
- Configured workers to use the new cluster and brought them back up.
- Threw a bunch of bogus jobs at the newly configured workers to make sure they were able to complete the whole job cycle.
- Configured all web-servers to use the new Redis setup.
- Deployed to production with the newly configured web servers.
- Once all of the production web servers were moved over, we let whatever was still queued up in the old Redis server finish processing.
- Finally, we restarted the other half of workers on the new Redis servers.
After we deployed, we kept an eye on things and after a while, the difference was obvious:
If you were to do this and you couldn’t just cut over to a new Redis cluster, one thing you might want to do is setup the new servers but make them slaves of your existing setup. Then, you could run queries against your old Redis server to see what additional keys you could clear out. Trigger a failover via Redis Sentinel, and just shut down your old Redis server.
In our case, anything that was in Redis would automatically regenerate so this wasn’t entirely necessary. Any data “loss” was acceptable and not critical. We opted for the solution that would net us a nice clean rig.
All things considered, Redis is a fantastic piece of software. We’re looking forward to utilizing it for more of our operations that don’t really fall into our traditional MySQL with Memcached arrangement. Atomic operations are blazing fast and the flexibility you get is super.
The Redis Sentinel solution isn’t as robust as
redis-failover with ZooKeeper was because there can be a small window of time where your master is gone and the Sentinels haven’t yet triggered a failover. If anyone has any solutions for that, we’d love to hear it.