A little bit of background. The 2nd of November was the launch of SnapShooter’s new backups engine. Now, this brought a lot of improvements to the way we do backups with databases and file systems as well as directly supporting applications. One of the most significant improvements I found was that we do real-time logging from a customers server back to SnapShooter. Unlike before where we would SSH into someone's server and pull the log periodically. This process was far more efficient; the logs would stream directly from customer servers back to SnapShooter. This was probably the biggest unknown for the launch because previously the traffic was relatively low. Server load was all dependent on the queue workers, but now we were going to be processing millions of inbound connections per day.
SnapShooter allows people to set hourly backups and most people generally stick to that, doing backups on the hour. We do support cron, and we do support going all the way down to a five-minute resolution. But most customers have backups on the hour and most of them happen at midnight UTC. So with us, we ended up with a big spike. All was good, the backup system went out, and we noticed on the DigitalOcean load balancer, that we were roughly processing a hundred requests per second at the peak of the popular hours. But then I started to notice that we were also receiving quite a lot of timeout issues on the load balancer!
The week previous, I had increased the servers in preparation for releasing this new system. I thought, you know, might as well scale them up to 80GB CPU instances. So I was a bit perplexed why we were getting some requests processing so slowly they timed out. For a bit of background, DigitalOcean’s load balancer limited to 60 seconds, requests over this will always time out.
DigitalOcean Load Balancer
Now I will add at this point, SnapShooter's backup script has a retrying policy around sending logs back. Now, this will retry countless times to make sure that SnapShooter is aware of the current status of a backup. So no customers backups were being affected. However, the performance was terrible our end, and more worryingly, this was effecting the dashboard at SnapShooter, which looks horrible to customers and prospective customers, who trusts a business when their website randomly times out!
Finding the fault
The first thing I did was I split up the load balances between the ingesting of data from backups and the serving of SnapShooter's website. Now, I should've done this from the beginning, but I didn’t expect to hit performance issues at sub 100 reqs. So I separated them apart, and I created a domain just for the ingestion of data, and this helped with debugging. I then put behind it two brand new 4GB FPM droplets, and I chose a dedicated CPU. And I noticed that during load testing this, we were still getting requests timing out and just getting more and more concerning.
There was no rhythm or reason to it; sometimes under load, it would run fine, sometimes under no load (1-5 reqs) we would get timeouts.
2020/11/05 14:36:51 [error] 17352#17352: *1319260 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.131.19.157, server: snapshooter.io, request: “POST [Retracted] HTTP/1.1”, upstream: “fastcgi://unix:/var/run/php/php7.4-fpm.sock”, host: “snapshooter.io”
There is nothing more annoying than debugging messages like this, there was an issue, but nope we can’t tell you more!
The first logical place in my mind was to look at MySQL to see if there were any issue, and missing index is a straightforward issue to solve, and I thought with the new system very possible I have overlooked a heavy query. Especially around our backup logs table that was doing ~150k inserts a day. A query without an index will run final locally, but as soon as the table starts to fill, full scans will be a disaster.
DigitalOcean Managed database does not have a slow query logger, at least not one that would have worked for me. So I quickly wrote up a Laravel Service Provider todo it myself.
60ms was my cut off, any query over that was getting sent to Papertrail and easily filtered by the emergency tag + SlowQuery. No binding into the logs because in the unlikely event that there is sensitive data in the query binding, I don't want it getting logged.
Yep, I’d missed some indexes! Not many, but our new key backup logging database was missing an index on the timestamp which does the ordering. I am also missing a team_id index on the backup jobs, which made the UI a little slow for customers wanting to see all their new backup jobs. A Laravel migration later, and everything index wise seemed good.
DigitalOcean managed database has a graph for Index vs Sequential reads, compare this to when you have a high load situation, if you have a low number of indexed reads, you should investigate more.
The database query work improved to page load times. However we still got the 60s timeouts, and the query logger was returning that some perfectly normal queries, like selecting a single user where taking over 2000ms to return. I didn’t forget to add a primary key to the user’s table!
So I did the sensible thing of increasing the size of the database, managed DigitalOcean databases make it easy to increase size, and I also added a read replica. And again, a performance benchmark and it was no better. We were still getting slow queries. At this point, I'm pulling my hair out. It makes no sense. Mighty servers, total overkill, and it's still not able to cope! An individual server was not able to cope with more than 20 requests per second. It makes no sense.
Looking at PHP
So I started talking to friends in the community. I got on a call with Leo Sjöberg, who helped me run through some of the obvious things. OPcache settings was checked, R-limit on the box was checked, but we still couldn't understand. Turning to these are a final level of optimisation, not something you need to do to get timeouts solved. If you use PHP-FPM out the box, you will be using something called dynamic mode. In my case, because I'm only running the server FPM, it makes sense to change to a static method, which I'll go into more detail later which we did, but still getting this query issue.
We did see a small performance increase, which was to be expected; however, the timeout still there.
In the process of trying to tune PHP FPM with Leo, I realised that there was a setting in there for slow requests. I enabled slow requests logging and waited, then I got my first stack traceback.
Every slow request stack trace ended the same
[0x00007f092a215dc0] auth() …./PhpRedisConnector.php:87
A request to Redis what's going on here? There's an issue with Redis. So, I did the sensible thing, which was to increase the size of the Redis cluster, even though we're nowhere near performance limit on the Redis cluster managed by DigitalOcean. Still, it makes sense just to increase it, like it's the quickest way to solve an issue, right?
And it still didn't solve this high timeout. And I posted this in the Laravel UK Slack where I've been discussing the issues. And I left that to have lunch, and when I came back, someone had posted a bug report on GitHub saying that Redis extension with PHP7.4 had an authentication issue. Oh, my goodness. I spent all week trying to solve a problem that doesn't exist where I think it exists.
The crux of this bug is that when the Redis extension tries to connect to a Redis server over TLS version 1.3, every so often, you'll get a random timeout. And there's still no resolution for this. A bit of shifting of blame between, is it Redis? Is it PHP? Is it something else? However, the fix in my case was to simply force Laravel to connect to Redis over TLS version 1.2. As soon as I did this, I didn't have a single timeout issue again. And I'd solved the problem. Well, more accurately, Lee Willis and Laravel UK had solved the issue.
Now I could get back to what I wanted to be working on, marketing, but I'd left a little bit of a mess. I've increased the size of the MySQL server and the Redis server. And in DigitalOcean there's no path for shrinking a managed service. So you would have to create a new cluster and move the data back. With the Redis that was relatively easy, with the database I still have not got around to migrating it back,
But to satisfy my curiosity, I decided to spin up a new FPM worker for the ingesting system, a $5 server. Just to see what it could do. And my goodness can a $5 server that's configured with OPcache and FPM set in the right mode. With the correct number of workers, you can deal with a serious number of requests. So we'll go into more detail about that. So in SnapShooter's case, the FPM servers don't do anything other than FPM process, I like to keep things nicely segregated. The servers that run the queues are not web-facing and dedicated for that particular purpose. And you can change the number of workers based on the hardware and the load. But for the FPM, they only process HTTP requests, and they run behind the load balancer.
So the first thing you can do is you can set FPM to run in static mode. There are three modes, on-demand, dynamic and static. Static, you just set the exact number of worker processes the FPM has. Dynamic, you set the minimum and maximum, and it will scale up and down between them, keeping active spares running. And then on demand is a way that you can basically spin down to nothing when not serving traffic. Only ever look at that one, if you're doing shared hosting with separate FPM workers.
Dynamic is potentially useful if you have many different processes on the server, and you want to be freeing up memory when you're not serving web requests. Still, in my case, with a fixed amount of resources, you can calculate how much memory each FPM worker is going to be using, serving your application. And then you can tune that to the amount of RAM you've got with a bit of overhead and set up. So for the $5 server, I think it worked out at about 30 workers. And a $5 server with 30 FPM workers was more than happy to serve over 200 requests per second. Yes, the response time became a little bit sluggish with high concurrent, but for the case of ingesting data, it wasn't an issue.
And that's how I managed to go from ridiculous servers to even more ridiculous servers. And then back to $5 servers. And now SnapShooter is run on three, $5 servers, doing the ingesting of data, which mainly happens at the hour. And then we have three workers for serving the dashboard, and they are three $10 servers just to get a little bit higher performance on the server processing time.
Other useful resources
Alex Vanderbist Some notes on slow queues and Redis over TLS