My team and I came across a puzzling 503 error from our httpd / php-fpm application, and the answer turned out to be so different from what we expected that I felt this was worth sharing. What initially looked like a server-side problem turned out to be something else entirely.
We were doing some basic modernization of one of our older applications. It was a simple API endpoint, written in php with an httpd front end. The application was several years old and was running directly on the linux OS, with php running directly inside httpd as module.
Definitely time for its 75,000 mile service. We were embedding it into docker containers to make it more portable, upgrading the php to the latest version, and switching to the more modern php-fpm arrangement. Instead of running php directly inside httpd, the web server would now use fastcgi proxy over to a separate php-fpm container to do the real work.
Everything was looking good through our testing, so finally the cutover came. We re-routed our traffic to the new containerized applications, and everything looked beautiful. Application was working like a charm, throughput was good… job well done.
Being an obsessive sort, I kept checking back in on the systems over the course of the day, and after a couple of hours, I saw we had recorded a half dozen 503 errors amongst tens of thousands of successful requests. Accompanying the 503 message in the access log was this message in the error log (ip addresses and urls obfuscated):
[proxy_fcgi:error] [pid 111] (70007)The timeout specified has expired: [client 10.11.12.13:39610] AH01075: Error dispatching request to :, referer: https://www.example.com/
Uh-oh, that’s a proxy timeout with the dreaded 503 response code. You typically see these when something is going way too slow on your server, and your proxy (load balancer, httpd, nginx, etc) eventually gives up waiting around and sends a 503 error back to the client. They are hard to debug because usually it is some sort of data or production scale issue that doesn’t show up in normal testing, and they can come and go. I’ve basically been trained that when I see a 503 error, it’s time to get out my performance tuning toolbox and get to serious work.
However, nothing about this presented like a performance tuning situation. The API calls were super fast, taking just a fraction of a second, and the system resources looked healthy. What could it possibly be doing that would take a 100 millisecond request and trigger a proxy timeout that was set to one minute?
Digging further, the problem became more mysterious. In the PHP logs, we found some application errors that would match up with the recording of the 503 event, but the errors suggested that the application had received a post request without any actual data being posted. That was a useful clue, but it didn’t make any sense – an empty request should be an immediate failure, not the sort of thing that would cause a timeout error.
We tried simulating a POST request without a body using Postman, and we saw what we expected… the same application errors, but a response within a few tens of milliseconds and a 200 from the server. Nothing matching the 503.
Continuing to look further, more strange things popped up. Looking at the httpd log entry for when the request was received versus when it actually responded, we saw what looked like a 20 second drift. httpd was triggering the 503 timeout after 20 seconds, not the 60 seconds it was configured for. Furthermore, those application errors appeared right at the end of the 20 seconds…not the beginning.
As far as we could tell, it seemed like httpd was hanging out for 20 seconds, and then finally telling php-fpm to go do its thing. php-fpm responded that it was missing post data, but httpd had already given up and sent the 503 to the client.
Now a possible explanation was starting to form. Our clients have users all over the world, some of whom have some very poor internet connections. Perhaps what we were seeing was a request starting from a client but somehow never finishing. If httpd then decided to give up after 20 seconds and pass php-fpm whatever it had, this would all start to make sense.
It felt like a good explanation, but I wanted to be sure. Under the covers, the http protocol is just a fancy telnet session. If I could figure out the right series of commands, maybe I could trigger it.
It took some monkeying around, but I finally stumbled on the right combination (IPs and urls again obfuscated):
computer:~ jeremy$ telnet 10.11.12.14 80 Trying 10.11.12.14... Connected to 10.11.12.14. Escape character is '^]'. POST /app/apiendpoint HTTP/1.1 Host: 10.11.12.14 Content-Type: application/x-www-form-urlencoded Content-Length: 100 helloworld
In the block above, I am opening up a connection on port 80 to the system using old fashioned telnet and then manually entering the http protocol commands. I first tell it I’m going to do a POST operation, then a few headers, and then the data.
Most significantly, I am giving it a content-length of 100. I am telling the server… “get ready, I’m going to send you 100 bytes of data.” I then send it “helloworld” and go silent. The server has gotten 10 bytes of data, and it is now waiting expectantly for the other 90 bytes.
But I have gone quiet. I haven’t closed the connection, butI haven’t sent the promised data, and the server is waiting. After 20 seconds, the server decides that it isn’t going to wait around any longer and severs the connection.
And then, it response with this:
HTTP/1.1 503 Service Unavailable Server: Apache Content-Type: text/html; charset=iso-8859-1 Strict-Transport-Security: max-age=16070400; preload Date: Wed, 11 Jan 2023 18:02:39 GMT Connection: close Content-Length: 299 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> </body></html> Connection closed by foreign host.
And sure enough, we get the 503 in the access log and the php logs generate the application errors about the missing data.
So, this wasn’t a problem in our application to fix, this was a client connection error.
For me it was eye opening. 503 has always meant “problem on the server” to me. It turns out it might also be triggered by a client with a bad connection!
I know there are lots of posts on 503 errors out there, but I’m hoping to add mine to the mix in case maybe it will point somebody in the right direction.