SQLSTATE[HY000] [2002] Connection timed out: FreeBSD PF, CARP, and MySQL

SQLSTATE[HY000] [2002] Connection timed out

A-Team Systems investigated a production issue where a PHP application intermittently reported this MySQL connection error:

SQLSTATE[HY000] [2002] Connection timed out

Unfortunately, this error can mean a lot of different things. You can get it when the database engine is down, when a firewall rule is wrong, when routing is broken, when a cable or switch path is having trouble, or when something in the middle simply does not return the TCP handshake.

Most people who land on this article will probably have a simpler cause than the one described here. That is worth saying up front. This case was unusual because almost everything was working, including almost all of the MySQL connections.

The application was making a very large number of database connections, and only about 1 in every 10,000 new TCP connections failed.

Everything else looked healthy; other application traffic was working, most MySQL traffic was working, and the database was not globally unavailable. The database endpoint path was not showing obvious signs of CPU, memory, swap, or listen-backlog exhaustion.

The environment: FreeBSD PF, CARP, and a high-churn MySQL path

The production path looked roughly like this:

Application servers

FreeBSD PF / CARP firewall pair

Database service endpoint

Backend database writer

In this case, there was a SQL proxy layer in the database path, and we suspected it for a while. That was a reasonable suspicion. Bypassing that layer stopped the visible application errors.

But the proxy software was not the problem. The same kind of issue could have shown up on a direct-to-database path with enough short-lived connections and the same firewall behavior in the middle.

The pieces that mattered were:

FreeBSD PF doing stateful forwarding
CARP providing firewall/router failover
no PF state synchronization between the two firewall nodes
a high rate of short-lived MySQL connections
repeated connections from the same application hosts to the same database destination IPs and port 3306

That told us the path mattered. It did not prove the service at the end of the path was failing.

Why the error was misleading

The application error was a MySQL connection timeout:

SQLSTATE[HY000] [2002] Connection timed out

This was not the same as a slow query, an authentication failure, or a connection that was established and then dropped later.

The failing requests were not getting far enough for the MySQL protocol to matter. They were failing at TCP connect time.

In other words, this was a TCP connect timeout, not a MySQL query problem. So the useful question became very basic:

Did the TCP handshake complete?

What made this hard to see

The failure rate was extremely low.

Roughly 1 out of every 10,000 new database connections would hit the timeout. The rest succeeded.

The other problem was that it would go away on its own. The first couple of times we saw it, by the time we had enough tooling in place to capture the right traffic and compare it against firewall state, the errors had already stopped showing up.

That made it easy to chase the wrong thing. A service can pass health checks while still losing a tiny fraction of new TCP connections. A network can move large volumes of traffic while one exact flow fails. A database endpoint can accept thousands of connections while a small number of new handshakes never complete.

What the packet capture showed

The most useful capture was intentionally narrow. We did not need full MySQL payloads, and capturing them would have created unnecessary volume and sensitivity.

We captured the TCP handshake.

On an affected application server, a failed connection looked like this:

app-server:ephemeral-port -> database-endpoint:3306

SYN
SYN retransmission
SYN retransmission
SYN retransmission

No SYN/ACK came back to the client.

That matched the application error. The client tried to open a TCP connection to the database endpoint, retransmitted the SYN, and eventually timed out.

This was not a failed SQL query. It was a TCP connection that never established.

What PF showed for the same flow

The next step was to compare the packet capture with PF state on the active FreeBSD firewall.

For the same exact 5-tuple:

source IP
source port
destination IP
destination port
protocol

the application server was still sending SYN retransmissions.

But PF showed that same flow as:

ESTABLISHED:ESTABLISHED

That was the useful contradiction.

The client was trying to start a new TCP connection. PF believed the flow was already established.

Once those two observations were matched to the same 5-tuple, the problem stopped looking like a general database endpoint failure. It looked like stale or incorrect firewall state for a specific flow.

Why most traffic still worked

This is the part that is easy to get wrong during an incident.

A stateful firewall is not just deciding whether one host can reach another host. It is tracking individual flows.

The unit of failure can be one specific TCP 5-tuple:

client IP + client port + server IP + server port + protocol

If one of those flow states is wrong, that one connection can fail while other connections between the same systems keep working.

That explains why the symptoms looked contradictory:

routing was working
the database endpoint was reachable
most MySQL connections succeeded
other application traffic was unaffected
only a very small number of new MySQL TCP connections timed out

"Everything else works" did not rule out the firewall. In this case, it helped rule out broader explanations like general routing failure, database outage, or full-path packet loss.

We also looked for the usual firewall explanations, including PF state table exhaustion. That did not line up with what we were seeing. If the state table were full, or if the firewall were hitting a broad capacity limit, we would expect a much more consistent failure pattern: waves of connection errors, broader impact, or counters that clearly matched the timing.

That was not the pattern here. The failures were too isolated and too random. They looked more like individual flow state mismatches than a firewall running out of capacity.

Why connection churn and source-port reuse mattered

The affected path had a high rate of short-lived MySQL connections. That high connection churn mattered.

That matters because TCP connections are identified by their 5-tuple. When a client creates many short-lived connections to the same destination IP and port, it will eventually reuse ephemeral source ports.

That ephemeral source-port reuse is normally fine. It became relevant here because the firewall's view of a flow did not always match the endpoint's view after the firewall restart and failover/failback sequence.

The likely sequence was:

One FreeBSD firewall node was active.
Maintenance shifted traffic to the other CARP node.
Many short-lived MySQL connections were interrupted and recreated.
Traffic later shifted back after firewall restart/failback activity.
Application servers continued creating many new connections to the same database endpoint on port 3306.
Some client source ports were reused for the same destination.
PF had or reconstructed state that did not match the actual TCP state at the endpoints.
PF treated a new SYN as belonging to a flow it believed was already established.
That specific connection timed out.
Other flows continued normally.

That matched what we were seeing better than database overload, DNS failure, general packet loss, or state table exhaustion.

What reproduced it

A simple CARP role change did not reliably reproduce the issue by itself.

The problem returned after a restart and failover/failback sequence involving both FreeBSD firewall nodes.

That detail matters because "CARP failed over" is too simple a description. The issue appeared after the full maintenance pattern: firewall restart behavior, traffic moving between nodes, high MySQL connection churn, and reused flow tuples.

What the evidence pointed to

The evidence pointed to stale or incorrect PF state on the FreeBSD firewall path after the maintenance sequence.

The strongest evidence was not an aggregate counter. It was the mismatch for the same exact flow:

client view:
  SYN retransmissions, no SYN/ACK

PF view:
  same 5-tuple marked ESTABLISHED:ESTABLISHED

The PF state table was not near exhaustion. The database endpoint was not globally unavailable. DNS testing did not point to name resolution as the active cause. Most MySQL traffic continued to succeed. The failures also did not arrive in waves the way we would expect from a broad capacity problem.

This is not the same thing as proving a FreeBSD PF software bug. What we had evidence for was a PF state mismatch for specific flows after the maintenance sequence.

That also explains why the problem faded over time. Bad or conflicting states eventually age out. Once the stale state is gone, new connections stop colliding with it.

Practical takeaway

If you are debugging rare MySQL SQLSTATE[HY000] [2002] Connection timed out errors through a FreeBSD PF and CARP firewall path, do not stop at broad health checks.

Capture the handshake and compare it with firewall state for the exact same 5-tuple.

In this incident, the useful evidence was:

the client sent SYN retransmissions
the client never saw a SYN/ACK
PF showed the same flow as already established
most other traffic continued to work

That combination points away from a normal MySQL failure and toward a per-flow state problem.

The fix in this environment was to clear the affected PF states for the database endpoint path during the maintenance window, rather than flushing the entire firewall state table.

For rare failures in high-churn TCP paths, the stateful firewall's view of the exact flow matters more than broad "is the network up?" checks.

Need help with Linux or FreeBSD production infrastructure?

A-Team Systems provides engineer-led support for production Linux and FreeBSD environments, including troubleshooting, operational oversight, performance work, and long-term infrastructure management.

If your team is dealing with intermittent production behavior that does not line up with simple service health checks, we can help isolate the failure path and turn the findings into safer operating procedures.

Contact A-Team Systems