Tech Blog

SQLSTATE[HY000] [2002] Connection timed out: FreeBSD PF, CARP, and MySQL TCP Failures

A production debugging note on rare MySQL connection timeouts where about 1 in 10,000 TCP connections failed while the rest of the application and database traffic continued normally.

Published
May 5, 2026
Updated
June 6, 2026
Author
Adam Strohl
Reading time
8 minutes

SQLSTATE[HY000] [2002] Connection timed out

A-Team Systems investigated a production issue where a PHP application intermittently reported this MySQL connection error:

SQLSTATE[HY000] [2002] Connection timed out

This error can come from several places: a down database engine, a firewall rule, a routing problem, a bad switch path, or anything in the middle that prevents the TCP handshake from completing.

In many environments, the cause will be simpler than the case described here. This incident was unusual because almost everything was working, including almost all of the MySQL connections.

The application was making a very large number of database connections, and only about 1 in every 10,000 new TCP connections failed.

Everything else looked healthy; other application traffic was working, most MySQL traffic was working, and the database was not globally unavailable. The database endpoint path was not showing obvious signs of CPU, memory, swap, or listen-backlog exhaustion.

Environment: FreeBSD PF, CARP, and MySQL connection churn

The production path looked roughly like this:

Application servers
FreeBSD PF / CARP firewall pair
Database service endpoint
Backend database writer

In this case, there was a SQL proxy layer in the database path, and we suspected it for a while. That was a reasonable suspicion. Bypassing that layer stopped the visible application errors.

But the proxy software was not the problem. The same kind of issue could have shown up on a direct-to-database path with enough short-lived connections and the same firewall behavior in the middle.

The relevant pieces were:

  • FreeBSD PF doing stateful forwarding
  • CARP providing firewall/router failover
  • no PF state synchronization between the two firewall nodes
  • a high rate of short-lived MySQL connections
  • repeated connections from the same application hosts to the same database destination IPs and port 3306

That made the network path part of the investigation. It did not prove that the service at the end of the path was failing.

The error was a TCP connect timeout

The application error was a MySQL connection timeout:

SQLSTATE[HY000] [2002] Connection timed out

This was not the same as a slow query, an authentication failure, or a connection that was established and then dropped later.

The failing requests were not getting far enough for the MySQL protocol to matter. They were failing at TCP connect time.

This was a TCP connect timeout, not a MySQL query problem. The useful question was basic:

Did the TCP handshake complete?

Why the failure was hard to catch

The failure rate was extremely low.

Roughly 1 out of every 10,000 new database connections would hit the timeout. The rest succeeded.

The other problem was that the failure would disappear on its own. The first couple of times we saw it, the errors had stopped by the time we had enough tooling in place to capture the right traffic and compare it against firewall state.

That made it easy to chase the wrong layer. A service can pass health checks while still losing a tiny fraction of new TCP connections. A network can move large volumes of traffic while one exact flow fails. A database endpoint can accept thousands of connections while a small number of new handshakes never complete.

Packet capture results

The most useful capture was intentionally narrow. We did not need full MySQL payloads, and capturing them would have created unnecessary volume and sensitivity.

The capture focused on the TCP handshake.

On an affected application server, a failed connection looked like this:

app-server:ephemeral-port -> database-endpoint:3306

SYN
SYN retransmission
SYN retransmission
SYN retransmission

No SYN/ACK came back to the client.

The capture matched the application error. The client tried to open a TCP connection to the database endpoint, retransmitted the SYN, and eventually timed out.

This was not a failed SQL query. It was a TCP connection that never established.

PF state for the same flow

The next step was to compare the packet capture with PF state on the active FreeBSD firewall.

For the same exact 5-tuple:

source IP
source port
destination IP
destination port
protocol

the application server was still sending SYN retransmissions.

But PF showed that same flow as:

ESTABLISHED:ESTABLISHED

That was the key mismatch.

The client was trying to start a new TCP connection. PF believed the flow was already established.

Once those two observations were matched to the same 5-tuple, the problem no longer looked like a general database endpoint failure. It looked like stale or incorrect firewall state for a specific flow.

Why most connections kept working

This is an easy distinction to miss during an incident.

A stateful firewall is not just deciding whether one host can reach another host. It is tracking individual flows.

The unit of failure can be one specific TCP 5-tuple:

client IP + client port + server IP + server port + protocol

If one of those flow states is wrong, that one connection can fail while other connections between the same systems keep working.

This explains the otherwise contradictory symptoms:

  • routing was working
  • the database endpoint was reachable
  • most MySQL connections succeeded
  • other application traffic was unaffected
  • only a very small number of new MySQL TCP connections timed out

"Everything else works" did not rule out the firewall. In this case, it helped rule out broader explanations like general routing failure, database outage, or full-path packet loss.

We also looked for the usual firewall explanations, including PF state table exhaustion. That did not line up with what we were seeing. If the state table were full, or if the firewall were hitting a broad capacity limit, we would expect a much more consistent failure pattern: waves of connection errors, broader impact, or counters that clearly matched the timing.

That was not the pattern here. The failures were isolated and irregular. They looked more like individual flow state mismatches than a firewall running out of capacity.

Connection churn and source-port reuse

The affected path had a high rate of short-lived MySQL connections.

TCP connections are identified by their 5-tuple. When a client creates many short-lived connections to the same destination IP and port, it will eventually reuse ephemeral source ports.

Ephemeral source-port reuse is normally fine. It became relevant here because the firewall's view of a flow did not always match the endpoint's view after the firewall restart and failover/failback sequence.

The likely sequence was:

  1. One FreeBSD firewall node was active.
  2. Maintenance shifted traffic to the other CARP node.
  3. Many short-lived MySQL connections were interrupted and recreated.
  4. Traffic later shifted back after firewall restart/failback activity.
  5. Application servers continued creating many new connections to the same database endpoint on port 3306.
  6. Some client source ports were reused for the same destination.
  7. PF had or reconstructed state that did not match the actual TCP state at the endpoints.
  8. PF treated a new SYN as belonging to a flow it believed was already established.
  9. That specific connection timed out.
  10. Other flows continued normally.

This fit the observed behavior better than database overload, DNS failure, general packet loss, or state table exhaustion.

Reproduction pattern

A simple CARP role change did not reliably reproduce the issue by itself.

The problem returned after a restart and failover/failback sequence involving both FreeBSD firewall nodes.

A simple "CARP failed over" description was too broad. The issue appeared after the full maintenance pattern: firewall restart behavior, traffic moving between nodes, high MySQL connection churn, and reused flow tuples.

Where the evidence pointed

The evidence pointed to stale or incorrect PF state on the FreeBSD firewall path after the maintenance sequence.

The strongest evidence was not an aggregate counter. It was the mismatch for the same exact flow:

client view:
  SYN retransmissions, no SYN/ACK

PF view:
  same 5-tuple marked ESTABLISHED:ESTABLISHED

The PF state table was not near exhaustion. The database endpoint was not globally unavailable. DNS testing did not point to name resolution as the active cause. Most MySQL traffic continued to succeed. The failures also did not arrive in waves the way we would expect from a broad capacity problem.

This is not the same thing as proving a FreeBSD PF software bug. What we had evidence for was a PF state mismatch for specific flows after the maintenance sequence.

This also explains why the problem faded over time. Bad or conflicting states eventually age out. Once the stale state is gone, new connections stop colliding with it.

Operational takeaway

If you are debugging rare MySQL SQLSTATE[HY000] [2002] Connection timed out errors through a FreeBSD PF and CARP firewall path, do not stop at broad health checks.

Capture the handshake and compare it with firewall state for the exact same 5-tuple.

In this incident, the useful evidence was:

  • the client sent SYN retransmissions
  • the client never saw a SYN/ACK
  • PF showed the same flow as already established
  • most other traffic continued to work

Taken together, those observations point away from a normal MySQL failure and toward a per-flow state problem.

The fix in this environment was to clear the affected PF states for the database endpoint path during the maintenance window, rather than flushing the entire firewall state table.

For rare failures in high-churn TCP paths, compare the stateful firewall's view of the exact flow before relying on broad "is the network up?" checks.

Need help with Linux or FreeBSD production infrastructure?

A-Team Systems provides engineer-led support for production Linux and FreeBSD environments, including troubleshooting, operational oversight, performance work, and long-term infrastructure management.

If your team is dealing with intermittent production behavior that does not line up with simple service health checks, we can help isolate the failure path and define the operational steps needed to prevent a repeat.

Contact A-Team Systems