ArcLibrary

Diagnosing TIME_WAIT and CLOSE_WAIT

The two most common TCP-state pathologies under high concurrency — one is by design, the other is a bug.

TIME_WAITCLOSE_WAITState Machine
核心 · Key Idea

In one line: TIME_WAIT is a mandatory wait period in the TCP spec for the active closer; CLOSE_WAIT is a bug in your application — you received the peer's FIN but never called close().

What it is#

Active closer:

ESTABLISHED → FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT (2*MSL ≈ 60s) → CLOSED

Passive closer:

ESTABLISHED → CLOSE_WAIT → LAST_ACK → CLOSED
                  ↑
       app forgot close() — gets stuck here

Analogy#

打个比方 · Analogy

TIME_WAIT = after hanging up, you keep the receiver to your ear for a moment to make sure no late echo arrives. CLOSE_WAIT = the other side said "I'm hanging up", you murmured "OK" but never actually put the receiver down, and the line stays busy.

Key concepts#

MSLMax Segment Lifetime
Maximum lifetime of a segment in the network (~30 s). 2*MSL ensures any in-flight packets fully die before the tuple is reused.
TIME_WAIT risksource-port exhaustion
A client doing many short-lived connections quickly burns through ~28K local ports stuck in TIME_WAIT.
CLOSE_WAIT riskfd leak
Application file descriptors leak slowly, ending in 'too many open files'.
tcp_tw_reuseTW reuse
Linux knob — lets new outbound connections reuse a TIME_WAIT-occupied local port.
SO_LINGER 0forced RST close
Skips the four-way teardown and emits an RST — **dangerous**, in-flight data is lost.

How it works#

Whichever side calls close() first becomes the active closer.

Practical notes#

  • Count states:

    ss -tan | awk '{print $1}' | sort | uniq -c
  • Too many TIME_WAITs (10k+):

    • On the client: net.ipv4.tcp_tw_reuse=1, widen ip_local_port_range.
    • On the server: usually harmless (passive closer doesn't enter TIME_WAIT — prefer keep-alive / persistent connections anyway).
    • Do not enable tcp_tw_recycle (removed from Linux).
  • CLOSE_WAIT keeps climbing:

    • Find leaked fds: lsof -p <pid> | grep CLOSE_WAIT.
    • Audit your code's error paths for missing defer conn.Close().
    • Reverse-proxy backends: check whether your idle-timeout actually closes the socket.
  • HTTP keep-alive / connection pools: reusing connections instead of opening one per request is the most permanent fix.

Easy confusions#

TIME_WAIT
**Protocol requirement**.
Disappears on its own.
Usually solvable by tuning sysctls.
CLOSE_WAIT
**Application bug**.
Stays forever until fixed.
Must change code.

Further reading#