Troubleshooting with an NTP Time Server Monitor: Common Issues & Fixes

Troubleshooting with an NTP Time Server Monitor: Common Issues & Fixes

1. Server unreachable / high “reach” failures

  • Symptom: reach register shows zeros or intermittent reach; polls fail.
  • Cause: network/DNS, firewall (UDP 123), server down or rate-limited.
  • Fixes:
    1. Network: ping/tracepath to server; verify routes.
    2. Firewall: allow UDP/123 both directions; check NAT rules.
    3. DNS: query A/AAAA records; use IP in config to test.
    4. Rate limiting: consult remote provider (NTP pool often rate-limits); reduce poll frequency or use closer servers.

2. Large offset (clock drift) or unstable time

  • Symptom: persistent large offset or frequent step adjustments.
  • Cause: wrong ntpd/chrony config, absent/bad reference clocks, VM timekeeping issues, bad hardware clock.
  • Fixes:
    1. Confirm sources: ntpq -p / chronyc sources; ensure multiple healthy servers (prefer 3+).
    2. Daemon choice: use chrony for VMs/unstable networks; ntpd/ntpsec for classic setups.
    3. Initial correction: use makestep/initstepslew to correct large initial offsets.
    4. Hardware: check RTC battery; enable kernel time disciplines (e.g., tsc, pps) if available.
    5. VMs: enable paravirtualized clock (kvm-clock), avoid suspend/resume drift; consider host-based time sync.

3. High jitter and fluctuating delay

  • Symptom: jitter large relative to offset; unstable selection of system peer.
  • Cause: network congestion, asymmetric routes, overloaded NTP server, poor GPS/serial connection.
  • Fixes:
    1. Network: perform ping/traceroute; avoid asymmetric routing; move to lower-latency servers.
    2. Server load: pick less-busy servers or a local stratum-2 server.
    3. GPS/PPS: check cabling, serial/UART settings; use PPS kernel discipline for precision.
    4. Poll intervals: increase poll interval to smooth measurements.

4. Wrong stratum or “kiss-of-death” / rate-limited responses

  • Symptom: unexpected high stratum or frequent KoD (rate limiting).
  • Cause: misconfigured reference chains, using servers that intentionally limit load.
  • Fixes:
    1. Validate refid chain: ensure your upstream servers have correct upstreams.
    2. Avoid abusive polling: use iburst on start and reasonable minpoll/maxpoll values.
    3. Respect pools: follow NTP Pool guidelines; add your server to pool responsibly.

5. Leap second or epoch-related errors

  • Symptom: sudden multi-second offsets around leap seconds or NTP era boundaries.
  • Cause: outdated leap-second file, old daemon lacking leap handling, or cross-era servers.
  • Fixes:
    1. Update: ensure ntpd/chrony and leapseconds file are current.
    2. Use pools/providers that properly handle leap seconds.
    3. Monitor: schedule checks around known leap events.

6. PPS / GPS reference miscalibration

  • Symptom: offsets differ when using PPS vs NMEA; PPS shows odd bias.
  • Cause: serial latency, incorrect time1/time2 offsets, missing kernel PPS support.
  • Fixes:
    1. Measure offset: collect peerstats and compute average GPS offset; apply time1 correction.
    2. Enable PPS kernel discipline and point daemon to /dev/pps0.
    3. Use gpsd as intermediary if direct driver issues occur; prefer chrony for GPS+PPS.

7. Monitoring tool shows stale or inconsistent data

  • Symptom: monitor displays outdated metrics, incomplete MRU lists, or hangs.
  • Cause: tool limitations (mode-6 limits), insufficient terminal/connection capacity, daemon version mismatch.
  • Fixes:
    1. Upgrade: use up-to-date ntp/ntpsec/chrony and monitoring tool versions.
    2. Capacity: ensure link can carry entire NTP load when fetching MRU lists.
    3. Poll interval: increase monitor poll interval; enable DNS caching if supported.

8. Time sync intermittently breaks at predictable times

  • Symptom: offsets occur at same times daily.
  • Cause: cron jobs, backups, snapshots, CPU governor changes, virtualization host tasks.
  • Fixes:
    1. Correlate: check cron, backups, snapshots, host maintenance windows.
    2. Adjust thresholds: monitoring warn/crit thresholds to match expected sync cadence.
    3. Use chrony on systems subject to suspend/resume or periodic stalls.

Quick diagnostics checklist

  1. Check reach/offset/jitter: ntpq -p or chronyc sources
  2. Verify connectivity: ping, traceroute, UDP/123 allowed
  3. Confirm daemon logs and loopstats for trends
  4. Ensure 3+ healthy time sources and proper minpoll/maxpoll
  5. For GPS: verify gpsd/PPS and apply measured offsets

When to escalate

  • Persistent >100 ms offsets after configuration and network checks.
  • Hardware reference (GPS/PPS) with unexplained large jitter.
  • Suspected kernel timekeeping bugs — collect logs/loopstats and open vendor/OS bug report.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *