Troubleshooting with an NTP Time Server Monitor: Common Issues & Fixes

Symptom: reach register shows zeros or intermittent reach; polls fail.
Cause: network/DNS, firewall (UDP 123), server down or rate-limited.
Fixes:
1. Network: ping/tracepath to server; verify routes.
2. Firewall: allow UDP/123 both directions; check NAT rules.
3. DNS: query A/AAAA records; use IP in config to test.
4. Rate limiting: consult remote provider (NTP pool often rate-limits); reduce poll frequency or use closer servers.

Symptom: persistent large offset or frequent step adjustments.
Cause: wrong ntpd/chrony config, absent/bad reference clocks, VM timekeeping issues, bad hardware clock.
Fixes:
1. Confirm sources: ntpq -p / chronyc sources; ensure multiple healthy servers (prefer 3+).
2. Daemon choice: use chrony for VMs/unstable networks; ntpd/ntpsec for classic setups.
3. Initial correction: use makestep/initstepslew to correct large initial offsets.
4. Hardware: check RTC battery; enable kernel time disciplines (e.g., tsc, pps) if available.
5. VMs: enable paravirtualized clock (kvm-clock), avoid suspend/resume drift; consider host-based time sync.

Symptom: jitter large relative to offset; unstable selection of system peer.
Cause: network congestion, asymmetric routes, overloaded NTP server, poor GPS/serial connection.
Fixes:
1. Network: perform ping/traceroute; avoid asymmetric routing; move to lower-latency servers.
2. Server load: pick less-busy servers or a local stratum-2 server.
3. GPS/PPS: check cabling, serial/UART settings; use PPS kernel discipline for precision.
4. Poll intervals: increase poll interval to smooth measurements.

Symptom: unexpected high stratum or frequent KoD (rate limiting).
Cause: misconfigured reference chains, using servers that intentionally limit load.
Fixes:
1. Validate refid chain: ensure your upstream servers have correct upstreams.
2. Avoid abusive polling: use iburst on start and reasonable minpoll/maxpoll values.
3. Respect pools: follow NTP Pool guidelines; add your server to pool responsibly.

Symptom: sudden multi-second offsets around leap seconds or NTP era boundaries.
Cause: outdated leap-second file, old daemon lacking leap handling, or cross-era servers.
Fixes:
1. Update: ensure ntpd/chrony and leapseconds file are current.
2. Use pools/providers that properly handle leap seconds.
3. Monitor: schedule checks around known leap events.

Symptom: offsets differ when using PPS vs NMEA; PPS shows odd bias.
Cause: serial latency, incorrect time1/time2 offsets, missing kernel PPS support.
Fixes:
1. Measure offset: collect peerstats and compute average GPS offset; apply time1 correction.
2. Enable PPS kernel discipline and point daemon to /dev/pps0.
3. Use gpsd as intermediary if direct driver issues occur; prefer chrony for GPS+PPS.

Symptom: monitor displays outdated metrics, incomplete MRU lists, or hangs.
Cause: tool limitations (mode-6 limits), insufficient terminal/connection capacity, daemon version mismatch.
Fixes:
1. Upgrade: use up-to-date ntp/ntpsec/chrony and monitoring tool versions.
2. Capacity: ensure link can carry entire NTP load when fetching MRU lists.
3. Poll interval: increase monitor poll interval; enable DNS caching if supported.

Symptom: offsets occur at same times daily.
Cause: cron jobs, backups, snapshots, CPU governor changes, virtualization host tasks.
Fixes:
1. Correlate: check cron, backups, snapshots, host maintenance windows.
2. Adjust thresholds: monitoring warn/crit thresholds to match expected sync cadence.
3. Use chrony on systems subject to suspend/resume or periodic stalls.

Persistent >100 ms offsets after configuration and network checks.
Hardware reference (GPS/PPS) with unexplained large jitter.
Suspected kernel timekeeping bugs — collect logs/loopstats and open vendor/OS bug report.

Comments