Troubleshooting with an NTP Time Server Monitor: Common Issues & Fixes
Troubleshooting with an NTP Time Server Monitor: Common Issues & Fixes
1. Server unreachable / high “reach” failures
- Symptom: reach register shows zeros or intermittent reach; polls fail.
- Cause: network/DNS, firewall (UDP 123), server down or rate-limited.
- Fixes:
- Network: ping/tracepath to server; verify routes.
- Firewall: allow UDP/123 both directions; check NAT rules.
- DNS: query A/AAAA records; use IP in config to test.
- Rate limiting: consult remote provider (NTP pool often rate-limits); reduce poll frequency or use closer servers.
2. Large offset (clock drift) or unstable time
- Symptom: persistent large offset or frequent step adjustments.
- Cause: wrong ntpd/chrony config, absent/bad reference clocks, VM timekeeping issues, bad hardware clock.
- Fixes:
- Confirm sources: ntpq -p / chronyc sources; ensure multiple healthy servers (prefer 3+).
- Daemon choice: use chrony for VMs/unstable networks; ntpd/ntpsec for classic setups.
- Initial correction: use makestep/initstepslew to correct large initial offsets.
- Hardware: check RTC battery; enable kernel time disciplines (e.g., tsc, pps) if available.
- VMs: enable paravirtualized clock (kvm-clock), avoid suspend/resume drift; consider host-based time sync.
3. High jitter and fluctuating delay
- Symptom: jitter large relative to offset; unstable selection of system peer.
- Cause: network congestion, asymmetric routes, overloaded NTP server, poor GPS/serial connection.
- Fixes:
- Network: perform ping/traceroute; avoid asymmetric routing; move to lower-latency servers.
- Server load: pick less-busy servers or a local stratum-2 server.
- GPS/PPS: check cabling, serial/UART settings; use PPS kernel discipline for precision.
- Poll intervals: increase poll interval to smooth measurements.
4. Wrong stratum or “kiss-of-death” / rate-limited responses
- Symptom: unexpected high stratum or frequent KoD (rate limiting).
- Cause: misconfigured reference chains, using servers that intentionally limit load.
- Fixes:
- Validate refid chain: ensure your upstream servers have correct upstreams.
- Avoid abusive polling: use iburst on start and reasonable minpoll/maxpoll values.
- Respect pools: follow NTP Pool guidelines; add your server to pool responsibly.
5. Leap second or epoch-related errors
- Symptom: sudden multi-second offsets around leap seconds or NTP era boundaries.
- Cause: outdated leap-second file, old daemon lacking leap handling, or cross-era servers.
- Fixes:
- Update: ensure ntpd/chrony and leapseconds file are current.
- Use pools/providers that properly handle leap seconds.
- Monitor: schedule checks around known leap events.
6. PPS / GPS reference miscalibration
- Symptom: offsets differ when using PPS vs NMEA; PPS shows odd bias.
- Cause: serial latency, incorrect time1/time2 offsets, missing kernel PPS support.
- Fixes:
- Measure offset: collect peerstats and compute average GPS offset; apply time1 correction.
- Enable PPS kernel discipline and point daemon to /dev/pps0.
- Use gpsd as intermediary if direct driver issues occur; prefer chrony for GPS+PPS.
7. Monitoring tool shows stale or inconsistent data
- Symptom: monitor displays outdated metrics, incomplete MRU lists, or hangs.
- Cause: tool limitations (mode-6 limits), insufficient terminal/connection capacity, daemon version mismatch.
- Fixes:
- Upgrade: use up-to-date ntp/ntpsec/chrony and monitoring tool versions.
- Capacity: ensure link can carry entire NTP load when fetching MRU lists.
- Poll interval: increase monitor poll interval; enable DNS caching if supported.
8. Time sync intermittently breaks at predictable times
- Symptom: offsets occur at same times daily.
- Cause: cron jobs, backups, snapshots, CPU governor changes, virtualization host tasks.
- Fixes:
- Correlate: check cron, backups, snapshots, host maintenance windows.
- Adjust thresholds: monitoring warn/crit thresholds to match expected sync cadence.
- Use chrony on systems subject to suspend/resume or periodic stalls.
Quick diagnostics checklist
- Check reach/offset/jitter:
ntpq -p or chronyc sources
- Verify connectivity: ping, traceroute, UDP/123 allowed
- Confirm daemon logs and loopstats for trends
- Ensure 3+ healthy time sources and proper minpoll/maxpoll
- For GPS: verify gpsd/PPS and apply measured offsets
When to escalate
- Persistent >100 ms offsets after configuration and network checks.
- Hardware reference (GPS/PPS) with unexplained large jitter.
- Suspected kernel timekeeping bugs — collect logs/loopstats and open vendor/OS bug report.
Leave a Reply