When the ntpd on aeron has lost sync 2003-2006.
03apr06
The ntp daemon keeps time on the aeron cpu by
communicating with time servers on the ao net. The message log /var/ntp.log
tells when events occurred (status monitoring has not been enabled). To
check how the clock is performing the ntp.log file for 13may03 through
30mar06 was searched for "synchronization lost" events.
The plot shows when the
nptd clock synchronization was lost (.ps) (.pdf):
-
Top plot: Hour of day when ntpd lost sync versus date. The red *
are
when data was recorded. The black * are when the ntpd lost sync.
-
Middle plot: The fraction of time the failure occurred versus hour of day.
The histogram used 1 hour bins. 80% of the failures occurred between 4
and 6 am.
-
Bottom plot: A blowup of the histogram of when the failures occurred. This
uses a 10 minute histogram bin. The failures jump up at 4:20 am and stay
there for about 20 minutes.
Looking at the log file, the synchronization lost events
are occurring when the clock had drifted off. On the wapps (see ntpd
on wapps, and wapp
clock drift), the local clock would drift when the local cpu was busy.
So something is causing the aeron cpu to be busy around 4 am:
Cron jobs on aeron:
The crontab for aeron was:
min |
hour |
day of
month |
month |
day of
week |
script dir |
01 |
* |
* |
* |
* |
/etc/cron.hourly |
02 |
4 |
* |
* |
* |
/etc/cron.daily |
22 |
4 |
* |
* |
* |
/etc/cron.weekly |
42 |
4 |
1 |
|
0 |
/etc/cron.monthly |
-
cron.hourly: no scripts
-
cron.daily: 00-logwatch, logrotate.cron makewhatis.cron rpm
slocate.cron tetex.cron tmpwatch
-
cron.weekly: makewhatis.cron
-
cron.monthly: no scripts
The problems around 4:20 to 4:40 could be cron.daily scripts that had not
finished by 4:20 or cron.weekly scripts that started at 4:22.
A plot of
the time difference between time resets (.ps) (.pdf)
shows lots of time resets spaced by 1 day but only one spaced by a week).
So the problem is the cron.daily script. The cpu hog of the cron.daily
is probably the slocate.cron that scans and indexes the entire files system.
It wasn't failing till around 4:20 probably because the routines run before
it took awhile.
Conclusions:
-
The aeron cpu was having lots of time resets. These were clustered around
4 am.
-
The cron.daily script was being run every morning at 4 am. It included
a scan of the entire file system (slocate.cron). This was probably causing
the cpu to not be able to process the ntp clock ticks.
-
On 03apr04 the cron.daily directory was changed to only include the
rotation of the log files. This was done on aeron and on all 4 wapps (which
had the same cron config).
-
The wapps
had a large number of time resets when it was running kernel 2.4.18.
When it was upgraded to 2.4.31 the jumps decreased dramatically. Aeron
is still running 2.4.18. We should upgrade it to 2.4.31.
processing: x101/ntp/aeronLosesSync.pro
home_~phil