wapp ntpd time jumps 2003-2006.
05apr06
The ntpd (network time protocol) daemon is run on
each of the 4 wapps to keep the clocks synchronized with our time servers.
Previous measurements have shown that:
-
on 15apr05 an ntpd
clock jump was documented for the wapps.
-
on 19apr05 ntpd was disabled and the local
clock drift on the wapps was measured while the cpu was exercised.
The local clock on the computer would drift by up to 300 milliseconds in
a few minutes. The local clock is just counting the 100 Hz tick interrupts.
For the clock to drift so fast (with ntpd disabled) the kernel must be
missing these interrupts. You miss an interrupt if a second interrupt
occurs before the previous one was serviced. This means that interrupts
must have been left disabled for over 10 milliseconds.
To further investigate the ntp clock jumps ,
the ntp log file for the 4 wapps from 2003 thru march 2006 was scanned
and the ntpd clock resets were examined. This reset occurs when the local
clock has drifted beyond a maximum clock offset or frequency offset set
by ntpd. Plots were made of when the time resets occurred as well as
the clock and frequency offsets:
wapp1
time resets and offsets (.ps) (.pdf)
wapp2
time resets and offsets (.ps) (.pdf)
wapp3
time resets and offsets (.ps) (.pdf)
wapp4
time resets and offsets (.ps) (.pdf)
-
The dashed red lines are the start of each year
-
The dashed green line is when the wapp kernel was upgraded from 2.4.18
to 2.4.31. This occurred on 15jul05 for wapp1,2,3 and on 09jun05 for wapp4.
-
The top plot shows when the nptd time resets occurred and how far out the
clock was when the nptd reset it. I've clipped values larger than 1 second
to 1 second. You can see a marked decrease when the kernels were upgraded.
-
The middle plot has the wapp clock offset from the ntp servers. It's offsets
also decreased markedly when then kernels were upgraded.
-
The bottom plot is the frequency offset that ntpd was applying to the wapp
clock (in parts per million). It leveled out after the upgrades (it also
jumped by about 150 ppm).
To figure out why the wapp was busy, a plot was made
of the
time of day when the time jumps occurred (.ps) (.pdf):
-
top plot: the size of the time resets versus date. Each wapp is plotted
in a different color. Although they decreased, there were still some time
jumps after the kernel upgrade.
-
2nd plot: The Hour of day (AST) of when the jumps was plotted versus date.
There are 2 diagonal bands. There is also a number of jumps around 5 am.
-
3rd plot: The LST hour of the jumps is plotted versus date. The diagonal
strips of the previous plot were 6 and 19 hour (galactic time) observation.
I checked and most of these were the alfa p2030 pulsar search which can
use lots of cpu time if snap is run.
-
4th plot: A histogram of when the jumps occurred versus AST hour. There
is an excess a little before 5am. This is being caused by the cron jobs
that were being run every morning around 4am. The main culprit was the
slocate routine that would scan all the disks to enable fast disc searches
(which are not needed on this machine).
Conclusions:
-
When the wapp cpus were busy the local clock would drift. This caused ntpd
to eventually generate a time jump.
-
For the wapp clock to drift so much (without ntp running) it must have
been missing clock interrupts. Something must have been disabling interrupts
for over 10 milliseconds.
-
When we upgraded from kernel 2.4.18 to 2.4.31 the number of jumps and clock
drifts decreased dramatically.
-
By default cron is running cpu intensive jobs every day (around 3 or 4
am). There was an increase of time jumps at 4 am (also seen
on the aeron cpu). These jobs should be disabled on the data
taking cpus (we've now done that for the wapps and aeron).
processing: x101/ntp/wapps/plotwapplogs.pro, plotwapptime_reset.pro
<-
page up
home_~phil