wapp ntpd time jumps 2003-2006.

05apr06

The ntpd (network time protocol) daemon is run on each of the 4 wapps to keep the clocks synchronized with our time servers. Previous measurements have shown that:

on 15apr05 an ntpd clock jump was documented for the wapps.
on 19apr05 ntpd was disabled and the local clock drift on the wapps was measured while the cpu was exercised. The local clock on the computer would drift by up to 300 milliseconds in a few minutes. The local clock is just counting the 100 Hz tick interrupts. For the clock to drift so fast (with ntpd disabled) the kernel must be missing these interrupts. You miss an interrupt if a second interrupt occurs before the previous one was serviced. This means that interrupts must have been left disabled for over 10 milliseconds.

To further investigate the ntp clock jumps , the ntp log file for the 4 wapps from 2003 thru march 2006 was scanned and the ntpd clock resets were examined. This reset occurs when the local clock has drifted beyond a maximum clock offset or frequency offset set by ntpd. Plots were made of when the time resets occurred as well as the clock and frequency offsets:

wapp1 time resets and offsets (.ps) (.pdf)
wapp2 time resets and offsets (.ps) (.pdf)
wapp3 time resets and offsets (.ps) (.pdf)
wapp4 time resets and offsets (.ps) (.pdf)

The dashed red lines are the start of each year
The dashed green line is when the wapp kernel was upgraded from 2.4.18 to 2.4.31. This occurred on 15jul05 for wapp1,2,3 and on 09jun05 for wapp4.
The top plot shows when the nptd time resets occurred and how far out the clock was when the nptd reset it. I've clipped values larger than 1 second to 1 second. You can see a marked decrease when the kernels were upgraded.
The middle plot has the wapp clock offset from the ntp servers. It's offsets also decreased markedly when then kernels were upgraded.
The bottom plot is the frequency offset that ntpd was applying to the wapp clock (in parts per million). It leveled out after the upgrades (it also jumped by about 150 ppm).

To figure out why the wapp was busy, a plot was made of the time of day when the time jumps occurred (.ps) (.pdf):

top plot: the size of the time resets versus date. Each wapp is plotted in a different color. Although they decreased, there were still some time jumps after the kernel upgrade.
2nd plot: The Hour of day (AST) of when the jumps was plotted versus date. There are 2 diagonal bands. There is also a number of jumps around 5 am.
3rd plot: The LST hour of the jumps is plotted versus date. The diagonal strips of the previous plot were 6 and 19 hour (galactic time) observation. I checked and most of these were the alfa p2030 pulsar search which can use lots of cpu time if snap is run.
4th plot: A histogram of when the jumps occurred versus AST hour. There is an excess a little before 5am. This is being caused by the cron jobs that were being run every morning around 4am. The main culprit was the slocate routine that would scan all the disks to enable fast disc searches (which are not needed on this machine).

Conclusions:

When the wapp cpus were busy the local clock would drift. This caused ntpd to eventually generate a time jump.
For the wapp clock to drift so much (without ntp running) it must have been missing clock interrupts. Something must have been disabling interrupts for over 10 milliseconds.
When we upgraded from kernel 2.4.18 to 2.4.31 the number of jumps and clock drifts decreased dramatically.
By default cron is running cpu intensive jobs every day (around 3 or 4 am). There was an increase of time jumps at 4 am (also seen on the aeron cpu). These jobs should be disabled on the data taking cpus (we've now done that for the wapps and aeron).

processing: x101/ntp/wapps/plotwapplogs.pro, plotwapptime_reset.pro

<- page up
home_~phil