Glitch in ri A/D system Dig Q1
july 2003
Diego janches had problems with his meteor data taken
07jul03. His setup was:
-
I,Q sample both channels (fifo 12) at 1 usec
-
input voltage levels about 13 mV rms.
-
Start the sampler running and leave it running for 12 hours (4mb/sec, 172
Gbytes total)
How the ri works:
The ri has 4 12 bit a/d converters (I1,Q1,I2,Q2). Each
12 bit sample is sign extended into a 16 bit number. The data from each
digitizer is clocked into a 32K by 16 bit fifo by a delayed sample pulse.
The data sits in the fifo until a read request comes from the vme bus.
At that time, a rdFifo pulse is generated and a sample at a time
is read out of the fifo. Each 16 bit by 32k fifo is built by combining
two 8bit by 32k fifo chips.
Symptoms:
The problem occured with digitizer Q1. The data
values (a/d counts with 1 count=1.2 mV) ranged from +20 to -20 counts.
At some point there was a glitch and it would jump from a negative number
to +250 and then -250 (about 250..) and then back to +/-20. This jumping
would then continue once it started for the rest of the run. The problem
did not show up when the test patterns were run through the system (zeros,
toggle, staircase tests were run for many hours without a failure).
A histogram
of the meteor data shows that the voltage distribution is skewed. There
are about 15e6 samples in the distribution.
-
Fig 1 Top: Vertical scale is full scale counts. The vertical scale shows
+/- 40 a/d counts. The black line is skewed for levels -10 to 0.
The other a/d 's have jumps about 0.
-
Fig 1 bottom: The vertical scal is limited to 1e6 counts. The horizontal
scale is extended to show a/d values +/- 300. There are peaks at +/- 250.
Digitizer Q1 counts around zero have been moved out here.
To debug the problem, we took the same base band noise
and passed it through 4 opamps and then into the 4 a/d converters. The
signals should then be pretty close for all 4 a/d converters (the offsets
and gains can be a little different). When digitizer Q1 jumps, we can look
at the other digitizers to see what it should have done. The plots show
the
glitch with the same signal input to all 4 digitizers:
-
Fig 1 Top: 50 samples with the glitch evident at sample 11. Black
is digitizer Q1.
-
Fig 1 Bottom: a blowup of the top plot showing that the 4 digitizers have
the same signal (with differing offsets). They follow each other until
sample 6.
-
Fig 2: Plot of sample[i+1]-sample[i] for 20 samples. You can see the black
curve diverge at sample 6. The yellow curve is the Q1 data throwing out
the low Q1 byte at sample 6. All the Q1 low8 samples after this are shifted
forward in time.
The key to the error is that once the error occurs,
it continues for hours. This implies that there is a memory somewhere that
remembers that the error has occurred. This is what made us suspicious
of the fifo chips. The table below shows the data samples around
the start of a failure.
-
Col 1 . This is the relative sample number as it appears on the data file
on disc.
-
Col 2. This shows the corresponding a/d samples in the data file.
The two numbers are the samples coming out of fifo Q1Hi and fifo Q1Lo.The
problem occurs at sample 6. The rdfifo pulse did not last long enough
for Q1Lo to cause the fifo to output an 8bit value. The computer
ends up reading 0xff for Q1lo on sample 6 since the fifo output is tristated
hi. Sample 6 Q1Lo stays in the fifo and gets read out on sample number
7. The two fifos are now out of sync.
-
Col 3. These are the 16 bit data samples in Hex. The leftmost two
hex digits are from Q1Hi, the rightmost two hex digits are from Q1Lo.
The data at sample 6 should have been FF FA.
-
Col 4. This is the col 3 data in decimal.
-
Col 5. This has the col 3 data after correction. I removed sample 6 Q1Lo
and moved all the samples below this up by one. There are no longer any
jumps.
-
Col 6. this is col 5 data in decimal.
The table below shows the data values (in hex and decimal) where the failure
started.
17 16bit samples Q1 digitizer at failure
Col 1
Data
file
sample |
Col 2
A/D
sample
Q1Hi,Q1Lo |
Col 3
Data (hex)
Q1Hi Q1Lo |
Col 4
Data (dec) |
Col 5
Data (hex)
Corrected
Q1Hi Q1Lo |
Col 6
Data (dec)
Corrected |
0 |
0,0 |
FF FD |
-3 |
FF FD |
-3 |
1 |
1,1 |
00 0E |
14 |
00 0E |
14 |
2 |
2,2 |
FF FE |
-2 |
FF FE |
-2 |
3 |
3,3 |
00 02 |
2 |
00 02 |
2 |
4 |
4,4 |
FF F9 |
-7 |
FF F9 |
-7 |
5 |
5.5 |
00 02 |
2 |
00 02 |
2 |
6 *** |
6,noQ1Lo
rdpulse |
FF FF |
-1 |
FF FA SKIP FF LO8 |
-6 |
7 |
7,6 |
FF FA |
-6 |
FF FA |
-6 |
8 |
8,7 |
FF FA |
-6 |
FF F7 |
-9 |
9 |
9,8 |
FF F7 |
-9 |
FF ED |
-19 |
10 |
10,9 |
FF ED |
-19 |
FF FB |
-5 |
11 |
11,10 |
00 FB |
251 |
00 01 |
1 |
12 |
12,11 |
00 01 |
1 |
00 09 |
9 |
13 |
13,12 |
FF 09 |
-247 |
FF F1 |
-15 |
14 |
14,13 |
00 F1 |
241 |
00 04 |
4 |
15 |
15,14 |
00 04 |
4 |
00 00 |
0 |
16 |
16,15 |
FF 00 |
-256 |
FF XX |
|
What is the problem:
The problem is that the rdFifo pulse for fifo Q1lo was
not long enough to reliably clock the fifo output (we could see this when
we put a scope on the pin). When the clock pulse was missed, the computer
still read the output databus. The "missed" sample would come out on the
next fifo read. The high, low 8 bits would then be out of sync.
When did it first occur.
Diego has data from feb02 that has the jumps. Data from
Apr03 does not have the jumps. So it has been around awhile.
Who else might see this jump??
When the two bytes (hi,lo) get out of sync, data
that goes from negative (0xff nn ) to positive (0x00 mm) will show jumps.
If the voltage levels are small then these jumps will be obvious. If the
voltage levels are large, then you won't be able to distinguish these jumps
(of order 255) from the actual noise.
Most aeronomony/sband radar programs will take data
continuously for 10 seconds to several minutes. Each new cycle of datataking
will clear the fifo (resyncing the hi/lo data if an error has occurred).
Since diego took data continuously for hours and he had very low voltage
levels, his experiment was more likely then most to show the problem.
When I tried to generate the error using continuous
sampling, it took on average about 200e6 samples before the error occurred.
Resolution:
On 16jul03 the rdfifo pulse for all the fifos was lengthened
(by 7.5 ns??). After this i ran the ri for 6 hours continuously (about
5*e10 samples) and saw no glitches. I've also setup a macro for diego (rdloop)
that will take data for a specified number of buffers, stop, and then restart
taking ri data . This will clear the fifo and not let any errors continue.
Diegos data can be corrected. You need to search
through the data till you start seeing these jumps. At that point shift
the hi8,lo8 bytes to be in sync again. Continue reading data, shifting
until you start to see jumps after the shift is applied. This is where
another missed fifo pulse occurred. You need to do this for all of the
data. You also need to be careful that you don't mistake rfi/meteors for
the jumps. That shouldn't be too hard since the jumps occur very often
after they start (everytime the voltage crosses 0 volts).
processing: x101/030715/doplot.pro, usr/t1748/doit.pro
home_~phil