Dump status timeout feb12
LAST MOD:22feb 2012
Intro:
On 11feb12 and again on 19/20feb12 cima return
a dump status timeout from the mock spectrometers.
- 11feb12: integration stopped early. msg:
Dump status timeout from
b6sNg0..(file server pdevs7)
- 20feb12:
b6nNg0 continues to have dump; status timeouts.
- Resolution
- 21feb12:
- we replaced the mock spectrometer (pdev-109) with
(pdev-127). This is used by b6sNg0 (or pdevs7 fileserver).
The spectrometer pdev-109 had ethernet failures in its log
file. The problems (dump timeouts) continued later in the
day.
- 22feb 12:
- After much soul searching (actually after much running
of some of the diagnostics, and switching
cables, switch ports, etc..) we finally convinced
ourselves (luis and I) that the "new" spectrometer box
(pdev-127) that we installed onj 21feb12 had it's own set of
problems.
- On 22feb12 we replaced the spectrometer pdev-127 with
another "new" spectrometer (pdev-111).
- The diagnostics ran without problems. We took data thru
cima for 2 hours with not errors.
20feb12: b6sXg0 (pdevs7) continues
stopping early
On 11feb12 the mock
spectrometer aborted a number of times with "dump status
timeout from:". On 19/20 feb12 these errors were
repeated. The /tmp/pnetctl.log file on pdevs1 was searched
for mock errors ("staterr"). The file had data from 03feb12 to
20feb12.
The listing shows how
often these errors have been occurring (.text):
- The errors occurred on 11,12,14,15,15,17,18,19,20feb 12. Looks
like it is occurring more often (assuming the usage has remained
constant).
- The errors are only occurring with b6sNg0.
- This is spectrometer pdev-109 and uses pdev7 as its
file server
- there are no xxxxG1 errors. I don't know whether any data
was taken on this file server.
- Looking at the /tmp/pnetctl.stderr on pdevs1 there are a
number of errors:
- Cannot connect to pdev-109:1421
- (this is the psrv port on pdevs7). I'm not sure the
sequence of these errors and the timeout errors.
- Some things "different about pdevs7"
- for alfa bm6b is dead .If there was any processing
with the spectra,there might be overflow,underflow interrupts
that could slow things down.
- Looking at the code: bpInp,bpOut there are no
multiply,divides, adds, subtracts going on with the
data.. just block moves. So there shouldn't be any
over/underflow errors whatever the data is..
- pdevs7 might be the file server that occasionally has
trouble connecting via the switch on power resets (need
to check with arun).
- It might be a good idea to check the switch used to
connect pdevs7 file server with spectrometer pdev-109
to see if it has any info about the communications.
- This might be trouble with the spectrometer itself.
- I checked the disc i/o rates on 11feb12 and they
seemed to be ok.
20feb12 : 15:25
The spectrometer failed while i was around. I
used the routine testbpcmd to dump the shared memory
that holds the disc buffers:
For pdevs7 i got:
--> Debug Reply
---->pdevs7
dbg ok
name
Ncur Nmax Nmin Nput Nget
NgetE semSt semMxVal semCntEntry
aoqSbpfrlL
51 51 0 12948936
12948885 5
2
1 51
aoqSbpfrlS 959
1000 956 50802
49843 0
2 1
959
aoqSbpinp 0
4 0 6866138 6866138
14441844 2
1 0
aoqSbpout 0 50
0 6866138 6866138 13951460
2
1 0
aoqSbpcmd 0
1 0 11911 11911
11908 2
1 0
aoqSbppnt 20 20 19
12440152 12440132 0
2
1 20
aoqSbpagc 20 20 19
12441284 12441264 0
2
1 20
ScramMxSemVal: 1
This may not be meaningful since pdevs2 also had Nmin=0... This may
have occurred a long time ago.
11feb12: integration stopped early. msg: Dump status timeout from
- 04:38:24 a2623. happened twice
- cima log message:
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0:Dump status timeout from:
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b6s1g0
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0:Received:
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b0s1g0: 768.082 1.31 125440 327680
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b1s1g0: 768.868 1.31 125440 327680
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b2s1g0: 768.606 1.31 125440 131072
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b3s1g0: 768.213 1.31 125440 131072
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b4s1g0: 767.820 1.31 125440 196608
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b5s1g0: 768.606 1.31 125440 131072
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: prun: b6s1g0: 761.528 1.31 125440 131072
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b0s1g0: 883736 1179720 5856
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b1s1g0: 1011840 1179720 5866
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b2s1g0: 1007496 1179720 5866
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b3s1g0: 1004600 1179720 5866
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b4s1g0: 1024872 1179720 5866
2012-Feb-11 04:38:24
NOTE from_mock: From MOCK: infook
pnetg0: psrv: b5s1g0: 873600 1179720 5856
2012-Feb-11 04:38:24
ERROR got_mock_message: From MOCK: observation
crashed!
What it means:
- every second messages are sent back to pnet (running on
pdevs1) telling that things are still running. they come from:
- prun (programming running on each spectrometer). It controls
the fpga, grabs the data and send it to psrv on the file
server.
- psrv (running on each file server) takes the data from the
spectrometer (via ethernet ) and puts it is the shared memory
buffers for bpInp,bpOut to process.
- If pnet does not receive all N messages (where N are the
number of spectrometers you are using) from the pruns and
from the psrvs, twithin 5 seconds of the completion of the last
set, then it aborts with the above message.
- In this case pdevs7 (b6s1g0) did not reply in time.
- psrv status message info:
- psrv: b0s1g0: 883736 1179720 5856
- b0s1g0 is the spectrometer and band it came from (a separate
psrv for high and low bands).
- 883736 - number of bytes in the current buffer that psrv is
filling. When it is filled, this buffer is passed to via
shared memory to bpInp
- 1179720- bytes needed to fill a buffer before passing it on.
In this case they were dumping .1 sec stokes spectra and 1
buffer held .9 secs of data.
- 5856 - number of .1 second average spectra completed.
- looking at /tmp/pnetctl.log on pdevs1 (pnetctl logfile) we see
the previous message:
- msgInp:9.17
MB/s 5324.54 MB [5806:5806]/5997 blocks (96.8%)
- This
says that the last complete set of messgages from the
psrvs,pruns had completed 5806 out of an requested 5997
averaged specgtra
- You'll notice that those psrv msgs that did reply didn't
have the same number of avg spectra completed (5856 and 5866)
. I problably means that of the pdevs got the message in for
the next second before the timeout occured.
What caused the problem?
- psrv was sending the messages back, but they weren't getting
to pnet on pdevs1.
- This means that there was a communications problem on the
local ethernet of the mock spectromters (or the switch).
- pdevs7 (b6s1g0) never sent the message back in time.
- psrv has to wait too long for a memory buffer to become
available (there is about 1 gb buffer of 20Mbyte buffers
availalbe in the shared memory).
- This would happen if the disc writes (bpOut) for some
reason slowed down or stopped.
- We were outputing a buffer every .9 seconds. so bpOut
would have to get 50 seconds behind??
- Looking at the file sizes:
- pdevs1: 770014080 bytes
- pdevs7: 762935040 bytes
- difference: 7079040.0 bytes or about 5 1.3 Mbyte
buffers.
- So it looks like psrv -> bpOut on pdevs7 stopped getting
data from the spectrometer for 5 seconds, or bpOut hung up
trying to output the data
- It doesn't look like a network "messaging error" since
psrv/bpOut continue to take data without any messages from
pnet.
Summary:
- pdevs7 psrv stopped sending the 1 sec heartbeats back to pnet
- the data file on pdevs7 had 5 seconds of data less than the
others. So it looks like it actaully stopped writing data to
disc.
- there could have been a communication problem between the
pdevs7 spectrometer and psrv on the fileserver, or more likely
something caused bpOut to stop writing data to disc.
- Need to check pdevs7's disc i/o rates.
<-
page up
home_~phil