Timeouts from disc i/o slowdown

29mar23

The problem

12meter data taking with the mock spectrometers was occasionally having an error and causing the datataking to stop.

On 29mar23 at 13:23:56 one of these error occurred.

The error printed on the user terminal was:

2023-03-29_13:23:56 mockpset:error mockpset grp g1 (0). stat:er retVal:pseterr pdevs2.pdev.naic.edu psetReply:.. retrying
This occurred when the tcl code tried to send the pset hdr info to the mock output program before the start of a scan.

The mock datataking programs consist of:

tcl program that user interacts with
pnetctl - running on pdevs1.

main program that talks to all of the programs on each mock spectrometer
the tcl program talks to to pnetctl.

For each spectrometer (up to 7) there are a set of programs used to take the data.

bufferpool:the local programs communicate via input and output queues that sit in shared memory.

the memory is referred to as the buffer pool

there is about 4 Gbytes of memory in the pool
it contains large buffers (for data)
small buffers to pass commands back and forth.
this allows queuing of data and commands.

prun - control program running on the mock fpga.

collects the data and send it via local ethernet to the psrv program running on the mock cpu

psrv - runs on the mock cpu

started by xinetd when the prun program connects to psrv's port
buffers the data from prun and then passes it on to the buffer pool in shared memory.
this program was originally written by jeff, it also has a number of diagnostic routines.

bpInp :buffer pool input program.

Takes data that psrv put on bpInp's input queue
Bit flips all spectra
Adds some info to the header
passes the data to the input queue of the bpOut program

bpOut: program to continue filling the header and then output the data to disc.

receives data buffers on it's input queue.
fills in the header
outputs the data to disc
It also receives commands/info from pnetctl via it's input queue (from bpCmd)

bpCmd: runs on each spectrometer. interfaces pnetctl to bpInp,bpOut

bpInp,bpOut read from their input queues in shared memory.
pnetctl needs to send info,commands to these programs for configuration. but pnetctl is on a remote computer
so bpCmd listens on an input socket port.

when it receives a request from pnetctl, it can pass them on to bpInp,bpOut on the local shared memory input queues.

Logging info around the error

The tcl log file had:

2023-03-29_13:22:48 13 308800075 offsets: -400.691 0.000 start:132252
2023-03-29_13:22:49 Mock scan starting filenumber:07300
2023-03-29_13:22:52 Observation begins Wed Mar 29 13:22:52 2023. scan:308800075
2023-03-29_13:23:53 Scan 308800075 completed.
2023-03-29_13:23:56 mockpset:error mockpset grp g1 (0). stat:er retVal:pseterr pdevs2.pdev.naic.edu psetReply:.. retrying
2023-03-29_13:23:56 mockpset:error mockpset grp g1 (0). stat:to retVal:.. failed retry..quitting

The previous scan finished at 2023-03-29_13:23:53 ( when prun exited after completing the integration time)
The tcl program then sends the pset info for the next scan (this was an ra/dec map with 60 sec scans).
The pnetctl program got the request

it sent the request to the 7 mock spectrometers

it went to bpCmd and then to bpOut

pnetctl then waits up to 1 second for the reply
I think this is where we are timing out.

message log of pnetctl

msgOut:76:msg:infook pnetg0:Observation begins at 1680110572.01 Wed Mar 29 13:22:52 2023:

scan started 13:22:52 .. should last for 60 seconds.

msgInp:pseterr
tm sendBpcmd pset: tot:1.0022439956665
send:pdevs1.pdev.naic.edu=0.000104 pdevs2.pdev.naic.edu=0.000060 pdevs3.pdev.naic.edu=0.000056 pdevs4.pdev.naic.edu=0.000048 pdevs5.pdev.naic.edu=0.000050 pdevs6.pdev.naic.edu=0.000051 pdevs7.pdev.naic.edu=0.000050

The command was sent correctly..

rcvCum:pdevs2.pdev.naic.edu=1.00163

The reply took longer than 1 second to get back to pnetctl. this is the timeout failure.

The message log on the mock spectrometer cpu

I chose to look at pdevs2.
the datataking programs send messages to /var/log/messages.
the file had:
Mar 29 13:22:48 pdevs2 xinetd[3292]: START: pdev pid=11930 from=192.168.7.123

prun connects to psrv port. xinetd starts psrv in preparation for the scan.

Mar 29 13:22:51 pdevs2 bpInp[3759]: Got hdr band 1

hdr info bpInp arrives (via bpCmd)
note: When sending info to bpOut, it always goes through bpInp's input queue to make sure that things are synchronized.
we don't want to setup a new scan if there are still buffers in bpOut's input queue from the previous scan).

Mar 29 13:22:51 pdevs2 bpOut[3763]: opened file:/dump/pdev/sun.20230329.b1s1g0.07300.fits

file number passed to bpOut via bpCmd telling next file number to use.

Mar 29 13:22:51 pdevs2 bpOut[3763]: Got hdr band 1

header info for new scan gets to bpOut.

Mar 29 13:23:53 pdevs2 xinetd[3292]: EXIT: pdev status=1 pid=11930 duration=65(sec)

psrv exits when prun finishes sending it data.
all of the buffers from psrv have been put on bpInp's input queue.
psrv ran for 65 seconds. the first 5 seconds setup info was passed around.

Mar 29 13:24:26 pdevs2 bpOut[3763]: subBand:1 done:blks:60 bytes:34360300 lastFile:/dump/pdev/sun.20230329.b1s1g0.07300.fits

bpOut reports that it has finished writing all of the buffers from the previous block at 13:24:26.
the datataking finished at 13:22:52 1 min=13:23:52 .. so the data took 34 extra seconds to write to disc.
the pset command request was put on bpOut's input queue around 13:23:55

It was not operated on until all of the previous data buffers were output (at 13:24:26).
this is 31 seconds after it was issued. Much longer than the 1 second timeout.

The data rate for the experiment is 8192channels*2pols*4bytes per second

This is 65KBytes/second.
I recently ran the disc i/o test on all the mock discs. they all wrote greater the 120MBytes/second.
So don't think there is a bad disc.

Plotting the buffer pools status around the error.

I had started the bufpool monitoring routine around 13:00 on 29mar23 to see what was happening with the buffers.

the bpMon program reads the bookkeeping info from the bufpool and outputs the into every 5 seconds to disc files.

this is done for each mock cpu
there is info for

the large (data) and small (command) buffer lists
the bpInp input queue
the bpOut input queue.

The plots show the number of buffers in the various queues around the time of the failure (.ps) (.pdf)

Each page has info on a different queue or buffer list
The 7 frames on a page are the 7 mock spectrometers
The data is sampled once every 5 seconds
the green lines are the 1 minute scan prior to the failure
the red line shows the failure
Page 1: Large (data) buffer freelist queue

the queue holds 50 buffers.
When bpInp gets data, it extracts a buffer and places it on bpOut's input queue.
when bpOut finished writing the data, it gets put back on the freelist queue.
The during the 60sec scan, the freelist maintains the 50 buffers. the input, output happens fast enough to not see a change
around 10 seconds before the end of the scan, the buffer count begins to decrement. It ends up having up to 10 buffers in use by bpout
since spectra were integrated for 1 sec each, no data was written to disc for up to 15 seconds.
It took about 30secs after the end of the scan for the i/o to complete.

The required data rate was 65KBytes/second. The discs were later timed to write at 120MBytes/second.

All 7 mocks showed identical response, so this is not a local disc problem on a mock.,

Page 2: Small buffer (command) free list queue.

No commands were sent during the failure.

There should be 1024 buffers in the freelist. looks like we've probably got sum buffers that have not been freed (aborts?)

Page 3: bpInp input queue

This holds the buffers on bpInp's queue waiting to be processed.
There are no buffers backed up waiting for bpInp to process.

Page 4: bpOut input queue.

These are the buffers waiting on bpOut's input queue.
The 10 buffers removed from the freelist are waiting for bpOut to process.
bpOut can not process a buffer on it's input queue until the previous write has completed.
This shows the problem is with bpOut writing to disc
The same thing happened to all 7 mocks, so it is not a single mock disc problem

Summary

mock datataking has been aborting with pset errors
The problem occurs because:

prun reports that the integration has completed
the user tcl program sends a new pset command to bpOut to setup the parameters for the next scan.
To be sure the command is synchronized with any buffers left in bpInp,bpOut queues, the command is placed on bpInp's input queue
bpInp passes it on to bpOut's input queue
For some reason there are 10 buffers (of 1 sec of data each) still sitting on bpOut's queue.

The pset command is placed behind these

30 seconds after the scan ends, the last data buffer is finished being written to disc
the pset command is then accepted by bpOUt
Meanwhile the program sending the pset command times out after waiting 1 second for the reply from bpOut.

The problem occurs identically for all 7 mocks.

this shows it is not an individual mock or mock disc problem

processing:x101/230329/pseterr.info, x101/pdev/bpmon/bpmonplot_230329.pro

<- page up
home_~phil