clp processing program
jan09
Intro:
The clp processing programs input the clp data
from the telescope, decode each ipp, and then accumulate for a
specified number of ipps (usually 1000 or 10 seconds). The output is
a 2D image of spectral density vs range.
Types of input data:
- .shs files: created by the echotek card:
- Usually 5 Mhz bandwidth upshifted and downshifted bands
- processing usually uses a 4K fft
- .rdev files: Created by tamarras machine (mock box with
tamarra's filters).
- bandwidth is evolving.. was 32 Mhz...
The different flavors of the clp processing programs are:
- clp - multi threaded fftw (Multiple threads doing 1 fft). This
works on the .rdev data (circa 2011)
- clp1 - single threaded fftw, multiple ipps processed
simultaneously (multiple threads) using rdev files.
- clp1shs - single threaded fftw, multiple ipps using echotek
.shs files.
- As of 12jan12
- clp1shs is the most up to date. It works.
- clp1,clp need to have the bug fixes put into clp1shs
backported to them.
clp - multiThreaded fftw (top)
What it does:
- 1 input thread to read 10 second blocks of data
- 1 inpProcessing thread to find clp data, multiply by conjugate
of code, put into 1 large buf
- 1 fftProcessing thread. Processes the entire 10 second fft
with 1 call to threaded fftw using N threads
- 1 power and accumulate thread.
- 1 output thread.
Info:
- Program is compiled to run on 64bit intel. The current fftw
libs have been compiled on aserv11 but it's probably the same to
use the libs in /usr/lib64 since fftw doesn't use sse4.
The fftw libs were compiled with --sse
File locations:
- Source: /share/megs/phil/svn/aosoft/src/clp/clp.c
- Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1shs
- Scripts:/pkg/aosoft/common/bin/clpserv.sc --help for options
clp1 - single Threaded fftw, multi threaded
processing. (top)
- 1 input thread to input 10 sec blocks of clp data and
then pass it to processing thread.
- N processing threads. One for each 10 secs of data:
- Decode by multiply by code conjugate
- fftw single threaded 1 height at a time
- power and accumulate.
- output block when done.
Info:
- Program is compiled to run on 64bit intel. The current fftw
libs have been compiled on aserv11 but it's probably the same to
use the libs in /usr/lib64 since fftw doesn't use sse4.
File locations:
- Source: /share/megs/phil/svn/aosoft/src/clp/clp1.c
- Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1
- Scripts:source/clpaserv.sc --help for options
clp1shs - single Threaded fftw, multi
threaded processing for .shs files. (top)
- 1 input thread to input a T sec blocks of clp datan (by
default T=10 secs) and then pass it to a processing
thread.
- N processing threads. 1 for each T secs of data:
- Decode by multiply by code conjugate
- fftw single threaded 1 height at a time
- power and accumulate.
- output block when done.
- program was created from clp1. It has some buf fixes that need
to be back ported to clp1.
Info:
-
Input files:
- naming convention is : projId_yyyymmdd(x)_fnum.shs
- eg: t2876_20140901_001.shs , t2876_20140901b_001.shs
- fnum is 000 to 999. if fnum becomes greater
than 999 then at that point fnum switches to 4 digits.
- b,c,d get added if we restart the program on a single
day
-
Output files:
- A binary decode and ascii hdr file is output for each T
seconds of processed data.
- Each input file will contain multiple T second blocks.
- The output filename prefix is:
projId_yyyymmdd(x)_fnum_blkNum
- the output suffixes are .dcd (binary file) .hdr (hdr file)
- Example. let the input .shs be: t2876_20140901_035.shs and
there are 6 T second blocks in it. then the output files are:
- t2876_20140901_035_001.dcd,
t2876_20140901_035_001.hdr .. to
- t2876_20140901_035_006.dcd, t2876_20140901_035_006.hdr
- ASCII hdrfile contents:
- FILE_NUM
0
- This is the filenumber from the input discfile
- BLK_IN_FILE
1
- This is the T second block in the file (count
from 1)
- NIPPS_ACCUM
6000
- Number of ipps that were processed and
accumulated
- CUM_IPP_START 8175
- ipp number for 1st ipp for this block.
Count from start of
processing of first file.
- If processing is restarted in the middle of the
dataset then this counter will start counting from that
point.
- SMP_TM_USEC
0.200
- HGHT_RES_USEC
5.000
- the computed heights are spaced by this many
usecs
- NUM_CHAN
1
- 1 --> 1 block recorded. 2 --> 2 blocks
recorded. (typically centered at 435.5, 439.5 Mhz)
- NUM_HGHTS
190
- HGHT_DELAY_USEC 320.000
- usecs start rf pulse to first height recorded
- FFTLEN
4096
- TX_SMP_IPP
1250
- samples in transmitter pulse
- HGHT_SMP_IPP
6000
- CODE_LEN_USEC
250.000
- THRIND_ITERATION
0 1
- thread index (0..45) and number of times this
thread has processed a block (1..)
- DATE_SECMID
20140901 41614
- date for start of T second block. yyyymmdd and
seconds from midnight (AST).
- POS_AZGRCH
321.4690 15.2170 0.0001
- az, dome, ch position in degreess (azimuth
degrees is measured from the dome side. 0 is north, 90
deg is east).
- AvgTMING wSt 0.0
fftU 82.9 accumU 3.9 ipp
0.019 inp 4.48 out 0.04 tot
116.8
- to process the T second block
- wSt .. number of seconds this threaded waited before
data became available
- fftU.. time for 1 fft (usecs)
- accumU: time to accumulate the power (usecs)
- ipp: seconds to process 1 ipp
(excluding i/o)
- inp: secs to input T second block of data
- out: secs to output T second block of data
- tot: total time (seconds) to process T second
block of data.
- Start/stop times:Tue
Sep 2 14:53:49 2014 / Tue Sep 2 14:55:46 2014
- When the processing started, finished for this T
second block (ast).
- Binary image files:
- 4 byte little endian floating point numbers
- a T second freq block followed (if NUM_CHAN=2) the second
frequency block
- freqBlk=float(FFTLEN,NUM_HGHTS) (fftlen hgt0,
fftlen hgt1, .... fftlen numhghts-1)
-
Processing Notes:
- Be sure and check NIPPS_ACCUM
for each image.
- It will only process contiguous ipps.
- If it finds another type of ipp it will only process what
was found
- (if less than 100 ipps are found then the block is
skipped)
- RFI processing is done on each ipp (before decoding). This
can be altered with a program option.
File locations:
- Source: /share/megs/phil/svn/aosoft/src/clp/clp1shs.c
- Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1shs
- and /share/megs/phil/svn/aosoft/bin64/clp1shs
- Scripts:/pkg/aosoft/common/bin/clp1shs.sc --help for options
Selecting the clp ipps for the .shs
files. (top)
The echotek card tries to write only clp ipps to
disc. It's algorithm lets a few mracf, power, topsd ipps slip
through. An algorithm to discard any non-clp ipps used for the .shs
files is:
- Search each ipp until a clp ipp is found
- accumulate the requested number of ipps or until an non-clp
ipp is found
- If you have accumulated more that N ipps (i've set N to 100
ipps or 1 second) then output the accumulated block to a
file.
- go back to the top.
Deciding if an ipp is a clp ipp:
- Each program had a different length transmitter pulse:
- power: 13*4 = 52 usecs
- mracf: 308 usecs
- clp: 440 usecs
- topsd:500 usecs.
- 100 samples (20 usecs @5mhz) is averaged at various positions
in the tx pulse.
- A threshold is set. To be a clp ipp the various positions
must be above or below this threshold.
- 150 usecs: power > than threshold
- This gets rid of power profiles
- 430 usecs: power > than threshold
- This gets rid of mracf profiles.
- 490 usecs: power < threshold.
- 5.57 usecs is added to all positions.
- This is the rise time for the tx signal in klystrons and
filters.
- 5 usecs is subtracted from the 430 and 490 positions to stay
away from the turn off edge.
- After apr14:
- An acf of N lags (def=50) is computed for the tx samples.
- lags 20-> 50 are averaged.
- if the acf is below a threshold (.4) then is is
clp. if it > then it is not clp
- This gets rid of the need for the topside test (490usecs
power < threshold).
- I'm also starting to parameterize these values so that we
can process clp data with different rf pulse lengths.
Data from the file t2573_28dec2011_000.shs was processed to see what
a reasonable threshold would be. The ipp used was 10 milliseonds.
For each ipp 20 usecs of data (100 samples) was averaged about
150, 430, and 490 usecs (this file had mracf,power, and clp).
The plots show the power at
each Tx position versus the ipp in the file (.ps) (.pdf):
- data was taken: 10secs mracf, 10 secs power, then 4*10 secs of
clp.
- The clp runs should then last for 40 secs or 4000 ipps.
- the echotek tried to only output the clp data.
- The three frames on each page show the power at 150, 430, and
490 usecs in the tx pulse.
- page 1: full vertical and horizontal scale:
- Top: 150 usecs:
- the dropout at the start is power.
- Around ipp 8000 there was another mracf, power cycle
visible
- Middle: 430 usecs
- An extra dropout is seen at ipp 4100. This must be a mracf
ipp that got pass the echotek.
- bottom: 490 usecs. Not much here since no topside was run.
- The average value is 8e6 at 150 usecs and 6.5e6 at 430
usecs. This falloff is the droop in the transmitter pulse.
- Page 2: blowup vertical scale, full horizontal scale.
- 430usecs:
- 6e5 pwr counts during power profile
- 490usecs:
- 6e5 pwr counts during power profile
- 2e5 pwr counts during mracf.
- These are outside the transmitter pulse for these programs.
- I'm not sure what the blanking duration was .. i's guess 200
usecs ..
- for power profile: 52 + 200 + 25 means that blanking was
over
- for mracf: 308+25 + 200 : blanking was just
finishing.
Summary:
- I set the threshold to 1.5e6 power counts.
- You need to set the threshold high enough so that the Tsys
values at 430 and 490 usecs for power and mracf don't end up
looking like tx power.
processing:
x101/120111/testclp.pro
Using rserv1: 48 core amd machine: (top)
rserv1 is a 48 core amd machine. clp1shs data was
processed using 45 cores. Some notes on the processing are:
- The program uses a single reader and then multiple processing
threads (one for each 10 second block).
- The input data is being read via nfs over a 1gb/sec ethernet
link.
- 193Mbytes of data is input and then passed to each thread
before it starts processing.
- The input data is read sequentially, it is not overlapped
with a threads processing:
- wait for thread to be available
- input thread data
- start thread processing
- goto next thread
- Some tests run: daeron to rserv1 showed about 100
mb/sec xfering 2gb files with 9mb blocks (via dd)
Plots were made of the daily processing of .shs clp data. Data is
plotted vs processed block number (each block normally 10 seconds of
data).
The plots show:
- top: number of accumulated ipps (10millisec
each) in each block. Normally this is 1000 (10 seconds).
- middle: Wall clock spacing between clp blocks.
- If power, mracf, and clp are each run for 10 seconds,
then the clp spacing will be 30 seconds.
- Bottom: Number of blocks each thread processed.
- This shows how well the process was spread between the
45 threads.
- If there a number of short blocks then some threads
may finish early and do more blocks.
- Page 2: timing:
- Top: Wait time for each thread startup.
- The first 45 threads have no startup wait time.
- The 46th block must wait for the processing time (880
secs) before it can be started. This is the wait time.
- the wait time should have a spike every 45 block starts
(approximately)
- 2ndFrame: Read time each block.
- The i/o time to read the 10 seconds (193Mbytes) of data.
- i tried some dd's from daeron to rserv1 of 2gb files.
I was getting close to 100 mb/sec. So it should only
take a few seconds to read the data.
- If the datataking was writing to the
daeron disk at the same time things could go slower.
- This occurs before the thread can be dispatched to start
it's processing.
- The data is read in blocks of 1 second.
- 3rdFrame: Thread processing and output time.
- The output times is usually very short (fractions of a
second) since the thread completes the i/o after the data
gets to the memory bufffers.
- the total time is the sum of this time and the read time
for each block.
- 4thFrame: average time for 1 4K fft
- The average and rms is printed in the header.
- 5thFrame:
- the average time to compute power and accumulate 1 4K
spectra.
Processing times by date:
Date
|
Plots
|
proj
|
thrProc
avg secs
|
4K fftAvg
Usecs
|
pwrAccum
avg
|
Notes:
|
20111220
|
(.ps) (.pdf) |
t2574 |
921 +/-39.6
|
85.7 +/-1
|
13.7 +/-3.9
|
|
20111222
|
(.ps) (.pdf) |
t2574 |
917+/-27.8
|
86.9+/-1.5
|
11.7 +/-2.6
|
|
20111223 |
(.ps) (.pdf) |
t2574 |
887+/-19.2
|
83.9+/-.6
|
11.5 +/-2.2
|
blk 1200 to 1600 input took
20 secs.
|
20111224 |
(.ps) (.pdf) |
t2574 |
913+/-28.9
|
85.8 +/-.7
|
12.8 +/- 3.0
|
|
20111228 |
(.ps) (.pdf) |
t2573 |
864.9+/-17
|
83.0 +/- .4
|
9.8 +/-2.0
|
|
20111229 |
(.ps) (.pdf) |
t2573 |
870.4+/-22.1
|
83.0 +/-.4
|
10.4+/-2.6
|
|
20111230
|
(.ps) (.pdf) |
t2573 |
864.9+/-18.0
|
83.9 +/-.4
|
9.9 +/-2.1
|
first 45 blocks ran 40
seconds faster than rest
|
20120112 |
(.ps) (.pdf) |
t1193 |
864.9 +/- 18.0
|
83.0 +/-.4
|
9.8
|
frist 45 blocks ran 40 secs
faster. Looks like the pwr and accum
code started to take longer.
|
20120113 |
(.ps) (.pdf) |
t1193 |
866.2 +/- 19.2
|
83.1 +/- .3
|
9.9+/- 2.3
|
1st 45 blocks ran 40 secs
faster then pwr,accum slowed down.
|
processing:
x101/120114/clp1shstmingchk.pro, tmingchk.pro
09dec13: clp processing times on wombat
The wombat computer was installed 06dec13. It
has:
-
2*8-core E5-2680 Xeons running at 2.7 Ghz with a 20Mb cache, and hyperthreading.
-
128 GB RAM8, 4 TB in a RAID 6 on 3Ware card
I ran the clp code on wombat using 15 and 30
threads.
- the input data came from rserv2
- the output data went to the wombat disc. (so the times may be
a few seconds faster because the disc write did not have to go
back out over the network).
- This code uses the fftw routines. I did not recompile the
program on wombat (but it did link with the .so libs on wombat).
The plots show the processing
times for clp on wombat (.ps) (.pdf):
- Top: time to process a single 10 sec clp block of data. black
used 30 threads, red used 15 threads
- 2nd: ratio wombat 30/15 thread processing time = 1.7
- hyperthreading does speed up the throughput
- 3rd: 4k complex fft time
- 30 threads: 20 usecs, 15 threads: 10 usecs
- So hyperthreading does not increase the throughput of fft's
- Bottom: wombat 30thread/15thread 4K fft times
- the ratio was 2.
- Hyperthreading does not improve the fft times
processing
times,throughput
cpu
|
threads
|
clp processing
|
4K fft processing
|
|
|
tm 1 blk
secs
|
1 thread ratio
(cpu/rserv1)
|
(tm 1 blk)/nthreads
secs
|
tm 1fft
usecs
|
1 thread ratio
cpu/rserv1
|
throughput
nthreads*ffts/sec
|
throughput ratio
cpu/rserv1
|
wombat
|
15
|
180
|
5
|
12
|
10
|
8.3
|
1.5e6
|
3
|
30
|
300
|
3
|
10 *
|
20
|
4.15
|
1.5e6
|
3
|
rserv1
|
46
|
900
|
1
|
19
|
83
|
1
|
.536
|
1
|
* wombat 30 threads processes 10 secs of data in 10 seconds.. It can
keep up with real time.
Summary:
- fft times.
- wombat is 8 times faster than rserv1 for a 4k fft
(1-15 threads on wombat)
- including the number of threads in the cpu, wombat is 3
times faster than rserv1
- This is probably from
- wombat is 2.7 Ghz, rserv1 is probably 2.1 Ghz
- The cache on rserv1 is smaller
- Buying a version of wombat with a faster clock would
probably speed this up even more.
- clp processing times:
- wombat has a 2 times throughput improvement compared
to rserv1
- With 30 threads, wombat can processed clp (5mhz,2band) data
in real time (clp running 100 % of the time).
- Note: wombat has the same problem with NAN that rserv2 has.
Until i figure out the problem, it can't be used to reduce the
data
- this is some crazy in my clp program that
occurs on rserv2, wombat (but never on rserv1).
processing:
x101/131207/clpwombat.pro
The clp1shs.c program was originally written and
run on rserv1 with an older version of centos (6.x?). The code
ran fine.
I tried recompiling and running the program on rserv2 and megs3.
Both of these versions generated infinities in the output spectra.
In oct18 giacomo upgraded rserv1 to centos
7.5. I ran the program on some .shs files:
- there were no infinities
- the fft code ran 3 times faster.
The 181101-09 echotek data was also processed
on rserv1 with centos 7.5 with no infinities.
On 05dec18 we started a new echotek run, and we now
started getting infinities again:
- The data processed 2 5 mhz bands
- the infinities only occurred in the 2nd 5mhz image. The first
image was always clean.
I made some plots looking at the data
- t3274 (accidentally placed in t2374) on dstor1
- 50 secs clp, 10 sec tpsd
- rfi and arc removal enabled (these were also enabled for the
hf run)
The plots show the infinities from
the 2nd 5mhz block (.ps) (.pdf)
- the first 450 files were processed
- there were 4095 10 sec spectral files generated
- Each file had 2 5mhz blocks of 4096FreqChan * 3400 Heights
- top: Number of infinities in block 2 vs file index
- of the 4095 images, 783 files had infinities
- Each bad file averaged about 20,000 infinities (out of 14e6
points /image)
- 2nd: maximum height index (count from 0) for the infinities
for block 2.
- There were no infinities above height index 60 (out of
3399).
- bottom:blowup showing how often infinities occurred.
Not sure why rserv1 decided to start
generating infinities again. I don't think anything has changed
since
the hf run.
processing:x101/181206/rserv1_inf.pro
home_~phil