clp processing program

jan09

Intro.
clp - mutltithreaded fftw
clp1- 1 thread per 10sec block (.rdev data)
clp1shs - 1 thread per block (.shs data)
Selecting the clp ipps for the .shs data.
Compute times using 45 threads on rserv1.
09dec13: clp processing times on wombat with 15 and 30 threads.
181205: infinities in the echotek output spectra

See also: Misc coded long pulse info

Intro:

The clp processing programs input the clp data from the telescope, decode each ipp, and then accumulate for a specified number of ipps (usually 1000 or 10 seconds). The output is a 2D image of spectral density vs range.

Types of input data:

.shs files: created by the echotek card:

Usually 5 Mhz bandwidth upshifted and downshifted bands
processing usually uses a 4K fft

.rdev files: Created by tamarras machine (mock box with tamarra's filters).

bandwidth is evolving.. was 32 Mhz...

The different flavors of the clp processing programs are:

clp - multi threaded fftw (Multiple threads doing 1 fft). This works on the .rdev data (circa 2011)
clp1 - single threaded fftw, multiple ipps processed simultaneously (multiple threads) using rdev files.
clp1shs - single threaded fftw, multiple ipps using echotek .shs files.
As of 12jan12

clp1shs is the most up to date. It works.
clp1,clp need to have the bug fixes put into clp1shs backported to them.

clp - multiThreaded fftw (top)

What it does:

1 input thread to read 10 second blocks of data
1 inpProcessing thread to find clp data, multiply by conjugate of code, put into 1 large buf
1 fftProcessing thread. Processes the entire 10 second fft with 1 call to threaded fftw using N threads
1 power and accumulate thread.
1 output thread.

Info:

Program is compiled to run on 64bit intel. The current fftw libs have been compiled on aserv11 but it's probably the same to use the libs in /usr/lib64 since fftw doesn't use sse4. The fftw libs were compiled with --sse

File locations:

Source: /share/megs/phil/svn/aosoft/src/clp/clp.c
Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1shs
Scripts:/pkg/aosoft/common/bin/clpserv.sc --help for options

clp1 - single Threaded fftw, multi threaded processing. (top)

1 input thread to input 10 sec blocks of clp data and then pass it to processing thread.
N processing threads. One for each 10 secs of data:

Decode by multiply by code conjugate
fftw single threaded 1 height at a time
power and accumulate.
output block when done.

Info:

Program is compiled to run on 64bit intel. The current fftw libs have been compiled on aserv11 but it's probably the same to use the libs in /usr/lib64 since fftw doesn't use sse4.

File locations:

Source: /share/megs/phil/svn/aosoft/src/clp/clp1.c
Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1
Scripts:source/clpaserv.sc --help for options

clp1shs - single Threaded fftw, multi threaded processing for .shs files. (top)

1 input thread to input a T sec blocks of clp datan (by default T=10 secs) and then pass it to a processing thread.
N processing threads. 1 for each T secs of data:

Decode by multiply by code conjugate
fftw single threaded 1 height at a time
power and accumulate.
output block when done.

program was created from clp1. It has some buf fixes that need to be back ported to clp1.

Info:

Input files:

naming convention is : projId_yyyymmdd(x)_fnum.shs

eg: t2876_20140901_001.shs , t2876_20140901b_001.shs

fnum is 000 to 999. if fnum becomes greater than 999 then at that point fnum switches to 4 digits.
b,c,d get added if we restart the program on a single day

Output files:

A binary decode and ascii hdr file is output for each T seconds of processed data.
Each input file will contain multiple T second blocks.
The output filename prefix is: projId_yyyymmdd(x)_fnum_blkNum
the output suffixes are .dcd (binary file) .hdr (hdr file)
Example. let the input .shs be: t2876_20140901_035.shs and there are 6 T second blocks in it. then the output files are:

t2876_20140901_035_001.dcd, t2876_20140901_035_001.hdr .. to
t2876_20140901_035_006.dcd, t2876_20140901_035_006.hdr

ASCII hdrfile contents:

FILE_NUM 0

This is the filenumber from the input discfile

BLK_IN_FILE 1

This is the T second block in the file (count from 1)

NIPPS_ACCUM 6000

Number of ipps that were processed and accumulated

CUM_IPP_START 8175

ipp number for 1st ipp for this block. Count from start of processing of first file.
If processing is restarted in the middle of the dataset then this counter will start counting from that point.

SMP_TM_USEC 0.200

sampling rate in usecs

HGHT_RES_USEC 5.000

the computed heights are spaced by this many usecs

NUM_CHAN 1

1 --> 1 block recorded. 2 --> 2 blocks recorded. (typically centered at 435.5, 439.5 Mhz)

NUM_HGHTS 190

we recorded

HGHT_DELAY_USEC 320.000

usecs start rf pulse to first height recorded

FFTLEN 4096

when computing spectrum

TX_SMP_IPP 1250

samples in transmitter pulse

HGHT_SMP_IPP 6000

height samples in ipp

CODE_LEN_USEC 250.000

code length

THRIND_ITERATION 0 1

thread index (0..45) and number of times this thread has processed a block (1..)

DATE_SECMID 20140901 41614

date for start of T second block. yyyymmdd and seconds from midnight (AST).

POS_AZGRCH 321.4690 15.2170 0.0001

az, dome, ch position in degreess (azimuth degrees is measured from the dome side. 0 is north, 90 deg is east).

AvgTMING wSt 0.0 fftU 82.9 accumU 3.9 ipp 0.019 inp 4.48 out 0.04 tot 116.8

to process the T second block

wSt .. number of seconds this threaded waited before data became available
fftU.. time for 1 fft (usecs)
accumU: time to accumulate the power (usecs)
ipp: seconds to process 1 ipp (excluding i/o)
inp: secs to input T second block of data
out: secs to output T second block of data
tot: total time (seconds) to process T second block of data.

Start/stop times:Tue Sep 2 14:53:49 2014 / Tue Sep 2 14:55:46 2014

When the processing started, finished for this T second block (ast).

Binary image files:

4 byte little endian floating point numbers
a T second freq block followed (if NUM_CHAN=2) the second frequency block
freqBlk=float(FFTLEN,NUM_HGHTS) (fftlen hgt0, fftlen hgt1, .... fftlen numhghts-1)

Processing Notes:

Be sure and check NIPPS_ACCUM for each image.

It will only process contiguous ipps.
If it finds another type of ipp it will only process what was found
(if less than 100 ipps are found then the block is skipped)

RFI processing is done on each ipp (before decoding). This can be altered with a program option.

for more info see: 14feb14: removing rfi from the plasma line data.

File locations:

Source: /share/megs/phil/svn/aosoft/src/clp/clp1shs.c
Binary: /pkg/aosoft/fedora4/x86_64/bin/clp1shs

and /share/megs/phil/svn/aosoft/bin64/clp1shs

Scripts:/pkg/aosoft/common/bin/clp1shs.sc --help for options

Selecting the clp ipps for the .shs files. (top)

The echotek card tries to write only clp ipps to disc. It's algorithm lets a few mracf, power, topsd ipps slip through. An algorithm to discard any non-clp ipps used for the .shs files is:

Search each ipp until a clp ipp is found
accumulate the requested number of ipps or until an non-clp ipp is found
If you have accumulated more that N ipps (i've set N to 100 ipps or 1 second) then output the accumulated block to a file.
go back to the top.

Deciding if an ipp is a clp ipp:

Each program had a different length transmitter pulse:

power: 13*4 = 52 usecs
mracf: 308 usecs
clp: 440 usecs
topsd:500 usecs.

100 samples (20 usecs @5mhz) is averaged at various positions in the tx pulse.

A threshold is set. To be a clp ipp the various positions must be above or below this threshold.
150 usecs: power > than threshold

This gets rid of power profiles

430 usecs: power > than threshold

This gets rid of mracf profiles.

490 usecs: power < threshold.

This exluded topside.

5.57 usecs is added to all positions.

This is the rise time for the tx signal in klystrons and filters.

5 usecs is subtracted from the 430 and 490 positions to stay away from the turn off edge.

After apr14:

An acf of N lags (def=50) is computed for the tx samples.
lags 20-> 50 are averaged.
if the acf is below a threshold (.4) then is is clp. if it > then it is not clp
This gets rid of the need for the topside test (490usecs power < threshold).
I'm also starting to parameterize these values so that we can process clp data with different rf pulse lengths.

Data from the file t2573_28dec2011_000.shs was processed to see what a reasonable threshold would be. The ipp used was 10 milliseonds. For each ipp 20 usecs of data (100 samples) was averaged about 150, 430, and 490 usecs (this file had mracf,power, and clp).

The plots show the power at each Tx position versus the ipp in the file (.ps) (.pdf):

data was taken: 10secs mracf, 10 secs power, then 4*10 secs of clp.

The clp runs should then last for 40 secs or 4000 ipps.

the echotek tried to only output the clp data.
The three frames on each page show the power at 150, 430, and 490 usecs in the tx pulse.
page 1: full vertical and horizontal scale:

Top: 150 usecs:

the dropout at the start is power.
Around ipp 8000 there was another mracf, power cycle visible

Middle: 430 usecs

An extra dropout is seen at ipp 4100. This must be a mracf ipp that got pass the echotek.

bottom: 490 usecs. Not much here since no topside was run.
The average value is 8e6 at 150 usecs and 6.5e6 at 430 usecs. This falloff is the droop in the transmitter pulse.

Page 2: blowup vertical scale, full horizontal scale.

430usecs:

6e5 pwr counts during power profile

490usecs:

6e5 pwr counts during power profile
2e5 pwr counts during mracf.

These are outside the transmitter pulse for these programs.
I'm not sure what the blanking duration was .. i's guess 200 usecs ..

for power profile: 52 + 200 + 25 means that blanking was over
for mracf: 308+25 + 200 : blanking was just finishing.

Summary:

I set the threshold to 1.5e6 power counts.
You need to set the threshold high enough so that the Tsys values at 430 and 490 usecs for power and mracf don't end up looking like tx power.

processing: x101/120111/testclp.pro

Using rserv1: 48 core amd machine: (top)

rserv1 is a 48 core amd machine. clp1shs data was processed using 45 cores. Some notes on the processing are:

The program uses a single reader and then multiple processing threads (one for each 10 second block).
The input data is being read via nfs over a 1gb/sec ethernet link.
193Mbytes of data is input and then passed to each thread before it starts processing.

The input data is read sequentially, it is not overlapped with a threads processing:

wait for thread to be available
input thread data
start thread processing
goto next thread

Some tests run: daeron to rserv1 showed about 100 mb/sec xfering 2gb files with 9mb blocks (via dd)

Plots were made of the daily processing of .shs clp data. Data is plotted vs processed block number (each block normally 10 seconds of data).
The plots show:

Page 1:

top: number of accumulated ipps (10millisec each) in each block. Normally this is 1000 (10 seconds).
middle: Wall clock spacing between clp blocks.

If power, mracf, and clp are each run for 10 seconds, then the clp spacing will be 30 seconds.

Bottom: Number of blocks each thread processed.

This shows how well the process was spread between the 45 threads.

If there a number of short blocks then some threads may finish early and do more blocks.

Page 2: timing:

Top: Wait time for each thread startup.

The first 45 threads have no startup wait time.
The 46th block must wait for the processing time (880 secs) before it can be started. This is the wait time.
the wait time should have a spike every 45 block starts (approximately)

2ndFrame: Read time each block.

The i/o time to read the 10 seconds (193Mbytes) of data.

i tried some dd's from daeron to rserv1 of 2gb files. I was getting close to 100 mb/sec. So it should only take a few seconds to read the data.
If the datataking was writing to the daeron disk at the same time things could go slower.

This occurs before the thread can be dispatched to start it's processing.
The data is read in blocks of 1 second.

3rdFrame: Thread processing and output time.

The output times is usually very short (fractions of a second) since the thread completes the i/o after the data gets to the memory bufffers.
the total time is the sum of this time and the read time for each block.

4thFrame: average time for 1 4K fft

The average and rms is printed in the header.

5thFrame:

the average time to compute power and accumulate 1 4K spectra.

Processing times by date:

Date	Plots	proj	thrProc avg secs	4K fftAvg Usecs	pwrAccum avg	Notes:
20111220	(.ps) (.pdf)	t2574	921 +/-39.6	85.7 +/-1	13.7 +/-3.9
20111222	(.ps) (.pdf)	t2574	917+/-27.8	86.9+/-1.5	11.7 +/-2.6
20111223	(.ps) (.pdf)	t2574	887+/-19.2	83.9+/-.6	11.5 +/-2.2	blk 1200 to 1600 input took 20 secs.
20111224	(.ps) (.pdf)	t2574	913+/-28.9	85.8 +/-.7	12.8 +/- 3.0
20111228	(.ps) (.pdf)	t2573	864.9+/-17	83.0 +/- .4	9.8 +/-2.0
20111229	(.ps) (.pdf)	t2573	870.4+/-22.1	83.0 +/-.4	10.4+/-2.6
20111230	(.ps) (.pdf)	t2573	864.9+/-18.0	83.9 +/-.4	9.9 +/-2.1	first 45 blocks ran 40 seconds faster than rest
20120112	(.ps) (.pdf)	t1193	864.9 +/- 18.0	83.0 +/-.4	9.8	frist 45 blocks ran 40 secs faster. Looks like the pwr and accum code started to take longer.
20120113	(.ps) (.pdf)	t1193	866.2 +/- 19.2	83.1 +/- .3	9.9+/- 2.3	1st 45 blocks ran 40 secs faster then pwr,accum slowed down.

processing: x101/120114/clp1shstmingchk.pro, tmingchk.pro

09dec13: clp processing times on wombat

The wombat computer was installed 06dec13. It has:

2*8-core E5-2680 Xeons running at 2.7 Ghz with a 20Mb cache, and hyperthreading.

128 GB RAM8, 4 TB in a RAID 6 on 3Ware card

I ran the clp code on wombat using 15 and 30 threads.

the input data came from rserv2
the output data went to the wombat disc. (so the times may be a few seconds faster because the disc write did not have to go back out over the network).
This code uses the fftw routines. I did not recompile the program on wombat (but it did link with the .so libs on wombat).

The plots show the processing times for clp on wombat (.ps) (.pdf):

Top: time to process a single 10 sec clp block of data. black used 30 threads, red used 15 threads
2nd: ratio wombat 30/15 thread processing time = 1.7

hyperthreading does speed up the throughput

3rd: 4k complex fft time

30 threads: 20 usecs, 15 threads: 10 usecs
So hyperthreading does not increase the throughput of fft's

Bottom: wombat 30thread/15thread 4K fft times

the ratio was 2.
Hyperthreading does not improve the fft times

processing times,throughput
cpu	threads	clp processing			4K fft processing
		tm 1 blk secs	1 thread ratio (cpu/rserv1)	(tm 1 blk)/nthreads secs	tm 1fft usecs	1 thread ratio cpu/rserv1	throughput nthreads*ffts/sec	throughput ratio cpu/rserv1
wombat	15	180	5	12	10	8.3	1.5e6	3
wombat	30	300	3	10 *	20	4.15	1.5e6	3
rserv1	46	900	1	19	83	1	.536	1

* wombat 30 threads processes 10 secs of data in 10 seconds.. It can keep up with real time.

Summary:

fft times.

wombat is 8 times faster than rserv1 for a 4k fft (1-15 threads on wombat)
including the number of threads in the cpu, wombat is 3 times faster than rserv1
This is probably from

wombat is 2.7 Ghz, rserv1 is probably 2.1 Ghz
The cache on rserv1 is smaller

Buying a version of wombat with a faster clock would probably speed this up even more.

clp processing times:

wombat has a 2 times throughput improvement compared to rserv1
With 30 threads, wombat can processed clp (5mhz,2band) data in real time (clp running 100 % of the time).
Note: wombat has the same problem with NAN that rserv2 has. Until i figure out the problem, it can't be used to reduce the data

this is some crazy in my clp program that occurs on rserv2, wombat (but never on rserv1).

processing: x101/131207/clpwombat.pro

181205 infinities in the output spectra

    The clp1shs.c program was originally written and run on rserv1 with an older version of centos (6.x?). The code ran fine.
I tried recompiling and running the program on rserv2 and megs3. Both of these versions generated infinities in the output spectra.

    In oct18 giacomo upgraded rserv1 to centos 7.5. I ran the program on some .shs files:

there were no infinities
the fft code ran 3 times faster.

The 181101-09 echotek data was also processed on rserv1 with centos 7.5 with no infinities.

On 05dec18 we started a new echotek run, and we now started getting infinities again:

The data processed 2 5 mhz bands
the infinities only occurred in the 2nd 5mhz image. The first image was always clean.

I made some plots looking at the data

t3274 (accidentally placed in t2374) on dstor1
50 secs clp, 10 sec tpsd
rfi and arc removal enabled (these were also enabled for the hf run)

The plots show the infinities from the 2nd 5mhz block (.ps) (.pdf)

the first 450 files were processed
there were 4095 10 sec spectral files generated
Each file had 2 5mhz blocks of 4096FreqChan * 3400 Heights
top: Number of infinities in block 2 vs file index

of the 4095 images, 783 files had infinities
Each bad file averaged about 20,000 infinities (out of 14e6 points /image)

2nd: maximum height index (count from 0) for the infinities for block 2.

There were no infinities above height index 60 (out of 3399).

bottom:blowup showing how often infinities occurred.

Not sure why rserv1 decided to start generating infinities again. I don't think anything has changed since
the hf run.

processing:x101/181206/rserv1_inf.pro

home_~phil

clp processing program

jan09

Intro:

Types of input data:

The different flavors of the clp processing programs are:

clp - multiThreaded fftw (top)

clp1 - single Threaded fftw, multi threaded processing. (top)

clp1shs - single Threaded fftw, multi threaded processing for .shs files. (top)

Info:

Input files:

Output files:

Processing Notes:

File locations:

Selecting the clp ipps for the .shs files. (top)

Summary:

Using rserv1: 48 core amd machine: (top)

09dec13: clp processing times on wombat

Summary:

181205 infinities in the output spectra