cpu |
date |
version |
serverType |
summer1 | 27sep09: |
intel L5420 harpertown
cpu |
|
aserv11 | xx.xx |
3.3.2 |
intel X5550 nehalem cpu fftw |
megs3 | 21feb13 | 3.3.2 |
intel
a2600 I7 quad core sandybridge cpu . |
21feb13 |
3.3.3 |
||
rserv2 |
27mar13: | 3.3.3 |
AMD 6272 bulldozer
interlagos with 64 cores. openSUSE 12.2 (kernel 3.4.33, gcc 4.7.1 |
cpu |
len |
mem align |
-O3 |
-march= bdver1 |
-mavx |
-mprefer-avx128 |
per loop usecs |
per fft usecs |
Date |
lib |
rserv2 |
64k |
- |
x |
1373 |
1290 |
20feb13 |
_fma4 |
|||
64k | 32 |
x |
1135 |
1052 |
||||||
64k |
- |
x |
x |
2183 |
1310 |
|||||
64k | 32 |
x |
x |
2142 |
1057 |
|||||
64k | 32 | x |
x |
x |
1108 |
1047 |
||||
64k | 32 |
x |
x |
2133 |
1048 |
|||||
64k | 32 | x |
x |
x |
1111 |
1050 |
||||
64k | 32 | x |
x |
1114 |
1053 |
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
1024 | 1 | 0 | 18.67 ms | 14.87 us | 3442.0 |
1024 | 4 | 0 | 29.46 ms | 43.46 us | 1178.1 |
1024 | 8 | 0 | 40.54 ms | 68.62 us | 746.1 |
1024 | 1 | 1 | 36.43 ms | 8.37 us | 6115.2 |
1024 | 4 | 1 | 61.23 ms | 54.53 us | 938.9 |
1024 | 8 | 1 | 107.30 ms | 69.15 us | 740.4 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
2048 | 1 | 0 | 37.96 ms | 35.25 us | 3195.5 |
2048 | 4 | 0 | 65.88 ms | 75.48 us | 1492.2 |
2048 | 8 | 0 | 84.82 ms | 69.73 us | 1615.3 |
2048 | 1 | 1 | 71.98 ms | 19.30 us | 5834.9 |
2048 | 4 | 1 | 97.61 ms | 52.47 us | 2146.6 |
2048 | 8 | 1 | 144.96 ms | 82.36 us | 1367.7 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
4096 | 1 | 0 | 74.14 ms | 78.51 us | 3130.4 |
4096 | 4 | 0 | 100.95 ms | 81.80 us | 3004.2 |
4096 | 8 | 0 | 117.69 ms | 100.98 us | 2433.8 |
4096 | 1 | 1 | 142.61 ms | 49.21 us | 4993.6 |
4096 | 4 | 1 | 217.56 ms | 72.66 us | 3382.5 |
4096 | 8 | 1 | 310.06 ms | 99.26 us | 2476.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
8192 | 1 | 0 | 147.49 ms | 170.78 us | 3117.9 |
8192 | 4 | 0 | 170.44 ms | 134.49 us | 3959.2 |
8192 | 8 | 0 | 216.80 ms | 152.18 us | 3499.0 |
8192 | 1 | 1 | 284.35 ms | 105.45 us | 5049.8 |
8192 | 4 | 1 | 363.84 ms | 132.57 us | 4016.6 |
8192 | 8 | 1 | 530.91 ms | 142.82 us | 3728.3 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
16384 | 1 | 0 | 292.04 ms | 366.41 us | 3130.1 |
16384 | 4 | 0 | 318.86 ms | 227.42 us | 5043.0 |
16384 | 8 | 0 | 376.15 ms | 175.45 us | 6536.7 |
16384 | 1 | 1 | 555.23 ms | 231.91 us | 4945.4 |
16384 | 4 | 1 | 670.96 ms | 203.48 us | 5636.2 |
16384 | 8 | 1 | 799.97 ms | 224.48 us | 5109.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
32768 | 1 | 0 | 601.26 ms | 795.00 us | 3091.3 |
32768 | 4 | 0 | 629.55 ms | 435.97 us | 5637.1 |
32768 | 8 | 0 | 641.25 ms | 296.72 us | 8282.6 |
32768 | 1 | 1 | 1.05 s | 509.09 us | 4827.4 |
32768 | 4 | 1 | 1.18 s | 431.06 us | 5701.3 |
32768 | 8 | 1 | 1.30 s | 276.88 us | 8876.2 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
65536 | 1 | 0 | 1.66 s | 1.73 ms | 3032.8 |
65536 | 4 | 0 | 1.53 s | 884.81 us | 5925.4 |
65536 | 8 | 0 | 1.48 s | 589.28 us | 8897.1 |
65536 | 1 | 1 | 2.39 s | 1.11 ms | 4740.7 |
65536 | 4 | 1 | 2.47 s | 727.94 us | 7202.4 |
65536 | 8 | 1 | 2.61 s | 521.56 us | 10052.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
131072 | 1 | 0 | 4.46 s | 4.92 ms | 2263.1 |
131072 | 4 | 0 | 3.73 s | 1.83 ms | 6096.8 |
131072 | 8 | 0 | 3.35 s | 1.14 ms | 9732.9 |
131072 | 1 | 1 | 6.01 s | 3.73 ms | 2984.9 |
131072 | 4 | 1 | 5.47 s | 1.56 ms | 7138.3 |
131072 | 8 | 1 | 5.33 s | 1.59 ms | 7009.2 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
262144 | 1 | 0 | 13.77 s | 12.99 ms | 1816.2 |
262144 | 4 | 0 | 10.27 s | 4.68 ms | 5041.8 |
262144 | 8 | 0 | 9.10 s | 3.12 ms | 7553.4 |
262144 | 1 | 1 | 18.21 s | 10.51 ms | 2244.6 |
262144 | 4 | 1 | 13.95 s | 4.02 ms | 5872.5 |
262144 | 8 | 1 | 13.42 s | 2.93 ms | 8060.5 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
524288 | 1 | 0 | 6.38 s | 31.44 ms | 1584.1 |
524288 | 4 | 0 | 3.75 s | 11.61 ms | 4291.5 |
524288 | 8 | 0 | 2.99 s | 7.40 ms | 6731.2 |
524288 | 1 | 1 | 5.71 s | 24.60 ms | 2024.9 |
524288 | 4 | 1 | 3.68 s | 10.32 ms | 4825.4 |
524288 | 8 | 1 | 3.12 s | 7.03 ms | 7083.5 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
1048576 | 1 | 0 | 18.22 s | 65.67 ms | 1596.7 |
1048576 | 4 | 0 | 10.73 s | 21.61 ms | 4853.4 |
1048576 | 8 | 0 | 8.96 s | 14.69 ms | 7139.0 |
1048576 | 1 | 1 | 16.44 s | 53.33 ms | 1966.2 |
1048576 | 4 | 1 | 11.50 s | 18.18 ms | 5766.8 |
1048576 | 8 | 1 | 8.57 s | 13.14 ms | 7978.8 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
1024 | 1 | 1 | 36.08 ms | 3.20 us | 15994.0 |
1024 | 4 | 1 | 100.03 ms | 38.96 us | 1314.1 |
1024 | 8 | 1 | 143.75 ms | 36.01 us | 1421.8 |
1024 | 12 | 1 | 166.72 ms | 32.17 us | 1591.6 |
1024 | 16 | 1 | 173.61 ms | 61.41 us | 833.7 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
2048 | 1 | 1 | 58.58 ms | 7.32 us | 15398.0 |
2048 | 4 | 1 | 163.33 ms | 44.80 us | 2514.0 |
2048 | 8 | 1 | 192.12 ms | 45.64 us | 2467.8 |
2048 | 12 | 1 | 240.42 ms | 59.04 us | 1908.0 |
2048 | 16 | 1 | 304.54 ms | 83.88 us | 1342.8 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
4096 | 1 | 1 | 101.56 ms | 17.22 us | 14268.0 |
4096 | 4 | 1 | 226.86 ms | 56.85 us | 4322.8 |
4096 | 8 | 1 | 311.78 ms | 65.09 us | 3775.9 |
4096 | 12 | 1 | 301.65 ms | 84.76 us | 2899.6 |
4096 | 16 | 1 | 375.20 ms | 67.17 us | 3658.7 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
8192 | 1 | 1 | 190.83 ms | 48.95 us | 10877.0 |
8192 | 4 | 1 | 346.32 ms | 60.65 us | 8779.8 |
8192 | 8 | 1 | 449.22 ms | 70.93 us | 7506.7 |
8192 | 12 | 1 | 511.94 ms | 114.89 us | 4634.7 |
8192 | 16 | 1 | 564.90 ms | 80.33 us | 6628.8 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
16384 | 1 | 1 | 349.97 ms | 115.52 us | 9927.7 |
16384 | 4 | 1 | 569.72 ms | 82.45 us | 13911.0 |
16384 | 8 | 1 | 635.73 ms | 122.51 us | 9361.7 |
16384 | 12 | 1 | 660.19 ms | 112.47 us | 10197.0 |
16384 | 16 | 1 | 800.49 ms | 93.67 us | 12244.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
32768 | 1 | 1 | 655.07 ms | 257.95 us | 9527.3 |
32768 | 4 | 1 | 1.03 s | 133.98 us | 18344.0 |
32768 | 8 | 1 | 1.11 s | 162.08 us | 15163.0 |
32768 | 12 | 1 | 1.15 s | 135.22 us | 18175.0 |
32768 | 16 | 1 | 1.12 s | 139.38 us | 17632.0 |
|
|
|
|
|
Intel ps_ipps benchmark: |
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
65536 | 1 | 1 | 1.49 s | 577.28 us | 9082.0 |
65536 | 4 | 1 | 2.01 s | 314.28 us | 16682.0 |
65536 | 8 | 1 | 1.81 s | 188.17 us | 27862.0 |
65536 | 12 | 1 | 1.60 s | 185.53 us | 28259.0 |
65536 | 16 | 1 | 1.97 s | 293.89 us | 17840.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
131072 | 1 | 1 | 3.23 s | 1.20 ms | 9295.4 |
131072 | 4 | 1 | 4.11 s | 493.94 us | 22556.0 |
131072 | 8 | 1 | 3.55 s | 299.66 us | 37180.0 |
131072 | 12 | 1 | 3.06 s | 270.83 us | 41137.0 |
131072 | 16 | 1 | 3.26 s | 319.06 us | 34918.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
262144 | 1 | 1 | 7.03 s | 2.68 ms | 8792.7 |
262144 | 4 | 1 | 8.19 s | 858.25 us | 27490.0 |
262144 | 8 | 1 | 7.56 s | 795.50 us | 29658.0 |
262144 | 12 | 1 | 5.40 s | 783.50 us | 30112.0 |
262144 | 16 | 1 | 6.20 s | 1.09 ms | 21605.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
524288 | 1 | 1 | 2.13 s | 11.29 ms | 4413.6 |
524288 | 4 | 1 | 1.27 s | 3.17 ms | 15708.0 |
524288 | 8 | 1 | 1.36 s | 3.07 ms | 16215.0 |
524288 | 12 | 1 | 12.45 s | 2.19 ms | 22737.0 |
524288 | 16 | 1 | 1.06 s | 2.56 ms | 19441.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | UseSSE2 | SetupTm | RunTm | MFLOPS |
1048576 | 1 | 1 | 6.26 s | 24.73 ms | 4239.2 |
1048576 | 4 | 1 | 3.61 s | 7.54 ms | 13911.0 |
1048576 | 8 | 1 | 3.21 s | 6.00 ms | 17484.0 |
1048576 | 12 | 1 | 41.21 s | 6.19 ms | 16933.0 |
1048576 | 16 | 1 | 2.89 s | 5.31 ms | 19740.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
1024 | 1 | avx | 12.34 ms | 1.36 us | 37644.0 |
1024 | 4 | avx | 41.38 ms | 10.02 us | 5107.8 |
1024 | 8 | avx | 46.98 ms | 13.88 us | 3687.5 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
2048 | 1 | avx | 21.92 ms | 3.70 us | 30434.0 |
2048 | 4 | avx | 50.52 ms | 9.39 us | 12000.0 |
2048 | 8 | avx | 77.85 ms | 16.82 us | 6696.3 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
4096 | 1 | avx | 38.60 ms | 8.80 us | 27916.0 |
4096 | 4 | avx | 84.91 ms | 11.71 us | 20984.0 |
4096 | 8 | avx | 97.09 ms | 17.57 us | 13984.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
8192 | 1 | avx | 73.73 ms | 22.43 us | 23740.0 |
8192 | 4 | avx | 119.75 ms | 16.35 us | 32563.0 |
8192 | 8 | avx | 142.31 ms | 25.87 us | 20580.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
16384 | 1 | avx | 137.73 ms | 52.25 us | 21950.0 |
16384 | 4 | avx | 206.76 ms | 25.70 us | 44627.0 |
16384 | 8 | avx | 213.62 ms | 29.53 us | 38841.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
32768 | 1 | avx | 271.86 ms | 124.62 us | 19721.0 |
32768 | 4 | avx | 373.26 ms | 45.91 us | 53526.0 |
32768 | 8 | avx | 401.74 ms | 88.19 us | 27867.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
65536 | 1 | avx | 678.04 ms | 275.86 us | 19006.0 |
65536 | 4 | avx | 813.69 ms | 90.05 us | 58219.0 |
65536 | 8 | avx | 752.61 ms | 84.72 us | 61886.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
131072 | 1 | avx | 1.53 s | 602.25 us | 18499.0 |
131072 | 4 | avx | 1.77 s | 176.06 us | 63279.0 |
131072 | 8 | avx | 1.49 s | 163.75 us | 68037.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
262144 | 1 | avx | 3.42 s | 1.26 ms | 18673.0 |
262144 | 4 | avx | 3.93 s | 381.19 us | 61893.0 |
262144 | 8 | avx | 3.26 s | 396.69 us | 59475.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
524288 | 1 | avx | 1.22 s | 7.05 ms | 7060.4 |
524288 | 4 | avx | 753.81 ms | 2.10 ms | 23697.0 |
524288 | 8 | avx | 617.56 ms | 2.02 ms | 24694.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
1048576 | 1 | avx | 3.92 s | 16.20 ms | 6472.7 |
1048576 | 4 | avx | 2.44 s | 5.50 ms | 19070.0 |
1048576 | 8 | avx | 2.30 s | 5.18 ms | 20223.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
1024 | 1 | avx | 13.32 ms | 1.31 us | 39188.0 |
1024 | 4 | avx | 32.82 ms | 12.11 us | 4228.5 |
1024 | 8 | avx | 48.87 ms | 12.99 us | 3942.3 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
2048 | 1 | avx | 24.41 ms | 3.42 us | 32948.0 |
2048 | 4 | avx | 49.20 ms | 10.85 us | 10384.0 |
2048 | 8 | avx | 72.02 ms | 15.54 us | 7247.5 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
4096 | 1 | avx | 39.52 ms | 8.39 us | 29292.0 |
4096 | 4 | avx | 74.76 ms | 14.93 us | 16463.0 |
4096 | 8 | avx | 100.88 ms | 16.94 us | 14511.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
8192 | 1 | avx | 74.70 ms | 22.03 us | 24174.0 |
8192 | 4 | avx | 121.43 ms | 17.62 us | 30215.0 |
8192 | 8 | avx | 139.91 ms | 22.96 us | 23197.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
16384 | 1 | avx | 145.28 ms | 52.37 us | 21901.0 |
16384 | 4 | avx | 204.60 ms | 25.44 us | 45083.0 |
16384 | 8 | avx | 217.31 ms | 39.05 us | 29366.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
32768 | 1 | avx | 278.92 ms | 124.88 us | 19679.0 |
32768 | 4 | avx | 387.19 ms | 45.80 us | 53658.0 |
32768 | 8 | avx | 408.18 ms | 62.48 us | 39336.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
65536 | 1 | avx | 677.89 ms | 280.27 us | 18707.0 |
65536 | 4 | avx | 815.73 ms | 91.06 us | 57575.0 |
65536 | 8 | avx | 729.79 ms | 132.50 us | 39569.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
131072 | 1 | avx | 1.51 s | 591.62 us | 18831.0 |
131072 | 4 | avx | 1.84 s | 189.84 us | 58686.0 |
131072 | 8 | avx | 1.61 s | 251.11 us | 44368.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
262144 | 1 | avx | 3.45 s | 1.30 ms | 18176.0 |
262144 | 4 | avx | 3.75 s | 379.97 us | 62092.0 |
262144 | 8 | avx | 3.38 s | 320.37 us | 73642.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
524288 | 1 | avx | 1.20 s | 7.02 ms | 7095.6 |
524288 | 4 | avx | 721.68 ms | 2.27 ms | 21948.0 |
524288 | 8 | avx | 611.24 ms | 2.06 ms | 24143.0 |
|
|
|
|
|
|
FFTLEN | Nthreads | SSE/AVX | SetupTm | RunTm | MFLOPS |
1048576 | 1 | avx | 3.93 s | 15.97 ms | 6566.3 |
1048576 | 4 | avx | 2.71 s | 5.50 ms | 19062.0 |
1048576 | 8 | avx | 2.24 s | 5.27 ms | 19910.0 |
|
|
|
|
|
|
The table below has the avx times: (top)
The table below show some timing differences
in different fftw versions:
cpu |
length |
threads |
times (usecs) |
|
v3.3.2 |
v3.3.3 |
|||
megs3 |
64k |
8 |
85 |
132 |
128k |
8 |
163 |
251 |
|
256k |
8 |
397 |
320 |
notes:
FFTlen |
time
usecs 1 thread |
time
usecs 2 threads 1 fft |
|||
aserv11 |
adslinux |
megs3 |
aserv11 |
adslinux |
|
1K |
3.4 |
1.4 |
1.5 |
- |
|
2K |
7.6 |
3.3 |
3.3 |
- |
|
4K |
17.3 |
8.4 |
8.3 |
- |
|
8K |
37.4 |
18.8 |
18.7 |
26.9 |
|
16K |
89.0 |
44.6 |
43.0 |
55.3 |
|
32K |
193. |
102.5 |
106.5 |
111.6 |
|
64K |
435. |
231 |
227.6 |
234. |
|
128K |
933. |
487 |
483.4 |
496. |
|
256K |
2296. |
1153 |
1143.1 |
- |
|
512K |
5709 |
3290 |
3227.1 |
- |
|
1024K |
12391 |
8088 |
8150 |
- |
|
2048K |
26776 |
17580 |
18134 |
- |
|
4096K |
55970. |
37458 |
38025 |
- |
|
8192K |
129655. |
83112 |
96197 |
- |
|
16384K |
270900. |
178718 |
213809 |
- |
The idl routine (atmclp) processes the coded long process atm data. It was used to benchmark some of the dual processor cpus at the observatory. The data set used was:
|
freq(ghz) |
thread |
kernel |
secs |
secs |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
53 (repeat) |
|
|
|
|
|
|
|
|
|
|
|
(but cpu was busy) |
For the aolc computers you should spread the jobs out over
multiple cpus rather than trying to run two of the same on the
same cpu (until arun gets a chance to update the kernels).
home_~phil-->