OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison Tim Cramer*, Dirk Schmidl*, Michael Klemm°, Dieter an Mey* *JARA, RWTH Aachen University, Germany Email: {cramer, schmidl, anmey}@rz.rwth-aachen.de °Intel Corporation Email:
[email protected] Rechen- und Kommunikationszentrum (RZ)
Motivation & Content Portability of OpenMP applications for Intel Xeon Phi 1 Effort
Programming models Performance
Synchronization overhead of OpenMP Performance Modeling Performance evaluation with simple kernel benchmarks and a sparse Conjugate Gradient kernel
Applying the Roofline performance model to investigate the utilization
Comparison with large SMP production systems 1preproduction 2
system
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
The Comparison Intel Xeon Phi Coprocessor vs. Bull’s Coherence Switch (BCS)
Source: Intel
Intel Xeon Phi Coprocessor • 1 x Intel Xeon Phi @ 1090 MHz • 61 Cores • ~ 1 TFLOPS DP Peak • 4 Hardware threads per Core • 8 GB GDDR5 memory • 512-bit SIMD vectors • Plugged into PCI Express bus 3
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
BCS • 16 x Intel Xeon X7550 @ 2 GHz • 128 Cores • 1 TFLOPS DP Peak • 1 Hardware thread per Core (hyper threading deactivated) • up to 2 TB DDR3 • 128-bit SIMD vectors • 6 HU in a rack
Source: D. Both, Bull GmbH
David vs. Goliath
4
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
Bull’s Coherence Switch (BCS)
High-level overview • 4 boards with 4 sockets connected over xQPI • 16 sockets in sum • 2 levels of NUMAness • 128 Cores
5
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
IB
I/O Hub
BCS
BCS
I/O Hub
IB
IB
I/O Hub
BCS
BCS
I/O Hub
IB
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
NHM -EX
2 XCSI cables
Source: D. Gutfreund, Bull GmbH
Intel Xeon Phi Programming Models
Host only
Language Extension for Offload (“LEO”)
Symmetric
main() foo() MPI_*()
main() foo() MPI_*()
main() foo() MPI_*()
#pragma offload
foo()
6
main() foo() MPI_*()
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
Coprocessor only (“native”)
Host CPU Additional Compiler Flag: -mmic
main() foo() MPI_*()
PCI Express bus
Intel Xeon Phi
Stream Memory Bandwidth STREAM Results
250
balanced thread placement achieves best performance on the coprocessor (60 Threads)
BCS (16 memory controller) has higher bandwidth than the Intel Xeon Phi
KNC achieves higher
Bandwidth in GB/s
2 GB memory footprint 200
40 % 156 GB/s
150 100
50 0
1
2
bandwidth than 8 Nehalem-EX
7
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
4
8 16 32 64 128 256 # Threads
BCS, scatter
BCS, compact
coprocessor, balanced
coprocessor, compact
EPCC Microbenchmarks
No big distinction between the per-
formance with LEO (“offload”) / native Faster synchronization on MIC for 120 threads compared with
128 on BCS Overhead for “pragma offload” 91.1 𝜇𝑠 (3x higher than a “parallel for” with
240 threads) These relatively small overheads will not prevent OpenMP applications to scale on Intel Xeon Phi 8
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
70 Overhead in Microseconds
OpenMP Sychronization Overhead
60 50 40 30 20 10 0 1
2
4
8 16 32 64 128 256 # Threads
BCS System Intel Xeon Phi native Intel Xeon Phi LEO
Overhead for a parallel-for.
Conjugate Gradient Method Implementation Variants
Testcase
Intel Math Kernel Library (MKL) OpenMP first-touch optimal distribution (no static schedule)
Fluorem/HV15R N=2,017,169, nnz=283,073,458 3.2 GB Memory footprint
Runtime Shares of the Linear Algebra Kernels (OpenMP only) System
#Threads
Serial Time [s]
Parallel Time [s]
daxpy / dxpay
dot product
SMXV
Xeon Phi
244
2387.40
32.24
3.71 %
1.89 %
94.03 %
BCS
128
1176.81
18.10
11.41 %
4.45 %
84.01 %
SMXV more dominant on MIC than on BCS daxpy (𝑦 = 𝑎 ∗ 𝑥 + 𝑦) / dxpay (𝑦 = 𝑥 + 𝑎 ∗ 𝑦) on BCS much higher
9
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
Sparse Matrix-Vector Multiplication (1/4) Roofline Model Using read memory bandwidth 𝐵𝑊 and theoretical peak performance 𝑃
Model for SMXV 𝑦 = 𝐴 ∗ 𝑥 Assumptions 𝑥 , 𝑦 can be kept in the cache (~ 15 MB each) 𝑛 ≪ 𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one index (int) element have to be loaded (dimension nnz) → 12 Bytes Operational intensity 𝐎 =
2 𝐹𝐿𝑂𝑃𝑆 12 𝐵𝑦𝑡𝑒
1 𝐹𝐿𝑂𝑃𝑆 𝐵𝑦𝑡𝑒
=6
Performance Limit: 𝐿 = min{𝑃, 𝑂 ∗ 𝐵𝑊} 10
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
(→ memory-bound)
Sparse Matrix-Vector Multiplication (2/4)
11
2048 1024 512 256 128 64 39.4 32 16 8 4 2 1 0.5 1/16 1/8
Peak FP Performance
SMXV Kernel
GFLOPS
Roofline Model BCS (read memory bandwidth 236.5 GB/s)
1/4 1/2 1 2 4 8 16 Operational Intensity in FLOPS/byte
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
32
64
Sparse Matrix-Vector Multiplication (3/4)
GFLOPS
Roofline Model Xeon Phi (read memory bandwidth 122.1 GB/s) Peak FP Performance
SMXV Kernel
12
2048 1024 512 256 128 64 32 20.35 16 8 4 2 1 0.5 1/16 1/8
1/4 1/2 1 2 4 8 16 Operational Intensity in FLOPS/byte
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
32
64
Sparse Matrix-Vector Multiplication (4/4)
Best results close to theoretical maximum predicted by Roofline BCS: Performance with 128 threads only slightly better than with 64 (also reflected in STREAM measurements) All results on Xeon Phi reached without special tuning (unmodified code) MIC: hyperthreads really help to reach the performance
13
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
GFLOPS
Reached absolute performance
37.2 of 39.4 GFLOPS
40 35 30 25 20 15 10 5 0
OpenMP, coprocessor, balanced MKL, coprocessor, balanced OpenMP, BCS, scatter OpenMP, BCS, compact MKL, BCS, scatter MKL, BCS, compact
18.7 / 20.4 GFLOPS 1
4
16 # Threads
64
256
Conjugate Gradient Method (Scalability) Speedup for 1000 Iterations
thread placement OpenMP, Xeon Phi: 53 without SMT / 74 with all hardware threads (MKL: 79) MKL: absolute performance lower on BCS / similar on Intel Xeon Phi
14
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
Speedup
OpenMP, BCS: 65 with compact
90 80 70 60 50 40 30 20 10 0
OpenMP, coprocessor, balanced MKL, coprocessor, balanced OpenMP, BCS, scatter OpenMP, BCS, compact MKL, BCS, scatter
MKL, BCS, compact
1
4
16 # Threads
64
256
SIMD Vectorization (for SMXV)
SIMD Vectorization Results (OpenMP implementation) Only small differences For small amount of threads
100
speedup of 2
With Roofline model: Using the complete bandwidth
GFLOPS
10
1
KNC, -no-vec
is also possible with not
vectorizable code
KNC, vectorized 0.1 1
2
4
8
16
30
# Threads
15
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
60
120 180 240
Conclusion & Future Work OpenMP Programming on Intel Xeon Phi Easy cross compiling / porting possible
Smaller synchronization overhead on MIC than on big SMP machines
Bandwidth Impressive bandwidth on Intel Xeon Phi (up to 156 GB/s)
Higher than on 8 Nehalem-EX
Productivity / Performance Good performance result with cross-compiling
No code modifications, special tuning necessary
Future Work Performance investigations for distributed memory paradigms like MPI on one or multiple coprocessors 16
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum
The End
Thank you for your attention.
17
OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum