Muss nicht eingetragen werden - RWTH-Aachen

18.10.2013 - OpenMP Sychronization Overhead. â No big distinction between the per- formance with LEO (âoffloadâ) / native. â Faster synchronization on ...

PDF Herunterladen

PNG-Bilder

929KB Größe 7 Downloads 667 Ansichten

Kommentar

OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison Tim Cramer*, Dirk Schmidl*, Michael Klemm°, Dieter an Mey* *JARA, RWTH Aachen University, Germany Email: {cramer, schmidl, anmey}@rz.rwth-aachen.de °Intel Corporation Email: [email protected] Rechen- und Kommunikationszentrum (RZ)

Motivation & Content  Portability of OpenMP applications for Intel Xeon Phi 1  Effort

 Programming models  Performance

 Synchronization overhead of OpenMP  Performance Modeling  Performance evaluation with simple kernel benchmarks and a sparse Conjugate Gradient kernel

 Applying the Roofline performance model to investigate the utilization

 Comparison with large SMP production systems 1preproduction 2

system

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

The Comparison  Intel Xeon Phi Coprocessor vs. Bull’s Coherence Switch (BCS)

Source: Intel

Intel Xeon Phi Coprocessor • 1 x Intel Xeon Phi @ 1090 MHz • 61 Cores • ~ 1 TFLOPS DP Peak • 4 Hardware threads per Core • 8 GB GDDR5 memory • 512-bit SIMD vectors • Plugged into PCI Express bus 3

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

BCS • 16 x Intel Xeon X7550 @ 2 GHz • 128 Cores • 1 TFLOPS DP Peak • 1 Hardware thread per Core (hyper threading deactivated) • up to 2 TB DDR3 • 128-bit SIMD vectors • 6 HU in a rack

Source: D. Both, Bull GmbH

David vs. Goliath

4

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

Bull’s Coherence Switch (BCS)

High-level overview • 4 boards with 4 sockets connected over xQPI • 16 sockets in sum • 2 levels of NUMAness • 128 Cores

5

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

IB

I/O Hub

BCS

BCS

I/O Hub

IB

IB

I/O Hub

BCS

BCS

I/O Hub

IB

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

NHM -EX

2 XCSI cables

Source: D. Gutfreund, Bull GmbH

Intel Xeon Phi Programming Models

Host only

Language Extension for Offload (“LEO”)

Symmetric

main() foo() MPI_*()

main() foo() MPI_*()

main() foo() MPI_*()

#pragma offload

foo()

6

main() foo() MPI_*()

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

Coprocessor only (“native”)

Host CPU Additional Compiler Flag: -mmic

main() foo() MPI_*()

PCI Express bus

Intel Xeon Phi

Stream Memory Bandwidth  STREAM Results

250

 balanced thread placement achieves best performance on the coprocessor (60 Threads)

 BCS (16 memory controller) has higher bandwidth than the Intel Xeon Phi

 KNC achieves higher

Bandwidth in GB/s

 2 GB memory footprint 200

40 % 156 GB/s

150 100

50 0

1

2

bandwidth than 8 Nehalem-EX

7

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

4

8 16 32 64 128 256 # Threads

BCS, scatter

BCS, compact

coprocessor, balanced

coprocessor, compact

EPCC Microbenchmarks

 No big distinction between the per-

formance with LEO (“offload”) / native  Faster synchronization on MIC for 120 threads compared with

128 on BCS  Overhead for “pragma offload” 91.1 𝜇𝑠 (3x higher than a “parallel for” with

240 threads)  These relatively small overheads will not prevent OpenMP applications to scale on Intel Xeon Phi 8

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

70 Overhead in Microseconds

 OpenMP Sychronization Overhead

60 50 40 30 20 10 0 1

2

4

8 16 32 64 128 256 # Threads

BCS System Intel Xeon Phi native Intel Xeon Phi LEO

Overhead for a parallel-for.

Conjugate Gradient Method  Implementation Variants

 Testcase

 Intel Math Kernel Library (MKL)  OpenMP first-touch optimal distribution (no static schedule)

 Fluorem/HV15R  N=2,017,169, nnz=283,073,458  3.2 GB Memory footprint

 Runtime Shares of the Linear Algebra Kernels (OpenMP only) System

#Threads

Serial Time [s]

Parallel Time [s]

daxpy / dxpay

dot product

SMXV

Xeon Phi

244

2387.40

32.24

3.71 %

1.89 %

94.03 %

BCS

128

1176.81

18.10

11.41 %

4.45 %

84.01 %

 SMXV more dominant on MIC than on BCS  daxpy (𝑦 = 𝑎 ∗ 𝑥 + 𝑦) / dxpay (𝑦 = 𝑥 + 𝑎 ∗ 𝑦) on BCS much higher

9

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

Sparse Matrix-Vector Multiplication (1/4)  Roofline Model  Using read memory bandwidth 𝐵𝑊 and theoretical peak performance 𝑃

 Model for SMXV 𝑦 = 𝐴 ∗ 𝑥  Assumptions 𝑥 , 𝑦 can be kept in the cache (~ 15 MB each) 𝑛 ≪ 𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one index (int) element have to be loaded (dimension nnz) → 12 Bytes  Operational intensity 𝐎 =

2 𝐹𝐿𝑂𝑃𝑆 12 𝐵𝑦𝑡𝑒

1 𝐹𝐿𝑂𝑃𝑆 𝐵𝑦𝑡𝑒

=6

 Performance Limit: 𝐿 = min{𝑃, 𝑂 ∗ 𝐵𝑊} 10

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

(→ memory-bound)

Sparse Matrix-Vector Multiplication (2/4)

11

2048 1024 512 256 128 64 39.4 32 16 8 4 2 1 0.5 1/16 1/8

Peak FP Performance

SMXV Kernel

GFLOPS

 Roofline Model BCS (read memory bandwidth 236.5 GB/s)

1/4 1/2 1 2 4 8 16 Operational Intensity in FLOPS/byte

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

32

64

Sparse Matrix-Vector Multiplication (3/4)

GFLOPS

 Roofline Model Xeon Phi (read memory bandwidth 122.1 GB/s) Peak FP Performance

SMXV Kernel

12

2048 1024 512 256 128 64 32 20.35 16 8 4 2 1 0.5 1/16 1/8

1/4 1/2 1 2 4 8 16 Operational Intensity in FLOPS/byte

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

32

64

Sparse Matrix-Vector Multiplication (4/4)

 Best results close to theoretical maximum predicted by Roofline  BCS: Performance with 128 threads only slightly better than with 64 (also reflected in STREAM measurements)  All results on Xeon Phi reached without special tuning (unmodified code)  MIC: hyperthreads really help to reach the performance

13

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

GFLOPS

 Reached absolute performance

37.2 of 39.4 GFLOPS

40 35 30 25 20 15 10 5 0

OpenMP, coprocessor, balanced MKL, coprocessor, balanced OpenMP, BCS, scatter OpenMP, BCS, compact MKL, BCS, scatter MKL, BCS, compact

18.7 / 20.4 GFLOPS 1

4

16 # Threads

64

256

Conjugate Gradient Method (Scalability)  Speedup for 1000 Iterations

thread placement  OpenMP, Xeon Phi: 53 without SMT / 74 with all hardware threads (MKL: 79)  MKL: absolute performance lower on BCS / similar on Intel Xeon Phi

14

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

Speedup

 OpenMP, BCS: 65 with compact

90 80 70 60 50 40 30 20 10 0

OpenMP, coprocessor, balanced MKL, coprocessor, balanced OpenMP, BCS, scatter OpenMP, BCS, compact MKL, BCS, scatter

MKL, BCS, compact

1

4

16 # Threads

64

256

SIMD Vectorization (for SMXV)

 SIMD Vectorization Results (OpenMP implementation)  Only small differences  For small amount of threads

100

speedup of 2

 With Roofline model: Using the complete bandwidth

GFLOPS

10

1

KNC, -no-vec

is also possible with not

vectorizable code

KNC, vectorized 0.1 1

2

4

8

16

30

# Threads

15

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

60

120 180 240

Conclusion & Future Work  OpenMP Programming on Intel Xeon Phi  Easy cross compiling / porting possible

 Smaller synchronization overhead on MIC than on big SMP machines

 Bandwidth  Impressive bandwidth on Intel Xeon Phi (up to 156 GB/s)

 Higher than on 8 Nehalem-EX

 Productivity / Performance  Good performance result with cross-compiling

 No code modifications, special tuning necessary

 Future Work  Performance investigations for distributed memory paradigms like MPI on one or multiple coprocessors 16

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum

The End

Thank you for your attention.

17

OpenMP Programming on Intel Xeon Phi Coprocessors Tim Cramer| Rechen- und Kommunikationszentrum