Performance Modeling for Dense Linear Algebra - RWTH-Aachen

11.12.2012 - Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS. Elmar Peise (AICES, RWTH) [email protected]. 11/12/2012.
4MB Größe 7 Downloads 273 Ansichten
Performance Modeling for Dense Linear Algebra Elmar Peise

Paolo Bientinesi

Aachen Institute for Advanced Study in Computational Engineering Science RWTH Aachen University

November 12th 2012 — PMBS 2012

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

1 / 25

Blocked algorithms Inversion of a triangular matrix L

L

1

2 Rn⇥n

L00 L10

L11

L20

L21

Variant 1

Variant 2

L10 L10 L11

L21 L21 L11

L10 L00 L111 L10 L111

Elmar Peise (AICES, RWTH)

L221 L21 L21 L111 L111

L22

Variant 3 L21 L20 L10 L11

L21 L111 L20 + L21 L10 L111 L10 L111

[email protected]

Variant 4 L21 L20 L10 L11

L221 L21 L20 L21 L10 L10 L00 L111 11/12/2012

2 / 25

Execution time Inversion of a triangular matrix L

L

1

2 Rn⇥n

variant variant variant variant

Time [s]

0.8 0.6

1 2 3 4

0.4 0.2 0

0

512

1,024 1,536 Matrix size n

2,048

Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

3 / 25

Efficiency Inversion of a triangular matrix L

L

1

2 Rn⇥n

100

variant 1 variant 2 variant 3 variant 4 DGEMM peak

Efficiency

80 60 40 20 0

0

512

1,024 1,536 Matrix size n

2,048

Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

4 / 25

Blocked algorithms Inversion of a triangular matrix L

3rd ( )

n3 6

n2 2

1st ( )

Variant 2

L10 L00 L111 L10 L111

+

+

2 Rn⇥n

2nd ( )

Variant 1 L10 L10 L11

1

L

L21 L21 L11

n 3

n3 6

Variant 3

L221 L21

L21 L111

L111

+

n2 2

+

n 3

L21 L20 L10 L11

Variant 4

L21 L111

L20 + L21 L10 L111 L10 L111 n3 6

+

4th ( )

n2 2

L221 L21 L20 L21 L10 L10 L00 L111

L21 L20 L10 L11 n3 2

n2 2

+n

Can we rank the algorithms based on their update statements?

No!

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

5 / 25

Outline

Measuring and Understanding BLAS Performance

Generation of Performance Models

Ranking Blocked Algorithms

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

6 / 25

Outline

Measuring and Understanding BLAS Performance

Generation of Performance Models

Ranking Blocked Algorithms

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

7 / 25

Routines and arguments Variant 1 L10 L10 L11

L10 L00 L111 L10 L111

Variant 2 L21 L21 L11

Variant 3

L221 L21

L21 L20 L10 L11

L21 L111 L111

L21 L111

Variant 4

L20 + L21 L10 L111 L10 L111

L21 L20 L10 L11

L221 L21 L20 L21 L10 L10 L00 L111

1

B



A

B

dtrsm(side, uplo, transA, diag, m, n, alpha, A, ldA, B , ldB) size flag data scalar

Elmar Peise (AICES, RWTH)

[email protected]

matrix

leading dimension

11/12/2012

8 / 25

Size arguments dtrsm(L, L, N, N, n, n, 0.5, A, n, B, n)

·108

GotoBLAS2 MKL ATLAS

Time [cycles]

5 4 3 2 1 0

0

256

512 n

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

9 / 25

Size arguments dtrsm(L, L, N, N, n, n, 0.5, A, n, B, n)

Error [cycles]

2

·107

GotoBLAS2 MKL ATLAS

1 0 1 0

256

512 n

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

9 / 25

Flag argument dtrsm(side, uplo, transA, diag, n, n, 0.5, A, n, B, n)

·106

GotoBLAS2 MKL ATLAS

cycles

6 4 2 0 side uplo transA diag

L L N N

L L N U

L L T N

L L T U

L U N N

L U N U

L U T N

L U T U

R L N N

R L N U

R L T N

R L T U

R U N N

R U N U

R U T N

R U T U

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

10 / 25

Outline

Measuring and Understanding BLAS Performance

Generation of Performance Models

Ranking Blocked Algorithms

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

11 / 25

Model structure input: (dtrsm, R, L, T, N, 256, 512, 0.7, A, 512, B, 512) model for dtrsm

piecewise polynomials:

parameter extraction: (R, L, T) (256, 512) discrete cases: LLN LLT LUN LUT

RLN RUN

RLT RUT

(256, 512)

statistical quantities: min avg std med

polynomial: (256, 512)

output:

min = . . .

Elmar Peise (AICES, RWTH)

avg = . . .

std = . . .

[email protected]

med = . . . 11/12/2012

12 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Adaptive Refinement

n

dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

1,024

20%

768

15% 10%

512

5%

256

0% 0

0

256

512 m

768

1,024

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

13 / 25

Comparison dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)

average error [%]

8 6 4 2 0

0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8 #samples

2

Adaptive Refinement Model Expansion

2.2 2.4 ·105

AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

14 / 25

Outline

Measuring and Understanding BLAS Performance

Generation of Performance Models

Ranking Blocked Algorithms

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

15 / 25

Ranking Blocked algorithm revisited: L

L

1

L00 L10

L11

L20

L21

Variant 1

Variant 2

L10 L10 L11

L21 L21 L11

L10 L00 L111 L10 L111

Elmar Peise (AICES, RWTH)

L221 L21 L21 L111 L111

L22

Variant 3 L21 L20 L10 L11

L21 L111 L20 + L21 L10 L111 L10 L111

[email protected]

Variant 4 L21 L20 L10 L11

L221 L21 L20 L21 L10 L10 L00 L111

11/12/2012

16 / 25

Prediction Input: trinv1(N, 300, A, 300, 100) Compute routine invocations update L10 L10 L00 L10 L111 L10 L11 L111 L10 L10 L00 L10 L111 L10 L11 L111 L10 L10 L00 L10 L111 L10 L11 L111

routine invocation (dtrmm, R, L, N, N, 100, 0, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 0, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1) (dtrmm, R, L, N, N, 100, 100, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 100, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1) (dtrmm, R, L, N, N, 100, 200, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 200, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1)

Evaluate performance Models Accumulate �����

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

17 / 25

Results for trinv trinvi (N, n, A, n, 96)

1

variant 1 variant 2 variant 3 variant 4 Measurements Prediction

Efficiency

0.8 0.6 0.4 0.2 0

0

256

512 n

768

1,024

Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

18 / 25

Results for trinv probabilistic prediction

trinvi (N, n, A, n, 96)

Efficiency

0.8

variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max

0.7

0.6

0.5 512

640

768 n

896

1,024

Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

19 / 25

Block-size

L00

L00

L10

L11

L20

L21

L22

block-size

Elmar Peise (AICES, RWTH)

L10

L11

L20

L21

L22

block-size

[email protected]

11/12/2012

20 / 25

Tuning the block-size trinvi (N, 1000, A, 1000, blocksize)

1

variant 1 variant 2 variant 3 variant 4 Measurements Prediction

Efficiency

0.8 0.6 0.4 0.2 0

0

64

128 192 blocksize

256

Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

21 / 25

Sandy Bridge trinvi (N, n, A, n, 96)

0.6

variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max

Efficiency

0.55 0.5 0.45 0.4 512

640

768 n

896

1,024

Intel Sandy Bridge-EP E5-2670 @ 2.6 GHz — 1 core — OpenBLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

22 / 25

Parallel trinv variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max

Efficiency

0.3

0.2

0.1

0

0

256

512 n

768

1,024

Intel Sandy Bridge-EP E5-2670 @ 2.6 GHz — 8 core — OpenBLAS Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

23 / 25

Sylvester Equation L

X + X

U = C

Variant 1 of 16 X01 X10 X01 X10 X11 X11 X11

C01 X00 U01 C10 L10 X00 ⌦(L00 , U11 , X01 ) ⌦(L11 , U00 , X10 ) C11 X10 U01 X11 L10 X01 ⌦(L11 , U11 , X11 ) X 2 R1000⇥1000

Intel Harpertown E5450 @ 2.99 GHz 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)

[email protected]

var. 1 2 5 6 16 3 4 8 10 15 9 14 12 7 11 13

predicted 27.03% 22.52% 15.51% 13.72% 1.79% 1.52% 1.50% 1.49% 1.43% 1.43% 1.40% 1.34% 1.29% 1.06% 1.04% 1.01%

measured 24.04% 21.07% 18.82% 18.51% 2.21% 1.52% 1.45% 1.37% 1.53% 1.52% 1.48% 1.33% 1.43% 1.16% 1.07% 1.01%

11/12/2012

24 / 25

Performance Modeling for Dense Linear Algebra

Thank You

Funding from DFG is gratefully acknowledged

Elmar Peise (AICES, RWTH)

[email protected]

11/12/2012

25 / 25