Performance Modeling for Dense Linear Algebra Elmar Peise
Paolo Bientinesi
Aachen Institute for Advanced Study in Computational Engineering Science RWTH Aachen University
November 12th 2012 — PMBS 2012
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
1 / 25
Blocked algorithms Inversion of a triangular matrix L
L
1
2 Rn⇥n
L00 L10
L11
L20
L21
Variant 1
Variant 2
L10 L10 L11
L21 L21 L11
L10 L00 L111 L10 L111
Elmar Peise (AICES, RWTH)
L221 L21 L21 L111 L111
L22
Variant 3 L21 L20 L10 L11
L21 L111 L20 + L21 L10 L111 L10 L111
[email protected]
Variant 4 L21 L20 L10 L11
L221 L21 L20 L21 L10 L10 L00 L111 11/12/2012
2 / 25
Execution time Inversion of a triangular matrix L
L
1
2 Rn⇥n
variant variant variant variant
Time [s]
0.8 0.6
1 2 3 4
0.4 0.2 0
0
512
1,024 1,536 Matrix size n
2,048
Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
3 / 25
Efficiency Inversion of a triangular matrix L
L
1
2 Rn⇥n
100
variant 1 variant 2 variant 3 variant 4 DGEMM peak
Efficiency
80 60 40 20 0
0
512
1,024 1,536 Matrix size n
2,048
Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
4 / 25
Blocked algorithms Inversion of a triangular matrix L
3rd ( )
n3 6
n2 2
1st ( )
Variant 2
L10 L00 L111 L10 L111
+
+
2 Rn⇥n
2nd ( )
Variant 1 L10 L10 L11
1
L
L21 L21 L11
n 3
n3 6
Variant 3
L221 L21
L21 L111
L111
+
n2 2
+
n 3
L21 L20 L10 L11
Variant 4
L21 L111
L20 + L21 L10 L111 L10 L111 n3 6
+
4th ( )
n2 2
L221 L21 L20 L21 L10 L10 L00 L111
L21 L20 L10 L11 n3 2
n2 2
+n
Can we rank the algorithms based on their update statements?
No!
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
5 / 25
Outline
Measuring and Understanding BLAS Performance
Generation of Performance Models
Ranking Blocked Algorithms
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
6 / 25
Outline
Measuring and Understanding BLAS Performance
Generation of Performance Models
Ranking Blocked Algorithms
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
7 / 25
Routines and arguments Variant 1 L10 L10 L11
L10 L00 L111 L10 L111
Variant 2 L21 L21 L11
Variant 3
L221 L21
L21 L20 L10 L11
L21 L111 L111
L21 L111
Variant 4
L20 + L21 L10 L111 L10 L111
L21 L20 L10 L11
L221 L21 L20 L21 L10 L10 L00 L111
1
B
↵
A
B
dtrsm(side, uplo, transA, diag, m, n, alpha, A, ldA, B , ldB) size flag data scalar
Elmar Peise (AICES, RWTH)
[email protected]
matrix
leading dimension
11/12/2012
8 / 25
Size arguments dtrsm(L, L, N, N, n, n, 0.5, A, n, B, n)
·108
GotoBLAS2 MKL ATLAS
Time [cycles]
5 4 3 2 1 0
0
256
512 n
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
9 / 25
Size arguments dtrsm(L, L, N, N, n, n, 0.5, A, n, B, n)
Error [cycles]
2
·107
GotoBLAS2 MKL ATLAS
1 0 1 0
256
512 n
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
9 / 25
Flag argument dtrsm(side, uplo, transA, diag, n, n, 0.5, A, n, B, n)
·106
GotoBLAS2 MKL ATLAS
cycles
6 4 2 0 side uplo transA diag
L L N N
L L N U
L L T N
L L T U
L U N N
L U N U
L U T N
L U T U
R L N N
R L N U
R L T N
R L T U
R U N N
R U N U
R U T N
R U T U
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
10 / 25
Outline
Measuring and Understanding BLAS Performance
Generation of Performance Models
Ranking Blocked Algorithms
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
11 / 25
Model structure input: (dtrsm, R, L, T, N, 256, 512, 0.7, A, 512, B, 512) model for dtrsm
piecewise polynomials:
parameter extraction: (R, L, T) (256, 512) discrete cases: LLN LLT LUN LUT
RLN RUN
RLT RUT
(256, 512)
statistical quantities: min avg std med
polynomial: (256, 512)
output:
min = . . .
Elmar Peise (AICES, RWTH)
avg = . . .
std = . . .
[email protected]
med = . . . 11/12/2012
12 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Adaptive Refinement
n
dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
1,024
20%
768
15% 10%
512
5%
256
0% 0
0
256
512 m
768
1,024
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
13 / 25
Comparison dtrsm(L, L, N, N, m, n, .5, L, 2500, B, 2500)
average error [%]
8 6 4 2 0
0
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8 #samples
2
Adaptive Refinement Model Expansion
2.2 2.4 ·105
AMD Opteron 8356 (Barcelona) @ 2.3 GHz — 1 core — GotoBLAS2 Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
14 / 25
Outline
Measuring and Understanding BLAS Performance
Generation of Performance Models
Ranking Blocked Algorithms
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
15 / 25
Ranking Blocked algorithm revisited: L
L
1
L00 L10
L11
L20
L21
Variant 1
Variant 2
L10 L10 L11
L21 L21 L11
L10 L00 L111 L10 L111
Elmar Peise (AICES, RWTH)
L221 L21 L21 L111 L111
L22
Variant 3 L21 L20 L10 L11
L21 L111 L20 + L21 L10 L111 L10 L111
[email protected]
Variant 4 L21 L20 L10 L11
L221 L21 L20 L21 L10 L10 L00 L111
11/12/2012
16 / 25
Prediction Input: trinv1(N, 300, A, 300, 100) Compute routine invocations update L10 L10 L00 L10 L111 L10 L11 L111 L10 L10 L00 L10 L111 L10 L11 L111 L10 L10 L00 L10 L111 L10 L11 L111
routine invocation (dtrmm, R, L, N, N, 100, 0, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 0, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1) (dtrmm, R, L, N, N, 100, 100, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 100, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1) (dtrmm, R, L, N, N, 100, 200, 1, ·, 300, ·, 300) (dtrsm, L, L, N, N, 100, 200, 1, ·, 300, ·, 300) (trinv1, N, 100, ·, 300, 1)
Evaluate performance Models Accumulate �����
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
17 / 25
Results for trinv trinvi (N, n, A, n, 96)
1
variant 1 variant 2 variant 3 variant 4 Measurements Prediction
Efficiency
0.8 0.6 0.4 0.2 0
0
256
512 n
768
1,024
Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
18 / 25
Results for trinv probabilistic prediction
trinvi (N, n, A, n, 96)
Efficiency
0.8
variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max
0.7
0.6
0.5 512
640
768 n
896
1,024
Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
19 / 25
Block-size
L00
L00
L10
L11
L20
L21
L22
block-size
Elmar Peise (AICES, RWTH)
L10
L11
L20
L21
L22
block-size
[email protected]
11/12/2012
20 / 25
Tuning the block-size trinvi (N, 1000, A, 1000, blocksize)
1
variant 1 variant 2 variant 3 variant 4 Measurements Prediction
Efficiency
0.8 0.6 0.4 0.2 0
0
64
128 192 blocksize
256
Intel Harpertown E5450 @ 2.99 GHz — 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
21 / 25
Sandy Bridge trinvi (N, n, A, n, 96)
0.6
variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max
Efficiency
0.55 0.5 0.45 0.4 512
640
768 n
896
1,024
Intel Sandy Bridge-EP E5-2670 @ 2.6 GHz — 1 core — OpenBLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
22 / 25
Parallel trinv variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median min – max
Efficiency
0.3
0.2
0.1
0
0
256
512 n
768
1,024
Intel Sandy Bridge-EP E5-2670 @ 2.6 GHz — 8 core — OpenBLAS Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
23 / 25
Sylvester Equation L
X + X
U = C
Variant 1 of 16 X01 X10 X01 X10 X11 X11 X11
C01 X00 U01 C10 L10 X00 ⌦(L00 , U11 , X01 ) ⌦(L11 , U00 , X10 ) C11 X10 U01 X11 L10 X01 ⌦(L11 , U11 , X11 ) X 2 R1000⇥1000
Intel Harpertown E5450 @ 2.99 GHz 1 core — Intel MKL BLAS Elmar Peise (AICES, RWTH)
[email protected]
var. 1 2 5 6 16 3 4 8 10 15 9 14 12 7 11 13
predicted 27.03% 22.52% 15.51% 13.72% 1.79% 1.52% 1.50% 1.49% 1.43% 1.43% 1.40% 1.34% 1.29% 1.06% 1.04% 1.01%
measured 24.04% 21.07% 18.82% 18.51% 2.21% 1.52% 1.45% 1.37% 1.53% 1.52% 1.48% 1.33% 1.43% 1.16% 1.07% 1.01%
11/12/2012
24 / 25
Performance Modeling for Dense Linear Algebra
Thank You
Funding from DFG is gratefully acknowledged
Elmar Peise (AICES, RWTH)
[email protected]
11/12/2012
25 / 25