Lecture 8:Parallel Sparse Methods 1
Lecture 8: Parallel Sparse Methods 1
Objective To learn to regularize irregular data with Limiting variations with clamping Sorting Transposition To learn to write a high-performance SpMV kernel based on JDS transposed format To learn the key techniques for compacting input data in parallel sparse methods for reduced consumption of memory bandwidth Better utilization of on-chip memory Fewer bytes transferred to on-chip memory Better utilization of global memory Challenge:retaining regularity 2
Objective • To learn to regularize irregular data with – Limiting variations with clamping – Sorting – Transposition • To learn to write a high-performance SpMV kernel based on JDS transposed format • To learn the key techniques for compacting input data in parallel sparse methods for reduced consumption of memory bandwidth – Better utilization of on-chip memory – Fewer bytes transferred to on-chip memory – Better utilization of global memory – Challenge: retaining regularity 2
Sparse Matrix Many real-world systems are sparse in nature Linear systems described as sparse matrices Solving sparse linear systems Traditional inversion algorithms such as Gaussian elimination can create too many "fill-in"elements and explode the size of the matrix Iterative Conjugate Gradient solvers based on sparse matrix-vector multiplication is preferred Solution of PDE systems can be formulated into linear operations expressed as sparse matrix- vector multiplication 3
Sparse Matrix • Many real-world systems are sparse in nature – Linear systems described as sparse matrices • Solving sparse linear systems – Traditional inversion algorithms such as Gaussian elimination can create too many “fill-in” elements and explode the size of the matrix – Iterative Conjugate Gradient solvers based on sparse matrix-vector multiplication is preferred • Solution of PDE systems can be formulated into linear operations expressed as sparse matrixvector multiplication 3
Sparse Data Motivation for Compaction Many real-world inputs are sparse/non-uniform Signal samples, mesh models, transportation networks, communication networks,etc. 4
Sparse Data Motivation for Compaction ▪ Many real-world inputs are sparse/non-uniform ▪ Signal samples, mesh models, transportation networks, communication networks, etc. 4
Sparse Matrix in Analytics and Al Recommender systems Natural language processing Predict missing ratings Latent semantic model Group similar users/items Word embedding as input to DNN Complex network Matrix Deep learning Link prediction Factorization Model compression Vertices clustering Embedding layer Web search Tensor decomposition Match query and document In machine learning and HPC applications items X Ratings (R) sasn R NETFLIX n items amazon.com Quora Music Spotify 5
Sparse Matrix in Analytics and AI Predict missing ratings Group similar users/items Match query and document In machine learning and HPC applications Matrix Link prediction Factorization Vertices clustering Latent semantic model Word embedding as input to DNN Recommender systems Complex network Web search Natural language processing Tensor decomposition Model compression Embedding layer Deep learning Ratings (R) n items m users * * * * * * * * ≈ x Users items T x T u v X f f R 5
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig of Grids Grids Matrix Matrix Body Carlo 1/o Teams Climate and 3 CESM,GCRM, X X X Weather CM1/WRF.HOMME Plasmas/ 2 H3D(M),VPIC X X X X Magnetosphere OSIRIS,Magtail/UPIC Stellar 5 PPM,MAESTRO X X X X X X Atmospheres and CASTRO,SEDONA, Supernovae ChaNGa,MS-FLUKSS Cosmology 2 Enzo,pGADGET X Combustion/ 2 PSDNS,DISTUF X Turbulence General Relativity 2 Cactus,Harm3D. LazEV Molecular AMBER,Gromacs, X Dynamics NAMD,LAMMPS Quantum Chemistry 2 SIAL,GAMESS, NWChem Material Science 3 NEMOS,OMEN,GW, X QMCPACK Earthquakes/ 2 AWP-ODC X Seismology HERCULES,PLSQR, SPECFEM3D Quantum Chromo 1 Chroma,MILC, X X Dynamics USQCD Social Networks EPISIMDEMICS Evolution Eve Engineering/System GRIPS,Revisit X of Systems 6 Computer Science X X X X ×
Science Area Number of Teams Codes Struct Grids Unstruct Grids Dense Matrix Sparse Matrix N- Body Monte Carlo FFT PIC Sig I/O Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X Plasmas/ Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X Stellar Atmospheres and Supernovae 5 PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS-FLUKSS X X X X X X Cosmology 2 Enzo, pGADGET X X X Combustion/ Turbulence 2 PSDNS, DISTUF X X General Relativity 2 Cactus, Harm3D, LazEV X X Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X Material Science 3 NEMOS, OMEN, GW, QMCPACK X X X X Earthquakes/ Seismology 2 AWP-ODC, HERCULES, PLSQR, SPECFEM3D X X X X Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X Social Networks 1 EPISIMDEMICS Evolution 1 Eve Engineering/System of Systems 1 GRIPS,Revisit X Computer Science 1 X X X X X 6
Sparse Matrix-Vector Multiplication (SpMV) X 十 A X Y Y 7
× A X + Y Y Sparse Matrix-Vector Multiplication (SpMV) 7
Challenges Compared to dense matrix multiplication,SpMV Is irregular/unstructured Has little input data reuse Key to maximal performance Maximize regularity(by reducing divergence and load imbalance) Maximize DRAM burst utilization (layout arrangement) 8
Challenges • Compared to dense matrix multiplication, SpMV – Is irregular/unstructured – Has little input data reuse • Key to maximal performance – Maximize regularity (by reducing divergence and load imbalance) – Maximize DRAM burst utilization (layout arrangement) 8
A Simple Parallel SpMV Row 0 3 0 1 0 Thread 0 Row 1 0 0 0 0 Thread 1 Row 2 0 2 4 1 Thread 2 Row 3 0 0 1 Thread 3 ·Each thread processes one row 9
Row 0 3 0 1 0 Thread 0 Row 1 0 0 0 0 Thread 1 Row 2 0 2 4 1 Thread 2 Row 3 1 0 0 1 Thread 3 A Simple Parallel SpMV • Each thread processes one row 9
Compressed Sparse Row(CSR) Format CSR Representation Row 0 Row 2 Row 3 Nonzero values data[7] {3,1, 2,4,1, 1,1} Column indices col _index[7]{0,2,1,2,3, 0,3 Row Pointers ptr[5] {0,2,2,5,7} Dense representation Row 0 3 0 Thread 0 Row 1 0 0 0 0 Thread 1 Row 2 0 2 4 1 Thread 2 Row 3 Thread 3 10
Row 0 Row 2 Row 3 Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 } Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 } Row Pointers ptr[5] { 0, 2, 2, 5, 7 } Compressed Sparse Row (CSR) Format 10 Row 0 3 0 1 0 Thread 0 Row 1 0 0 0 0 Thread 1 Row 2 0 2 4 1 Thread 2 Row 3 1 0 0 1 Thread 3 Dense representation CSR Representation