Practical vectorization Practical vectorization Sebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2022 1/50 S.Ponce-CERN
Practical vectorization 1 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Practical vectorization S´ebastien Ponce sebastien.ponce@cern.ch CERN Thematic CERN School of Computing 2022
Practical vectorization Outline Introduction ② Measuring vectorization Vectorization Prerequisite Vectorizing techniques in C++ ●Autovectorization oInline assembly o Intrinsics oCompiler extensions oLibraries What to expect 2/50 S.Ponce-CERN
Practical vectorization 2 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Outline 1 Introduction 2 Measuring vectorization 3 Vectorization Prerequisite 4 Vectorizing techniques in C++ Autovectorization Inline assembly Intrinsics Compiler extensions Libraries 5 What to expect ?
Practical vectorization 4心,ntro Meature Peeeg Techniques Expectat66 Introduction Introduction Measuring vectorization Vectorization Prerequisite Vectorizing techniques in C+ What to e色pect? 3/50 S.Ponce-CERN
Practical vectorization 3 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Introduction 1 Introduction 2 Measuring vectorization 3 Vectorization Prerequisite 4 Vectorizing techniques in C++ 5 What to expect ?
Practical vectorization Intro Meature Feeeg Technigues Expe Goal of this course Make the theory explained by Andrzej concerning SIMD and vectorization more concrete o Detail the impact of vectorization on your code on your data model 。on actual C++code Give an idea of what to expect from vectorized code 4/50 S.Ponce-CERN
Practical vectorization 4 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Goal of this course Make the theory explained by Andrzej concerning SIMD and vectorization more concrete Detail the impact of vectorization on your code on your data model on actual C++code Give an idea of what to expect from vectorized code
Practical vectorization Intro SIMD Single Instruction Multiple Data Concept o Run the same operation in parallel on multiple data o Operation is as fast as in single data case oThe data leave in a "vector" Practically A B R +回=风 A2 B2 R2 → A B3 R3 A B R4 5/50 S.Ponce-CERN
Practical vectorization 5 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations SIMD - Single Instruction Multiple Data Concept Run the same operation in parallel on multiple data Operation is as fast as in single data case The data leave in a “vector” Practically A + B = R A 1 A 2 A 3 A 4 + B 1 B 2 B 3 B 4 = R 1 R 2 R 3 R 4
Practical vectorization Intro Promises of vectorization Theoretical gains Computation speed up corresponding to vector width o Note that it's dependant on the type of data ◆float vs double shorts versus ints Various units for various vector width Name Arch nb bits nb floats/int nb doubles/long SSEI 4 X86 128 4 2 AVX2 X86 256 8 4 AVX2 2(FMA) X86 256 8 4 AVX2 512 X86 512 16 8 SVE3 ARM 128-2048 464 2-32 1 Streaming SIMD Extensions2 Advanced Vector eXtension3 Scalable Vector Extension 6/50 S.Ponce-CERN
Practical vectorization 6 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Promises of vectorization Theoretical gains Computation speed up corresponding to vector width Note that it’s dependant on the type of data float vs double shorts versus ints Various units for various vector width Name Arch nb bits nb floats/int nb doubles/long SSE1 4 X86 128 4 2 AVX2 X86 256 8 4 AVX2 2 (FMA) X86 256 8 4 AVX2 512 X86 512 16 8 SVE3 ARM 128-2048 4-64 2-32 1 Streaming SIMD Extensions 2 Advanced Vector eXtension 3 Scalable Vector Extension
Practical vectorization ntro How to now what you can use Manually Look for sse,avx,etc in your processor flags 1scpu I egrep mmxlsselavx' Flags:fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts 7/50 S.Ponce·CERN
Practical vectorization 7 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations How to now what you can use Manually Look for sse, avx, etc in your processor flags lscpu | egrep ``mmx|sse|avx'' Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
Practical vectorization Intro Situation for Intel processors Nehalem (2009). Sandy Bridge (2012):Haswell (2014): Knights Corner Knights Landing Skylake (2017): Westmere (2010): Itel Xeon Intel Xeon (2012 2016年 Intel Xeon Scalable Intel Xeon Processor Intel Xeon Phi Intel Xeon Phi Processor Family Processoes E3E$futily E3 v3/E5 V3/E7v3 Coprocessor x100 Precessoe x200 (legacy) AVX-512VL AVX-512DQ Ivy Bridge (2013): Broadwe2015 AVX-512BW Ietel Xeon Intel Xeon 512-bit Processor Procecor 512-bit E3 V2/E5 V2/E7 v2 E34E5v4E74 AVX-512ER Family AVX-512PF AVX-512CD AVX-512CD 512-bit AVX-512F AVX-512F 256-6it IMCI 256-bit AVX2 AVX2 AVX2 128-bit AVX AVX AVX AVX SSE* SSE* SSE* SSE SSE primary instraction set 8/50 S.Ponce-CERN
Practical vectorization 8 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Situation for Intel processors
Practical vectorization 花5 Measuring vectorization Introduction 2 Measuring vectorization Vectorization Prerequisite Vectorizing techniques in C+ What to e色pect? 9/50 S.Ponce-CERN
Practical vectorization 9 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Measuring vectorization 1 Introduction 2 Measuring vectorization 3 Vectorization Prerequisite 4 Vectorizing techniques in C++ 5 What to expect ?
Practical vectorization Intro Measure Techniques Am I using vector registers Yes you are As vector registers are used for scalar operations o Remember Andrzej's picture Wasted pasn Am I efficiently using vector registers o Here we have to look at the generated assembly code Looking for specific intructions oOr for the use of specific names of registers 10/50 S.Ponce-CERN
Practical vectorization 10 / 50 S. Ponce - CERN Intro Measure Prereq Techniques Expectations Am I using vector registers ? Yes you are As vector registers are used for scalar operations Remember Andrzej’s picture Wasted Used Am I efficiently using vector registers ? Here we have to look at the generated assembly code Looking for specific intructions Or for the use of specific names of registers