@Double buffering One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop i Load next tile from global memory Load current tile to shared memory Loop t Deposit current tile to shared syncthreads memory syncthreadso) Compute current tile Load next tile from global syncthreads( memory Compute current tileDouble Buffering 9