Using Blocking to Increase Temporal Locality

2023-04-20 05:23:16

In the last essay Rearranging Loops to Increase Spatial Locality we saw how some simple rearrangements of the loops could increase spatial locality. But observe that even with good loop nestings, the time per loop iteration increases with increasing array size. What is happening is that as the array size increases, the temporal locality decreases, and the cache experiences an increasing number of capacity misses. To fix this, we can use a general technique called blocking.

The general idea of blocking is to organize the data structures in a program into large chunks called blocks. (In this context, the term “block” refers to an application-level chunk of data, not a cache block.) The program is structured so that it loads a chunk into the L1 cache, does all the reads and writes that it needs to on that chunk, then discards the chunk, loads in the next chunk, and so on.

Blocking a matrix multiply routine works by partitioning the matrices into submatrices and then exploiting the mathematical fact that these submatrices can be manipulated just like scalars. For example, if n = 8, then we could partition each matrix into four 4×4 submatrices:

Using Blocking to Increase Temporal Locality

The version of blocked matrix multiplication, which we call the bijk version is presented below. The basic idea behind this code is to partition A and C into 1×bsize row slivers and to partition B into bsize×bsize blocks. The innermost (j, k) loop pair multiplies a sliver of A by a block of B and accumulates the result into a sliver of C. The i loop iterates through n row slivers of A and C, using the same block in B.

void bijk(array A, array B, array C, int n, int bsize)
{
		double sum = 0.0;
		int en = bsize*(n/bsize); // Amount that fits evenly into blocks 
		
		for (int i=0; i!=n; ++i)
		    for (int j=0; j!=n; ++j)
		        C[i][j] = 0.0;
		
		for (int kk=0; kk < en; kk += bsize) {
		    for (int jj=0; jj < en; jj += bsize) {
		        for (int i=0; i!=n; ++i) {
		            for (int j=jj; j != jj+bsize; ++j) {
		                sum = C[i][j];
		                for (int k=kk; k != kk+bsize; ++k) {
		                    sum += A[i][k]*B[k][j];
				         }
		                C[i][j] = sum;
		            }
		        }
		    }
		 }
 }

The key idea is that it loads a block of B into the cache, uses it up, and then discards it. References to A enjoy good spatial locality

because each sliver is accessed with a stride of 1. There is also good temporal locality because the entire sliver is referenced bsize times in succession. References to B enjoy good temporal locality because the entire bsize×bsize block is accessed n times in succession. Finally, the references to C have good spatial locality because each element of the sliver is written in succession. Notice that references to C do not have

good temporal locality because each sliver is only accessed one time.

Blocking can make code harder to read, but it can also pay big performance dividends. Blocking improves the running time by a factor of two over the best non-blocked version, from about 20 cycles per iteration down to about 10 cycles per iteration.

Using Blocking to Increase Temporal Locality

繼續閱讀

BlockingQueue知識合集

SGI STL中的空間配置器（Allocator）圖文詳解1 為什麼需要記憶體配置器？2 SGI STL中空間配置器的基本原理3 重要代碼詳解（注釋詳盡）4 總結

高位元組優先，低位元組優先

Client and Server Code on Network Programming

從源碼全面解析LinkedBlockingQueue的來龍去脈

并發程式設計-常見并發工具BlockingQueue的使用及原了解析

《Java源碼分析》：BlockingQueue之LinkedBlockingQueue《Java源碼分析》：LinkedBlockingQueue

阻塞隊列--BlockingQueue

JUC-LinkedBlockingQueue學習六.出隊列七.疊代器八.LinkedBlockingQueue與ArrayBlockingQueue的比較九.總結十.參考

使用DBMS_LOCK防止會話阻塞

Multi-Programming-8 線程安全類實作生産者和消費者

模仿ArrayBlockingQueue編寫一個批處理BlockingQueue

《Java源碼分析》：BlockingQueue之PriorityBlockingQueue《Java源碼分析》：BlockingQueue之PriorityBlockingQueue

常用實用類封裝單例BlockingDictionaryGuard

Java BlockingQueue阻塞隊列

計算機中空間局限性(Spatial Locality)與時間局限性(Temporal Locality)