下面談談我對這個問題的看法,如#3所說,這種橫縱轉換必然是打散了計算,後來我想利用memory cache來
提高速度,對于記憶體資料,硬碟資料是一個慢速裝置,而對于當今的cpu來說,記憶體又是慢速的了。
我先上網查了一下我的cpu參數:E8400,L1 cache 64k,L2 cache 6M。
然後我設計了資料是1023*1024的矩陣,double類型,這樣每一行資料就是1024 * sizeof(double)=8k
這樣L1 cache能夠裝下8行的資料,但是不知道這個緩存是否能夠全部利用,經試驗裝入6行效率最高。我猜
測緩存會裝入讀取資料的前面一小段和後面一大段(這個隻是猜測)。
接下來說說我的算法。
将原始資料6行6行劃分成條狀,最後的餘數不夠6行特殊處理,每次讀取條狀第一個資料的時候,可預計這個
條帶會整個被放入L1 cache,對于目标資料,這個條帶是縱向的。一條一條的旋轉,這樣原始資料就不用頻繁讀取了。
另一種方法是對原始資料進行列劃分,而對目标資料進行行劃分,這樣就相當于對寫記憶體資料進行了緩存,經試驗,劃分為128時效率最好,我估計寫記憶體的時候應該和L2 cache關系較大,這樣算下來,8k*128=1M,也就是說6M 的L2 cache可以利用大約1M來優化寫記憶體?或者是作業系統規定大約緩存1M的資料就開始寫記憶體?
我也不知道是否如此。
以上說法都是個人觀點,不一定準确。
我的環境:win7 64bit VS2008 debug/x64
下面是我的測試時間,可以看出讀緩存可以快大約4倍,另外,先測試的函數往往要多耗時間:
編譯器選項/Od
CacheRead: 1985498775
CacheWrite: 2279309904
Simple: 7816269690
Press any key to continue . . .
編譯器選項/O2
CacheRead: 1334952851
CacheWrite: 1714044429
Simple: 4617906147
Press any key to continue . . .
- C/C++ code
-
#include < stdio.h > #include < stdlib.h > #include < memory.h > #include < intrin.h > #pragma intrinsic(__rdtsc) #define M 2047 #define N 2048 __declspec( align( 128 ) ) double aMatrix[M * N]; __declspec( align( 128 ) ) double bMatrix[N * M]; void Rot90_Simple( double * pDst, double * pSrc, int width, int height ) { int dataSize = width * height; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; ++ dataSize; while ( pDst < pDstEnd ) { * pDst ++ = * pRunner; pRunner += width; if ( pRunner > pSrcEnd ) pRunner -= dataSize; } } void Rot90_CacheWrite( double * pDst, double * pSrc, int width, int height ) { int blockSize = 128 ; int wblock = width / blockSize; int hblock = height / blockSize; int wlast = width - blockSize * wblock; int hlast = height - blockSize * hblock; int nextRow = width + blockSize; int nextCol = blockSize * height - 1 ; int dataSize = width * height; double preRead = * pSrc; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; for ( int i = 0 ; i < wblock; ++ i ) { while ( pRunner < pSrcEnd ) { for ( int j = 0 ; j < blockSize; ++ j ) { * pDst = * pRunner -- ; pDst += height; } pDst -= nextCol; pRunner += nextRow; } pDst += ( nextCol - height + 1 ); // preRead = pRunner[nextCol]; pRunner -= ( dataSize + blockSize ); } nextCol = wlast * height - 1 ; nextRow = width + wlast; while ( pRunner < pSrcEnd ) { for ( int j = 0 ; j < wlast; ++ j ) { * pDst = * pRunner -- ; pDst += height; } pDst -= nextCol; pRunner += nextRow; } } void Rot90_CacheRead( double * pDst, double * pSrc, int width, int height ) { int blockSize = 6 ; int wblock = width / blockSize; int hblock = height / blockSize; int wlast = width - blockSize * wblock; int hlast = height - blockSize * hblock; int nextRow = height - blockSize; int nextCol = blockSize * width + 1 ; int dataSize = width * height; double preRead = * pSrc; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; for ( int i = 0 ; i < hblock; ++ i ) { while ( pDst < pDstEnd ) { for ( int j = 0 ; j < blockSize; ++ j ) { * pDst ++ = * pRunner; pRunner += width; } pDst += nextRow; pRunner -= nextCol; } pDst -= ( dataSize - blockSize ); preRead = pRunner[nextCol]; pRunner += ( nextCol + width - 1 ); } nextRow = height - hlast; nextCol = hlast * width + 1 ; while ( pDst < pDstEnd ) { for ( int j = 0 ; j < hlast; ++ j ) { * pDst ++ = * pRunner; pRunner += width; } pDst += nextRow; pRunner -= nextCol; } } float main() { unsigned __int64 t1, t2, t; int times = 20 ; for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_CacheRead( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " CacheRead:/t%llu/n " , t ); for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_CacheWrite( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " CacheWrite:/t%llu/n " , t ); for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_Simple( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " Simple:/t/t%llu/n " , t ); return 0 .f; }