天天看點

矩陣90°旋轉

下面談談我對這個問題的看法,如#3所說,這種橫縱轉換必然是打散了計算,後來我想利用memory cache來

提高速度,對于記憶體資料,硬碟資料是一個慢速裝置,而對于當今的cpu來說,記憶體又是慢速的了。

我先上網查了一下我的cpu參數:E8400,L1 cache 64k,L2 cache 6M。

然後我設計了資料是1023*1024的矩陣,double類型,這樣每一行資料就是1024 * sizeof(double)=8k

這樣L1 cache能夠裝下8行的資料,但是不知道這個緩存是否能夠全部利用,經試驗裝入6行效率最高。我猜

測緩存會裝入讀取資料的前面一小段和後面一大段(這個隻是猜測)。

接下來說說我的算法。

将原始資料6行6行劃分成條狀,最後的餘數不夠6行特殊處理,每次讀取條狀第一個資料的時候,可預計這個

條帶會整個被放入L1 cache,對于目标資料,這個條帶是縱向的。一條一條的旋轉,這樣原始資料就不用頻繁讀取了。

另一種方法是對原始資料進行列劃分,而對目标資料進行行劃分,這樣就相當于對寫記憶體資料進行了緩存,經試驗,劃分為128時效率最好,我估計寫記憶體的時候應該和L2 cache關系較大,這樣算下來,8k*128=1M,也就是說6M 的L2 cache可以利用大約1M來優化寫記憶體?或者是作業系統規定大約緩存1M的資料就開始寫記憶體?

我也不知道是否如此。

以上說法都是個人觀點,不一定準确。

我的環境:win7 64bit VS2008 debug/x64

下面是我的測試時間,可以看出讀緩存可以快大約4倍,另外,先測試的函數往往要多耗時間:

編譯器選項/Od

CacheRead:      1985498775

CacheWrite:    2279309904

Simple:        7816269690

Press any key to continue . . .

編譯器選項/O2

CacheRead:      1334952851

CacheWrite:    1714044429

Simple:        4617906147

Press any key to continue . . .

C/C++ code
#include 
    <
    stdio.h
    >
    
#include 
    <
    stdlib.h
    >
    
#include 
    <
    memory.h
    >
    
#include 
    <
    intrin.h
    >
    

    #pragma
     intrinsic(__rdtsc)
    



    #define
     M 2047
    

    #define
     N 2048
    

__declspec( align(
    128
    ) ) 
    double
     aMatrix[M
    *
    N];
__declspec( align(
    128
    ) ) 
    double
     bMatrix[N
    *
    M];


    void
     Rot90_Simple( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;

    
    ++
    dataSize;

    
    while
    ( pDst 
    <
     pDstEnd )
    {
        
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
        pRunner 
    +=
     width;
        
    if
    ( pRunner 
    >
     pSrcEnd ) pRunner 
    -=
     dataSize;
    }
}


    void
     Rot90_CacheWrite( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     blockSize 
    =
     
    128
    ;
    
    int
     wblock 
    =
     width 
    /
     blockSize;
    
    int
     hblock 
    =
     height 
    /
     blockSize;
    
    int
     wlast 
    =
     width 
    -
     blockSize 
    *
     wblock;
    
    int
     hlast 
    =
     height 
    -
     blockSize 
    *
     hblock;

    
    int
     nextRow 
    =
     width 
    +
     blockSize;
    
    int
     nextCol 
    =
     blockSize 
    *
     height 
    -
     
    1
    ;

    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
     preRead 
    =
     
    *
    pSrc;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     wblock; 
    ++
    i )
    {
        
    while
    ( pRunner 
    <
     pSrcEnd )
        {
            
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     blockSize; 
    ++
    j )
            {
                
    *
    pDst 
    =
     
    *
    pRunner
    --
    ;
                pDst 
    +=
     height;
            }
            pDst 
    -=
     nextCol;
            pRunner 
    +=
     nextRow;
        }
        pDst 
    +=
     ( nextCol 
    -
     height 
    +
     
    1
     );
        
    //
    preRead = pRunner[nextCol];
    

            pRunner 
    -=
     ( dataSize 
    +
     blockSize );
    }

    nextCol 
    =
     wlast 
    *
     height 
    -
     
    1
    ;
    nextRow 
    =
     width 
    +
     wlast;

    
    while
    ( pRunner 
    <
     pSrcEnd )
    {
        
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     wlast; 
    ++
    j )
        {
            
    *
    pDst 
    =
     
    *
    pRunner
    --
    ;
            pDst 
    +=
     height;
        }
        pDst 
    -=
     nextCol;
        pRunner 
    +=
     nextRow;
    }

}


    void
     Rot90_CacheRead( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     blockSize 
    =
     
    6
    ;
    
    int
     wblock 
    =
     width 
    /
     blockSize;
    
    int
     hblock 
    =
     height 
    /
     blockSize;
    
    int
     wlast 
    =
     width 
    -
     blockSize 
    *
     wblock;
    
    int
     hlast 
    =
     height 
    -
     blockSize 
    *
     hblock;

    
    int
     nextRow 
    =
     height 
    -
     blockSize;
    
    int
     nextCol 
    =
     blockSize 
    *
     width 
    +
     
    1
    ;

    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
     preRead 
    =
     
    *
    pSrc;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;
    
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     hblock; 
    ++
    i )
    {
        
    while
    ( pDst 
    <
     pDstEnd )
        {
            
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     blockSize; 
    ++
    j )
            {
                
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
                pRunner 
    +=
     width;
            }
            pDst 
    +=
     nextRow;
            pRunner 
    -=
     nextCol;
        }
        pDst 
    -=
     ( dataSize 
    -
     blockSize );
        preRead 
    =
     pRunner[nextCol];
        pRunner 
    +=
     ( nextCol 
    +
     width 
    -
     
    1
     );
    }

    nextRow 
    =
     height 
    -
     hlast;
    nextCol 
    =
     hlast 
    *
     width 
    +
     
    1
    ;

    
    while
    ( pDst 
    <
     pDstEnd )
    {
        
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     hlast; 
    ++
    j )
        {
            
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
            pRunner 
    +=
     width;
        }
        pDst 
    +=
     nextRow;
        pRunner 
    -=
     nextCol;
    }
}


    float
     main()
{
    unsigned __int64 t1, t2, t;
    
    int
     times 
    =
     
    20
    ;

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_CacheRead( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    CacheRead:/t%llu/n
    "
    , t );

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_CacheWrite( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    CacheWrite:/t%llu/n
    "
    , t );

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_Simple( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    Simple:/t/t%llu/n
    "
    , t );

    
    return
     
    0
    .f;
}