天天看点

矩阵90°旋转

下面谈谈我对这个问题的看法,如#3所说,这种横纵转换必然是打散了计算,后来我想利用memory cache来

提高速度,对于内存数据,硬盘数据是一个慢速设备,而对于当今的cpu来说,内存又是慢速的了。

我先上网查了一下我的cpu参数:E8400,L1 cache 64k,L2 cache 6M。

然后我设计了数据是1023*1024的矩阵,double类型,这样每一行数据就是1024 * sizeof(double)=8k

这样L1 cache能够装下8行的数据,但是不知道这个缓存是否能够全部利用,经试验装入6行效率最高。我猜

测缓存会装入读取数据的前面一小段和后面一大段(这个只是猜测)。

接下来说说我的算法。

将原始数据6行6行划分成条状,最后的余数不够6行特殊处理,每次读取条状第一个数据的时候,可预计这个

条带会整个被放入L1 cache,对于目标数据,这个条带是纵向的。一条一条的旋转,这样原始数据就不用频繁读取了。

另一种方法是对原始数据进行列划分,而对目标数据进行行划分,这样就相当于对写内存数据进行了缓存,经试验,划分为128时效率最好,我估计写内存的时候应该和L2 cache关系较大,这样算下来,8k*128=1M,也就是说6M 的L2 cache可以利用大约1M来优化写内存?或者是操作系统规定大约缓存1M的数据就开始写内存?

我也不知道是否如此。

以上说法都是个人观点,不一定准确。

我的环境:win7 64bit VS2008 debug/x64

下面是我的测试时间,可以看出读缓存可以快大约4倍,另外,先测试的函数往往要多耗时间:

编译器选项/Od

CacheRead:      1985498775

CacheWrite:    2279309904

Simple:        7816269690

Press any key to continue . . .

编译器选项/O2

CacheRead:      1334952851

CacheWrite:    1714044429

Simple:        4617906147

Press any key to continue . . .

C/C++ code
#include 
    <
    stdio.h
    >
    
#include 
    <
    stdlib.h
    >
    
#include 
    <
    memory.h
    >
    
#include 
    <
    intrin.h
    >
    

    #pragma
     intrinsic(__rdtsc)
    



    #define
     M 2047
    

    #define
     N 2048
    

__declspec( align(
    128
    ) ) 
    double
     aMatrix[M
    *
    N];
__declspec( align(
    128
    ) ) 
    double
     bMatrix[N
    *
    M];


    void
     Rot90_Simple( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;

    
    ++
    dataSize;

    
    while
    ( pDst 
    <
     pDstEnd )
    {
        
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
        pRunner 
    +=
     width;
        
    if
    ( pRunner 
    >
     pSrcEnd ) pRunner 
    -=
     dataSize;
    }
}


    void
     Rot90_CacheWrite( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     blockSize 
    =
     
    128
    ;
    
    int
     wblock 
    =
     width 
    /
     blockSize;
    
    int
     hblock 
    =
     height 
    /
     blockSize;
    
    int
     wlast 
    =
     width 
    -
     blockSize 
    *
     wblock;
    
    int
     hlast 
    =
     height 
    -
     blockSize 
    *
     hblock;

    
    int
     nextRow 
    =
     width 
    +
     blockSize;
    
    int
     nextCol 
    =
     blockSize 
    *
     height 
    -
     
    1
    ;

    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
     preRead 
    =
     
    *
    pSrc;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     wblock; 
    ++
    i )
    {
        
    while
    ( pRunner 
    <
     pSrcEnd )
        {
            
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     blockSize; 
    ++
    j )
            {
                
    *
    pDst 
    =
     
    *
    pRunner
    --
    ;
                pDst 
    +=
     height;
            }
            pDst 
    -=
     nextCol;
            pRunner 
    +=
     nextRow;
        }
        pDst 
    +=
     ( nextCol 
    -
     height 
    +
     
    1
     );
        
    //
    preRead = pRunner[nextCol];
    

            pRunner 
    -=
     ( dataSize 
    +
     blockSize );
    }

    nextCol 
    =
     wlast 
    *
     height 
    -
     
    1
    ;
    nextRow 
    =
     width 
    +
     wlast;

    
    while
    ( pRunner 
    <
     pSrcEnd )
    {
        
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     wlast; 
    ++
    j )
        {
            
    *
    pDst 
    =
     
    *
    pRunner
    --
    ;
            pDst 
    +=
     height;
        }
        pDst 
    -=
     nextCol;
        pRunner 
    +=
     nextRow;
    }

}


    void
     Rot90_CacheRead( 
    double
     
    *
    pDst, 
    double
     
    *
    pSrc, 
    int
     width, 
    int
     height )
{
    
    int
     blockSize 
    =
     
    6
    ;
    
    int
     wblock 
    =
     width 
    /
     blockSize;
    
    int
     hblock 
    =
     height 
    /
     blockSize;
    
    int
     wlast 
    =
     width 
    -
     blockSize 
    *
     wblock;
    
    int
     hlast 
    =
     height 
    -
     blockSize 
    *
     hblock;

    
    int
     nextRow 
    =
     height 
    -
     blockSize;
    
    int
     nextCol 
    =
     blockSize 
    *
     width 
    +
     
    1
    ;

    
    int
     dataSize 
    =
     width 
    *
     height;
    
    double
     preRead 
    =
     
    *
    pSrc;
    
    double
    *
     pRunner 
    =
     pSrc 
    +
     width 
    -
     
    1
    ;
    
    double
    *
     pSrcEnd 
    =
     pSrc 
    +
     dataSize;
    
    double
    *
     pDstEnd 
    =
     pDst 
    +
     dataSize;
    
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     hblock; 
    ++
    i )
    {
        
    while
    ( pDst 
    <
     pDstEnd )
        {
            
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     blockSize; 
    ++
    j )
            {
                
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
                pRunner 
    +=
     width;
            }
            pDst 
    +=
     nextRow;
            pRunner 
    -=
     nextCol;
        }
        pDst 
    -=
     ( dataSize 
    -
     blockSize );
        preRead 
    =
     pRunner[nextCol];
        pRunner 
    +=
     ( nextCol 
    +
     width 
    -
     
    1
     );
    }

    nextRow 
    =
     height 
    -
     hlast;
    nextCol 
    =
     hlast 
    *
     width 
    +
     
    1
    ;

    
    while
    ( pDst 
    <
     pDstEnd )
    {
        
    for
    ( 
    int
     j 
    =
     
    0
    ; j 
    <
     hlast; 
    ++
    j )
        {
            
    *
    pDst
    ++
     
    =
     
    *
    pRunner;
            pRunner 
    +=
     width;
        }
        pDst 
    +=
     nextRow;
        pRunner 
    -=
     nextCol;
    }
}


    float
     main()
{
    unsigned __int64 t1, t2, t;
    
    int
     times 
    =
     
    20
    ;

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_CacheRead( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    CacheRead:/t%llu/n
    "
    , t );

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_CacheWrite( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    CacheWrite:/t%llu/n
    "
    , t );

    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     N 
    *
     M; i
    ++
     ) aMatrix[i] 
    =
     
    1.0
     
    *
     rand() 
    /
     RAND_MAX;
    t1 
    =
     __rdtsc();
    
    for
    ( 
    int
     i 
    =
     
    0
    ; i 
    <
     times ; 
    ++
    i ) Rot90_Simple( bMatrix, aMatrix, N, M );
    
    //
    memcpy( aMatrix, bMatrix, M*N*sizeof(double) );
    

        t2 
    =
     __rdtsc();
    t 
    =
     t2 
    -
     t1;
    printf( 
    "
    Simple:/t/t%llu/n
    "
    , t );

    
    return
     
    0
    .f;
}