Pthreads矩陣乘法實作

這兩天接觸了Markdown文檔編輯器之後，我便對這種編輯方式欲罷不能了，下面繼續推出pthreads矩陣乘法的使用方法。其實與MPI矩陣乘法的實作比起來，Pthreads要簡單很多，主要是由于MPI是基于程序的通信，而Pthreads和之後要提到的OpenMP則是基于線程的通信，從代碼量和實作方式相比較來看，線程的通信似乎是要更簡單一些，将矩陣劃分成塊之後直接配置設定給對應線程即可。

并行化思路

假設矩陣乘法為A * B = C。讓每個線程計算乘積向量 x 的不同部分,特别地,p 個線程中的每一個線程計算x中的 1000/P 個連續的元素。這個算法首先通過pthread_create函數生成thread_count個程序，再将矩陣 A 按行配置設定給每個線程，并将矩陣B定義為全局變量，使每個線程能計算相應部分的矩陣相乘結果，然後再将子線程的計算結果傳回給主線程，通過pthread_join函數使主線程等待所有線程結束，最後在主線程裡将矩陣傳回結果存儲到檔案中。

好了，廢話不多說，直接上代碼！

Pthreads矩陣乘法實作

#include<stdio.h>
#include<stdlib.h>
#include<pthread.h>

int thread_count;
int size, local_size;
int *a, *b, *c;
FILE *fp;

int* transpose_matrix(int *m, int size);

void* Init();

void* pthread_mult(void* rank);

int main(int argc, char* argv[]){
    int i, j;
    long thread;

    float time_use = ;
    struct timeval start;
    struct timeval end;

    size = ;

    gettimeofday(&start, NULL);

    pthread_t* thread_handles;

    thread_count = strtol(argv[], NULL, );

    local_size = size/thread_count;

    thread_handles = malloc(thread_count * sizeof(pthread_t));

    Init();

    for(thread = ; thread<thread_count; thread++)
        pthread_create(&thread_handles[thread], NULL, pthread_mult, (void*) thread);

    for (thread=; thread<thread_count; thread++)
        pthread_join(thread_handles[thread], NULL);

    fp=fopen("c.txt","w");//打開檔案
    for(i=;i<;i++) {//寫資料
        for(j=;j<;j++)
            fprintf(fp,"%d ",c[i*size+j]); 
        fputc('\n',fp); 
    }
    fclose(fp);//關閉檔案

    gettimeofday(&end, NULL);
    time_use = (end.tv_sec-start.tv_sec)*+(end.tv_usec-start.tv_usec);

    printf("time_use is %f\n", time_use/);

    free(thread_handles);
    free(a);
    free(b);
    free(c);
    return ;
}

int* transpose_matrix(int *m, int size){
    int i, j;
    for(i=; i<size; i++){
        for(j=i+; j<size; j++){
            int temp = m[i*size+j];
            m[i*size+j] = m[j*size+i];
            m[j*size+i] = temp;
        }
    }
    return m;
}

void* Init(){

    int i, j;

    a = (int*)malloc(sizeof(int)*size*size);
    b = (int*)malloc(sizeof(int)*size*size);
    c = (int*)malloc(sizeof(int)*size*size);

    //從檔案中讀入矩陣
    fp=fopen("a.txt","r");//打開檔案
    for(i=;i<;i++) //讀資料
        for(j=;j<;j++)
            fscanf(fp,"%d",&a[i*size+j]);
    fclose(fp);//關閉檔案

    fp=fopen("b.txt","r");

    for(i=;i<;i++)
        for(j=;j<;j++)
            fscanf(fp,"%d",&b[i*size+j]);
    fclose(fp);

    b = transpose_matrix(b, size);
}

void* pthread_mult(void* rank){
    long my_rank = (long) rank;
    int i, j, k, temp;
    int my_first_row = my_rank*local_size;
    int my_last_row = (my_rank+)*local_size - ;

    for(i = my_first_row; i <= my_last_row; i++){
        for(j = ; j<size; j++){
            temp = ;
            for(k = ; k<size; k++)
                temp += a[i*size+k] * b[j*size+k];
            c[i*size+j] = temp;
        }

    }
}

結果加速比展示

串行矩陣乘法運作時間：3.490950秒

① 不同程序執行時間及其加速比展示：

程序數目運作時間(秒) 加速比

1 3.500414 1.00271101

2 2.116865 1.649113193

4 2.079719 1.678568114

8 2.088567 1.671457033

20 2.036915 1.713841766

25 2.047089 1.705323999

② 不同程序執行效率展示：

程序數目效率

1 1.00271101

2 0.8245565965

4 0.4196420285

8 0.208932129125

20 0.0856920883

25 0.06821295996

結果分析

① 執行時間分析：

開始時由1個線程增長為2個線程的過程中，執行時間接近于減半，較符合并行計算的情況，但之後随着線程數目的增多，并行計算的時間再也沒有減半，基本穩定在2秒左右。

在1個線程增加到2個線程的過程中，加速比幾乎增加了兩倍，符合并行計算的情況，但之後随着線程數的增加，加速比基本穩定在1.6~1.7左右，再也沒有加倍的情況出現。

③ 效率分析：

程式執行效率随線程數增加在不斷下降，隻是下降的趨勢在不斷減小。

④ 原因分析：

Pthreads并行程式的測試平台為Intel Core i5 CPU，為雙核CPU，即在一個處理器上內建兩個運算核心支援兩個線程并行執行。一個線程與串行乘法的執行時間相比要差不多，兩線程時時間大緻減半，但兩線程以上則并行時間不會再有明顯變化。是以會出現上述結果。

Pthreads矩陣乘法實作

并行化思路

Pthreads矩陣乘法實作

結果加速比展示

結果分析

繼續閱讀

在Visual Studio中開啟OpenMP

OpenMP并行程式編譯執行語句

MFC中使用CUDA5.0的方法（VS2010環境）

關于“并發”、“并行”、“串行”的一點了解

對應Intel SSE的android NEON

算法提高矩陣乘法（區間dp）

【NOIP2018模拟賽2018.10.23】木門道伏擊戰

linux 檢視其他使用者啟動的程序

CUDA 程式設計指南(Shane Cook) 第9章應用程式性能優化(1) 摘錄

VMware(虛拟機)下得Linux 叢集

首發：吳恩達的 CS229的數學基礎（線性代數），有人把它做成了線上翻譯版本！...

[BZOJ]4180: 字元串計數 SAM+矩陣乘法+二分

HDOJ 4549 - M斐波那契數列費馬小定理，矩陣乘法

HDU 5451 Best Solver

ZOJ 2316 Matrix Multiplication

Freda的道路