《Hadoop實戰第2版》——3.1節為什麼要用MapReduce

2021-11-09 21:22:04

3.1　為什麼要用mapreduce

mapreduce的流行是有理由的。它非常簡單、易于實作且擴充性強。大家可以通過它輕易地編寫出同時在多台主機上運作的程式，也可以使用ruby、python、php和c++等非java類語言編寫map或reduce程式，還可以在任何安裝hadoop的叢集中運作同樣的程式，不論這個叢集有多少台主機。mapreduce适合處理海量資料，因為它會被多台主機同時處理，這樣通常會有較快的速度。

下面來看一個例子。

引文分析是評價論文好壞的一個非常重要的方面，本例隻對其中最簡單的一部分，即論文的被引用次數進行了統計。假設有很多篇論文（百萬級），且每篇論文的引文形式如下所示：

references

david m. blei, andrew y. ng, and michael i. jordan.

latent dirichlet allocation. journal of machine

learning research, 3:993–1022.

samuel brody and noemie elhadad. 2010. an unsupervised

aspect-sentiment model for online reviews. in

naacl '10.

jaime carbonell and jade goldstein. 1998. the use of

mmr, diversity-based reranking for reordering documents

and producing summaries. in sigir '98, pages

335–336.

dennis chong and james n. druckman. 2010. identifying

frames in political news. in erik p. bucy and

r. lance holbert, editors, sourcebook for political

communication research: methods, measures, and

analytical techniques. routledge.

cindy chung and james w. pennebaker. 2007. the psychological

function of function words. social communication:

frontiers of social psychology, pages 343–

359.

g¨unes erkan and dragomir r. radev. 2004. lexrank:

graph-based lexical centrality as salience in text summarization.

j. artif. int. res., 22(1):457–479.

stephan greene and philip resnik. 2009. more than

words: syntactic packaging and implicit sentiment. in

naacl '09, pages 503–511.

aria haghighi and lucy vanderwende. 2009. exploring

content models for multi-document summarization. in

naacl '09, pages 362–370.

sanda harabagiu, andrew hickl, and finley lacatusu.

negation, contrast and contradiction in text processing.

在單機運作時，想要完成這個統計任務，需要先切分出所有論文的名字存入一個hash表中，然後周遊所有論文，檢視引文資訊，一一計數。因為文章數量很多，需要進行很多次内外存交換，這無疑會延長程式的執行時間。但在mapreduce中，這是一個wordcount就能解決的問題。

《Hadoop實戰第2版》——3.1節為什麼要用MapReduce

繼續閱讀

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

成員函數初始化清單

2021-08-13c++——類之操作符重載

swmm與lisflood-fp源碼如何一起編譯 CMake指令

Windows下VS開發環境環境安裝工程項目設定關于Debug和Release的提示

一文看懂字元串的加減乘除

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

C++ 第十五周報告1--《冒泡法排序》

ubuntu14.04下安裝hbse1.0.1.1

C++實作簡單順序表

User Defined Hadoop DataType

C經典書籍筆記——C陷阱與缺陷②(文法陷阱之優先級)一、錯誤案列二、優先級規律

Ambari介紹和架構原理

線性表之順序表的實作

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

SequoiaDB巨杉資料庫C++驅動概述