目前hadoop有2個開源版本,一個是Apache的版本,另一個是Cloudera在Apache的基礎上進行優化的版本,也稱為CDH3版。
兩個版本的對比情況如下:
CDH3 版本 | Apache 版本 | 描述 | |
Hadoop Common | ● | ● | The common utilities that support the other Hadoop subprojects. |
Hadoop Distributed File System (HDFS) | ● | ● | A distributed file system that provides high-throughput access to application data. |
Hadoop MapReduce | ● | ● | A software framework for distributed processing of large data sets on compute clusters. |
Flume | ● | A distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. | |
Sqoop | ● | A tool that imports data from relational databases into Hadoop clusters. | |
Hue | ● | A graphical user interface to work with CDH. | |
Pig | ● | ● | A high-level data-flow language and execution framework for parallel computation.Enables you to analyze large amounts of data using Pig's query language called Pig Latin. |
Hive | ● | ● | A data warehouse infrastructure that provides data summarization and ad hoc querying. A powerful data warehousing application built on top of Hadoop which enables you to access your data using Hive QL, a language that is similar to SQL. |
HBase | ● | ● | A scalable, distributed database that supports structured data storage for large tables. provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). |
ZooKeeper | ● | ● | A high-performance coordination service for distributed applications.A highly reliable and available service that provides coordination between distributed processes. |
Oozie | ● | A server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs. | |
Whirr | ● | Provides a fast way to run cloud services. | |
Snappy | ● | A compression/decompression library. | |
Avro | ● | A data serialization system. | |
Cassandra | ● | A scalable multi-master database with no single points of failure. | |
Chukwa | ● | A data collection system for managing large distributed systems. | |
Mahout | ● | A Scalable machine learning and data mining library. |
理論上說,CDH3版本應該支援Apache版本的全部元件及其子項目。
兩個hadoop版本的異同如下:
系統從CDH3b3開始不支援hadoop.job.ugi參數,請使用UserGroupInformation.doAs()方法代替。 其它見:https://ccp.cloudera.com/display/CDHDOC/Incompatible+Changes 安裝Cloudera CDH3基于hadoop穩定版0.20.2,并內建很多更新檔(patch)。 CDH提供rpm包和tar兩種方式(Cloudera更推薦使用rpm方式),hadoop0.20.2隻提供了tar包安裝方式。 Cloudera CDH3 自動設定JAVA_HOME環境變量,apache hadoop需要手工配置。 Apache hadoop使用start/stop-dfs.sh start/stop-all.sh腳本維護叢集,CDH通過root身份運作/etc/init.d/hadoop-0.20-* 腳本啟動、關閉服務,這種方式隻可以管理目前伺服器,如果希望實作類似start/stop-all.sh需要自己寫腳本。 Cloudera CDH安裝成功後會添加兩個使用者:hdfs(hdfs檔案系統相關), mapred(mapreduce相關),而Apache hadoop通常的做法是添加一個hadoop使用者來做所有的事情。 Cloudera CDH通過alternatives切換多個配置檔案,而Apache hadoop配置檔案隻儲存在$HADOOP_HOME/conf下面。 eclipse插件Cloudera CDH預設沒有提供eclipse插件,需要自己編譯,而且它的插件和Apache hadoop插件不相容。安全CDH3支援Kerberos安全認證,apache hadoop則使用簡陋的使用者名比對認證。 |