天天看點

Hadoop生态圈一覽

根據hadoop官網的相關介紹和實際使用中的軟體集,将hadoop生态圈的主要軟體工具簡單介紹下,拓展對整個hadoop生态圈的了解。

這是hadoop生态從google的三篇論文開始的發展曆程,現已經發展成為一個生态體系,并還在蓬勃發展中....

Hadoop生态圈一覽

這是官網上的hadoop生态圖,包含了大部分常用到的hadoop相關工具軟體

Hadoop生态圈一覽

這是以體系從下到上的布局展示的hadoop生态系統圖,言明了各工具軟體在體系中所處的位置

Hadoop生态圈一覽

這張圖是hadoop在系統中核心元件與系統的依賴關系

Hadoop生态圈一覽

下面就是簡單介紹hadoop生态圈中的一些工具

<a target="_blank" href="http://hadoop.apache.org/">hadoop</a>

官網原文:

what is apache hadoop?

the apache™ hadoop® project develops open-source software for reliable, scalable, distributed computing.

the apache hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. it is designed to scale up from single servers

to thousands of machines, each offering local computation and storage. rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service

on top of a cluster of computers, each of which may be prone to failures.

the project includes these modules:

hadoop common: the common utilities that support the other hadoop modules.

hadoop distributed file system (hdfs™): a distributed file system that provides high-throughput access to application data.

hadoop yarn: a framework for job scheduling and cluster resource management.

hadoop mapreduce: a yarn-based system for parallel processing of large data sets.

other hadoop-related projects at apache include:

ambari™: a web-based tool for provisioning, managing, and monitoring apache hadoop clusters which includes support for hadoop hdfs, hadoop mapreduce, hive, hcatalog, hbase, zookeeper, oozie, pig and sqoop. ambari also provides

a dashboard for viewing cluster health such as heatmaps and ability to view mapreduce, pig and hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

avro™: a data serialization system.

cassandra™: a scalable multi-master database with no single points of failure.

chukwa™: a data collection system for managing large distributed systems.

hbase™: a scalable, distributed database that supports structured data storage for large tables.

hive™: a data warehouse infrastructure that provides data summarization and ad hoc querying.

mahout™: a scalable machine learning and data mining library.

pig™: a high-level data-flow language and execution framework for parallel computation.

spark™: a fast and general compute engine for hadoop data. spark provides a simple and expressive programming model that supports a wide range of applications, including etl, machine learning, stream processing, and graph

computation.

tez™: a generalized data-flow programming framework, built on hadoop yarn, which provides a powerful and flexible engine to execute an arbitrary dag of tasks to process data for both batch and interactive use-cases. tez

is being adopted by hive™, pig™ and other frameworks in the hadoop ecosystem, and also by other commercial software (e.g. etl tools), to replace hadoop™ mapreduce as the underlying execution engine.

zookeeper™: a high-performance coordination service for distributed applications.

譯文:

什麼是apache hadoop?

apache hadoop項目是以可靠、可擴充和分布式計算為目的而發展而來的開源軟體

apache hadoop 軟體庫是一個允許在叢集計算機上使用簡單的程式設計模型來進行大資料集的分布式任務的架構。它是設計來從單伺服器擴充到成千台機器上,每個機器提供本地的計算和存儲。相比于依賴硬體來實作高可用,該庫自己設計來檢查和管理應用部署的失敗情況,是以是在叢集計算機之上提供高可用的服務,沒個節點都有可能失敗。

該項目包括子產品:

hadoop common :通用的工具來支援其他的hadoop子產品

hadoop distributed filesystem(hdfs):一個提供高可用擷取應用資料的分布式檔案系統

hadoop yarn;job排程和叢集資源管理的架構

hadoop mapreduce:基于yarn系統的并行處理大資料集的程式設計模型

其他hadoop相關的項目:

ambari:一個基于web的工具,用來供應、管理和監測apache hadoop叢集包括支援hadoop hdfs、hadoop mapreduce、hive、hcatalog、hbase、zookeeper、oozie、pig和sqoop。ambari 也提供一個可視的儀表盤來檢視叢集的健康狀态(比如熱圖),并且能夠以一種使用者友好的方式根據其特點可視化的檢視mapreduce、pig和hive 應用來診斷其性能特征。

avro :資料序列化系統。

cassandra :可擴充的多主節點資料庫,而且沒有單節點失敗情況。

chukwa : 管理大型分布式系統的資料收集系統

hbase ; 一個可擴充的分布式資料庫,支援大表的結構化資料存儲

hive : 一個提供資料概述和ad組織查詢的資料倉庫

mahout :可擴充大的機器學習和資料挖掘庫

pig :一個支援并行計算的進階的資料流語言和執行架構

spark : 一個快速通用的hadoop資料的計算引擎。spark 提供一個簡單和富有表現力的程式設計模型并支援多領域應用,包括etl、機器學習、流處理 和圖計算。

tez : 一個通用的資料流處理架構,建構在hadoop yarn上,提供一個有力的靈活的引擎來執行一個任意的dag任務來處理資料(批處理和互動式兩種方式)。tez 可以被hive、pig和其他hadoop生态系統架構和其他商業軟體(如:etl工具)使用,用來替代hadoop mapreduce 作為底層的執行引擎。

zookeeper :一個應用于分布式應用的高性能的協調服務。

/****************************************************************************/

ambari監控頁面:

Hadoop生态圈一覽

the apache ambari project is aimed at making hadoop management simpler by developing software for provisioning, managing, and monitoring apache hadoop clusters. ambari provides

an intuitive, easy-to-use hadoop management web ui backed by its restful apis.

ambari enables system administrators to:

provision a hadoop cluster

ambari provides a step-by-step wizard for installing hadoop services across any number of hosts.

ambari handles configuration of hadoop services for the cluster.

manage a hadoop cluster

ambari provides central management for starting, stopping, and reconfiguring hadoop services across the entire cluster.

monitor a hadoop cluster

ambari provides a dashboard for monitoring health and status of the hadoop cluster.

ambari leverages ambari metrics system for metrics collection.

ambari leverages ambari alert framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

ambari enables application developers and system integrators to:

easily integrate hadoop provisioning, management, and monitoring capabilities to their own applications with the ambari rest apis.

譯文:apache ambari 項目的目标是通過開發提供、管理和監測hadoop叢集的軟體使得hadoop的管理更簡單。

ambari 提供了直覺的,易于使用的hadoop 管理的web 接口依賴于他自己的restful api。

ambari 幫助系統管理者:

1.提供hadoop叢集

ambari 提供一步步的向導來安裝任意數量主機的hadoop 服務群。

ambari 管理叢集的hadoop服務群的配置

2.管理hadoop叢集

ambari 提供控制管理整個叢集的啟動、停止、和重新配置hadoop服務群

3.監測hadoop叢集

ambari 提供了儀表盤來監測hadoop的健康和hadoop叢集的狀态

ambari利用ambari度量系統來度量資料收集

ambari利用ambari警報架構為系統報警,當你需要注意時通知你(比如:一個節點挂掉、剩餘磁盤不足等等).

ambari 為應用開發人員和系統內建商提供了:

通過使用ambari rest 的api很容易整合hadoop提供、管理和監測的能力到他們自己的應用中

目前最新版本:the latest release ambari 2.0.0 

apache avro™ is a data serialization system.

avro provides:

rich data structures.

a compact, fast, binary data format.

a container file, to store persistent data.

remote procedure call (rpc).

simple integration with dynamic languages. code generation is not required to read or write data files nor to use or implement rpc protocols. code generation as an optional optimization, only worth implementing for statically

typed languages.

avro 是資料序列化系統

avro 提供:

1.富資料結構。

2.緊湊、快速、二進制的資料格式化。

3.一個容器檔案來存儲持久化資料。

4.遠端過程調用

5.簡單的內建了動态語言,代碼生成不再需要讀寫資料檔案也不再使用或內建rpc協定。代碼生成作為一個可選選項,僅僅值得靜态語言實作

官方原文:

schemas

avro relies on schemas. when avro data is read, the schema used when writing it is always present. this permits each datum to be written with no per-value overheads, making serialization both fast and small. this also facilitates

use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

when avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. if the program reading the data expects a different schema this can be easily resolved, since both schemas

are present.

when avro is used in rpc, the client and server exchange schemas in the connection handshake. (this can be optimized so that, for most calls, no schemas are actually transmitted.) since both client and server both have the

other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

avro schemas are defined with json . this facilitates implementation in languages that already have json libraries.

譯文:模式

avro 依賴模式。avro資料的讀寫操作是很頻繁的,而這些操作都需要使用模式。這樣就減少寫入每個資料資料的開銷,使得序列化快速而又輕巧。這種資料及其模式的自我描述友善于動态腳本語言,腳本語言,以前資料和它的模式一起使用,是完全的自描述。

當avro 資料被存儲在一個檔案中,它的模式也一同被存儲。是以,檔案可被任何程式處理,如果程式需要以不同的模式讀取資料,這就很容易被解決,因為兩模式都是已知的。

當在rpc中使用avro時,用戶端和服務端可以在握手連接配接時交換模式(這是可選的,是以大多數請求,都沒有模式的事實上的發送)。因為用戶端和服務端都有彼此全部的模式,是以相同命名字段、缺失字段和多餘字段等資訊之間通信中需要解決的一緻性問題就可以容易解決

avro模式用json定義,這有利于已經擁有json庫的語言的實作

comparison with other systems

avro provides functionality similar to systems such as thrift, protocol buffers, etc. avro differs from these systems in the following fundamental aspects.

dynamic typing: avro does not require that code be generated. data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. this facilitates construction

of generic data-processing systems and languages.

untagged data: since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.

no manually-assigned field ids: when a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

apache avro, avro, apache, and the avro and apache logos are trademarks of the apache software foundation.

和其他系統的比較

avro提供着與諸如thrift和protocol buffers等系統相似的功能,但是在一些基礎方面還是有差別的

1 動态類型:avro并不需要生成代碼,模式和資料存放在一起,而模式使得整個資料的處理過程并不生成代碼、靜态資料類型等等。這友善了資料處理系統和語言的構造。

2 未标記的資料:由于讀取資料的時候模式是已知的,那麼需要和資料一起編碼的類型資訊就很少了,這樣序列化的規模也就小了。

3 不需要使用者指定字段号:即使模式改變,處理資料時新舊模式都是已知的,是以通過使用字段名稱可以解決差異問題。

目前最新版本:23 july 2014: avro 1.7.7 released

<a target="_blank" href="http://www.ibm.com/developerworks/cn/opensource/os-cn-cassandra/">cassandra的安裝配置入門</a>

Hadoop生态圈一覽

cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. cassandra brings together the distributed systems technologies from dynamo and the data model from google's bigtable. like dynamo,

cassandra is eventually consistent. like bigtable, cassandra provides a columnfamily-based data model richer than typical key/value systems.

cassandra was open sourced by facebook in 2008, where it was designed by avinash lakshman (one of the authors of amazon's dynamo) and prashant malik ( facebook engineer ). in a lot of ways you can think of cassandra as dynamo

2.0 or a marriage of dynamo and bigtable. cassandra is in production use at facebook but is still under heavy development.

cassandra是一個高可擴充的、最終一緻、分布式、結構化的k-v倉庫,cassandra将bigtable的資料模型和dynamo的分布式系統技術整合在一起。與dynamo類似,cassandra最終一緻,與bigtable類似,cassandra提供了基于列族的資料模型,比典型的k-v系統更豐富。

cassandra 被facebook在2008年開源,被avinash lakshman(是amazon的dynamo的作者之一)和prashant malik(facebook的工程師)設計,在很多方面,你可以把cassandra看作dynamo 2.0,或者dynamo和bigtable的結合。cassandra已經應用在facebook的生産環境中,但它仍然處于密集開發期

目前最新版本:the latest release of apache cassandra is 2.1.4 (released on 2015-04-01)

<a target="_blank" href="http://baidutech.blog.51cto.com/4114344/748261/">chukwa在百度的應用實踐</a>

chukwa架構圖:

Hadoop生态圈一覽

chukwa is an open source data collection system for monitoring large distributed systems. chukwa is built on top of the hadoop distributed file system (hdfs) and map/reduce framework and inherits hadoop’s scalability and

robustness. chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

chukwa 是一個監測大型分布式系統的開源資料收集系統。chukwa 建立在hdfs和mapreduce上,繼承了hadoop的可擴充性和魯棒性。為了更好的使用收集的資料,chukwa也包含了一個靈活有力的工具包用來顯示、監測和分析結果。

目前最新版本:last published: 2015-03-24 | version: 0.6.0

<a target="_blank" href="http://blog.csdn.net/kisssun0608/article/details/44872027">hbase僞分布式安裝</a>

<a target="_blank" href="http://blog.csdn.net/kisssun0608/article/details/44872059">hbase的叢集環境安裝</a>

<a target="_blank" href="http://blog.csdn.net/kisssun0608/article/details/44872099">hbase基礎和shell操作</a>

<a target="_blank" href="http://www.uml.org.cn/sjjm/201212141.asp">hbase入門篇</a>

hbase的體系結構圖

Hadoop生态圈一覽

apache hbase™ is the hadoop database, a distributed, scalable, big data store.

when would i use apache hbase?

use apache hbase™ when you need random, realtime read/write access to your big data. this project's goal is the hosting of very large tables -- billions of rows x millions of columns -- atop clusters of commodity hardware.

apache hbase is an open-source, distributed, versioned, non-relational database modeled after google's bigtable: a distributed storage system for structured data by chang et al. just as bigtable leverages the distributed data storage provided by the google

file system, apache hbase provides bigtable-like capabilities on top of hadoop and hdfs.

features

linear and modular scalability.

strictly consistent reads and writes.

automatic and configurable sharding of tables

automatic failover support between regionservers.

convenient base classes for backing hadoop mapreduce jobs with apache hbase tables.

easy to use java api for client access.

block cache and bloom filters for real-time queries.

query predicate push down via server side filters

thrift gateway and a rest-ful web service that supports xml, protobuf, and binary data encoding options

extensible jruby-based (jirb) shell

support for exporting metrics via the hadoop metrics subsystem to files or ganglia; or via jmx

譯文:apache hbase是hadoop 資料庫,一個分布式,可擴充的大資料倉庫

什麼時候使用apache hbase 呢?

當随機、實時讀寫你的大資料時就需要使用hbase。這個項目的目标是成為巨大的表(數十億行 x 數百萬列資料)的托管在商品硬體的叢集上.

hbase是一個開源的,分布式,版本化,非關系的資料庫,仿照自google的bigtable(一個chang et al開發的結構化資料的分布式存儲系統),bigtable的分布式資料存儲由gfs(google file system)提供,hbase在hadoop和hdfs上提供類似大表能力。

特點:

線性的和子產品化的可擴充性。

嚴格一緻的讀和寫。

自動和可配置的分區表。

友善的支援hadoop的mapreduce 的jobs與hbase表的基類。

易于使用的java api的用戶端通路。

實時查詢的塊緩存和bloom過濾器。

查詢謂詞下推通過伺服器端過濾器。

thrift網關和rest-ful的web服務,支援xml,protobuf和二進制資料編碼選項

可擴充的基于jruby(jirs)的shell

支援導出名額通過hadoop的名額子系統到檔案或神經節;或者通過jmx

目前最新版本:1.0.0

hive原理圖

Hadoop生态圈一覽

the apache hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. hive provides a mechanism to project structure onto this data and query the data using a sql-like

language called hiveql. at the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in hiveql.

apache hive資料倉庫軟體有利于查詢和管理大資料集駐紮在分布式倉庫上。hive提供了機制保護資料上的結構并且查詢資料使用的類似sql的語言hiveql。同時,當hiveql表達這邏輯不友善或者效率低下時,這種語言也允許傳統的mapreduce程式員添加他們自定義的mapper和reduce。

目前最新版本:8 march 2015: release 1.1.0 available

mahout算法适用場景

Hadoop生态圈一覽

the apache mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.

the three major components of mahout are an environment for building scalable algorithms, many new scala + spark (h2o in progress) algorithms, and mahout's mature hadoop mapreduce algorithms.

mahout 項目目标是建構一個快速建立可擴充高性能的機器學習應用的環境。

mahout的三個主要的元件是建構可擴充的算法環境,大量scala+spark算法和mahout的成熟的mapreduce算法。

2015.4.11 0.10版本釋出

apache mahout introduces a new math environment we call samsara, for its theme of universal renewal. it reflects a fundamental rethinking of how scalable machine learning

algorithms are built and customized. mahout-samsara is here to help people create their own math while providing some off-the-shelf algorithm implementations. at its core are general linear algebra and statistical operations along with the data structures

to support them. you can use it as a library or customize it in scala with mahout-specific extensions that look something like r. mahout-samsara comes with an interactive shell that runs distributed operations on a spark cluster. this make prototyping or task

submission much easier and allows users to customize algorithms with a whole new degree of freedom.

mahout algorithms include many new implementations built for speed on mahout-samsara. they run on spark and some on h2o, which means as much as a 10x speed increase. you’ll find robust matrix decomposition algorithms as

well as a naive bayes classifier and collaborative filtering. the new spark-itemsimilarity enables the next generation of cooccurrence recommenders that can use entire user click streams and context in making recommendations.

mahout 引進了一個新的數學環境叫做samsara,它的主題是通用的重建。它反映了可擴充的機器學習算法怎樣構架和自定義的根本性反思。在提供現成的算法實作的同時,mahout-samsara幫助人們建立他們自己的數學。它的核心是一般線性代數和統計的操作随着資料結構來支援它們。你可以使用它作為一個庫或者用scala自定義它,mahout-specific擴充看起來有些像r語言。mahout-samsara到達伴随一個互動的shell(在spark叢集上運作分布式操作)。這讓原型機制造或者任務送出更容易并且允許使用者在一個完整的心得自由度中自定義算法。

mahout算法包括許多新實作建構專為mahout-samsara。他們運作在spark上和一些h2o上,這意味着将會提速10倍以上,你将發現強大的矩陣分解算法和樸素貝葉斯分類器和協同過濾一樣好。新的spark-itemsimilarity(spark的基于物品的相似)成為下一代共生的推薦可以使用整個使用者點選流和上下文來進行推薦。

目前最新版本:0.10.0 released

<a target="_blank" href="http://www.ibm.com/developerworks/cn/linux/l-apachepigdataquery/">使用pig處理資料</a>

apache pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. the salient property of pig

programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

at the present time, pig's infrastructure layer consists of a compiler that produces sequences of map-reduce programs, for which large-scale parallel implementations already exist (e.g., the hadoop subproject). pig's language

layer currently consists of a textual language called pig latin, which has the following key properties:

ease of programming. it is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data

flow sequences, making them easy to write, understand, and maintain.

optimization opportunities. the way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

extensibility. users can create their own functions to do special-purpose processing.

pig是由用于表達資料分析程式的進階語言來分析大資料集的平台,與基礎平台耦合來評估這些程式。pig程式的突出屬性是他們的結構适合大量的并行化,這将使他們能夠處理非常大的資料集。

目前,pig的底層實作是由編譯器産生序列的mapreduce程式,大量可擴充并行實作已經存在。pig語言層目前由文本語言叫pig latin組成。pig litin擁有如下屬性:

簡易程式設計:實作簡單的,難以并行的資料分析任務來并行執行是很平常的事。有多個互相關聯的資料轉換的複雜的任務是顯示編碼為資料流序列,使其易于寫,了解和保持。

優化條件:這種方法(任務被編碼為允許系統自動優化它們的執行)允許使用者專注于語義更甚于效率。

可擴充性:使用者可以創造談們自己的方法來做特殊目的的處理。

目前最新版本是apache pig 0.14.0 包括 pig on tez, orcstorage

apache spark™ is a fast and general engine for large-scale data processing.

speed:run programs up to 100x faster than hadoop mapreduce in memory, or 10x faster on disk.spark has an advanced dag execution engine that supports cyclic data flow and in-memory computing

ease of use:write applications quickly in java, scala or python.spark offers over 80 high-level operators that make it easy to build parallel apps. and you can use it interactively from the scala and python shells.

generality:combine sql, streaming, and complex analytics.spark powers a stack of high-level tools including spark sql, mllib for machine learning, graphx, and spark streaming. you can

combine these libraries seamlessly in the same application.

runs everywhere:spark runs on hadoop, mesos, standalone, or in the cloud. it can access diverse data sources including hdfs, cassandra, hbase, s3.you can run spark readily using its standalone cluster mode, on ec2, or run

it on hadoop yarn or apache mesos. it can read from hdfs, hbase, cassandra, and any hadoop data source.

spark是一個快速,一般性的進行大量可擴充資料的引擎。

速度:在記憶體中運作程式是hadoop的100倍以上,或者在磁盤上的10倍以上。spark還有進階的有向無環圖(dag)執行引擎支援循環資料流和記憶體計算。

易于使用:可以凱蘇的使用java、scala或者python編寫程式。spark提供超過80個高水準的操作者使得很容易建構并行app。并且你可以從scala和python的shell互動式使用它。

通用性:結合sql,流和複雜的分析。spark 供給了高水準的棧工具包括spark sql,機器學習的mllib,graphx和spark streaming。你可以在同一個應用中無縫結合這些庫。

到處運作:spark運作在hadoop、mesos、獨立運作或者運作在雲上,他可以獲得多樣化的資料源包括hdfs、cassandra、hbase、s3。你可以容易的運作spark使用它的獨立叢集模式,在ec2上,或者運作在hadoop的yarn或者apache的mesos上。它可以從hdfs,hbase,cassandra和任何hadoop資料源。

快速

Hadoop生态圈一覽

通用性

Hadoop生态圈一覽

可到處運作

Hadoop生态圈一覽

易于程式設計

Hadoop生态圈一覽

目前最新版本是spark 1.2.2 and 1.3.1 released

Hadoop生态圈一覽

上圖展示的流程包含多個mr任務,每個任務都将中間結果存儲到hdfs上——前一個步驟中的reducer為下一個步驟中的mapper提供資料。

Hadoop生态圈一覽

該圖表展示了使用tez時的流程,僅在一個任務中就能完成同樣的處理過程,任務之間不需要通路hdfs

the apache tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. it is currently built atop apache hadoop yarn

the 2 main design themes for tez are:

empowering end users by:

expressive dataflow definition apis

flexible input-processor-output runtime model

data type agnostic

simplifying deployment

execution performance

performance gains over map reduce

optimal resource management

plan reconfiguration at runtime

dynamic physical data flow decisions

by allowing projects like apache hive and apache pig to run a complex dag of tasks, tez can be used to process data, that earlier took multiple mr jobs, now in a single tez job as shown below.

tez 項目目标是建構一個應用架構允許為複雜的有向無環圖任務處理資料,目前建構在yarn上。

tez的兩個主要的設計主題是:

授權使用者:

表達資料流定義的api

靈巧的輸入輸出處理器運作時模式

資料類型無關

簡化部署

執行性能

提升mapreduce性能

最優化資源管理

運作時重置配置計劃

動态邏輯資料流決議

通過允許項目向hive和pig來運作複雜的dag認為,tez可以被用于處理資料更簡單處理多mapreduce jobs,現在展示一個單一的tez job

tez api包括以下幾個元件:

有向無環圖(dag)——定義整體任務。一個dag對象對應一個任務。

節點(vertex)——定義使用者邏輯以及執行使用者邏輯所需的資源和環境。一個節點對應任務中的一個步驟。

邊(edge)——定義生産者和消費者節點之間的連接配接。

邊需要配置設定屬性,對tez而言這些屬性是必須的,有了它們才能在運作時将邏輯圖展開為能夠在叢集上并行執行的實體任務集合。下面是一些這樣的屬性:

資料移動屬性,定義了資料如何從一個生産者移動到一個消費者。

排程(scheduling)屬性(順序或者并行),幫助我們定義生産者和消費者任務之間應該在什麼時候進行排程。

資料源屬性(持久的,可靠的或者暫時的),定義任務輸出内容的生命周期或者持久性,讓我們能夠決定何時終止。

目前最新版本:apache tez 0.6.0

<a target="_blank" href="http://blog.csdn.net/kisssun0608/article/details/44871853">zookeeper理論知識和叢集安裝配置</a>

Hadoop生态圈一覽

apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.

what is zookeeper?

zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. all of these kinds of services are used in some form or another by

distributed applications. each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. because of the difficulty of implementing these kinds of services, applications initially usually skimp

on them ,which make them brittle in the presence of change and difficult to manage. even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

zookeeper是一個嘗試來開發和保持一個開源的來提供高可靠的分布式協調的服務。

什麼是zookeeper?

zookeeper 是一個集中服務,保持配置資訊,命名和提供分布式同步并且提供組服務。所有這些種服務被分布式應用用于某些形式或其他。每次它們實作這大量的工作修複bug并比賽的情況是不可避免的。由于這些種服務的實作不同,應用最初通常吝啬它們,使得它們忍受在變化的存在和難以管理。甚至在正确時,當應用部署時,不同的實作導緻管理負責。

目前最新版本:10 march, 2014: release 3.4.6 available

sqoop工作流程:

Hadoop生态圈一覽

apache sqoop

apache sqoop(tm) is a tool designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databases.

sqoop successfully graduated from the incubator in march of 2012 and is now a top-level apache project: more information

latest stable release is 1.4.5 (download, documentation). latest cut of sqoop2 is 1.99.5 (download, documentation). note that 1.99.5 is not compatible with 1.4.5 and not feature complete, it is not intended for production

deployment.

sqoop是一個用來将hadoop和關系型資料庫中的資料互相轉移的工具,可以将一個關系型資料庫(例如 : mysql ,oracle ,postgres等)中的資料導進到hadoop的hdfs中,也可以将hdfs的資料導進到關系型資料庫中。

對于某些nosql資料庫它也提供了連接配接器。sqoop,類似于其他etl工具,使用中繼資料模型來判斷資料類型并在資料從資料源轉移到hadoop時確定類型安全的資料處理。sqoop專為大資料批量傳輸設計,能夠分割資料集并建立hadoop任務來處理每個區塊。

flume工作流程:

Hadoop生态圈一覽

flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. it has a simple and flexible architecture based on streaming data flows. it is robust

and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. it uses a simple extensible data model that allows for online analytic application.

譯文:flume是一個分布式、可靠的、高可用的有效收集、聚合和轉移大量日志檔案的服務。它擁有簡單靈活的基于資料流的體系結構。它是魯棒性的,擁有容錯可調的可靠性機制、故障轉移和恢複機制。使用簡單可擴充的可以線上分析應用的資料模型

日志收集

flume最早是cloudera提供的日志收集系統,目前是apache下的一個孵化項目,flume支援在日志系統中定制各類資料發送方,用于收集資料。

資料處理

flume提供對資料進行簡單處理,并寫到各種資料接受方(可定制)的能力 flume提供了從console(控制台)、rpc(thrift-rpc)、text(檔案)、tail(unix tail)、syslog(syslog日志系統,支援tcp和udp等2種模式),exec(指令執行)等資料源上收集資料的能力。

目前最新版本:november 18, 2014 - apache flume 1.5.2 released

impala

impala的原理圖

Hadoop生态圈一覽

impala架構分析

impala是cloudera公司主導開發的新型查詢系統,它提供sql語義,能查詢存儲在hadoop的hdfs和hbase中的pb級大資料。已有的hive系統雖然也提供了sql語義,但由于hive底層執行使用的是mapreduce引擎,仍然是一個批處理過程,難以滿足查詢的互動性。相比之下,impala的最大特點也是最大賣點就是它的快速。那麼impala如何實作大資料的快速查詢呢?在回答這個問題前,需要先介紹google的dremel系統,因為impala最開始是參照

dremel系統進行設計的。

dremel是google的互動式資料分析系統,它建構于google的gfs(google file system)等系統之上,支撐了google的資料分析服務bigquery等諸多服務。dremel的技術亮點主要有兩個:一是實作了嵌套型資料的列存儲;二是使用了多層查詢樹,使得任務可以在數千個節點上并行執行和聚合結果。列存儲在關系型資料庫中并不陌生,它可以減少查詢時處理的資料量,有效提升

查詢效率。dremel的列存儲的不同之處在于它針對的并不是傳統的關系資料,而是嵌套結構的資料。dremel可以将一條條的嵌套結構的記錄轉換成列存儲形式,查詢時根據查詢條件讀取需要的列,然後進行條件過濾,輸出時再将列組裝成嵌套結構的記錄輸出,記錄的正向和反向轉換都通過高效的狀态機實作。另 外,dremel的多層查詢樹則借鑒了分布式搜尋引擎的設計,查詢樹的根節點負責接收查詢,并将查詢分發到下一層節點,底層節點負責具體的資料讀取和查詢執行,然後将結果傳回上層節點。

在cloudera的測試中,impala的查詢效率比hive有數量級的提升。從技術角度上來看,impala之是以能有好的性能,主要有以下幾方面的原因。

impala不需要把中間結果寫入磁盤,省掉了大量的i/o開銷。

省掉了mapreduce作業啟動的開銷。mapreduce啟動task的速度很慢(預設每個心跳間隔是3秒鐘),impala直接通過相應的服務程序來進行作業排程,速度快了很多。

impala完全抛棄了mapreduce這個不太适合做sql查詢的範式,而是像dremel一樣借鑒了mpp并行資料庫的思想另起爐竈,是以可做更多的查詢優化,進而省掉不必要的shuffle、sort等開銷。

通過使用llvm來統一編譯運作時代碼,避免了為支援通用編譯而帶來的不必要開銷。

用c++實作,做了很多有針對性的硬體優化,例如使用sse指令。

使用了支援data locality的i/o排程機制,盡可能地将資料和計算配置設定在同一台機器上進行,減少了網絡開銷。

翻譯自apache官網上的文檔,因英語水準和專業水準有限,翻譯不當之處還希望交流指教。