源:https://github.com/nathanmarz/storm/wiki/Rationale
2012-04014譯
Rationale基本原理
The past decade has seen a revolution in data processing.MapReduce, Hadoop, and related technologies have made it possible to store andprocess data at scales previously unthinkable. Unfortunately, these dataprocessing technologies are not realtime systems, nor are they meant to be.There's no hack that will turn Hadoop into a realtime system; realtime dataprocessing has a fundamentally different set of requirements than batchprocessing.
在過去的數年中,資料計算處理發生了許多的變革。MapReduce、Hadoop,以及其他相關的技術,使我們在之前無法企及的規模上進行資料存儲與計算成為了可能。遺憾的是,這些資料計算技術不是一個實時的系統,而且他們也沒打算去實作實時。我們沒有将Hadoop轉換成實時系統的捷徑,實時計算與批量計算所滿足的需求上有着本質的差別。
However, realtime data processing at massive scale isbecoming more and more of a requirement for businesses. The lack of a"Hadoop of realtime" has become the biggest hole in the dataprocessing ecosystem.
盡管如此,基于海量資料的實時計算在商業上的需求越來越強烈。Hadoop的時效性缺點成為了資料計算這一生态系統的一個巨大“天坑”。
Storm fills that hole.
Strom系統将拟補以上問題。
Before Storm, you would typically have to manually builda network of queues and workers to do realtime processing. Workers wouldprocess messages off a queue, update databases, and send new messages to otherqueues for further processing. Unfortunately, this approach has seriouslimitations:
在Storm系統出現之前,你可能不得不為你的實時系統建立一個任務隊列以及任務節點網絡。任務節點處理隊列裡的資訊,更新資料庫,以及給下一階段的任務隊列發送消息以維持後續的計算。不幸的是,這些都将面臨一些嚴峻的局限性:
- Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
系統開發乏味:你将在配置資訊,部署工作節點,部署任務排程上花費大量的開發時間。而你的實時計算代碼隻是占了你的整個系統的一小部分。
- Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.
系統的脆弱:幾乎沒有什麼容錯性。你要保證每一個任務節點及隊列運作正常。
- Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.
海量資料之痛:當系統的吞吐量超過單工作節點或隊列的承受力是,你需要關注資料如何分片,如何分發。你需要配置其他節點使其知道這些新增的節點位置以便互相傳輸、協調工作。這裡介紹了可能存在失敗的資料轉移。
Although the queues and workers paradigm breaksdown for large numbers of messages, message processing is clearlythe fundamental paradigm for realtime computation. The question is: how do youdo it in a way that doesn't lose data, scales to huge volumes of messages, andis dead-simple to use and operate?
盡管包含着大量資訊的工作節點及隊列的清單發生故障,信号的處理依然是實時計算系統的基礎核心部分。問題是:面對海量的資訊,我們怎麼在不丢失資料的要求下去完成他,而且是在他們(節點和隊列)是容易死掉的情況下使用和運轉。
Storm satisfies these goals.
還好,我們有Storm 。
Why Storm is important
Storm exposes a set of primitives for doing realtimecomputation. Like how MapReduce greatly eases the writing of parallel batchprocessing, Storm's primitives greatly ease the writing of parallel realtimecomputation.
Strom為實時計算開放了一些通用原語,就像MapReduce極大地簡化了并行批處理計算的寫操作,Strom也極大地簡化了并行實時計算的寫操作。
The key properties of Storm are:
- Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.
極為廣泛的用例:Strom可以被用于流處理,處理消息及更新資料庫;連續計算,對資料進行連續的查詢并以流的形式返還給用戶端結果;分布式RPC,以并行的方式運作昂貴的運算。Storm的這些很少的原語便可滿足相當可觀的用例。
- Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
可擴充性:Storm在每秒海量資訊的下可擴充,為達到拓撲的可擴充,我們必須要增加機器并且為其增加一些并行化的配置。例如一個Storm應用在一個10個節點的叢集上每秒處理1000000個消息 — 包括每秒一百多次的資料庫調用。Storm使用ZooKeeper來協調叢集内的各種配置使得Storm的叢集可以很容易的擴充很大。
- Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
保證資料無丢失: 實時系統必須保證資料被成功的處理。那些會丢失資料的系統的适用場景非常窄, 而storm保證每一條消息都會被處理, 這一點和例如S4相比有巨大的反差
- Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
系統健壯性:不像Hadoop — 出了名的難管理, storm叢集非常容易管理。容易管理是storm的設計目标之一。
- Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
高容錯性:如果在消息處理過程中出了一些異常, 如果有必要storm會重新配置設定任務。 storm保證一個處理邏輯永遠運作 (除非你殺掉這個處理邏輯)。
- Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.
支援多語言程式設計:健壯性和可伸縮性不應該局限于一個平台。Storm的topology和消息處理元件可以用任何語言來定義, 這一點使得任何人都易于接收.