Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

data processing with smack: spark,

mesos, akka, cassandra, and kafka

this article introduces the

smack (spark, mesos, akka, cassandra, and kafka) stack and

illustrates how you can use it to build scalable data processing platforms

while the smack stack is really concise and consists of only several

components, it is possible to implement different system designs within it

which list not only purely batch or stream processing, but also contain more

complex lambda and kappa architectures as well.

what is the smack stack?

first, let’s talk a little bit

about what smack is. here’s a quick rundown of the technologies that are

included in it:

spark - a

fast and general engine for distributed large-scale data processing.

mesos - a

cluster resource management system that provides efficient resource isolation

and sharing across distributed applications.

akka - a

toolkit and runtime for building highly concurrent, distributed, and resilient

message-driven applications on the jvm.

• cassandra

- a distributed highly available database designed to handle large amounts of

data across multiple datacenters.

• kafka

- a high-throughput, low-latency distributed messaging system/commit log

designed for handling real-time data feeds.

storage layer:

cassandra

although not in alphabetical order, first let’s

start with the c in smack. cassandra is well

known for its high-availability and high-throughput

characteristics and is able to handle enormous write loads and survive cluster

nodes failures. in terms of cap theorem, cassandra provides

tunable consistency/availability for operations.

what

is most interesting here is that when it comes to data processing, cassandra

is linearly scalable (increased loads could be addressed by just adding more

nodes to a cluster) and it provides cross-datacenter replication (xdcr)

capabilities. actually xdcr provides not only data replication but a set of use

cases, including:

<bullet

point - start>

•

geo-distributed datacenters handling data specific for the region or located

closer to customers.

data migration across datacenters: recovery after failures or moving data to a

new datacenter.

• separate

operational and analytics workloads.

point - end>

however,

all these features come for their own price and with cassandra this

price is its data model. this could be thought of just as a nested sorted map

distributed across cluster nodes by partition key and entries sorted/grouped by

clustering columns. an example is provided below:

get specific data in some range, the full key must be specified and no range

clauses allowed except for the last column in the list. this constraint is

introduced to limit multiple scans for different ranges which may produce

random access to disks and lower down the performance. this means that the data

model should be carefully designed against read queries to limit the amount of

reads/scans which leads to lesser flexibility when it comes to support of new

queries.

but

what if one has some tables that need to be joined somehow with another tables?

let's consider the next case: calculate total views per campaign for a given

month for all campaigns.

with

a given model, the only way to achieve this goal is to read all campaigns, read

all events, sum the properties (with matched campaign ids) and assign them to

campaigns. it can be really challenging to implement such applications because

the amount of data stored in casandra may be huge and exceed the memory

capacity. therefore, such sort of data should be processed in a distributed

manner and spark perfectly fits this type of use cases.

the

main abstraction spark operates with is rdd (resilient

distributed dataset, a distributed collection of elements) and the workflow

consists of four main phases:

points - start>

rdd operations (transformations and actions) form dag (direct acyclic graph).

dag is split into stages of tasks which are then submitted to the cluster

manager.

stages combine tasks which don't require shuffling/repartitioning.

tasks run on workers and results then return to the client.

points - end>

here's

how one can solve the above problem with spark and cassandra:

interaction

with cassandra is performed via spark-cassandra-connector, which

makes the whole process easy and straightforward. there's one more interesting

option to work with nosql stores and that’s sparksql, which

translates sql statements into a series of rdd operations.

several lines of code it's possible to implement naive lambda design which

of course could be much more sophisticated, but this example shows just how

easy this can be achieved.

spark-cassandra

connector is

data-locality aware and reads the data from the closest node in a cluster, thus

minimizing the amount of data transferred over the network. to fully facilitate

spark-c* connector data locality awareness, spark workers should

be collocated with cassandra nodes.

above image illustrates spark collocation with cassandra. it

makes sense to separate your operational (or write-heavy) cluster from one for

analytics. here’s why:

points>

clusters can be scaled independently.

data is replicated by cassandra, with no extra-work needed.

the analytics cluster has different read/write load patterns.

the analytics cluster could contain additional data (for example, dictionaries)

and processing results.

• spark

resources impact is limited to only one cluster.

let's

look at the spark application deployment options one more time:

can be seen above, there are three main options available for the cluster

resource manager:

standalone — spark (as the master node) and workers are installed and

executed as standalone applications (which obviously introduces some overhead

and supports only static resource allocation per worker).

• yarn

— works very good if you already have hadoop.

• mesos

— from the beginning, mesos was designed for dynamic allocation of cluster

resources, not only for running hadoop applications but for handling

heterogeneous workloads.

mesos architecture

m in smack stands for the mesos architecture. a mesos cluster consists

of master nodes which are responsible for resource offerings and scheduling,

and slave nodes which do the actual heavy lifting in the task execution.

ha mode

with multiple master nodes, zookeeper is used for leader election and

service discovery. applications executed on mesos are called frameworks

and utilize apis to handle resource offers and submit tasks to mesos.

generally the task execution process consists of the following steps:

1. slave nodes

provide available resources to the master node.

2. the master

node sends resource offers to frameworks.

3. the

scheduler replies with tasks and resources needed per task.

4. the master

node sends tasks to slave nodes.

bringing spark, mesos

and cassandra together

said before spark workers should be collocated with cassandra

nodes to enforce data locality awareness, thus lowering the amount of network

traffic and cassandra cluster load. here's one of the possible

deployment scenarios on how to achieve this with mesos:

master nodes and zookeepers are collocated.

slave nodes and cassandra nodes are collocated to enforce better data

locality for spark.

binaries are deployed to all worker nodes and spark-env.sh is configured

with proper master endpoints and executor jar location.

the spark executor jar is uploaded to s3/hdfs.

with provided setup,

the spark job can be submitted to the cluster with simple spark-submit

invocation from any worker nodes having spark binaries installed and

assembly jar containing actual job logic uploaded.

there

exist options to run dockerized spark, so that there's no need to

distribute binaries to every single cluster node.

scheduled and

long-running task execution

every

data processing system sooner or later faces the necessity of running two types

of jobs: scheduled/periodic jobs like periodic batch aggregations and

long-running ones which are the case for stream processing. the main

requirement for both of these types is fault tolerance - jobs must continue

running even in case of cluster nodes failures. mesos ecosystem comes

with two outstanding frameworks supporting each of this types of jobs.

marathon is

a framework for fault-tolerant execution of long-running tasks supporting ha

mode with zookeeper, able to run docker and having a nice rest

api. here's an example of using the shell command to run spark-submit

for simple job configuration:

chronos has

the same characteristics as marathon but is designed for running

scheduled jobs and in general it is distributed ha cron supporting

graphs of jobs. here's an example of s3 compaction job configuration

which is implemented as a simple bashscript:

till now we have the storage layer designed, resource management set up and

jobs are configured. the only thing which is not there yet is the data to

process:

assuming

that incoming data will arrive at high rates, the endpoints which will receive

it should meet the following requirements:

provide high throughput/low latency

• be

resilient

allow easy scalability

support back pressure

second thoughts, back pressure is not a must, but it would be nice to have this

as an option to handle load spikes. akka perfectly fits the requirements

and basically it was designed to provide this feature set. here is a quick

run-down of the benefits you can expect to get from akka:

actor model implementation for jvm

message-based and asynchronous

enforcement of the non-shared mutable state

easily scalable from one process to cluster of machines

actors form hierarchies with parental supervision

not only concurrency framework: akka-http, akka-streams, and akka-persistence

a simplified example of three actors which handle json httprequest,

parse it into the domain model case class, and save it to cassandra:

looks like only several lines of code are needed to make everything work, but

while writing raw data (events) to cassandra with akka, the

following problems may be caused:

is still designed for fast serving but not batch processing, so pre-aggregation

of incoming data is needed

computation time of aggregations/rollups will grow with amount of data

actors are not suitable for performing aggregation due to stateless design

model

micro-batches could partially solve the problem

some sort of reliable buffer for raw data is still needed

here's an

example of publishing json data through http to kafka with

akka-http:

while

akka still could be used for consuming stream data from kafka,

having spark in your ecosystem brings spark streaming as an

option to solve the following problems:

• it

supports a variety of data sources

provides at-least-once semantics

exactly-once semantics available with kafka direct and idempotent storage

an example of consuming event stream from kinesis with spark

streaming:

usually

this is the most boring part of any system, but it's really important to

protect data from loss in every possible way when the datacenter is unavailable

or analysis is performed on the datacenter breakdowns.

why not store the data in kafka/kinesis?

the moment of writing this article, kinesis is the only one solution

that can retain data without backups when all processing results have been

lost. while kafka supports a long retention period, cost of hardware

ownership should be considered because for example s3 storage is much cheaper

than multiple instances running kafka and s3 sla are really good.

apart

from having backups, the restoring/patching strategies should be designed

upfront and tested so that any problems with data could be quickly fixed. programmers'

mistakes in aggregation job or duplicated data deletion may break the accuracy

of the computation results. therefore, it is very important to have the

capability of fixing such errors. one thing to make all these operations easier

is to enforce idempotency in the data model so that multiple repetition of the

same operations produce the same results (for example, sql update is an

idempotent operation while counter increment is not).

here

is an example of spark job which reads s3 backup and loads it into cassandra:

smack: the big picture

this

concludes our broad description of smack. to allow you to better visualize the

design of a data platform built with smack, here’s a visual depiction of

the architecture:

the above article we talked about some of the basic functions of using smack.

to finish with here is a quick rundown of it’s main advantages:

concise toolbox for wide variety of data processing scenarios

battle-tested and widely used software with large support communities

easy scalability and replication of data while preserving low latencies

unified cluster management for heterogeneous loads

single platform for any kind of applications

implementation platform for different architecture designs (batch, streaming,

lambda, or kappa)

really fast time-to-market (for example, for mvp verification)

thanks

for reading. if anyone has experience developing applications using smack. please

leave some comments.

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

繼續閱讀

SparkSQL項目練習1 準備資料2 需求：各區域熱門商品Top3

RFC SDK 指南用戶端程式的編寫

Java8新的時間API擷取時間內插補點Java8新的時間API擷取時間內插補點與以前的java.util.Date擷取時間內插補點對比

Redis的快速上手與基本使用

手機軟體抓包工具及其使用方法

推薦一些VB的學習交流網站

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

Spark在windows環境裡跑時報錯找不到org.apache.hadoop.fs.FSDataInputStream

Spark流式分析系統實作流式實時日志分析系統

Scala和Java二種方式實戰Spark Streaming開發

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

Spark實作wordcount

GNU科學函數庫[參考手冊][v0.1 Build 090129 Beta][GNU Scientific Library]

與專家面對面：Android開發入門問與答

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結