天天看點

基于shark-0.9.1 spark0.9.1 hadoop 2.3.0的環境搭建(os redhat 6.4)

OS version:redhat 6.4

1、從apache官網下載下傳hadoop2.3.0版本,并正确配置參數(此處不詳述)

2、在github的apache/spark頁面下載下傳spark-0.9.1的源代碼:http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.1/spark-0.9.1.tgz

需要注意的是截止目前官方提供的spark的版本是基于CDH5/hadoop 2.2.0編譯的,spark-0.9.1在hadoop2.3.0上還存在點小問題:

spark啟動時需要讀取yarn-site.xml中的yarn.application.classpath,如果此參數沒有顯示配置,則預設的值是空,這時會抛出異常:

Exception in thread "main" java.lang.NullPointerException

        at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)

        at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)

        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)

        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

        at org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:498)

        at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:519)

        at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:333)

        at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:94)

        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:78)

        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:125)

        at org.apache.spark.SparkContext.<init>(SparkContext.scala:200)

        at shark.SharkContext.<init>(SharkContext.scala:42)

        at shark.SharkContext.<init>(SharkContext.scala:61)

        at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:78)

        at shark.SharkEnv$.init(SharkEnv.scala:38)

        at shark.SharkCliDriver.<init>(SharkCliDriver.scala:278)

        at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)

        at shark.SharkCliDriver.main(SharkCliDriver.scala)

需要修改下源碼,再根據hadoo 2.3.0再重新編譯一次,由于編譯需要聯網下載下傳,而依賴包關系也比較複雜,是以不能上公網的機器就隻能暫時呵呵了,不過重編譯環境不需要hadoop ,可以随便找一台對應版本的linux機器編譯.

基于hadoop 2.3.0 的spark編譯:

3、修改spark源碼:

spark source code下載下傳後存放路徑為:/data/spark-0.9.1-source

進入如下路徑:

/data/spark-0.9.1-source/spark-0.9.1/yarn/stable/target/scala-2.10/classes/org/apache/spark/deploy/yarn

修改Client.scala源檔案,搜尋關鍵字找到populateHadoopClasspath函數所在的行

将這個函數修改如下:

def populateHadoopClasspath(conf: Configuration, env: HashMap[String, String]) {

    val classpathEntries = Option(conf.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH)).getOrElse(

        YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH)

    for (c <- classpathEntries) {

      Apps.addToEnvironment(env, Environment.CLASSPATH.name, c.trim)

    }

  }

此函數中的問題是conf.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH)有可能取空,導緻空的scala collection進行疊代,就會産生異常。

4、根據對應的hadoop版本編譯spark,叢集環境為:

hadoop 版本為2.3.0

hostname:  server1             server2             server3

角色:   nn、rm、zk、hive, dn、nm、zk、hive, dn、nm、zk、hive

進入spark源碼所在的路徑

$ cd /data/spark-0.9.1-source

用sbt工具重新編譯(需要聯網執行,會下載下傳sbt-launcher等元件和依賴包)

$ SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt clean assembly

經過一段時間的等待(與網速有關,中間如果卡住,ctrl+c停止後重新執行上述指令),結果顯示success,完成編譯

5、配置spark:

$ cd /app/spark/conf

$ mv spark-env.sh.template spark-env.sh

$ mv log4j.properties.template log4j.properties

$vi spark-env.sh

設定如下參數:

#hadoop的配置檔案(hadoop-env.sh,*.xml所在的目錄)

export YARN_CONF_DIR=/app/hadoop/etc/hadoop

#spark的worker數量

export SPARK_WORKER_INSTANCES=2

#spark worker和master可使用的記憶體,此處的數值不能低于384M(spark要求),同時要綜合考慮yarn-site.xml中的關于記憶體的設定,不能超過resource.memory-mb的大小。

export SPARK_WORKER_MEMORY=512M

export SPARK_MASTER_MEMORY=512M

#spark的jar包設定

export SPARK_JAR=/app/spark/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.3.0.jar

export SPARK_YARN_APP_JAR=/app/spark/examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar

#spark運作模式

export MASTER=yarn-client

6、将編譯後的spark分發到hadoop叢集上,因為後面要使用shark,是以叢集中的節點都要分發,分發後安裝路徑為為/app/spark,建議将此路徑配置到使用者的環境變量中.

基于hadoop 2.3.0 的shark編譯:

7、在github頁面上下載下傳shark-0.9.1:https://github.com/amplab/shark/releases

截止目前download link中隻有兩個版本,是以選擇Shark with Hadoop 2 (cdh5)下載下傳,下載下傳後解壓

$ tar -zxvf shark-0.9.1-bin-hadoop2.tgz 

進入解壓後的目錄,根據對應的hadoop版本用sbt工具編譯(需要聯網執行,會下載下傳sbt-launcher等元件和依賴包):

$ SHARK_HADOOP_VERSION=2.3.0 SHARK_YARN=true sbt/sbt clean package

同上,經過等待結果顯示success,完成編譯,并分發到hadoop叢集中的每個節點上,分發後安裝路徑為/app/shark,建議将此路徑配置到使用者的環境變量中。

8、配置shark

$ cd /app/shark/conf

$ mv shark-env.sh.template shark-env.sh

$vi shark-env.sh

#shark和spark的記憶體設定,可與之前的spark保持一緻

export SPARK_MEM=512m

export SHARK_MASTER_MEM=512m

# 配置hive路徑,此處使用amplab的hive,放在/app/hive下,并已正确配置

export HIVE_CONF_DIR=/app/hive/conf

export HIVE_HOME=/app/hive

# spark和hadoop的相關配置

export HADOOP_HOME="/app/hadoop"

export SPARK_HOME="/app/spark"

export MASTER="yarn-client"

#shark運作在yarn上的必要配置

export SHARK_EXEC_MODE=yarn

export SPARK_ASSEMBLY_JAR=/app/spark/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.3.0.jar

export SHARK_ASSEMBLY_JAR=/app/shark/target/scala-2.10/shark_2.10-0.9.1.jar

# On EC2, change the local.dir to /mnt/tmp

SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "

SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "

SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "

export SPARK_JAVA_OPTS

9、将配置後的shark分發到hadoop叢集中的所有節點,分發後安裝路徑為為/app/shark,建議将此路徑配置到使用者的環境變量中

此時shark和spark都已安裝配置完成,直接啟動spark-shell即可使用spark的指令行模式,

但目前由于shark-0.9.1自身的問題,導緻此時直接啟動shark不會報錯,建表後也不會報錯,但在操作dml語句時會報錯:

shark> select * from tb01;

336.301: [GC 148256K->24886K(501248K), 0.0209300 secs]

org.apache.spark.SparkException: Job aborted: Task 1.0:0 failed 4 times (most recent failure: Exception failure: java.lang.RuntimeException: readObject can't find class org.apache.hadoop.hive.conf.HiveConf)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

FAILED: Execution Error, return code -101 from shark.execution.SparkTask

解決報錯的方法:将/app/shark/lib,/app/shark/lib_managed/jars,/app/shark/lib_managed/bundles目錄和子目錄下的所有jar包轉移到一個固定的路徑,并将此路徑配置到yarn.application.classpath中去:

10、在hadoop叢集中的所有節點上執行:

10.1 建立shark的集中lib庫目錄

$ mkdir -p /app/sharklib

10.2 編寫移動jar檔案的腳本

$ cd /app

$ vi mvjar_script.sh

#!/bin/bash

for jar in `find /app/shark/lib -name '*jar'`; do

  cp $jar /app/sharklib/

done

for jar in `find /app/shark/lib_managed/jars -name '*jar'`; do

  cp $jar /app/sharklib/

done

for jar in `find /app/shark/lib_managed/bundles -name '*jar'`; do

  cp $jar /app/sharklib/

done

10.3 執行腳本

$ chmod +x mvjar_script.sh

$ ./mvjar_script.sh

11、更改yarn-site.xml參數,并分發到叢集中所有節點,需重新開機hadoop:

将/app/sharklib加入到yarn.application.classpath中,此處“…………”用來表示之前已有的的配置:

<property>

  <name>yarn.application.classpath</name>

  <value>/app/sharklib/*,………………</value>

</property>

12、啟動shark指令行:

shark> CREATE TABLE log01(

     userid int,

     name STRING)

 ROW FORMAT DELIMITED

   FIELDS TERMINATED BY '\t'

 STORED AS TEXTFILE;

shark> LOAD DATA LOCAL INPATH '/home/hadoop/t01.txt'  INTO TABLE log01;

shark> select count(1) from log01;

65.461: [GC 150825K->24878K(500736K), 0.0185750 secs]

69.203: [GC 152366K->22449K(501760K), 0.0169070 secs]

OK

28

Time taken: 5.625 seconds

shark> select count(1) from log01;

OK

28

Time taken: 1.533 seconds

shark> select count(1) from log01;

75.695: [GC 148274K->22303K(501248K), 0.0193210 secs]

OK

28

Time taken: 1.293 seconds

會發現執行第一次時count時間較長,第二次開始執行時間就有顯著縮短。

至此基于hadoop2.3.0,spark-0.9.1,shark-0.9.1的安裝配置完成,可以使用shark-withinfo,shark,spark-shell,或者啟動shar server使用beeline等指令行工具開始愉快的玩耍了。

繼續閱讀