天天看点

Spark集群搭建(Hadoop+Spark+Zookeeper+HBase)硬件准备与软件版本说明配置基础环境安装Java安装Hadoop安装Spark安装Python安装Scala安装SBT安装Zookeeper安装HBase安装Pycharm编写集群启动与关闭脚本用户管理界面补充

目录

  • 硬件准备与软件版本说明
  • 配置基础环境
  • 安装Java
  • 安装Hadoop
  • 安装Spark
  • 安装Python
  • 安装Scala
  • 安装SBT
  • 安装Zookeeper
  • 安装HBase
  • 安装Pycharm
  • 编写集群启动与关闭脚本
  • 用户管理界面
  • 补充

硬件准备与软件版本说明

1. 硬件: 三台装有Centos7系统的电脑

2. 软件安装版本: 以下为经过测试可行的软件版本

Java:openjdk version "1.8.0_252" 
 Hadoop:2.7.7 
 Spark:2.4.5 
 Python:3.6.8 
 Scala:2.11.12 
 SBT:1.3.13 
 Zookeeper:3.4.14
 HBase:1.2.7
           

配置基础环境

1. 创建新用户及设置密码:

adduser spark

passwd spark

2. 将新用户加入wheel组(可以使用sudo):

usermod -aG wheel spark

3. 修改计算机名称

  1. 修改主机名字:

    sudo vi /etc/hostname

  2. 设置电脑的静态IP:

    vi /etc/sysconfig/network-scripts/ifcfg-ens33

    (ens33为当前电脑使用的网卡的名称)
  3. 修改hosts文件: 通过

    sudo vi /etc/hosts

    添加如下内容
192.168.11.137  master
192.168.28.54   slave1
192.168.28.51   slave2
           

4. 配置每台机器ssh免密登陆(可远程操控其他电脑)

  1. 在每台机器上

    ssh-keygen -t rsa

  2. 将slave1与slave2上的id_rsa.pub用scp命令发送给master

    scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/id_rsa.pub.slave1

    scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/id_rsa.pub.slave2

  3. 在master上

    cat ~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys

    并将authorized_keys文件发给每台slave

    scp ~/.ssh/authorized_keys [email protected]:~/.ssh/

    scp ~/.ssh/authorized_keys [email protected]:~/.ssh/

安装Java

1. 安装JAVA

  1. 安装JRE:

    sudo yum install java-1.8.0-openjdk.x86_64

  2. 安装JDK:

    sudo yum -y install java-1.8.0-openjdk-devel.x86_64

2. 配置环境变量

  1. vi /etc/profile

    添加配置内容
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
           
  1. 让配置文件生效:

    source /etc/profile

    测试配置文件是否生效 :

    echo $JAVA_HOME

安装Hadoop

具体安装过程参考网址

1. 下载hadoop并解压

2. 修改配置文件(在路径/home/spark/allBigData/hadoop/etc/hadoop里)

  1. hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
# 配置日志目录生效,但还有部分日志存在默认目录$HADOOP_HOME/logs
export HADOOP_LOG_DIR=/home/spark/allBigData/data/hadoop/log 
export HADOOP_PID_DIR=/home/spark/allBigData/data/hadoop/pid
           
  1. yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
export YARN_PID_DIR=/home/spark/allBigData/data/hadoop/pid
           
  1. slaves
master
slave1
slave2
           
  1. core-site.xml
<configuration>
        <property>
             <name>hadoop.tmp.dir</name>
             <value>file:/home/spark/allBigData/data/hadoop/tmp</value>
             <description>A base for other temporary directories.</description>
        </property>
        <property>
             <name>fs.defaultFS</name>
             <value>hdfs://master:9000/</value>
        </property>
</configuration>
           
  1. hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9001</value>
    </property>
        <property>
             <name>dfs.replication</name>
             <value>1</value>
        </property>
        <property>
             <name>dfs.namenode.name.dir</name>
             <value>file:/home/spark/allBigData/data/hadoop/dfs/name</value>
        </property>
        <property>
             <name>dfs.datanode.data.dir</name>
             <value>file:/home/spark/allBigData/data/hadoop/dfs/data</value>
        </property>
</configuration>
           
  1. mapred-site.xml
<configuration>
        <property>
          <name>mapreduce.framework.name</name>
           <value>yarn</value>
         </property>
</configuration>
           

补充:

<property>
            <name>mapreduce.jobhistory.address</name>
            <value>master:10020</value>
        </property>
     
        <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>master:19888</value>
        </property>
           
  1. yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>master:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>master:8088</value>
    </property>
</configuration>
           

补充:

<property>
                <name>yarn.application.classpath</name>
                <value>
                        /home/spark/allBigData/hadoop/etc/*,
                        /home/spark/allBigData/hadoop/etc/hadoop/*,
                        /home/spark/allBigData/hadoop/lib/*,
                        /home/spark/allBigData/hadoop/lib/native/*,
                        /home/spark/allBigData/hadoop/share/hadoop/common/*,
                        /home/spark/allBigData/hadoop/share/hadoop/common/lib/*,
                        /home/spark/allBigData/hadoop/share/hadoop/mapreduce/*,
                        /home/spark/allBigData/hadoop/share/hadoop/mapreduce/lib/*,
                        /home/spark/allBigData/hadoop/share/hadoop/hdfs/*,
                        /home/spark/allBigData/hadoop/share/hadoop/hdfs/lib/*,
                        /home/spark/allBigData/hadoop/share/hadoop/yarn/*,
                        /home/spark/allBigData/hadoop/share/hadoop/yarn/lib/*,
                        /home/spark/allBigData/hbase/lib/*
                </value>
        </property>
	<property>
	    <name>yarn.log-aggregation-enable</name>
	    <value>true</value>
	</property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://master:19888/jobhistory/logs</value>
    </property>
           
  1. mapred-env.sh:
export HADOOP_MAPRED_PID_DIR=/home/spark/allBigData/data/hadoop/pid
           

3. 发给slave机器

scp -r /home/spark/allBigData/hadoop [email protected]:/home/spark/allBigData/

scp -r /home/spark/allBigData/hadoop [email protected]:/home/spark/allBigData/

4. 启动hadoop

第一次启动时格式化namenode:

bin/hadoop namenode -format

启动集群命令:

/home/spark/allBigData/hadoop/sbin/start-all.sh

安装Spark

具体安装过程参考网址

1. 下载spark并解压

2. 修改配置文件(在路径/home/spark/allBigData/spark/conf里)

注意到很多配置文件都是以template结尾的,这是因为spark给我们提供的模板配置文件,我们可以拷贝一份,然后将.template给去掉,变成真正的配置文件后再编辑

  1. spark-env.sh
export SPARK_PID_DIR=/home/spark/allBigData/data/spark/pid
export SPARK_LOG_DIR=/home/spark/allBigData/data/spark/log
export SPARK_WORKER_DIR=/home/spark/allBigData/data/spark/worker
export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
export HADOOP_HOME=/home/spark/allBigData/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=master
export SPARK_LOCAL_DIRS=/home/spark/allBigData/data/spark/local
           
  1. slaves
master
slave1
slave2
           

3. 发给slave机器

scp -r /home/spark/allBigData/spark [email protected]:/home/spark/allBigData/

scp -r /home/spark/allBigData/spark [email protected]:/home/spark/allBigData/

4. 启动spark

/home/spark/allBigData/spark/sbin/start-all.sh

5. 配置spark的history server

参考网址

安装Python

参考网址

补充说明:

1. 第二步从官网下载Gzipped source tarball安装包

2. 编译安装过程中注意:

  1. configure使用命令

    ./configure --prefix=/usr/local/python3 --enable-shared CFLAGS=-fPIC

    这样才会在python安装目录里找到libpython3.6m.so.1.0
  2. 不能使用

    make && make install

    一次性安装,会报错,单独使用

    sudo make

    sudo make install

    命令

3. 修改默认的python版本和pip版本: 修改/usr/bin/python的指向

sudo ln -s /usr/local/python3.6/bin/python3.6 /usr/bin/python

sudo ln -s /usr/local/python3.6.5/bin/pip3.6 /usr/bin/pip3

补充:

查看默认的python版本:

python --version

查看已安装的python版本:

ll /usr/bin/python*

4. 安装完成后最后一步,

cp /usr/local/python3.6.3/lib/libpython3.6m.so.1.0 /usr/lib64/

,避免报错error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

5. 更改国内镜像源

参考网址

pip.conf内容写为:

[global]
index-url = http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host = mirrors.aliyun.com
           

安装Scala

安装SBT

  • 具体安装过程参考网址,建立sbt启动脚本时注意将

    -jar

    后面的路径修改为与安装路径一致
  • 更换sbt源为国内源参考网址
  • 在用idea新建sbt-scala项目时,注意选择已经安装好的scala2.12.11\sbt1.3.13版本,否则会报错

安装Zookeeper

具体安装过程参考网址

补充说明:

1. 新建文件夹:

mkdir ./data

mkdir ./logs

2. 修改配置文件:

vim ./conf/zoo.cfg

dataDir=/home/spark/allBigData/data/zookeeper/data
dataLogDir=/home/spark/allBigData/data/zookeeper/log
server.1=192.168.11.137:2888:3888
server.2=192.168.28.54:2888:3888
server.3=192.168.28.51:2888:3888
           

3. 在每台机器新建myid文件: 添加相应机器的编号

以防万一,复制myid到./data/下,否则可能报错Error contacting service. It is probably not running

4. 将修改后的配置文件发送给其他机器

安装HBase

具体安装过程参考网址

关于修改配置文件的补充说明:

1. 修改hbase-site.xml:

vim /conf/hbase-site.xml

<configuration>
    <property>
      <name>hbase.tmp.dir</name>
      <value>/home/spark/allBigData/data/hbase/data</value>
    </property>
    <property>
      <name>hbase.rootdir</name>
      <value>hdfs://master:9000/hbase</value>
    </property>
    <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
    </property>
    <property>
      <name>hbase.master</name>
      <value>master:60000</value>
    </property>
<property>
      <name>hbase.zookeeper.quorum</name>
      <value>master,slave1,slave2</value>
    </property>
    <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2181</value>
    </property>
    <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/spark/allBigData/data/zookeeper/data</value>
      <description>property from zoo.cfg,the directory where the snapshot is stored</description>
    </property>
</configuration>
           

为了保证python与hbase交互时,连接不会自动断开,参照网址1和网址2添加以下配置:

<property>
	 <name>hbase.thrift.server.socket.read.timeout</name>
	 <value>31104000</value>
	 <description>eg:second</description>
</property>  
<property>
	 <name>hbase.thrift.connection.max-idletime</name>
	 <value>31104000</value>
	 <description>eg:second</description>
</property>  
           

注意修改配置后,需要重新启动hbase和thrift才会生效

2.修改hbase-env.sh:

vim ./conf/hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64
export HBASE_CLASSPATH=/home/spark/allBigData/hadoop/etc/hadoop
export HBASE_PID_DIR=/home/spark/allBigData/data/hbase/pid
export HBASE_LOG_DIR=/home/spark/allBigData/data/hbase/log
export HBASE_MANAGES_ZK=false
           

3. 修改regionservers:

vim ./conf/regionservers

master
slave1
slave2
           

4. 修改etc/profile:

vim /etc/profile

export HBASE_HOME=/home/spark/allBigData/hbase
export PATH=$PATH:$HBASE_HOME/bin
           

其他说明:

  • 启动hbase集群命令:

    ./bin/start-hbase.sh

  • 停止hbase集群命令:

    ./bin/stop-hbase.sh

  • python与hbase的交互: 需要启动thrift服务

    hbase安装目录./bin/hbase-daemon.sh start thrift

    (关闭thrift服务用stop)

安装Pycharm

具体安装过程参考网址

编写集群启动与关闭脚本

  • 编写集群启动脚本:

    vim startCluster.sh

#!/bin/bash

echo "==============Starting hadoop=============="
/home/spark/allBigData/hadoop/sbin/start-all.sh

echo "==============Starting hadoop job history=============="
echo "---Starting master---"
/home/spark/allBigData/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
for i in slave1 slave2
  do
    echo "---Starting ${i}---"
    ssh $i "/home/spark/allBigData/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver"
  done

echo "==============Starting spark and job history=============="
/home/spark/allBigData/spark/sbin/start-all.sh
/home/spark/allBigData/spark/sbin/start-history-server.sh

echo "==============Starting zkServer=============="
echo "---Starting master---"
/home/spark/allBigData/zookeeper/bin/zkServer.sh start
/home/spark/allBigData/zookeeper/bin/zkServer.sh status
for i in slave1 slave2
  do
    echo "---Starting ${i}---"
    ssh $i "source /etc/profile;/home/spark/allBigData/zookeeper/bin/zkServer.sh start;/home/spark/allBigData/zookeeper/bin/zkServer.sh status"
  done

echo "==============Starting hbase=============="
/home/spark/allBigData/hbase/bin/start-hbase.sh
/home/spark/allBigData/hbase/bin/hbase-daemon.sh start thrift
           
  • 编写集群关闭脚本:

    vim stopCluster.sh

#!/bin/bash

echo "==============Stopping hbase=============="
/home/spark/allBigData/hbase/bin/hbase-daemon.sh stop thrift;/home/spark/allBigData/hbase/bin/stop-hbase.sh

echo "==============Stopping hadoop job history=============="
echo "---Stopping master---"
/home/spark/allBigData/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver
for i in slave1 slave2
  do
    echo "---Stopping ${i}---"
    ssh $i "/home/spark/allBigData/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver"
  done

echo "==============Stopping zkServer=============="
echo "---Stopping master---"
/home/spark/allBigData/zookeeper/bin/zkServer.sh stop
/home/spark/allBigData/zookeeper/bin/zkServer.sh status
for i in slave1 slave2
  do
    echo "---Stopping ${i}---"
    ssh $i "source /etc/profile;/home/spark/allBigData/zookeeper/bin/zkServer.sh stop;/home/spark/allBigData/zookeeper/bin/zkServer.sh status"
  done

echo "==============Stopping spark and job history=============="
/home/spark/allBigData/spark/sbin/stop-history-server.sh
/home/spark/allBigData/spark/sbin/stop-all.sh

echo "==============Stopping hadoop=============="
/home/spark/allBigData/hadoop/sbin/stop-all.sh
           

用户管理界面

  • yarn界面:master:8088
  • hdfs的web界面:master:50070
  • spark界面:master:8080
  • hbase界面:master:16010
  • hadoop job history界面:master:19888
  • spark job history界面:master:4000

补充

  • 构建Spark应用时,读取hdfs文件写法:

    hdfs://master:9000/data/station_info.csv

    (在hadoop配置文件core-site.xml中配置fs.defaultFS后可以省略hdfs://master:9000)
  • 向集群提交应用的命令行参数:
# 提交pyspark应用到yarn上
/home/spark/allBigData/spark/bin/spark-submit \
--master yarn --deploy-mode cluster --num-executors 3 \
--files file:/home/spark/workspace_python/test/data/deviceInfo.csv \
--jars /home/spark/allBigData/useful_jars/json4s-jackson_2.11-3.5.3.jar \
--conf spark.yarn.dist.archives=file:/home/spark/workspace_python/venv.zip#pyenv \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python \
/home/spark/workspace_python/test/t1/TipSpdRatioWarn_H_RDD_10M.py

# 采用standalone模式提交pyspark应用
/home/spark/allBigData/spark/bin/spark-submit \
--master spark://master:7077 --num-executors 3 \
--files file:/home/spark/workspace_python/test/data/deviceInfo.csv \
--jars /home/spark/allBigData/useful_jars/json4s-jackson_2.11-3.5.3.jar \
--conf spark.yarn.dist.archives=file:/home/spark/workspace_python/venv.zip#pyenv \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python \
/home/spark/workspace_python/test/t1/TipSpdRatioWarn_H_RDD_10M.py

# 提交scala应用到yarn上,注意--queue参数设置队列
spark-submit \
--class wordCount \
--master yarn --deploy-mode cluster --queue test \
--executor-memory 512M \
--total-executor-cores 3 \
/home/spark/IdeaProjects/SparkScalaTest5/target/scala-2.11/sparkscalatest4_2.11-0.1.jar \
hdfs://master:9000/data/README.md \
hdfs://master:9000/result/wordCount2
           
  • 关于集群配置与运行过程中的问题与解决方法参考网址

继续阅读