天天看点

Spark2.4到3.1版本迁移指南(包含Scala和PySpark版本)————附带详细配置信息和代码

文章目录

  • ​​1 Scala​​
  • ​​1.1 使用编译器​​
  • ​​1.1.1 Spark3.1版本​​
  • ​​1.1.1.1 配置​​
  • ​​1.1.1.2 提交架包​​
  • ​​1.1.1 Spark2.4版本​​
  • ​​1.2 使用shell​​
  • ​​1.2.1 spark3.1版本​​
  • ​​1.2.2 spark2.4版本​​
  • ​​2 PySpark​​
  • ​​2.1 IDEA的使用​​
  • ​​2.1.1 spark3.1​​
  • ​​2.1.2 spark2.4​​
  • ​​2.2 使用shell​​
  • ​​2.2.1 spark3.1​​
  • ​​2.2.2 spark2.4​​

1 Scala

1.1 使用编译器

在IDEA上配置的Maven的pom.xml为:

1.1.1 Spark3.1版本

1.1.1.1 配置

version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>bigdataartifactId>
        <groupId>com.hahahagroupId>
        <version>1.1.1version>
    parent>
    <modelVersion>4.0.0modelVersion>

    <artifactId>scalatestartifactId>

    <dependencies>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-sql_${scala-compat-version}artifactId>
            <version>${spark-version}version>
        dependency>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-core_${scala-compat-version}artifactId>
            <version>${spark-version}version>
        dependency>

        <dependency>
            <groupId>org.scala-langgroupId>
            <artifactId>scala-libraryartifactId>
            <version>${scala-version}version>
        dependency>

        <dependency>
            <groupId>org.apache.commonsgroupId>
            <artifactId>commons-lang3artifactId>
            <version>3.1version>
        dependency>

        <dependency>
            <groupId>org.apache.hadoopgroupId>
            <artifactId>hadoop-clientartifactId>
            <version>${hadoop-version}version>
        dependency>

        <dependency>
            <groupId>org.apache.hudigroupId>
            <artifactId>hudi-spark-bundle_${scala-compat-version}artifactId>
            <version>${hudi-version}version>
        dependency>

        <dependency>
            <groupId>org.apache.httpcomponentsgroupId>
            <artifactId>httpclientartifactId>
            <version>4.5.11version>
        dependency>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-avro_${scala-compat-version}artifactId>
            <version>${spark-version}version>
        dependency>

        <dependency>
            <groupId>org.json4sgroupId>
            <artifactId>json4s-jackson_${scala-compat-version}artifactId>
            <version>${org.json4s-version}version>
        dependency>

        <dependency>
            <groupId>org.scala-lang.modulesgroupId>
            <artifactId>scala-xml_${scala-compat-version}artifactId>
            <version>${org.scala-lang.modules-version}version>
        dependency>

        <dependency>
            <groupId>com.fasterxml.jackson.modulegroupId>
            <artifactId>jackson-module-scala_${scala-compat-version}artifactId>
            <version>2.10.3version>
        dependency>

        <dependency>
            <groupId>org.scala-lang.modulesgroupId>
            <artifactId>scala-parser-combinators_${scala-compat-version}artifactId>
            <version>1.1.2version>
            <exclusions>
                <exclusion>
                    <groupId>org.scala-langgroupId>
                    <artifactId>scala-libraryartifactId>
                exclusion>
            exclusions>
        dependency>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-sql-kafka-0-10_${scala-compat-version}artifactId>
            <version>${spark-version}version>
        dependency>

    dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.mavengroupId>
                <artifactId>scala-maven-pluginartifactId>
                <executions>
                    <execution>
                        <id>scala-compile-firstid>
                        <phase>process-resourcesphase>
                        <goals>
                            <goal>add-sourcegoal>
                            <goal>compilegoal>
                        goals>
                    execution>
                    <execution>
                        <id>scala-test-compileid>
                        <phase>process-test-resourcesphase>
                        <goals>
                            <goal>testCompilegoal>
                        goals>
                    execution>
                executions>
            plugin>

            
            <plugin>
                <groupId>org.apache.maven.pluginsgroupId>
                <artifactId>maven-shade-pluginartifactId>
                <version>2.4.3version>
                <executions>
                    <execution>
                        <phase>packagephase>
                        <goals>
                            <goal>shadegoal>
                        goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SFexclude>
                                        <exclude>META-INF/*.DSAexclude>
                                        <exclude>META-INF/*.RSAexclude>
                                    excludes>
                                filter>
                            filters>
                        configuration>
                    execution>
                executions>
            plugin>
        plugins>
    build>
project>      

参数版本信息:

<properties>
        <org.scala-lang.modules-version>1.3.0org.scala-lang.modules-version>
        <org.json4s-version>3.7.0-M2org.json4s-version>
        <spark-version>3.1.2spark-version>
        <scala-version>2.12.10scala-version>
        <scala-compat-version>2.12scala-compat-version>
        <hadoop-version>2.10.1hadoop-version>
        <hudi-version>0.10.0hudi-version>
        <mybatis.generator.configurationFile>${project.basedir}/src/main/resources/generatorConfig.xml
        mybatis.generator.configurationFile>
        <mysql.version>8.0.13mysql.version>
        <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
        <java.version>1.8java.version>
        <hahaha.comm.version>1.0.0hahaha.comm.version>
        <hahaha.comm.base.version>1.0.0hahaha.comm.base.version>
        <hahaha.comm.web.version>1.0.0hahaha.comm.web.version>
        <hahaha.comm.database.version>1.0.0hahaha.comm.database.version>
        <hahaha.parent.version>1.0.0hahaha.parent.version>
        <hahaha.springcloud.version>1.0.0hahaha.springcloud.version>
        <hahaha.springcloud.erueka.version>1.0.0hahaha.springcloud.erueka.version>
        <hahaha.springcloud.configserver.version>1.0.0hahaha.springcloud.configserver.version>
        <codehaus-jackson.version>1.9.13codehaus-jackson.version>
        <jackson.version>2.9.7jackson.version>
        <minio.version>4.0.0minio.version>
        <spring.cloud.version>Greenwich.M3spring.cloud.version>
        <spring.boot.version>2.1.0.RELEASEspring.boot.version>
        <durid.version>1.1.16durid.version>
        <jjwt.version>0.7.0jjwt.version>
    properties>      

1.1.1.2 提交架包

提交jar包使用的命令为:

  • 本地模式:

不带包【打包成jar包时,就把所有包都打包在其中了】:

/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit  --class com.ali.bigdata.hudistreamer.jk.KafkaData2Hudi /user/scalatest-1.1.1.jar /software/member/config/jkKafkaHudi.json      
Spark2.4到3.1版本迁移指南(包含Scala和PySpark版本)————附带详细配置信息和代码

带包版本(jar包不带依赖):

/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit  --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar  --driver-class-path /software/hadoop-2.10.1/etc/hadoop/:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2  /software/member/config/scalatest-1.1.1.jar /software/member/config/kafkaHudi9.json      
Spark2.4到3.1版本迁移指南(包含Scala和PySpark版本)————附带详细配置信息和代码
Spark2.4到3.1版本迁移指南(包含Scala和PySpark版本)————附带详细配置信息和代码
  • yarn模式:
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit   --master yarn --deploy-mode cluster --num-executors 3 --executor-memory 4G --executor-cores 2  --conf spark.default.parallelism=20 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3 --conf spark.sql.shuffle.partitions=200 --driver-memory 4g  --class com.hahaha.bigdata.scalatest.jk.KafkaData2Hudi /software/member/config/scalatest-1.1.1.jar /software/member/config/kafkaHudi6.json      

启动yarn模式时,做的准备:

  • 1,更改配置:
find / -name capacity-scheduler.xml

vi /software/hadoop-2.10.1/etc/hadoop/capacity-scheduler.xml      

配置信息:

<property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percentname>
    <value>0.8value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    description>
  property>      

启动日志:

vi /software/spark-3.1.2-bin-hadoop2.7/conf/spark-defaults.conf      

配置信息:

spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.20.3.67:8020/sparklog
spark.yarn.historyServer.address 10.20.3.67:4000      
  • 2,关闭已有的spark进程:
  • 1,杀死已启动的yarn程序;
yarn application -list
yarn application -kill application_1496703976885_00567      
  • 2 关闭spark任务:
ps aux | grep spark
sudo kill -9 4567 7865      

1.1.1 Spark2.4版本

替换下面这几行:

1.2 使用shell

1.2.1 spark3.1版本

/software/spark-3.1.2-bin-hadoop2.7/bin/spark-shell  --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar  --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2      

1.2.2 spark2.4版本

spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.8,com.googlecode.json-simple:json-simple:1.1,com.alibaba:fastjson:1.2.51,net.minidev:json-smart:2.4.7 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars $HUDI_SPARK_BUNDLE --master spark://10.20.3.72:7077 --driver-class-path $HADOOP_CONF_DIR:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3      

2 PySpark

2.1 IDEA的使用

2.1.1 spark3.1

#### spark环境
import findspark
findspark.add_jars("/root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar")
findspark.add_packages("org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0") # Hudi包
findspark.add_packages("org.apache.spark:spark-avro_2.12:3.0.1") # 序列化时使用
findspark.add_packages("org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2") # 处理kafka数据时需要
findspark._add_to_submit_args("--driver-class-path $HADOOP_CONF_DIR:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar")
findspark.init(spark_home="/software/spark-3.1.2-bin-hadoop2.7", python_path="/software/Python-3.9.5/bin/python3")
## 创建spark环境
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
    .master("local[*]") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.default.parallelism", 2) \
    .appName("hudi-datalake-test9") \
    .getOrCreate()      

需要修改的配置的为:

vi /etc/profile      

添加的配置为:

export PYSPARK_PYTHON=/software/Python-3.9.5
export PYSPARK_DRIVER_PYTHON=/software/Python-3.9.5
export PATH=$PYSPARK_PYTHON:$PATH
export PATH=$PYSPARK_DRIVER_PYTHON:$PATH      

在Linux中提交给​

​spark-submit​

​,

#!/usr/bin/env bash

fileName=$1
param=$2

/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --jars /software/hudi-0.7.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar  --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3 /software/spark-3.1.2-bin-hadoop2.7/Script/$fileName $param      

⚠️注意:

  • 添加​

    ​--conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3​

    ​是为了解决python权限不够的问题。

如果使用yarn的cluster模式(​

​--master yarn --deploy-mode cluster​

​),则需要spark使用的各个平台的公共环境相同。

命令如下:

/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --jars /software/hudi-0.7.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar  --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --master yarn --deploy-mode cluster --num-executors 3 --executor-memory 2G --executor-cores 3  --conf spark.default.parallelism=20 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3 --conf spark.sql.shuffle.partitions=200 --driver-memory 4g --conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3 --py-files /software/spark-3.1.2-bin-hadoop2.7/PySparkProject/bigDataAnalysisModeling/hahaha.zip  /software/spark-3.1.2-bin-hadoop2.7/PySparkProject/bigDataAnalysisModeling/main.py      

注意⚠️:如果有打包的文件,则使用​

​--py-files​

​​选项,如果是文件,则使用​

​--files​

​选项。

2.1.2 spark2.4

import findspark
findspark.add_jars("/root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar")
findspark.add_packages("org.apache.spark:spark-avro_2.11:2.4.4")
findspark._add_to_submit_args("--driver-class-path /usr/app/hadoop-2.10.1/etc/hadoop:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar")
findspark.init(spark_home="/software/hadoop-2.10.1", python_path="/software/Python-3.9.5/bin/python3")

from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
    .master("local[*]") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.default.parallelism", 2) \
    .appName("hudi-datalake-test9") \
    .getOrCreate()      

提交给spark submit运行的命令:

#!/usr/bin/env bash
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export  PATH=$PATH:$LD_LIBRARY_PATH
fileName=$1
/usr/app/spark-2.4.7-bin-hadoop2.7/bin/spark-submit --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --master spark://10.20.3.72:7077 --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.spark:spark-avro_2.11:2.4.4 /usr/app/spark-2.4.7-bin-hadoop2.7/Script/$fileName      

使用方法:

bash runPySpark.py spark.py opt/sparkParam.py      

2.2 使用shell

2.2.1 spark3.1

脚本【runPySpark.py】如下:

/software/spark-3.1.2-bin-hadoop2.7/bin/pyspark --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2      

修改配置为:

export PYSPARK_PYTHON=/software/Python-3.9.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/software/Python-3.9.5/bin/python3
export PYSPARK_DRIVER_PYTHON_OPTS      
  • (1)下载YUM INSTALL LIBFFI-DEVEL -Y包
yum install libffi-devel -y      
  • (2)在PYTHON3的安装目录下重新编译:
make install      

2.2.2 spark2.4

bin/pyspark --jars $HUDI_SPARK_BUNDLE --master spark://10.20.3.72:7077 --driver-class-path $HADOOP_CONF_DIR:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.spark:spark-avro_2.11:2.4.4