文章目录
- 1 Scala
- 1.1 使用编译器
- 1.1.1 Spark3.1版本
- 1.1.1.1 配置
- 1.1.1.2 提交架包
- 1.1.1 Spark2.4版本
- 1.2 使用shell
- 1.2.1 spark3.1版本
- 1.2.2 spark2.4版本
- 2 PySpark
- 2.1 IDEA的使用
- 2.1.1 spark3.1
- 2.1.2 spark2.4
- 2.2 使用shell
- 2.2.1 spark3.1
- 2.2.2 spark2.4
1 Scala
1.1 使用编译器
在IDEA上配置的Maven的pom.xml为:
1.1.1 Spark3.1版本
1.1.1.1 配置
version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>bigdataartifactId>
<groupId>com.hahahagroupId>
<version>1.1.1version>
parent>
<modelVersion>4.0.0modelVersion>
<artifactId>scalatestartifactId>
<dependencies>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_${scala-compat-version}artifactId>
<version>${spark-version}version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_${scala-compat-version}artifactId>
<version>${spark-version}version>
dependency>
<dependency>
<groupId>org.scala-langgroupId>
<artifactId>scala-libraryartifactId>
<version>${scala-version}version>
dependency>
<dependency>
<groupId>org.apache.commonsgroupId>
<artifactId>commons-lang3artifactId>
<version>3.1version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-clientartifactId>
<version>${hadoop-version}version>
dependency>
<dependency>
<groupId>org.apache.hudigroupId>
<artifactId>hudi-spark-bundle_${scala-compat-version}artifactId>
<version>${hudi-version}version>
dependency>
<dependency>
<groupId>org.apache.httpcomponentsgroupId>
<artifactId>httpclientartifactId>
<version>4.5.11version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-avro_${scala-compat-version}artifactId>
<version>${spark-version}version>
dependency>
<dependency>
<groupId>org.json4sgroupId>
<artifactId>json4s-jackson_${scala-compat-version}artifactId>
<version>${org.json4s-version}version>
dependency>
<dependency>
<groupId>org.scala-lang.modulesgroupId>
<artifactId>scala-xml_${scala-compat-version}artifactId>
<version>${org.scala-lang.modules-version}version>
dependency>
<dependency>
<groupId>com.fasterxml.jackson.modulegroupId>
<artifactId>jackson-module-scala_${scala-compat-version}artifactId>
<version>2.10.3version>
dependency>
<dependency>
<groupId>org.scala-lang.modulesgroupId>
<artifactId>scala-parser-combinators_${scala-compat-version}artifactId>
<version>1.1.2version>
<exclusions>
<exclusion>
<groupId>org.scala-langgroupId>
<artifactId>scala-libraryartifactId>
exclusion>
exclusions>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql-kafka-0-10_${scala-compat-version}artifactId>
<version>${spark-version}version>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<executions>
<execution>
<id>scala-compile-firstid>
<phase>process-resourcesphase>
<goals>
<goal>add-sourcegoal>
<goal>compilegoal>
goals>
execution>
<execution>
<id>scala-test-compileid>
<phase>process-test-resourcesphase>
<goals>
<goal>testCompilegoal>
goals>
execution>
executions>
plugin>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-shade-pluginartifactId>
<version>2.4.3version>
<executions>
<execution>
<phase>packagephase>
<goals>
<goal>shadegoal>
goals>
<configuration>
<filters>
<filter>
<artifact>*:*artifact>
<excludes>
<exclude>META-INF/*.SFexclude>
<exclude>META-INF/*.DSAexclude>
<exclude>META-INF/*.RSAexclude>
excludes>
filter>
filters>
configuration>
execution>
executions>
plugin>
plugins>
build>
project>
参数版本信息:
<properties>
<org.scala-lang.modules-version>1.3.0org.scala-lang.modules-version>
<org.json4s-version>3.7.0-M2org.json4s-version>
<spark-version>3.1.2spark-version>
<scala-version>2.12.10scala-version>
<scala-compat-version>2.12scala-compat-version>
<hadoop-version>2.10.1hadoop-version>
<hudi-version>0.10.0hudi-version>
<mybatis.generator.configurationFile>${project.basedir}/src/main/resources/generatorConfig.xml
mybatis.generator.configurationFile>
<mysql.version>8.0.13mysql.version>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
<java.version>1.8java.version>
<hahaha.comm.version>1.0.0hahaha.comm.version>
<hahaha.comm.base.version>1.0.0hahaha.comm.base.version>
<hahaha.comm.web.version>1.0.0hahaha.comm.web.version>
<hahaha.comm.database.version>1.0.0hahaha.comm.database.version>
<hahaha.parent.version>1.0.0hahaha.parent.version>
<hahaha.springcloud.version>1.0.0hahaha.springcloud.version>
<hahaha.springcloud.erueka.version>1.0.0hahaha.springcloud.erueka.version>
<hahaha.springcloud.configserver.version>1.0.0hahaha.springcloud.configserver.version>
<codehaus-jackson.version>1.9.13codehaus-jackson.version>
<jackson.version>2.9.7jackson.version>
<minio.version>4.0.0minio.version>
<spring.cloud.version>Greenwich.M3spring.cloud.version>
<spring.boot.version>2.1.0.RELEASEspring.boot.version>
<durid.version>1.1.16durid.version>
<jjwt.version>0.7.0jjwt.version>
properties>
1.1.1.2 提交架包
提交jar包使用的命令为:
- 本地模式:
不带包【打包成jar包时,就把所有包都打包在其中了】:
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --class com.ali.bigdata.hudistreamer.jk.KafkaData2Hudi /user/scalatest-1.1.1.jar /software/member/config/jkKafkaHudi.json
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLi0zaHRGcWdUYuVzVa9GczoVdG1mWfVGc5RHLwIzX39GZhh2csATMflHLwEzX4xSZz91ZsAzMfRHLGZkRGZkRfJ3bs92YskmNhVTYykVNQJVMRhXVEF1X0hXZ0xiNx8VZ6l2cssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnL2ADM3ImNjBjN0MjNmRGOyYzXyEjNycTM1IzLcVDMyIDMy8CXn9Gbi9CXzV2Zh1WavwVbvNmLvR3YxUjLyM3Lc9CX6MHc0RHaiojIsJye.png)
带包版本(jar包不带依赖):
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop/:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 /software/member/config/scalatest-1.1.1.jar /software/member/config/kafkaHudi9.json
- yarn模式:
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 3 --executor-memory 4G --executor-cores 2 --conf spark.default.parallelism=20 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3 --conf spark.sql.shuffle.partitions=200 --driver-memory 4g --class com.hahaha.bigdata.scalatest.jk.KafkaData2Hudi /software/member/config/scalatest-1.1.1.jar /software/member/config/kafkaHudi6.json
启动yarn模式时,做的准备:
- 1,更改配置:
find / -name capacity-scheduler.xml
vi /software/hadoop-2.10.1/etc/hadoop/capacity-scheduler.xml
配置信息:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percentname>
<value>0.8value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
description>
property>
启动日志:
vi /software/spark-3.1.2-bin-hadoop2.7/conf/spark-defaults.conf
配置信息:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.20.3.67:8020/sparklog
spark.yarn.historyServer.address 10.20.3.67:4000
- 2,关闭已有的spark进程:
- 1,杀死已启动的yarn程序;
yarn application -list
yarn application -kill application_1496703976885_00567
- 2 关闭spark任务:
ps aux | grep spark
sudo kill -9 4567 7865
1.1.1 Spark2.4版本
替换下面这几行:
1.2 使用shell
1.2.1 spark3.1版本
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-shell --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
1.2.2 spark2.4版本
spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.8,com.googlecode.json-simple:json-simple:1.1,com.alibaba:fastjson:1.2.51,net.minidev:json-smart:2.4.7 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars $HUDI_SPARK_BUNDLE --master spark://10.20.3.72:7077 --driver-class-path $HADOOP_CONF_DIR:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3
2 PySpark
2.1 IDEA的使用
2.1.1 spark3.1
#### spark环境
import findspark
findspark.add_jars("/root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar")
findspark.add_packages("org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0") # Hudi包
findspark.add_packages("org.apache.spark:spark-avro_2.12:3.0.1") # 序列化时使用
findspark.add_packages("org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2") # 处理kafka数据时需要
findspark._add_to_submit_args("--driver-class-path $HADOOP_CONF_DIR:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar")
findspark.init(spark_home="/software/spark-3.1.2-bin-hadoop2.7", python_path="/software/Python-3.9.5/bin/python3")
## 创建spark环境
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.default.parallelism", 2) \
.appName("hudi-datalake-test9") \
.getOrCreate()
需要修改的配置的为:
vi /etc/profile
添加的配置为:
export PYSPARK_PYTHON=/software/Python-3.9.5
export PYSPARK_DRIVER_PYTHON=/software/Python-3.9.5
export PATH=$PYSPARK_PYTHON:$PATH
export PATH=$PYSPARK_DRIVER_PYTHON:$PATH
在Linux中提交给
spark-submit
,
#!/usr/bin/env bash
fileName=$1
param=$2
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --jars /software/hudi-0.7.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3 /software/spark-3.1.2-bin-hadoop2.7/Script/$fileName $param
⚠️注意:
- 添加
是为了解决python权限不够的问题。--conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3
如果使用yarn的cluster模式(
--master yarn --deploy-mode cluster
),则需要spark使用的各个平台的公共环境相同。
命令如下:
/software/spark-3.1.2-bin-hadoop2.7/bin/spark-submit --jars /software/hudi-0.7.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --master yarn --deploy-mode cluster --num-executors 3 --executor-memory 2G --executor-cores 3 --conf spark.default.parallelism=20 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3 --conf spark.sql.shuffle.partitions=200 --driver-memory 4g --conf spark.pyspark.python=/software/Python-3.9.5/bin/python3 --conf spark.pyspark.driver.python=/software/Python-3.9.5/bin/python3 --py-files /software/spark-3.1.2-bin-hadoop2.7/PySparkProject/bigDataAnalysisModeling/hahaha.zip /software/spark-3.1.2-bin-hadoop2.7/PySparkProject/bigDataAnalysisModeling/main.py
注意⚠️:如果有打包的文件,则使用
--py-files
选项,如果是文件,则使用
--files
选项。
2.1.2 spark2.4
import findspark
findspark.add_jars("/root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar")
findspark.add_packages("org.apache.spark:spark-avro_2.11:2.4.4")
findspark._add_to_submit_args("--driver-class-path /usr/app/hadoop-2.10.1/etc/hadoop:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar")
findspark.init(spark_home="/software/hadoop-2.10.1", python_path="/software/Python-3.9.5/bin/python3")
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.default.parallelism", 2) \
.appName("hudi-datalake-test9") \
.getOrCreate()
提交给spark submit运行的命令:
#!/usr/bin/env bash
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export PATH=$PATH:$LD_LIBRARY_PATH
fileName=$1
/usr/app/spark-2.4.7-bin-hadoop2.7/bin/spark-submit --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --master spark://10.20.3.72:7077 --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.spark:spark-avro_2.11:2.4.4 /usr/app/spark-2.4.7-bin-hadoop2.7/Script/$fileName
使用方法:
bash runPySpark.py spark.py opt/sparkParam.py
2.2 使用shell
2.2.1 spark3.1
脚本【runPySpark.py】如下:
/software/spark-3.1.2-bin-hadoop2.7/bin/pyspark --jars /root/yl/hudi-0.10.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar --driver-class-path /software/hadoop-2.10.1/etc/hadoop:/software/apache-hive-2.3.8-bin/conf/:/software/member/config/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
修改配置为:
export PYSPARK_PYTHON=/software/Python-3.9.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/software/Python-3.9.5/bin/python3
export PYSPARK_DRIVER_PYTHON_OPTS
- (1)下载YUM INSTALL LIBFFI-DEVEL -Y包
yum install libffi-devel -y
- (2)在PYTHON3的安装目录下重新编译:
make install
2.2.2 spark2.4
bin/pyspark --jars $HUDI_SPARK_BUNDLE --master spark://10.20.3.72:7077 --driver-class-path $HADOOP_CONF_DIR:/usr/app/apache-hive-2.3.8-bin/conf/:/software/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --deploy-mode client --driver-memory 1G --executor-memory 1G --num-executors 3 --packages org.apache.spark:spark-avro_2.11:2.4.4