第一个 Spark 程序:WordCount
1. 使用 Spark-shell
- 准备数据:创建文件夹
,以及input
文件Words.txt
在文件中输入数据:[[email protected] spark-2.1.1]$ mkdir input [[email protected] input]$ vim Words.txt
hello spark hello scala hello world
- 进入
spark-shell
[[email protected] spark-2.1.1]$ bin/spark-shell
- 编写
程序并运行WordCount
scala> sc.textFile("input/").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect res0: Array[(String, Int)] = Array((scala,1), (hello,3), (world,1), (spark,1))
2. 使用开发工具 IDEA
- 创建
项目,并导入如下依赖Maven
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.1</version> </dependency> </dependencies> <build> <plugins> <!-- 打包插件, 否则 scala 类不会编译并打包进去 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.4.6</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
- 创建
文件,实现以下代码WordCount.scala
package com.guli import org.apache.spark.{SparkConf, SparkContext} object WordCount { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setAppName("WorldCount").setMaster("local[*]") val sc = new SparkContext(conf) val wcArray: Array[(String, Int)] = sc.textFile("/Users/zgl/Desktop/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect() wcArray.foreach(println) sc.stop() } }
- 运行结果
(scala,1) (hello,3) (world,1) (spark,1)