spark implementation hadoop setup,cleanup

2023-04-04 04:34:09

def main(args: Array[String]) {
    val sc = new SparkContext("local", "xxx")
    val inputData = sc.textFile("hdfs://master:8020/data/spark/user-history-data")
    val lines = inputData.map(line => (line, line.length))

    val result = lines.mapPartitions { valueIterator =>
      if (valueIterator.isEmpty) {
        Iterator[String]()
      } else {
        val transformedItem = new ListBuffer[String]() //setup ListBuffer
        val fs: FileSystem = FileSystem.get(new Configuration()) //setup FileSystem

        valueIterator.map { item =>
          transformedItem += item._1 +":"+item._2
          val outputFile = fs.create(new Path("/home/xxx/opt/data/spark/" + item._1.substring(,item._1.indexOf("\t")) + ".txt"))
          outputFile.write((item._1 +":"+item._2).getBytes())
          if (!valueIterator.hasNext) {
            transformedItem.clear() //cleanup transformedItem
            outputFile.close() //cleanup outputFile
            fs.close() //cleanup fs
          }
          transformedItem
        }
      }
    }

    result.foreach(println(_))
    sc.stop()

将hdfs資料：

zhangsan 1 2015-07-30 20:01:01 127.0.0.1

zhangsan 2 2015-07-30 20:01:01 127.0.0.1

zhangsan 3 2015-07-30 20:01:01 127.0.0.1

zhangsan 4 2015-07-31 20:01:01 127.0.0.1

zhangsan 5 2015-07-31 20:21:01 127.0.0.1

lisi 1 2015-07-30 21:01:01 127.0.0.1

lisi 2 2015-07-30 22:01:01 127.0.0.1

lisi 3 2015-07-31 23:31:01 127.0.0.1

lisi 4 2015-07-31 22:21:01 127.0.0.1

lisi 5 2015-07-31 23:11:01 127.0.0.1

wangwu 1 2015-07-30 21:01:01 127.0.0.1

wangwu 2 2015-07-30 22:01:01 127.0.0.1

wangwu 3 2015-07-31 23:31:01 127.0.0.1

wangwu 4 2015-07-31 22:21:01 127.0.0.1

wangwu 5 2015-07-31 23:11:01 127.0.0.1

讀取到spark中，并統計每行長度，再将資料寫到本地的檔案中（檔案名稱以每行第一個單詞）

最終實作hadoop中setup, cleanup

強烈閱讀如下連結：

http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%[email protected].com%3E

http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/

http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983

http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark

spark implementation hadoop setup,cleanup

繼續閱讀

ubuntu hadoop2.6.1，terminal下運作wordcount

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

MapReduce(一)：入門級程式wordcount及其分析

hadoop操作遇到的問題問題一：輸出檔案已存在

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Spark實作wordcount

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結