假设有数据量T级名为data的RDD,需要做一些列动作,一般需要使用map-reduce,其中map阶段可以使用def函数或者lambda形式,返回新的RDD,reduce可以起到累加作用,例:
1 from pyspark import SparkConf
2 conf = SparkConf().setAppName('test')
3 try:
4 sc.stop()
5 except:
6 pass
7 sc = SparkContext(conf = conf)
8
9 data = ["hello", "world", "hello", "world"]
10
11 rdd = sc.parallelize(data)
12 res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
13
14 res_rdd.first()
View Code
附常见操作API,map()对每一条rdd进行并行操作,reduce()、reduceByKey()计数,filter()过滤,join()、union()等