MapReduce實作TopK算法原理及代碼

2023-08-06 11:53:01

1、map階段

通過map方法将資料構造成資料小于K的TreeMap，在每次map後判斷TreeMap的大小和K的大小，當TreeMap的資料量大于K時，取出最小的數。在map結束後會執行cleanup方法，該方法将map中的前K個資料傳入reduce任務中。

2、reduce階段

在 reduce方法中，依次将map方法中傳入K個資料放入 TreeMap中，進而将K個數

據利用紅黑樹的 firstKey方法按從大到小者利用紅黑樹的 lastKey方法按從小到大的順序

排列。進而求出前K個數。

3、代碼部分: 從1000w資料中找到最大的100個數。

public class TopKAapp{
	private static final String INPUT PATH ＝"hdfs:/xxx/topk_input";
	private static final String OUT PATH＝"hdfs://xxx/topk out"；

	public static void main (String[] args) throws Exception{
		Configuration conf new ConfigurationO;
		final FileSystem fileSystem FileSystem.get(new URI（INPUT_PATH), conf);
		final Path outPath new Path (OUT PATH);
		if (fileSystem.exists (outPath)){
			fileSystem.delete(outPath, true);
		}

		final Job job＝ new Job(conf, TopKAapp.class.getSimpleNameO);
		FilelnputFormat.setInputPaths(job, INPUT PATH);
		job.setMapperClass(MyMapper.class);
		job.setPartitionerClass(HashPartitioner.class);
		job.setNumReduceTasks(1);
		job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(NullWritable.class);
		job. setOutput ValueClass(Long Writable.class);
		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
		job.setOutputFormatClass(TextOutputFormat.class);
		job.waitForCompletion(true);
	}

	static class MyMapper extends Mapper＜Long Writable, Text, Null Writable, Long Writable＞{
		public static final int K＝100;
		private TreeMap＜Long, Long＞ tree＝ new TreeMap＜Long, Long＞O;

		public void map(Long Writable key, Text text, Context context) throws IOException, InterruptedException{
			long temp Long.parseLong(text.toStringO);
			tree.put(temp, temp);
			if (tree.size() ＞ K)
				tree.remove(tree.firstKeyO);
		}
		＠Override
		protected void cleanup(Context context) throws IOException, InterruptedException
			for (Long text tree. values)){
				context. write(Null Writable.get), new Long Writable(text));
			}
		}
	}

	static class MyReducer extends Reducer<NullWritable, Long Writable, NullWritable,
Long Writable>{
		public static final int K=100;
		private TreeMap<Long, Long> tree new TreeMap<Long, Long>);

		@Override
		protected void cleanup(Context context) throws IOException, InterruptedException{
			for (Long val tree.descendingKeySet()){
				context.write(Null Writable.get(), new Long Writable(val));
			}
		}

		@Override
		protected void reduce(Null Writable key, Iterable<Long Writable> values, Context context) throws IOException, InterruptedException{
			for (Long Writable value values){
				tree.put(value.get), value.getO);
				if(tree.size()>K)
					tree.remove(tree.firstKey));
			}
		}
	}
}

MapReduce實作TopK算法原理及代碼

繼續閱讀

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

C++ 第十五周報告1--《冒泡法排序》

淺談企業活動中進行資料分析的重要性

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

Ambari介紹和架構原理

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark

hdu7108哈希

MapReduce實作TopK算法 原理及代碼

繼續閱讀

MapReduce實作TopK算法原理及代碼