搜索引擎：MapReduce实战----倒排索引

倒排索引（Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。

有两种不同的反向索引形式：

一条记录的水平反向索引（或者反向档案索引）包含每个引用单词的文档的列表。

一个单词的水平反向索引（或者完全反向索引）又包含每个单词在一个文档中的位置。

后者的形式提供了更多的兼容性（比如短语搜索），但是需要更多的时间和空间来创建。

举例：

以英文为例，下面是要被索引的文本：

<code>T2 = "it is a banana"</code>

我们就能得到下面的反向文件索引：

<code>"banana": {2}</code>

检索的条件<code>"what"</code>, <code>"is"</code> 和 <code>"it"</code> 将对应这个集合：{0,1}∩{0,1,2}∩{0,1,2}={0,1}。

对相同的文字，我们得到后面这些完全反向索引，有文档数量和当前查询的单词结果组成的的成对数据。同样，文档数量和当前查询的单词结果都从零开始。

所以，<code>"banana": {(2, 3)}</code> 就是说 “banana”在第三个文档里 (T2)，而且在第三个文档的位置是第四个单词(地址为 3)。

<code>"banana": {(2, 3)}</code>

如果我们执行短语搜索<code>"what is it"</code> 我们得到这个短语的全部单词各自的结果所在文档为文档0和文档1。但是这个短语检索的连续的条件仅仅在文档1得到。

（1）Map过程

首先使用默认的TextInputFormat类对输入文件进行处理，得到文本中每行的偏移量及其内容，Map过程首先必须分析输入的<key, value>对，得到倒排索引中需要的三个信息：单词、文档URI和词频，如图所示：

存在两个问题，第一：<key, value>对只能有两个值，在不使用Hadoop自定义数据类型的情况下，需要根据情况将其中的两个值合并成一个值，作为value或key值；

第二，通过一个Reduce过程无法同时完成词频统计和生成文档列表，所以必须增加一个Combine过程完成词频统计

public static class Map extends Mapper<Object,Text,Text,Text>{

private Text keyInfo = new Text();

private Text valueInfo = new Text();

private FileSplit split; //存储所在文件的路径

public void map(Object key,Text value,Context context) throws IOException,

InterruptedException{

split = (FileSplit)context.getInputSplit(); //获取当前任务分割的单词所在的文件路径

StringTokenizer itr = new StringTokenizer(value.toString());

while(itr.hasMoreTokens()){

keyInfo.set(itr.nextToken()+"+"+split.getPath().toString()); //keyvalue是由单词和URI组成的

valueInfo.set("1");

//value值设置成1

context.write(keyInfo,valueInfo);

}

（2）Combine过程

将key值相同的value值累加，得到一个单词在文档中的词频，如图

public static class Combiner extends Reducer<Text,Text,Text,Text>{

private Text info = new Text();

public void reduce(Text key,Iterable<Text>values,Context context) throws

IOException, InterruptedException{

int sum = 0;

for(Text value:values){

sum += Integer.parseInt(value.toString());

// int index = key.toString().indexOf("+");

// info.set(key.toString().substring(index+1)+":"+sum);

// key.set(key.toString().substring(0,index));

String record = key.toString();

String[] str = record.split("[+]");

info.set(str[1]+":"+sum);

key.set(str[0]);

context.write(key,info);

（3）Reduce过程

讲过上述两个过程后，Reduce过程只需将相同key值的value值组合成倒排索引文件所需的格式即可，剩下的事情就可以直接交给MapReduce框架进行处理了

public static class Reduce extends Reducer<Text,Text,Text,Text>{

private Text result = new Text();

String value =new String();

for(Text value1:values){

value += value1.toString()+" ; ";

result.set(value);

context.write(key,result);

完整代码如下：

package ReverseIndex;

import java.io.*;

import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReverseIndex {

public static class Map extends Mapper<Object,Text,Text,Text>{

public static class Combiner extends Reducer<Text,Text,Text,Text>{

//下面三行注释和紧接着四行功能一样，只不过实现方法不一样罢了

//对传进来的key进行拆分，以+为界

public static class Reduce extends Reducer<Text,Text,Text,Text>{

public static void main(String[] args) throws IOException, ClassNotFoundException,

InterruptedException {

// TODO Auto-generated method stub

Job job = new Job();

job.setJarByClass(ReverseIndex.class);

job.setNumReduceTasks(1); //设置reduce的任务数量为1，平常的小测试不需要开辟太多的reduce任务进程

job.setMapperClass(Map.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

job.setCombinerClass(Combiner.class);

job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path("/thinkgamer/input"));

FileOutputFormat.setOutputPath(job, new Path("/thinkgamer/output"));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

搜索引擎：MapReduce实战----倒排索引

继续阅读

httpd dead but subsys locked;No space left on device:Couldn't create accept loc

apache (httpd)不支持中文路径问题先卸载yum安装的httpd再用源码安装，重装httpd再安装支持中文的插件遇到问题

搭建httpd服务

windows下配置Apache的vhost初次接触，强烈欢迎拍砖，指出错误

Apache与PHP环境下配置本地虚拟主机

Linux 7 中配置Apache服务，及禁止ip访问，删除apache广告页面。

Apache配置文件中的deny和allow的使用

Apache 配置默认编码

服务器配置——Apache

Apache静态文件访问配置（书封服务器）

apache httpd 配置

Ubuntu16.04安装Apache+MySQL+PHP1. 安装Apache2. 安装MySQL3. 安装PHP4. 安装phpMyAdmin

Apache配置SSLApache配置SSL

Windows下配置Apache的SSL服务

Apache2.4.x 配置文件详解Apache配置需要了解如下：开始讲解：

配置apache支持PHP（win7）