迷你搜尋引擎

這幾天在思考如何對項目做出擴充，當一個網站通路量上升之後随之而來的便是使用者的大量交流，根據現在主流的交流方式來看，一般都是一個使用者先進行發帖，然後其他使用者在下面對之評論，評論系統暫且擱置一邊不談，現在有一個問題就是當文章數量越來越多，如何快速找到與關鍵詞相對應的文章，使用關鍵字在資料庫中周遊不太現實，但其實仔細想想搜尋引擎不就是做這個的嗎，于是就想着實作一個簡單的搜尋引擎。

要實作一個簡單的搜尋引擎，必要的相關知識一定是必不可少的，簡單來說，第一步先要對已有的資源進行整理，根據整理出來的結果建構索引，搜尋子產品根據給定關鍵詞查找索引給出排序結果

對已有資源進行整理

搜尋引擎使用的前提就是有大量的html文檔，需要搜尋引擎來做出某種處理來給出正确的結果，是以搜尋引擎内部就需要對web雜亂的資源進行管理，以此來完成後面的工作，我們現在假設某一個html目錄内有幾千條html頁面，我們需要對這些頁面依次周遊進行統計，将頁面中的标題，正文以及html對應在web中的url以某種格式儲存下來，這裡是儲存在一個檔案中。

C++提供的庫中還不足以讓我們擁有周遊整個目錄這個功能，是以這裡需要借助boost庫。

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <boost/filesystem/path.hpp>
#include <boost/filesystem/operations.hpp>

const std::string g_input_path =  "../data/input"; // 待解析的html文檔都在這個目錄下
const std::string g_output_path = "../data/output/pase_file"; //将解析結果放在這個檔案中

struct docInfo {
    std::string title;
    std::string url;
    std::string content;
};

bool enumFile(const std::string& input_path, std::vector<std::string>& file_list) {
    namespace fs = boost::filesystem;
    fs::path root_path(input_path);
    if(!fs::exists(root_path)) {
        std::cout << "input err " << __FILE__ << ":" << __LINE__ << std::endl;
        return false;
    }

    // boost遞歸周遊目錄，借助疊代器
    fs::recursive_directory_iterator end_iter;    
    for(fs::recursive_directory_iterator iter(root_path); iter != end_iter; ++iter) {
        if(!fs::is_regular_file(*iter)) {
            continue; // 過濾掉目錄
        }

        if(iter->path().extension() != ".html") {
            continue;
        }

        file_list.push_back(iter->path().string());
    }
    return true;
}

int main()
{
    std::vector<std::string> file_list; // 将所有出現過的頁面路徑都放在這個vector中

    if(enumFile(g_input_path, file_list)) {
        std::cout << "enmu file err " << __FILE__ << ":" << __LINE__ << std::endl;
        return 1;
    }

    std::ofstream out_file(g_output_path.c_str());
    if(!out_file.is_open()) {
        std::cout << "out file err " << __FILE__ << ":" << __LINE__ << std::endl;
        return -1;
    }

    // 依次處理每個html頁面
    for(const auto& file_path : file_list) {
        struct docInfo info;
        if(parseFile(file_path, info)) {
            std::cout << "pase file err" << std::endl;
            continue;
        }
        writeOutPut(info, out_file);
    }
    out_file.close();
    return 0;
}

解析完畢之後在最終生成的配置檔案中每一行就對應一個html文檔

建構索引

認識反向索引
- 在普通情況下的索引都是會根據文章的标題或者序号給出文章正文的内容，這種通常情況下的索引我們叫做正排索引
- 反向索引就是在給出關鍵詞的情況下找出關鍵詞所出現的文章标題或者序号。
搜尋引擎的工作流程
- 先将使用者輸入詞或句進行拆分（這裡使用結巴分詞）
- 根據拆分詞結合權重在反向索引中查找，根據傳回id找到正排索引
- 根據正排索引将搜尋結果進行界面包裝進行傳回
建構索引

struct docInfo {
	uint64_t doc_id;
	std::string title;
	std::string content;
	std::string url;
};

struct Weight {
	uint64_t doc_id;
	int weight; // 權重 為排序做準備
	std::string key;
};

std::vector<docInfo> forward_index; //正排索引容器
std::unordered_map<std::string, std::vector<Weight>> inverted_index; // 倒排容器

bool buildIndex(std::string& file_path) {
	// ... 判斷檔案相關
	
	// 讀取配置檔案 周遊每個解析過的文檔
	std::string line;
	while(std::getline(file, line)) {
		// 構造正排索引 
		const docInfo *doc_info = buildForward(line);
		// 更新反向索引
		buildInverted(*doc_info);
	}
}

const docInfo* buildForward(std::string html) {
	std::vector<std::string> tokens;
	// ...使用boost對讀取的配置檔案進行切分
	
	docInfo doc_info;
	doc_info.doc_id = forward_index.size();
	doc_info.title = tokens[0];
	doc_info.content = tokens[1];
	doc_info.url = tokens[2];
	
	forward_index.push_back(doc_info);
	return &forward_index.back();
}

void buildInverted(const docInfo& doc_info) {
	// 對doc_info中的标題和正文進行分詞
	std::vector<std::string> title_tokens;
	cutWord(doc_info.title, title_tokens);
	
	std::vector<std::string> content_tokens;
	curWord(doc_info.content, content_tokens);
	
	// 對分詞結果進行詞頻統計
	struct wordCount {
		int title_count;
		int content_count;
	};
	std::unordered_map<std::string, wordCount> word_count;
	
	for(std::string word : title_tokens) {
		boost::to_lower(word); // 全部轉換成小寫
		++word_count[word].title_count;
	}
	
	for(std::string word : content_tokens) {
		boost::to_lower(word);
		++word_count[word].content_count;
	}
	
	// 周遊結果 更新反向索引
	for(const auto& word_pair : word_count) {
		Weight weight;
		weight.doc_id = doc_info.doc_id;
		weight.weight = 10 * word_pair.second.title_count + word_pair.second.content_count;
		weight.key = word_pair.first;
		
		auto& inverted_list = inverted_index[word_pair.first];
		inverted_index.push_back(weight);
	}
}

索引已經正常構造完畢，剩下的就是将使用者的輸入拆分進行搜尋

bool Search(const std::string& input_query, std::string& result) {
	// 分詞 對查詢詞進行分詞
	std::vector<std::string> tokens;
	curWord(input_query, tokens);
	
	// 觸發	對分詞結果進行查詢
	std::vector<Weight> all_tokens_result;
	for(std::string word : tokens) {
		boost::to_lower(word);
		auto *inverted_list = index->getInvertedList(word); // 此處傳回的是指向倒排連結清單的指針
		if(inverted_list == nullptr) {
			continue;
		}
		
		all_tokens_result.insert(all_tokens_result.end(),\
		inverted_list.begin(), inverted_list.end());
	}
	
	// 排序 對結果進行一定的排序
	std::sort(all_tokens_result.begin(), all_tokens_result.end(), 
				[](const Weight& w1, const Weight& w2){
					return w1.weight > w2.weight;
				});
				
	// 構造結果
	// ...
}

至此全部結構都已經完成，點我擷取源碼

迷你搜尋引擎

對已有資源進行整理

建構索引

繼續閱讀

Ubuntu14.04 LTS下安裝mongodb

C++實作簡單順序表

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

C經典書籍筆記——C陷阱與缺陷②(文法陷阱之優先級)一、錯誤案列二、優先級規律

線性表之順序表的實作

samba伺服器的功能

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

【Linux】UDP廣播封包接收速率問題

SequoiaDB巨杉資料庫C++驅動概述

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三