天天看點

學習筆記CB011:lucene搜尋引擎庫、IKAnalyzer中文切詞工具、檢索服務、查詢索引、導流、word2vec...

影視劇字幕聊天語料庫特點,把影視劇說話内容一句一句以回車換行羅列三千多萬條中國話,相鄰第二句很可能是第一句最好回答。一個問句有很多種回答,可以根據相關程度以及曆史聊天記錄所有回答排序,找到最優,是一個搜尋排序過程。

lucene+ik。lucene開源免費搜尋引擎庫,java語言開發。ik IKAnalyzer,開源中文切詞工具。語料庫切詞建索引,文本搜尋做文本相關性檢索,把下一句取出作答案候選集,答案排序,問題分析。

建索引。eclipse建立maven工程,maven自動生成pom.xml檔案,配置包依賴資訊,dependencies标簽中添加依賴:

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-core</artifactId>

<version>4.10.4</version>

</dependency>

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-queryparser</artifactId>

<version>4.10.4</version>

</dependency>

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-analyzers-common</artifactId>

<version>4.10.4</version>

</dependency>

<dependency>

<groupId>io.netty</groupId>

<artifactId>netty-all</artifactId>

<version>5.0.0.Alpha2</version>

</dependency>

<dependency>

<groupId>com.alibaba</groupId>

<artifactId>fastjson</artifactId>

<version>1.1.41</version>

</dependency>

project标簽增加配置,依賴jar包自動拷貝lib目錄:

<build>

<plugins>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-dependency-plugin</artifactId>

<executions>

<execution>

<id>copy-dependencies</id>

<phase>prepare-package</phase>

<goals>

<goal>copy-dependencies</goal>

</goals>

<configuration>

<outputDirectory>${project.build.directory}/lib</outputDirectory>

<overWriteReleases>false</overWriteReleases>

<overWriteSnapshots>false</overWriteSnapshots>

<overWriteIfNewer>true</overWriteIfNewer>

</configuration>

</execution>

</executions>

</plugin>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-jar-plugin</artifactId>

<configuration>

<archive>

<manifest>

<addClasspath>true</addClasspath>

<classpathPrefix>lib/</classpathPrefix>

<mainClass>theMainClass</mainClass>

</manifest>

</archive>

</configuration>

</plugin>

</plugins>

</build>

https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ik-analyzer/IK%20Analyzer%202012FF_hf1_source.rar 下載下傳ik源代碼把src/org目錄拷到chatbotv1工程src/main/java下,重新整理maven工程。

com.shareditor.chatbotv1包下maven自動生成App.java,改成Indexer.java:

Analyzer analyzer = new IKAnalyzer(true);

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);

iwc.setOpenMode(OpenMode.CREATE);

iwc.setUseCompoundFile(true);

IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File(indexPath)), iwc);

BufferedReader br = new BufferedReader(new InputStreamReader(

new FileInputStream(corpusPath), "UTF-8"));

String line = "";

String last = "";

long lineNum = 0;

while ((line = br.readLine()) != null) {

line = line.trim();

if (0 == line.length()) {

continue;

}

if (!last.equals("")) {

Document doc = new Document();

doc.add(new TextField("question", last, Store.YES));

doc.add(new StoredField("answer", line));

indexWriter.addDocument(doc);

}

last = line;

lineNum++;

if (lineNum % 100000 == 0) {

System.out.println("add doc " + lineNum);

}

}

br.close();

indexWriter.forceMerge(1);

indexWriter.close();

編譯拷貝src/main/resources所有檔案到target目錄,target目錄執行

java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Indexer ../../subtitle/raw_subtitles/subtitle.corpus ./index

生成索引目錄index通過lukeall-4.9.0.jar檢視。

檢索服務。netty建立http服務server,代碼在https://github.com/warmheartli/ChatBotCourse的chatbotv1目錄:

Analyzer analyzer = new IKAnalyzer(true);

QueryParser qp = new QueryParser(Version.LUCENE_4_9, "question", analyzer);

if (topDocs.totalHits == 0) {

qp.setDefaultOperator(Operator.AND);

query = qp.parse(q);

System.out.println(query.toString());

indexSearcher.search(query, collector);

topDocs = collector.topDocs();

}

if (topDocs.totalHits == 0) {

qp.setDefaultOperator(Operator.OR);

query = qp.parse(q);

System.out.println(query.toString());

indexSearcher.search(query, collector);

topDocs = collector.topDocs();

}

ret.put("total", topDocs.totalHits);

ret.put("q", q);

JSONArray result = new JSONArray();

for (ScoreDoc d : topDocs.scoreDocs) {

Document doc = indexSearcher.doc(d.doc);

String question = doc.get("question");

String answer = doc.get("answer");

JSONObject item = new JSONObject();

item.put("question", question);

item.put("answer", answer);

item.put("score", d.score);

item.put("doc", d.doc);

result.add(item);

}

ret.put("result", result);

查詢索引,query詞做切詞拼lucene query,檢索索引question字段,比對傳回answer字段值作候選集,挑出候選集一條作答案。server通過http通路,如http://127.0.0.1:8765/?q=hello 。中文需轉urlcode發送,java端讀取按urlcode解析,server啟動方法:

java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Searcher

聊天界面。一個展示聊天内容框框,選擇ckeditor,支援html格式内容展示,一個輸入框和發送按鈕,html代碼:

<div class="col-sm-4 col-xs-10">

<div class="row">

<textarea id="chatarea">

<div style='color: blue; text-align: left; padding: 5px;'>機器人: 喂,大哥您好,您終于肯跟我聊天了,來侃侃呗,我來者不拒!</div>

<div style='color: blue; text-align: left; padding: 5px;'>機器人: 啥?你問我怎麼這麼聰明會聊天?因為我剛剛吃了一堆影視劇字幕!</div>

</textarea>

</div>

<br />

<div class="row">

<div class="input-group">

<input type="text" id="input" class="form-control" autofocus="autofocus" οnkeydοwn="submitByEnter()" />

<span class="input-group-btn">

<button class="btn btn-default" type="button" οnclick="submit()">發送</button>

</span>

</div>

</div>

</div>

<script type="text/javascript">

CKEDITOR.replace('chatarea',

{

readOnly: true,

toolbar: ['Source'],

height: 500,

removePlugins: 'elementspath',

resize_enabled: false,

allowedContent: true

});

</script>

調用聊天server,要一個發送請求擷取結果控制器:

public function queryAction(Request $request)

{

$q = $request->get('input');

$opts = array(

'http'=>array(

'method'=>"GET",

'timeout'=>60,

)

);

$context = stream_context_create($opts);

$clientIp = $request->getClientIp();

$response = file_get_contents('http://127.0.0.1:8765/?q=' . urlencode($q) . '&clientIp=' . $clientIp, false, $context);

$res = json_decode($response, true);

$total = $res['total'];

$result = '';

if ($total > 0) {

$result = $res['result'][0]['answer'];

}

return new Response($result);

}

控制器路由配置:

chatbot_query:

path: /chatbot/query

defaults: { _controller: AppBundle:ChatBot:query }

聊天server響應時間比較長,不導緻web界面卡住,執行submit時異步發請求和收結果:

var xmlHttp;

function submit() {

if (window.ActiveXObject) {

xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");

}

else if (window.XMLHttpRequest) {

xmlHttp = new XMLHttpRequest();

}

var input = $("#input").val().trim();

if (input == '') {

jQuery('#input').val('');

return;

}

addText(input, false);

jQuery('#input').val('');

var datastr = "input=" + input;

datastr = encodeURI(datastr);

var url = "/chatbot/query";

xmlHttp.open("POST", url, true);

xmlHttp.onreadystatechange = callback;

xmlHttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");

xmlHttp.send(datastr);

}

function callback() {

if (xmlHttp.readyState == 4 && xmlHttp.status == 200) {

var responseText = xmlHttp.responseText;

addText(responseText, true);

}

}

addText往ckeditor添加一段文本:

function addText(text, is_response) {

var oldText = CKEDITOR.instances.chatarea.getData();

var prefix = '';

if (is_response) {

prefix = "<div style='color: blue; text-align: left; padding: 5px;'>機器人: "

} else {

prefix = "<div style='color: darkgreen; text-align: right; padding: 5px;'>我: "

}

CKEDITOR.instances.chatarea.setData(oldText + "" + prefix + text + "</div>");

}

代碼:

https://github.com/warmheartli/ChatBotCourse

https://github.com/warmheartli/shareditor.com

效果示範:http://www.shareditor.com/chatbot/

導流。統計網站流量情況。cnzz統計看最近半個月受訪頁面流量情況,使用者通路集中頁面。增加圖庫動态按鈕。吸引使用者點選,在每個頁面右下角放置動态小圖示,頁面滾動它不動,使用者點了直接跳到想要引流的頁面。搜客服漂浮代碼。

建立js檔案,lrtk.js :

$(function()

{

var tophtml="<a href=\"http://www.shareditor.com/chatbot/\" target=\"_blank\"><div id=\"izl_rmenu\" class=\"izl-rmenu\"><div class=\"btn btn-phone\"></div><div class=\"btn btn-top\"></div></div></a>";

$("#top").html(tophtml);

$("#izl_rmenu").each(function()

{

$(this).find(".btn-phone").mouseenter(function()

{

$(this).find(".phone").fadeIn("fast");

});

$(this).find(".btn-phone").mouseleave(function()

{

$(this).find(".phone").fadeOut("fast");

});

$(this).find(".btn-top").click(function()

{

$("html, body").animate({

"scroll-top":0

},"fast");

});

});

var lastRmenuStatus=false;

$(window).scroll(function()

{

var _top=$(window).scrollTop();

if(_top>=0)

{

$("#izl_rmenu").data("expanded",true);

}

else

{

$("#izl_rmenu").data("expanded",false);

}

if($("#izl_rmenu").data("expanded")!=lastRmenuStatus)

{

lastRmenuStatus=$("#izl_rmenu").data("expanded");

if(lastRmenuStatus)

{

$("#izl_rmenu .btn-top").slideDown();

}

else

{

$("#izl_rmenu .btn-top").slideUp();

}

}

});

});

上半部分定義id=top的div标簽内容。一個id為izl_rmenu的div,css格式定義在另一個檔案lrtk.css裡:

.izl-rmenu{position:fixed;left:85%;bottom:10px;padding-bottom:73px;z-index:999;}

.izl-rmenu .btn{width:72px;height:73px;margin-bottom:1px;cursor:pointer;position:relative;}

.izl-rmenu .btn-top{background:url(http://www.shareditor.com/uploads/media/default/0001/01/thumb_416_default_big.png) 0px 0px no-repeat;background-size: 70px 70px;display:none;}

下半部分當頁面滾動時div展開。

在所有頁面公共代碼部分增加

<div id="top"></div>

龐大語料庫運用,LSTM-RNN訓練,中文語料轉成算法識别向量形式,最強大word embedding工具word2vec。

word2vec輸入切詞文本檔案,影視劇字幕語料庫回車換行分隔完整句子,是以我們先對其做切詞,word_segment.py檔案:

# coding:utf-8

import sys

import importlib

importlib.reload(sys)

import jieba

from jieba import analyse

def segment(input, output):

input_file = open(input, "r")

output_file = open(output, "w")

while True:

line = input_file.readline()

if line:

line = line.strip()

seg_list = jieba.cut(line)

segments = ""

for str in seg_list:

segments = segments + " " + str

segments = segments + "\n"

output_file.write(segments)

else:

break

input_file.close()

output_file.close()

if __name__ == '__main__':

if 3 != len(sys.argv):

print("Usage: ", sys.argv[0], "input output")

sys.exit(-1)

segment(sys.argv[1], sys.argv[2]);

使用:

python word_segment.py subtitle/raw_subtitles/subtitle.corpus segment_result

word2vec生成詞向量。word2vec可從https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec擷取,make編譯生成二進制檔案。

執行:

./word2vec -train ../segment_result -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

生成vectors.bin詞向量,二進制格式,word2vec自帶distance工具來驗證:

./distance vectors.bin

詞向量二進制檔案格式加載。word2vec生成詞向量二進制格式:詞數目(空格)向量次元。

加載詞向量二進制檔案python腳本:

# coding:utf-8

import sys

import struct

import math

import numpy as np

reload(sys)

sys.setdefaultencoding( "utf-8" )

max_w = 50

float_size = 4

def load_vectors(input):

print "begin load vectors"

input_file = open(input, "rb")

# 擷取詞表數目及向量次元

words_and_size = input_file.readline()

words_and_size = words_and_size.strip()

words = long(words_and_size.split(' ')[0])

size = long(words_and_size.split(' ')[1])

print "words =", words

print "size =", size

word_vector = {}

for b in range(0, words):

a = 0

word = ''

# 讀取一個詞

while True:

c = input_file.read(1)

word = word + c

if False == c or c == ' ':

break

if a < max_w and c != '\n':

a = a + 1

word = word.strip()

# 讀取詞向量

vector = np.empty([200])

for index in range(0, size):

m = input_file.read(float_size)

(weight,) = struct.unpack('f', m)

vector[index] = weight

# 将詞及其對應的向量存到dict中

word_vector[word.decode('utf-8')] = vector

input_file.close()

print "load vectors finish"

return word_vector

if __name__ == '__main__':

if 2 != len(sys.argv):

print "Usage: ", sys.argv[0], "vectors.bin"

sys.exit(-1)

d = load_vectors(sys.argv[1])

print d[u'真的']

運作方式如下:

python word_vectors_loader.py vectors.bin

參考資料:

《Python 自然語言處理》

http://www.shareditor.com/blogshow?blogId=113

http://www.shareditor.com/blogshow?blogId=114

http://www.shareditor.com/blogshow?blogId=115

歡迎推薦上海機器學習工作機會,我的微信:qingxingfengzi

轉載于:https://www.cnblogs.com/libinggen/p/8898062.html