[python]将搜狗(sogou)的細胞詞庫轉換為mmseg的詞庫

From: https://github.com/aboutstudy/scel2mmseg

------------------------------------------------------------

将搜狗(sogou)的細胞詞庫轉換為mmseg的詞庫

功能：

scel2mmseg.py: 将.scel檔案轉換為mmseg格式的.txt檔案

使用方法： python scel2mmseg.py a.scel a.txt

批量轉換方法：python scel2mmseg.py scel檔案目錄 a.txt

說明：新增加的所有詞的詞頻都為1，對于格式的解釋如下：[摘自 http://jjw.in/server/226 ]

每條記錄分兩行。其中，第一行為詞項，其格式為：[詞條]\t[詞頻率]。需要注意的是，對于單個字後面跟這個字作單字成詞的頻率，這個頻率需要在大量的預先切分好的語料庫中進行統計，使用者增加或删除詞時，一般不需要修改這個數值；對于非單字詞，詞頻率處必須為1。第二行為占位項，是由于LibMMSeg庫的代碼是從Coreseek其他的分詞算法庫（N-gram模型）中改造而來的，在原來的應用中，第二行為該詞在各種詞性下的分布頻率。LibMMSeg的使用者隻需要簡單的在第二行處填”x:1″即可
mergedict.py: 将mmseg的多個.txt檔案合并為一個.txt

使用方法： python mergedict.py unigram.txt b.txt c.txt new.txt

說明： .txt可以使mmseg格式的，也可以是每行一個詞的格式（這樣詞頻預設為1）

注意：因為merge的時候會判重，一個詞在前面出現過，就不會追加到新産生的檔案中,是以要将unigram.txt放到最前面

------------------------------------------------------------

scel2mmseg.py:

------------------------------------------------------------

import struct
import os, sys, glob

def read_utf16_str (f, offset=-1, len=2):
    if offset >= 0:
        f.seek(offset)
    str = f.read(len)
    return str.decode('UTF-16LE')

def read_uint16 (f):
    return struct.unpack ('<H', f.read(2))[0]

def get_word_from_sogou_cell_dict (fname):
    f = open (fname, 'rb')
    file_size = os.path.getsize (fname)
    
    hz_offset = 0
    mask = struct.unpack ('B', f.read(128)[4])[0]
    if mask == 0x44:
        hz_offset = 0x2628
    elif mask == 0x45:
        hz_offset = 0x26c4
    else:
        sys.exit(1)
    
    title   = read_utf16_str (f, 0x130, 0x338  - 0x130)
    type    = read_utf16_str (f, 0x338, 0x540  - 0x338)
    desc    = read_utf16_str (f, 0x540, 0xd40  - 0x540)
    samples = read_utf16_str (f, 0xd40, 0x1540 - 0xd40)
    
    py_map = {}
    f.seek(0x1540+4)
    
    while 1:
        py_code = read_uint16 (f)
        py_len  = read_uint16 (f)
        py_str  = read_utf16_str (f, -1, py_len)
    
        if py_code not in py_map:
            py_map[py_code] = py_str
    
        if py_str == 'zuo':
            break
    
    f.seek(hz_offset)
    while f.tell() != file_size:
        word_count   = read_uint16 (f)
        pinyin_count = read_uint16 (f) / 2
    
        py_set = []
        for i in range(pinyin_count):
            py_id = read_uint16(f)
            py_set.append(py_map[py_id])
        py_str = "'".join (py_set)

        for i in range(word_count):
            word_len = read_uint16(f)
            word_str = read_utf16_str (f, -1, word_len)
            f.read(12) 
            yield py_str, word_str

    f.close()

def showtxt (records):
    for (pystr, utf8str) in records:
        print len(utf8str), utf8str

def store(records, f):
    for (pystr, utf8str) in records:
        f.write("%s\t1\n" %(utf8str.encode("utf8")))
        f.write("x:1\n")

def main ():
	if len (sys.argv) != 3:
		print "Unknown Option \n usage: python %s file.scel new.txt" %(sys.argv[0])
		exit (1)
	
	#Specify the param of scel path as a directory, you can place many scel file in this dirctory, the this process will combine the result in one txt file
	if os.path.isdir(sys.argv[1]):
		for fileName in glob.glob(sys.argv[1] + '*.scel'):
			print fileName
			generator = get_word_from_sogou_cell_dict(fileName)
			with open(sys.argv[2], "a") as f:
				store(generator, f)

	else:
		generator = get_word_from_sogou_cell_dict (sys.argv[1])
		with open(sys.argv[2], "w") as f:
			store(generator, f)
			#showtxt(generator)

if __name__ == "__main__":
    main()

------------------------------------------------------------

[python]将搜狗(sogou)的細胞詞庫轉換為mmseg的詞庫

繼續閱讀

用sphinx寫文檔1 Sphinx簡介2 安裝和配置3, 編輯文檔4，編譯5，總結

wordpress coreseek全文搜尋配置

FlatLinguist API 語言專家

LanguageModel API 語言模型

LargeNGramModel API 語言模型

presto體驗

SpringMVC學習系列(五)------異常處理

Sphinx中文指南

微服務SpringCloud項目中SpringSecurity與JWT在Zuul網關中使用

在Ubuntu上安裝CMU Sphinx語音識别引擎

Java電影購票系統(源代碼+資料庫)#Java#畢業設計#電影購票系統#Javaweb#計算機源碼基于springbo

python查找并删除相同檔案-UNIQ File-wxPython-v6

【實戰】sphinx的配置檔案

pecl ,apt-get 的安裝和删除指令的差別

SpringCloud初級學習(三)------Eureka服務注冊和發現

linux單機部署多tomcat+nginx實作負載均衡一、nginx 安裝二、單機多tomcat部署三、nginx負載均衡配置Contact