前面已經寫過如何使用python調用tesseract API了,這裡說的是如何使用tesseract的LSTM模式。tesseract 4.0已經加入LSTM了,在用指令行執行的時候,添加 “–oem 1”參數即可,但是pythonocr子產品裡并沒有提供使用oem參數的init函數,檢視tesseract的源碼,capi.cpp定位到257行有:
TESS_API int TESS_CALL TessBaseAPIInit1(TessBaseAPI* handle, const char* datapath, const char* language, TessOcrEngineMode oem,
char** configs, int configs_size)
{
return handle->Init(datapath, language, oem, configs, configs_size, nullptr, nullptr, false);
}
TESS_API int TESS_CALL TessBaseAPIInit2(TessBaseAPI* handle, const char* datapath, const char* language, TessOcrEngineMode oem)
{
return handle->Init(datapath, language, oem);
}
TESS_API int TESS_CALL TessBaseAPIInit3(TessBaseAPI* handle, const char* datapath, const char* language)
{
return handle->Init(datapath, language);
}
其中TessBaseAPIInit2()函數就是我們需要的,其實已經導出在了tesseract.so檔案中,需要我們聲明一下才能使用。打開pythonocr安裝目錄下的tesseract_raw.py檔案,定位到148行,可以看到對init1和init3的函數聲明,那麼加入init2的函數聲明即可,修改後如下:
g_libtesseract.TessBaseAPIInit1.argtypes = [
ctypes.c_void_p, # TessBaseAPI*
ctypes.c_char_p, # datapath
ctypes.c_char_p, # language
ctypes.c_int, # TessOcrEngineMode
ctypes.POINTER(ctypes.c_char_p), # configs
ctypes.c_int, # configs_size
]
g_libtesseract.TessBaseAPIInit1.restype = ctypes.c_int
# 添加的對init2的函數聲明
g_libtesseract.TessBaseAPIInit2.argtypes = [
ctypes.c_void_p, # TessBaseAPI*
ctypes.c_char_p, # datapath
ctypes.c_char_p, # language
ctypes.c_int, # TessOcrEngineMode
]
g_libtesseract.TessBaseAPIInit2.restype = ctypes.c_int
g_libtesseract.TessBaseAPIInit3.argtypes = [
ctypes.c_void_p, # TessBaseAPI*
ctypes.c_char_p, # datapath
ctypes.c_char_p, # language
]
g_libtesseract.TessBaseAPIInit3.restype = ctypes.c_int
然後定位到351行,這裡是pythonocr的init函數實作,修改成如下:
def init(hljs-number">0):
assert(g_libtesseract)
handle = g_libtesseract.TessBaseAPICreate()
try:
if lang:
lang = lang.encode("utf-8")
prefix = None
if TESSDATA_PREFIX:
prefix = TESSDATA_PREFIX.encode("utf-8")
g_libtesseract.TessBaseAPIInit2(
ctypes.c_void_p(handle),
ctypes.c_char_p(prefix),
ctypes.c_char_p(lang),
oem
)
g_libtesseract.TessBaseAPISetVariable(
ctypes.c_void_p(handle),
b"tessedit_zero_rejection",
b"F"
)
except:
g_libtesseract.TessBaseAPIDelete(ctypes.c_void_p(handle))
raise
return handle
在外部調用的時候,隻需要将以前的
handle = tesseract_raw.init(lang='eng')
修改成:
handle = tesseract_raw.init(lang='eng', oem=)
即可。下載下傳最新支援lstm的tessdata資料包,識别結果會比之前有大大的提高!如何在調用API的時候使用多語言,就如同指令行下的 -l eng+chi這種,還在摸索中,如果誰知道,請麻煩告知,謝謝!