Tesseract OCR 圖檔文字識别

2021-10-24 10:42:00

Tesseract 介紹

Tesseract是一個開源的文本識别引擎，支援多種語言。4.0.0版本增加了LSTM神經網絡。Tesseract最初是由惠普公司研發，2005年開源。

Tesseract安裝

下載下傳Tesseract的安裝包，位址

安裝過程：

選擇常用的數學公式包，其他的語言包可以先不勾選，後續需要時再下載下傳。如果勾選了安裝過程可能極慢甚至中斷。

設定環境變量

設定
TESSDATA_PREFIX 環境變量到tesseract的data目錄。

Tesseract OCR 圖檔文字識别

選擇語言包：

使用Tesseract進行文本識别時，需要下載下傳相應的語言包，如本文需要對中文進行識别在data下載下傳

chi_sim.traineddata

放到

TESSDATA_PREFIX

目錄下。

Tesseract中文識别

Tesseract沒有提供圖形界面，隻能通過指令行或者程式設計語言來調用。

需要注意的是，在使用Tessearct對中文進行識别的時候需要指定使用的語言模型，否則會識别失敗出現一堆亂碼。

指令行調用Tesseract

tesseract 1.png result -l chi_sim   # -l 參數指定語言模型

python調用Tessearct

使用python調用Tessearct需要首先安裝兩個python lib

pip install pillow
pip install pytesseract

使用python調用Tessearct進行圖檔中文識别

# coding = utf-8
from PIL import Image
import pytesseract
image = Image.open("1.png")
# 這裡lang='chi_sim'參數很重要，意思是對中文進行識别，如果加這個參數預設應該是英文的，中文識别出來的是亂碼
text = pytesseract.image_to_string(image, lang='chi_sim')
print(text)

'''
類似于
919@400 ROK
1X
< Aah @ Fix
arta
ExT, 2%
Med Ea
BAAR ALFRE RIE tS
| Be Be
cai | = LRT +R
'''

Reference

Python:文本識别抛棄pytesser，直接使用Tesseract - Penguin (polarxiong.com)

tesseract官方文檔：Tesseract User Manual | tessdoc (tesseract-ocr.github.io)