天天看點

Tesseract安裝

【1】直接安裝

1)Ubuntu 14.04下,可以直接安裝發行包tesseract-ocr

sudo apt-get install tesseract-ocr      

這樣安裝的系統在/usr/bin下,資料檔案在/usr/share/tesseract-ocr/tessdata下(已經安裝了eng包)

在/usr/local/lib/python*.*/dist-package下有一個檔案夾pytesseract

(也許是我不小心裝上去的,GitHub[https://github.com/madmaze/pytesseract]上寫的是sudo pip install pytesseract安裝),

這樣就可以在Python中用tesseract了,例子如下:

import Image

import pytesseract

print pytesseract.image_to_string(Image.open('./Test/Python/t2.png'))

print pytesseract.image_to_string(Image.open('./Test/Python/t2.png'), )

把我訓練好的數字樣本檔案num.traineddata拷貝到資料檔案目錄下

print pytesseract.image_to_string(Image.open('./Test/Python/t2.png'), )

特殊的數字識别就很準了!

2)這樣安裝好的tesseract-ocr有一個問題,就是在Terminal下無法使用tesseract指令解析,報如下錯誤(但Python中可用):

Tesseract Open Source OCR Engine v3.03 with Leptonica

Error in pixReadStreamPng: function not present

Error in pixReadStream: png: no pix returned

Error in pixRead: pix not read

Error in pixGetInputFormat: pix not defined

Reading ./Test/Python/t2.png as a list of filenames...

Error in fopenReadStream: file not found

Error in pixRead: image file not found: �PNG

Image file �PNG cannot be read!

Error during processing.

網上說是因為Leptonica不認識png,tif,jpg格式(其實基本上什麼格式都不認識,真不知道為什麼還要基于這個庫?)

(這個問題我還沒有解決?????????????????)

--------------------------------------------------------------------------------------------

【2】從源碼安裝

1)首先需要安裝leptonica,下載下傳位址:www.leptonica.org/download.html,例如下載下傳leptonica-1.68.tar.gz

然後安裝,使用如下的基本安裝方式就可以了(leptonica的定制安裝有興趣的再弄吧):

./configure         [build the Makefile]

make                [builds the library and shared library versions of all the progs]

sudo make install   [as root; this puts liblept.a into /usr/local/lib/ and all the progs into /usr/local/bin/ ]

2)下載下傳Tesseract,現在Tesseract托管到GitHub了(https://github.com/tesseract-ocr)。(不用FQ了去googlecode了下了!)

從GitHub下載下傳代碼,解壓縮到某個目錄(例如/tmp/tesseract)

3)安裝

./autogen.sh

./configure

make

sudo make install

sudo ldconfig

注意這樣安裝好的系統在/usr/local/bin下,資料檔案在/usr/local/share/tessdata下!

其中可能會有如下錯誤:

[1]./autogen.sh時,報錯一堆工具沒有,則需要補齊相應工具:

沒有aclocal        sudo apt-get install automake

沒有libtoolize     sudo apt-get install libtool

如果再報沒有其他工具,則執行這個工具,Ubuntu會告訴你如何安裝它。

[2]資料問題

源碼make出來的系統是沒有資料的,必須至少安裝一個資料包(一般是eng)才能運作系統,安裝方法:

先下載下傳資料包,然後解壓縮到/usr/local/share/tessdata

[3]測試是否安裝成功

先測試系統安裝,運作tesseract,出現以下内容說明安裝成功!

[email protected]:/usr/local/share/tessdata$ tesseract

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

pagesegmode values are:

0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:

  -v --version: version info

  --list-langs: list available languages for tesseract engine

常見錯誤是沒有語言資料,如下,這是需要按照前面說的安裝好語言資料(最好裝上eng,系統預設是eng,而且eng肯定用得上):

Error opening data file /usr/local/share/tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Failed loading language 'eng'

Tesseract couldn't load any languages!

Could not initialize tesseract.

然後測試檔案識别,源碼目錄下有個phototest.tif檔案,可以作為測試用。

tesseract phototest.tif test1 -l eng

常見錯誤是Leptonica不比對,如下:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

Error in findTiffCompression: function not present

Error in pixReadStreamTiff: function not present

Error in pixReadStream: tiff: no pix returned

Error in pixRead: pix not read

Unsupported image type.

這個問題我還沒有解決,網上說的方法不行(在Ubuntu 14.04上沒試通)????????????????????????????????

轉載于:https://www.cnblogs.com/searchware/p/4825138.html