天天看點

Tesseract-OCR 訓練過程 V3.02

軟體:

jTessBoxEditor Version 0.9 (30 April 2013)

Tesseract-OCR win32 v3.02 with Leptonica

訓練步驟:

1.使用jTessBoxEditor,tools->merge_tif,産生tif檔案

2.産生box檔案

tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop makebox

3.使用jTessBoxEditor打開,Insert或Delete,添加删除字元,并通過xywh調整對應的坐票

4.訓練(如果遇到不可識别的字元,couldn t find a matching blob,嘗試換位置或調坐标)

tesseract.exe eng.arial.01.tif eng.arial.01 nobatch box.train

5.字型預處理

unicharset_extractor.exe eng.arial.01.box

6.建立font_properties.txt,内容為:arial 0 0 0 0 0

7.字型處理

mftraining.exe -F font_properties.txt -U unicharset eng.arial.01.tr

8.cntraining.exe eng.arial.01.tr

9.把unicharset, inttemp, normproto, pffmtable這四個檔案加上字首“eng.arial.01.”

10.combine_tessdata.exe eng.arial.01.

顯示:

Combining tessdata files

TessdataManager combined tesseract data files.

Offset for type 0 is -1

Offset for type 1 is 108

Offset for type 2 is -1

Offset for type 3 is 1660

Offset for type 4 is 327545

Offset for type 5 is 327781

Offset for type 6 is -1

Offset for type 7 is -1

Offset for type 8 is -1

Offset for type 9 is -1

Offset for type 10 is -1

Offset for type 11 is -1

Offset for type 12 is –1

必須确定的是第2、4、5、6行的資料不是-1,那麼一個新的字典就算生成了。

11.此時目錄下“eng.arial.01.traineddata”的檔案拷貝到tesseract程式目錄下的“tessdata”目錄

12.

#tesseract.exe test.jpg result -l eng.arial.01

#tesseract.exe a.bmp result2 -l eng.arial.01

指定布局識别方式

tesseract.exe 42.png result2 -l eng.arial.01 -psm 7

布局參數描述:

-psm N

    Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

    0 = Orientation and script detection (OSD) only.

    1 = Automatic page segmentation with OSD.

    2 = Automatic page segmentation, but no OSD, or OCR.

    3 = Fully automatic page segmentation, but no OSD. (Default)

    4 = Assume a single column of text of variable sizes.

    5 = Assume a single uniform block of vertically aligned text.

    6 = Assume a single uniform block of text.

    7 = Treat the image as a single text line.

    8 = Treat the image as a single word.

    9 = Treat the image as a single word in a circle.

    10 = Treat the image as a single character.