軟體:
jTessBoxEditor Version 0.9 (30 April 2013)
Tesseract-OCR win32 v3.02 with Leptonica
訓練步驟:
1.使用jTessBoxEditor,tools->merge_tif,産生tif檔案
2.産生box檔案
tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop makebox
3.使用jTessBoxEditor打開,Insert或Delete,添加删除字元,并通過xywh調整對應的坐票
4.訓練(如果遇到不可識别的字元,couldn t find a matching blob,嘗試換位置或調坐标)
tesseract.exe eng.arial.01.tif eng.arial.01 nobatch box.train
5.字型預處理
unicharset_extractor.exe eng.arial.01.box
6.建立font_properties.txt,内容為:arial 0 0 0 0 0
7.字型處理
mftraining.exe -F font_properties.txt -U unicharset eng.arial.01.tr
8.cntraining.exe eng.arial.01.tr
9.把unicharset, inttemp, normproto, pffmtable這四個檔案加上字首“eng.arial.01.”
10.combine_tessdata.exe eng.arial.01.
顯示:
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 1660
Offset for type 4 is 327545
Offset for type 5 is 327781
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is –1
必須确定的是第2、4、5、6行的資料不是-1,那麼一個新的字典就算生成了。
11.此時目錄下“eng.arial.01.traineddata”的檔案拷貝到tesseract程式目錄下的“tessdata”目錄
12.
#tesseract.exe test.jpg result -l eng.arial.01
#tesseract.exe a.bmp result2 -l eng.arial.01
指定布局識别方式
tesseract.exe 42.png result2 -l eng.arial.01 -psm 7
布局參數描述:
-psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.