Tesseract Ocr 引擎識别圖形驗證碼

1、Tesseract介紹

tesseract 是一個google支援的開源ocr項目，其項目位址：https://github.com/tesseract-ocr/tesseract，目前最新的源碼可以在這裡下載下傳。

實際使用tesseract ocr也有兩種方式：1- 動态庫方式 libtesseract 2 - 執行程式方式 tesseract.exe

2、Tesseract安裝包下載下傳

Tesseract的release版本下載下傳位址：https://github.com/tesseract-ocr/tesseract/wiki/Downloads

Currently, there is no official Windows installer for newer versions.

意思就是官方不提供最新版windows平台安裝包，隻有相對略老的3.02.02版本，其下載下傳位址：https://sourceforge.net/projects/tesseract-ocr-alt/files/。

最新版3.03和3.05版本，都是三方維護和管理的安裝包，有好幾個發行機構，分别是：

3rd party Windows exe's/installer

binaries compiled by @egorpugin (ref issue # 209)https://www.dropbox.com/s/8t54mz39i58qslh/tesseract-3.05.00dev-win32-vc19.zip?dl=1

You have to install VC2015 x86 redist from microsoft.com in order to run them. Leptonica is built with all libs except for libjp2k.

https://github.com/UB-Mannheim/tesseract/wiki http://domasofan.spdns.eu/tesseract/

3、Tesseract ocr 的使用

安裝之後，預設目錄C:\Program Files (x86)\Tesseract-OCR，你需要把這個路徑放到你作業系統的path搜尋路徑中，這樣用起來比較友善。

在安裝目錄C:\Program Files (x86)\Tesseract-OCR下可以看到 tesseract.exe這個指令行執行程式。

注：安裝後的目錄，你可以打包成壓縮包拷貝到别的地方或别的電腦直接解壓使用。

tesseract文法如下：

例如：tesseract 1.png output -l eng -psm 7 ，表示采取單行文本方式，使用英語字庫識别1.png這個圖檔檔案，識别結果輸出到目前目錄output.txt檔案中。其中 -psm 7 表示用單行文本識别，-l eng 表示使用英語語言。是以預設選項直接使用 “tesseract 1.png output” 即可。

4、Tesseract ocr 的 Java 工具類

package com.shanhy.unifiedintegral.common.ocr;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;

import org.jdesktop.swingx.util.OS;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;

/**
 * 圖形驗證碼識别
 *
 * @author   單紅宇(365384722)
 * @myblog  http://blog.csdn.net/catoop/
 * @create    2016年9月2日
 */
@Component
public class OCRHelper {
	
	private final String LANG_OPTION = "-l";
	private final String EOL = System.getProperty("line.separator");
	/**
	 * tesseract安裝目錄
	 */
	@Value("${tesseractDirPath:C://Program Files (x86)//Tesseract-OCR}")//預設值C://Program Files (x86)//Tesseract-OCR
	private String tessPath;

	/**
	 * @param imageFile
	 *            傳入的圖像檔案
	 * @param imageFormat
	 *            傳入的圖像格式
	 * @return 識别後的字元串
	 */
	public String recognizeText(File imageFile) throws Exception {
		/**
		 * 設定輸出檔案的儲存的檔案目錄
		 */
		File outputFile = new File(imageFile.getParentFile(), "output");

		StringBuffer strB = new StringBuffer();
		List<String> cmd = new ArrayList<String>();
		if (OS.isWindowsXP()) {
			cmd.add(tessPath + "\\tesseract");
		} else if (OS.isLinux()) {
			cmd.add("tesseract");
		} else {
			cmd.add(tessPath + "\\tesseract");
		}
		cmd.add("");
		cmd.add(outputFile.getName());
		cmd.add(LANG_OPTION);
		// 設定語言參數
		// cmd.add("chi_sim");// 中文簡體（需要額外安裝）
		cmd.add("eng");// 英文（安裝預設自帶）

		ProcessBuilder pb = new ProcessBuilder();
		/**
		 * Sets this process builder's working directory.
		 */
		pb.directory(imageFile.getParentFile());
		cmd.set(1, imageFile.getName());
		pb.command(cmd);
		pb.redirectErrorStream(true);
		Process process = pb.start();
		// tesseract.exe 1.jpg 1 -l chi_sim
		// Runtime.getRuntime().exec("tesseract.exe 1.jpg 1 -l chi_sim");
		/**
		 * the exit value of the process. By convention, 0 indicates normal
		 * termination.
		 */
		// System.out.println(cmd.toString());
		int w = process.waitFor();
		if (w == 0)// 0代表正常退出
		{
			BufferedReader in = new BufferedReader(
					new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath() + ".txt"), "UTF-8"));
			String str;

			while ((str = in.readLine()) != null) {
				strB.append(str).append(EOL);
			}
			in.close();
		} else {
			String msg;
			switch (w) {
			case 1:// 大部分是權限問題，目前運作的java執行權限不夠
				msg = "Errors accessing files. There may be spaces in your image's filename.";
				break;
			case 29:
				msg = "Cannot recognize the image or its selected region.";
				break;
			case 31:
				msg = "Unsupported image format.";
				break;
			default:
				msg = "Errors occurred.";
			}
			throw new RuntimeException(msg);
		}
		new File(outputFile.getAbsolutePath() + ".txt").delete();
		return strB.toString().replaceAll("\\s*", "");
	}
	
	public void setTessPath(String path){
		this.tessPath = path;
	}

//	public static void main(String[] args) {
//		try {
//			OCRHelper ocr = new OCRHelper();
//			ocr.setTessPath("D://Tesseract-OCR");
//			System.out.println(ocr.recognizeText(new File("G://vcode.jpg")));
//		} catch (Exception e) {
//			e.printStackTrace();
//		}
//	}
}

其中的 tesseractDirPath 的是我把這個安裝目錄放到配置檔案中了。

一般看起來比較正常的驗證碼基本可以直接識别，如果是有噪度的圖檔，可以自己先對圖檔處理，降噪後再使用該工具類識别。

Tesseract Ocr 引擎識别圖形驗證碼

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method