一, pytesseract 介绍
1,pytesseract 说明
pytesseract 最新版本 0.1.6, 网址: https://pypi.python.org/pypi/pytesseract
Python-tesseract is a wrapper for google's Tesseract-OCR
(http://code.google.com/p/tesseract-ocr/). It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.
翻译一下大意:
a,Python-tesseract 是一个基于 google's Tesseract-OCR 的独立封装包;
b,Python-tesseract 功能是识别图片文件中文字, 并作为返回参数返回识别结果;
c,Python-tesseract 默认支持 tiff,bmp 格式图片, 只有在安装 PIL 之后, 才能支持 jpeg,gif,png 等其他图片格式;
2,pytesseract 安装
- INSTALLATION:
- Prerequisites:
* Python-tesseract requires python 2.5 or later or python 3.
* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
- the package "python-imaging" or "python3-imaging" for python3.
- * Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ .
- You must be able to invoke the tesseract command as "tesseract". If this
- isn't the case, for example because tesseract isn't in your PATH, you will
have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
Under Debian/Ubuntu you can use the package "tesseract-ocr".
- Installing via pip:
- See the [pytesseract package page](https://pypi.python.org/pypi/pytesseract)
- ```
$> sudo pip install pytesseract
翻译一下:
a,Python-tesseract 支持 python2.5 及更高版本;
b,Python-tesseract 需要安装 PIL(Python Imaging Library) , 来支持更多的图片格式;
c,Python-tesseract 需要安装 tesseract-ocr 安装包, 具体参看上一篇博文 http://www.cnblogs.com/zhongtang/p/5554784.html .
综上, Pytesseract 原理:
1, 上一篇博文 http://www.cnblogs.com/zhongtang/p/5554784.html 中提到, 执行命令行 tesseract.exe 1.png output -l eng , 可以识别 1.png 中文字, 并把识别结果输出到 output.txt 中;
2,Pytesseract 对上述过程进行了二次封装, 自动调用 tesseract.exe, 并读取 output.txt 文件的内容, 作为函数的返回值进行返回.
二, pytesseract 使用
- USAGE:
- ```
- > try:
- > import Image
- > except ImportError:
- > from PIL import Image
- > import pytesseract
- > print(pytesseract.image_to_string(Image.open('test.png')))
- > print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
可以看到:
1, 核心代码就是 image_to_string 函数, 该函数还支持 - l eng 参数, 支持 - psm 参数.
用法:
image_to_string(Image.open('test.png'),lang="eng" config="-psm 7")
2,pytesseract 里调用了 image, 所以才需要 PIL, 其实 tesseract.exe 本身是支持 jpeg,png 等图片格式的.
实例代码, 识别某公共网站的验证码 (大家千万别干坏事啊, 思虑再三, 最后还是隐掉网站域名, 大家去找别的网站试试吧......):
View Code
三, pytesseract 代码优化
上述程序在 windows 平台运行时, 会发现有黑色的控制台窗口一闪而过的画面, 不太友好.
略微修改了 pytesseract.py(C:\Python27\Lib\site-packages\pytesseract 目录下), 把上述过程进行了隐藏.
- # modified by zhongtang hide console window
- # new code
- IS_WIN32 = 'win32' in str(sys.platform).lower()
- if IS_WIN32:
- startupinfo = subprocess.STARTUPINFO()
- startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
- startupinfo.wShowWindow = subprocess.SW_HIDE
- proc = subprocess.Popen(command,
- stderr=subprocess.PIPE,startupinfo=startupinfo)
- '''
- # old code
- proc = subprocess.Popen(command,
- stderr=subprocess.PIPE)
- '''
- # modified end
为了方便初学者, 把 pytesseract.py 也贴出来, 高手自行忽略.
View Code
以上......
来源: http://www.bubuko.com/infodetail-2656476.html