pic bsp r.java ray 讲解 ng- sam call
爬虫往往会遇到乱码问题。最简单的方法是根据 http 的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。
好的解决办法是直接根据页面内容来自动判断页面的编码。如 Mozilla 公司的 firefox 使用的 universalchardet 编码自动检测工具。
juniversalchardet 是 universalchardet 的 Java 版本。源码开源于 https://github.com/thkoch2001/juniversalchardet
自动编码主要是根据统计学的方法来判断。具体原理,可以看 http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
现在以 Java 爬虫常用的 httpclient 来讲解如何使用。看以下关键代码:
- UniversalDetector encDetector = new UniversalDetector(null);
- while ((l = myStream.read(tmp)) != -1) {
- buffer.append(tmp, 0, l);
- if (!encDetector.isDone()) {
- encDetector.handleData(tmp, 0, l);
- }
- }
- encDetector.dataEnd();
- String encoding = encDetector.getDetectedCharset();
- if (encoding != null) {
- return new String(buffer.toByteArray(), encoding);
- }
- encDetector.reset();
http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3
- <!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet
- -->
- <dependency>
- <groupId>
- com.googlecode.juniversalchardet
- </groupId>
- <artifactId>
- juniversalchardet
- </artifactId>
- <version>
- 1.0.3
- </version>
- </dependency>
https://code.google.com/archive/p/juniversalchardet/
juniversalchardet is a Java port of 'universalchardet', that is the encoding detector library of Mozilla.
The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
1 Currently not supported by Java
Download ``` import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {public static void main(String[] args) throws java.io.IOException {byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
- // (1)
- UniversalDetector detector = new UniversalDetector(null);
- // (2)
- int nread;
- while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
- detector.handleData(buf, 0, nread);
- }
- // (3)
- detector.dataEnd();
- // (4)
- String encoding = detector.getDetectedCharset();
- if (encoding != null) {
- System.out.println("Detected encoding = " + encoding);
- } else {
- System.out.println("No encoding detected.");
- }
- // (5)
- detector.reset();
} } ```
The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.
用 juniversalchardet 解决爬虫乱码问题
来源: http://www.bubuko.com/infodetail-2081919.html