详解 Huffman 编码算法之 Java 实现
这里有新鲜出炉的 Java 设计模式, 程序狗速度看过来!
Java 程序设计语言
java 是一种可以撰写跨平台应用软件的面向对象的程序设计语言, 是由 Sun Microsystems 公司于 1995 年 5 月推出的 Java 程序设计语言和 Java 平台 (即 JavaEE(j2ee), JavaME(j2me), JavaSE(j2se)) 的总称
Huffman 编码是一种编码方式, 常用于无损压缩本文只介绍用 Java 语言来实现该编码方式的算法和数据结构有兴趣的可以了解一下
Huffman 编码介绍
Huffman 编码处理的是字符以及字符对应的二进制的编码配对问题, 分为编码和解码, 目的是压缩字符对应的二进制数据长度我们知道字符存贮和传输的时候都是二进制的 (计算机只认识 0/1), 那么就有字符与二进制之间的 mapping 关系字符属于字符集(Charset), 字符需要通过编码(encode) 为二进制进行存贮和传输, 显示的时候需要解码 (decode) 回字符, 字符集与编码方法是一对多关系 (Unicode 可以用 UTF-8,UTF-16 等编码) 理解了字符集, 编码以及解码, 满天飞的乱码问题也就游刃而解了以英文字母小写 a 为例, ASCII 编码中, 十进制为 97, 二进制为 01100001ASCII 的每一个字符都用 8 个 Bit(1Byte)编码, 假如有 1000 个字符要传输, 那么就要传输 8000 个 Bit 问题来了, 英文中字母 e 的使用频率为 12.702%, 而 z 为 0.074%, 前者是后者的 100 多倍, 但是确使用相同位数的二进制可以做得更好, 方法就是可变长度编码, 指导原则就是频率高的用较短的位数编码, 频率低的用较长位数编码 Huffman 编码算法就是处理这样的问题
Huffman 编码 Java 实现
Huffman 编码算法主要用到的数据结构是完全二叉树 (full binary tree) 和优先级队列后者用的是 Java.util.PriorityQueue, 前者自己实现(都为内部类), 代码如下:
- static class Tree {
- private Node root;
- public Node getRoot() {
- return root;
- }
- public void setRoot(Node root) {
- this.root = root;
- }
- }
- static class Node implements Comparable < Node > {
- private String chars = "";
- private int frequence = 0;
- private Node parent;
- private Node leftNode;
- private Node rightNode;@Override public int compareTo(Node n) {
- return frequence - n.frequence;
- }
- public boolean isLeaf() {
- return chars.length() == 1;
- }
- public boolean isRoot() {
- return parent == null;
- }
- public boolean isLeftChild() {
- return parent != null && this == parent.leftNode;
- }
- public int getFrequence() {
- return frequence;
- }
- public void setFrequence(int frequence) {
- this.frequence = frequence;
- }
- public String getChars() {
- return chars;
- }
- public void setChars(String chars) {
- this.chars = chars;
- }
- public Node getParent() {
- return parent;
- }
- public void setParent(Node parent) {
- this.parent = parent;
- }
- public Node getLeftNode() {
- return leftNode;
- }
- public void setLeftNode(Node leftNode) {
- this.leftNode = leftNode;
- }
- public Node getRightNode() {
- return rightNode;
- }
- public void setRightNode(Node rightNode) {
- this.rightNode = rightNode;
- }
- }
统计数据
既然要按频率来安排编码表, 那么首先当然得获得频率的统计信息我实现了一个方法处理这样的问题如果已经有统计信息, 那么转为 Map<Character,Integer > 即可如果你得到的信息是百分比, 乘以 100 或 1000, 或 10000 总是可以转为整数比如 12.702% 乘以 1000 为 12702,Huffman 编码只关心大小问题统计方法实现如下:
- public static Map<Character, Integer> statistics(char[] charArray) {
- Map<Character, Integer> map = new HashMap<Character, Integer>();
- for (char c : charArray) {
- Character character = new Character(c);
- if (map.containsKey(character)) {
- map.put(character, map.get(character) + 1);
- } else {
- map.put(character, 1);
- }
- }
- return map;
- }
构建树
构建树是 Huffman 编码算法的核心步骤思想是把所有的字符挂到一颗完全二叉树的叶子节点, 任何一个非页子节点的左节点出现频率不大于右节点算法为把统计信息转为 Node 存放到一个优先级队列里面, 每一次从队列里面弹出两个最小频率的节点, 构建一个新的父 Node(非叶子节点), 字符内容刚弹出来的两个节点字符内容之和, 频率也是它们的和, 最开始的弹出来的作为左子节点, 后面一个作为右子节点, 并且把刚构建的父节点放到队列里面重复以上的动作 N-1 次, N 为不同字符的个数 (每一次队列里面个数减 1) 结束以上步骤, 队列里面剩一个节点, 弹出作为树的根节点代码如下:
- private static Tree buildTree(Map < Character, Integer > statistics, List < Node > leafs) {
- Character[] keys = statistics.keySet().toArray(new Character[0]);
- PriorityQueue < Node > priorityQueue = new PriorityQueue < Node > ();
- for (Character character: keys) {
- Node node = new Node();
- node.chars = character.toString();
- node.frequence = statistics.get(character);
- priorityQueue.add(node);
- leafs.add(node);
- }
- int size = priorityQueue.size();
- for (int i = 1; i <= size - 1; i++) {
- Node node1 = priorityQueue.poll();
- Node node2 = priorityQueue.poll();
- Node sumNode = new Node();
- sumNode.chars = node1.chars + node2.chars;
- sumNode.frequence = node1.frequence + node2.frequence;
- sumNode.leftNode = node1;
- sumNode.rightNode = node2;
- node1.parent = sumNode;
- node2.parent = sumNode;
- priorityQueue.add(sumNode);
- }
- Tree tree = new Tree();
- tree.root = priorityQueue.poll();
- return tree;
- }
编码
某个字符对应的编码为, 从该字符所在的叶子节点向上搜索, 如果该字符节点是父节点的左节点, 编码字符之前加 0, 反之如果是右节点, 加 1, 直到根节点只要获取了字符和二进制码之间的 mapping 关系, 编码就非常简单代码如下:
- public static String encode(String originalStr, Map < Character, Integer > statistics) {
- if (originalStr == null || originalStr.equals("")) {
- return "";
- }
- char[] charArray = originalStr.toCharArray();
- List < Node > leafNodes = new ArrayList < Node > ();
- buildTree(statistics, leafNodes);
- Map < Character,
- String > encodInfo = buildEncodingInfo(leafNodes);
- StringBuffer buffer = new StringBuffer();
- for (char c: charArray) {
- Character character = new Character(c);
- buffer.append(encodInfo.get(character));
- }
- return buffer.toString();
- }
- private static Map < Character,
- String > buildEncodingInfo(List < Node > leafNodes) {
- Map < Character,
- String > codewords = new HashMap < Character,
- String > ();
- for (Node leafNode: leafNodes) {
- Character character = new Character(leafNode.getChars().charAt(0));
- String codeword = "";
- Node currentNode = leafNode;
- do {
- if (currentNode.isLeftChild()) {
- codeword = "0" + codeword;
- } else {
- codeword = "1" + codeword;
- }
- currentNode = currentNode.parent;
- } while ( currentNode . parent != null );
- codewords.put(character, codeword);
- }
- return codewords;
- }
解码
因为 Huffman 编码算法能够保证任何的二进制码都不会是另外一个码的前缀, 解码非常简单, 依次取出二进制的每一位, 从树根向下搜索, 1 向右, 0 向左, 到了叶子节点(命中), 退回根节点继续重复以上动作代码如下:
- public static String decode(String binaryStr, Map < Character, Integer > statistics) {
- if (binaryStr == null || binaryStr.equals("")) {
- return "";
- }
- char[] binaryCharArray = binaryStr.toCharArray();
- LinkedList < Character > binaryList = new LinkedList < Character > ();
- int size = binaryCharArray.length;
- for (int i = 0; i < size; i++) {
- binaryList.addLast(new Character(binaryCharArray[i]));
- }
- List < Node > leafNodes = new ArrayList < Node > ();
- Tree tree = buildTree(statistics, leafNodes);
- StringBuffer buffer = new StringBuffer();
- while (binaryList.size() > 0) {
- Node node = tree.root;
- do {
- Character c = binaryList.removeFirst();
- if (c.charValue() == '0') {
- node = node.leftNode;
- } else {
- node = node.rightNode;
- }
- } while (! node . isLeaf ());
- buffer.append(node.chars);
- }
- return buffer.toString();
- }
测试以及比较
以下测试 Huffman 编码的正确性(先编码, 后解码, 包括中文), 以及 Huffman 编码与常见的字符编码的二进制字符串比较代码如下:
- public static void main(String[] args) {
- String oriStr = "Huffman codes compress data very effectively: savings of 20% to 90% are typical,"
- + "depending on the characteristics of the data being compressed. 中华崛起";
- Map<Character, Integer> statistics = statistics(oriStr.toCharArray());
- String encodedBinariStr = encode(oriStr, statistics);
- String decodedStr = decode(encodedBinariStr, statistics);
- System.out.println("Original sstring:" + oriStr);
- System.out.println("Huffman encoed binary string:" + encodedBinariStr);
- System.out.println("decoded string from binariy string:" + decodedStr);
- System.out.println("binary string of UTF-8:"
- + getStringOfByte(oriStr, Charset.forName("UTF-8")));
- System.out.println("binary string of UTF-16:"
- + getStringOfByte(oriStr, Charset.forName("UTF-16")));
- System.out.println("binary string of US-ASCII:"
- + getStringOfByte(oriStr, Charset.forName("US-ASCII")));
- System.out.println("binary string of GB2312:"
- + getStringOfByte(oriStr, Charset.forName("GB2312")));
- }
- public static String getStringOfByte(String str, Charset charset) {
- if (str == null || str.equals("")) {
- return "";
- }
- byte[] byteArray = str.getBytes(charset);
- int size = byteArray.length;
- StringBuffer buffer = new StringBuffer();
- for (int i = 0; i < size; i++) {
- byte temp = byteArray[i];
- buffer.append(getStringOfByte(temp));
- }
- return buffer.toString();
- }
- public static String getStringOfByte(byte b) {
- StringBuffer buffer = new StringBuffer();
- for (int i = 7; i >= 0; i--) {
- byte temp = (byte) ((b >> i) & 0x1);
- buffer.append(String.valueOf(temp));
- }
- return buffer.toString();
- }
来源: http://www.phperz.com/article/18/0209/359176.html