Java中utf-8格式字符串的存储方法。

字节 turn byte[] spa 负数 oid 只有一个 ret 字符串截取

知识点:
可通过 byte[] bytes=“xxxx”.getBytes("utf-8")得到字符串通过utf-8解析到字节数组。utf-8编码格式下，计算机采用1个字节存储ASCII范围内的字符，采用3个字节储存中文字符。

UTF-8是一种变长字节编码方式。对于某一个字符的UTF-8编码，如果只有一个字节则其最高二进制位为0；如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的位数，其余各字节均以10开头。UTF-8最多可用到6个字节。
如表：
1字节 0xxxxxxx
2字节 110xxxxx 10xxxxxx
3字节 1110xxxx 10xxxxxx 10xxxxxx
4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 
注意:计算中中utf-8编码存储多字节字符时，并未将8个二进制位的首位作为符号位，如直接输出，得到的将是负数。

byte[] bss = "这是一个神奇的世界".getBytes("utf-8");
System.out.println("bss长度:"+bss.length);//输出:27，一个中文用三个字节存储。
        
        //输出:-24 -65 -103 -26 -104 -81 -28 -72 -128 -28 -72 -86 -25 -91 -98 -27 -91 -121 -25 -102 -124 -28 -72 -106 -25 -107 -116 
 for (byte b:bss) {
      System.out.print(b+" ");
}

如要正确获得每一个字节表示的实际编码值。可通过如下方式。（需了解位移运算，原码、反码、补码相关知识）

1.十进制

byte[] bss = "这是一个神奇的世界".getBytes("utf-8");
System.out.println("bss长度:" + bss.length); //输出:27，一个中文用三个字节存储。
//输出:232 191 153 230 152 175 228 184 128 228 184 170 231 165 158 229 165 135 231 154 132 228 184 150 231 149 140 
for (byte b: bss) {
    System.out.print(Integer.valueOf(b & 0xff) + " ");
}

2.十六进制

byte[] bss = "这是一个神奇的世界".getBytes("utf-8");
System.out.println("bss长度:" + bss.length); //输出:27，一个中文用三个字节存储。
//输出:e8 bf 99 e6 98 af e4 b8 80 e4 b8 aa e7 a5 9e e5 a5 87 e7 9a 84 e4 b8 96 e7 95 8c 
for (byte b: bss) {
    System.out.print(Integer.toHexString(b & 0xff) + " ");
}

3.二进制

byte[] bss = "这是一个神奇的世界".getBytes("utf-8");
System.out.println("bss长度:" + bss.length); //输出:27，一个中文用三个字节存储。
//输出:11101000 10111111 10011001 11100110 10011000 10101111 11100100 
// 10111000 10000000 11100100 10111000 10101010 11100111 10100101 10011110
// 11100101 10100101 10000111 11100111 10011010 10000100 11100100 
// 10111000 10010110 11100111 10010101 10001100
for (byte b: bss) {
    System.out.print(Integer.toBinaryString(b & 0xff) + " ");
}

练习:中英文混合字符串截取

* 通过传入字符串和字节素，根据字节数截取字串，utf-8下非英文字符占据多个字节，
* 如截取位置处于非英文字符的中间位置，应舍弃最后一个被截断的字符。

public class StrTruncate {
 
    public static void main(String[] args) throws UnsupportedEncodingException {
        Scanner scanner = new Scanner(System. in );
        System.out.println("输入(字符串,字节数)");
        String inputStr = scanner.nextLine();
 
        String sub = new StrTruncate().getSubStr(inputStr.split(",")[0], Integer.valueOf(inputStr.split(",")[1]));
        System.out.println("截取后的字符串为:" + sub);
    }
 
    public String getSubStr(String resource, int charLen) throws UnsupportedEncodingException {
        if (charLen <= 0) {
            return null;
        }
        byte[] bytes = resource.getBytes("utf-8");
        if (bytes[charLen] < 0) {
            while (!Integer.toBinaryString(bytes[charLen] & 0xff).startsWith("11")) {
                charLen--;
            }
        }
        String subStr = new String(bytes, 0, charLen, "utf-8");
        return subStr;
 
    }
}

执行结果如下：

Java中utf-8格式字符串的存储方法。

字节 turn byte[] spa 负数 oid 只有一个 ret 字符串截取

原文：http://www.cnblogs.com/shuais16969/p/7896089.html

来源: http://www.bubuko.com/infodetail-2406825.html

与本文相关文章

暂无,快来抢沙发吧！