前言: 上篇文章 HBase Filter 过滤器概述 https://mp.weixin.qq.com/s/76y5NIBQMwvR11Cx2Mbt3w 对 HBase 过滤器的组成及其家谱进行简单介绍, 本篇文章主要对 HBase 过滤器之比较器作一个补充介绍, 也算是 HBase Filter 学习的必备低阶魂技吧. 本篇文中源码基于 HBase 1.1.2.2.6.5.0-292 HDP 版本.
HBase 所有的比较器实现类都继承于父类 ByteArrayComparable, 而 ByteArrayComparable 又实现了 Comparable 接口; 不同功能的比较器差别在于对父类 compareTo() 方法的重写逻辑不同.
下面分别对 HBase Filter 默认实现的七大比较器一一进行介绍.
1. BinaryComparator
介绍: 二进制比较器, 用于按字典顺序比较指定字节数组.
先看一个小例子:
- public class BinaryComparatorDemo {
- public static void main(String[] args) {
- BinaryComparator bc = new BinaryComparator(Bytes.toBytes("bbb"));
- int code1 = bc.compareTo(Bytes.toBytes("bbb"), 0, 3);
- System.out.println(code1); // 0
- int code2 = bc.compareTo(Bytes.toBytes("aaa"), 0, 3);
- System.out.println(code2); // 1
- int code3 = bc.compareTo(Bytes.toBytes("ccc"), 0, 3);
- System.out.println(code3); // -1
- int code4 = bc.compareTo(Bytes.toBytes("bbf"), 0, 3);
- System.out.println(code4); // -4
- int code5 = bc.compareTo(Bytes.toBytes("bbbedf"), 0, 6);
- System.out.println(code5); // -3
- }
- }
不难看出, 该比较器的比较规则如下:
两个字符串首字母不同, 则该方法返回首字母的 asc 码的差值
参与比较的两个字符串如果首字符相同, 则比较下一个字符, 直到有不同的为止, 返回该不同的字符的 asc 码差值
两个字符串不一样长, 可以参与比较的字符又完全一样, 则返回两个字符串的长度差值
看一下以上规则对应其 compareTo() 方法的源码实现:
实现一:
- static enum UnsafeComparer implements Bytes.Comparer<byte[]> {
- INSTANCE;
- ....
- public int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
- if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
- return 0;
- } else {
- int minLength = Math.min(length1, length2);
- int minWords = minLength / 8;
- long offset1Adj = (long)(offset1 + BYTE_ARRAY_BASE_OFFSET);
- long offset2Adj = (long)(offset2 + BYTE_ARRAY_BASE_OFFSET);
- int j = minWords <<3;
- int offset;
- for(offset = 0; offset < j; offset += 8) {
- long lw = theUnsafe.getLong(buffer1, offset1Adj + (long)offset);
- long rw = theUnsafe.getLong(buffer2, offset2Adj + (long)offset);
- long diff = lw ^ rw;
- if (diff != 0L) {
- return lessThanUnsignedLong(lw, rw) ? -1 : 1;
- }
- }
- offset = j;
- int b;
- int a;
- if (minLength - j>= 4) {
- a = theUnsafe.getInt(buffer1, offset1Adj + (long)j);
- b = theUnsafe.getInt(buffer2, offset2Adj + (long)j);
- if (a != b) {
- return lessThanUnsignedInt(a, b) ? -1 : 1;
- }
- offset = j + 4;
- }
- if (minLength - offset>= 2) {
- short sl = theUnsafe.getShort(buffer1, offset1Adj + (long)offset);
- short sr = theUnsafe.getShort(buffer2, offset2Adj + (long)offset);
- if (sl != sr) {
- return lessThanUnsignedShort(sl, sr) ? -1 : 1;
- }
- offset += 2;
- }
- if (minLength - offset == 1) {
- a = buffer1[offset1 + offset] & 255;
- b = buffer2[offset2 + offset] & 255;
- if (a != b) {
- return a - b;
- }
- }
- return length1 - length2;
- }
- }
实现二:
- static enum PureJavaComparer implements Bytes.Comparer<byte[]> {
- INSTANCE;
- private PureJavaComparer() {
- }
- public int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
- if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
- return 0;
- } else {
- int end1 = offset1 + length1;
- int end2 = offset2 + length2;
- int i = offset1;
- for(int j = offset2; i <end1 && j < end2; ++j) {
- int a = buffer1[i] & 255;
- int b = buffer2[j] & 255;
- if (a != b) {
- return a - b;
- }
- ++i;
- }
- return length1 - length2;
- }
- }
- }
实现一是对实现二的一个优化, 都引自 Bytes 类, HBase 优先执行实现一方案, 如果有异常再执行实现二方案. 如下:
- public static int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
- return Bytes.LexicographicalComparerHolder.BEST_COMPARER.compareTo(buffer1, offset1, length1, buffer2, offset2, length2);
- }
- ...
- ...
- static final String UNSAFE_COMPARER_NAME = Bytes.LexicographicalComparerHolder.class.getName() + "$UnsafeComparer";
- static final Bytes.Comparer<byte[]> BEST_COMPARER = getBestComparer();
- static Bytes.Comparer<byte[]> getBestComparer() {
- try {
- Class<?> theClass = Class.forName(UNSAFE_COMPARER_NAME);
- Bytes.Comparer<byte[]> comparer = (Bytes.Comparer)theClass.getEnumConstants()[0];
- return comparer;
- } catch (Throwable var2) {
- return Bytes.lexicographicalComparerJavaImpl();
- }
- }
- 2. BinaryPrefixComparator
介绍: 二进制比较器, 只比较前缀是否与指定字节数组相同.
先看一个小例子:
- public class BinaryPrefixComparatorDemo {
- public static void main(String[] args) {
- BinaryPrefixComparator bc = new BinaryPrefixComparator(Bytes.toBytes("b"));
- int code1 = bc.compareTo(Bytes.toBytes("bbb"), 0, 3);
- System.out.println(code1); // 0
- int code2 = bc.compareTo(Bytes.toBytes("aaa"), 0, 3);
- System.out.println(code2); // 1
- int code3 = bc.compareTo(Bytes.toBytes("ccc"), 0, 3);
- System.out.println(code3); // -1
- int code4 = bc.compareTo(Bytes.toBytes("bbf"), 0, 3);
- System.out.println(code4); // 0
- int code5 = bc.compareTo(Bytes.toBytes("bbbedf"), 0, 6);
- System.out.println(code5); // 0
- int code6 = bc.compareTo(Bytes.toBytes("ebbedf"), 0, 6);
- System.out.println(code6); // -3
- }
- }
该比较器只是基于 BinaryComparator 比较器稍作更改而已, 以下代码一目了然:
- public int compareTo(byte[] value, int offset, int length) {
- return Bytes.compareTo(this.value, 0, this.value.length, value, offset, this.value.length <= length ? this.value.length : length);
- }
看一下同 BinaryComparator 方法的异同:
- public int compareTo(byte[] value, int offset, int length) {
- return Bytes.compareTo(this.value, 0, this.value.length, value, offset, length);
- }
区别只在于最后一个传参, 即 length=min(this.value.length,value.length), 取小. 这样在后面的字节逐位比较时, 即只需比较 min length 次.
3. BitComparator
介绍: 位比价器, 通过 BitwiseOp 提供的 AND(与),OR(或),NOT(非) 进行比较. 返回结果要么为 1 要么为 0, 仅支持 EQUAL 和非 EQUAL.
先看一个小例子:
- public class BitComparatorDemo {
- public static void main(String[] args) {
- // 长度相同按位或比较: 由低位起逐位比较, 每一位按位或比较都为 0, 则返回 1, 否则返回 0.
- BitComparator bc1 = new BitComparator(new byte[]{0,0,0,0}, BitComparator.BitwiseOp.OR);
- int i = bc1.compareTo(new byte[]{0,0,0,0}, 0, 4);
- System.out.println(i); // 1
- // 长度相同按位与比较: 由低位起逐位比较, 每一位按位与比较都为 0, 则返回 1, 否则返回 0.
- BitComparator bc2 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.AND);
- int j = bc2.compareTo(new byte[]{0,1,0,1}, 0, 4);
- System.out.println(j); // 1
- // 长度相同按位异或比较: 由低位起逐位比较, 每一位按位异或比较都为 0, 则返回 1, 否则返回 0.
- BitComparator bc3 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.XOR);
- int x = bc3.compareTo(new byte[]{1,0,1,0}, 0, 4);
- System.out.println(x); // 1
- // 长度不同, 返回 1, 否则按位比较
- BitComparator bc4 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.XOR);
- int y = bc4.compareTo(new byte[]{1,0,1}, 0, 3);
- System.out.println(y); // 1
- }
- }
上述注释阐述的规则, 对应以下代码:
- ...
- public int compareTo(byte[] value, int offset, int length) {
- if (length != this.value.length) {
- return 1;
- } else {
- int b = 0;
- for(int i = length - 1; i>= 0 && b == 0; --i) {
- switch(this.bitOperator) {
- case AND:
- b = this.value[i] & value[i + offset] & 255;
- break;
- case OR:
- b = (this.value[i] | value[i + offset]) & 255;
- break;
- case XOR:
- b = (this.value[i] ^ value[i + offset]) & 255;
- }
- }
- return b == 0 ? 1 : 0;
- }
- }
- ...
核心思想就是: 由低位起逐位比较, 直到 b!=0 退出循环.
4. LongComparator
介绍: Long 型专用比较器, 返回值: 0 -1 1. 上篇概述没有提到, 这里补上.
先看一个小例子:
- public class LongComparatorDemo {
- public static void main(String[] args) {
- LongComparator longComparator = new LongComparator(1000L);
- int i = longComparator.compareTo(Bytes.toBytes(1000L), 0, 8);
- System.out.println(i); // 0
- int i2 = longComparator.compareTo(Bytes.toBytes(1001L), 0, 8);
- System.out.println(i2); // -1
- int i3 = longComparator.compareTo(Bytes.toBytes(998L), 0, 8);
- System.out.println(i3); // 1
- }
- }
这个比较器实现相当简单, 不多说了, 如下:
- public int compareTo(byte[] value, int offset, int length) {
- Long that = Bytes.toLong(value, offset, length);
- return this.longValue.compareTo(that);
- }
- 5. NullComparatorDemo
介绍: 控制比较式, 判断当前值是不是为 null. 是 null 返回 0, 不是 null 返回 1, 仅支持 EQUAL 和非 EQUAL.
先看一个小例子:
- public class NullComparatorDemo {
- public static void main(String[] args) {
- NullComparator nc = new NullComparator();
- int i1 = nc.compareTo(Bytes.toBytes("abc"));
- int i2 = nc.compareTo(Bytes.toBytes(""));
- int i3 = nc.compareTo(null);
- System.out.println(i1); // 1
- System.out.println(i2); // 1
- System.out.println(i3); // 0
- }
- }
这个比较器实现相当简单, 不多说了, 如下:
- public int compareTo(byte[] value) {
- return value != null ? 1 : 0;
- }
- 6. RegexStringComparator
介绍: 提供一个正则的比较器, 支持正则表达式的值比较, 仅支持 EQUAL 和非 EQUAL. 匹配成功返回 0, 匹配失败返回 1.
先看一个小例子:
- public class RegexStringComparatorDemo {
- public static void main(String[] args) {
- RegexStringComparator rsc = new RegexStringComparator("abc");
- int abc = rsc.compareTo(Bytes.toBytes("abcd"), 0, 3);
- System.out.println(abc); // 0
- int bcd = rsc.compareTo(Bytes.toBytes("bcd"), 0, 3);
- System.out.println(bcd); // 1
- String check = "^([a-z0-9A-Z]+[-|\\.]?)+[a-z0-9A-Z]@([a-z0-9A-Z]+(-[a-z0-9A-Z]+)?\\.)+[a-zA-Z]{2,}$";
- RegexStringComparator rsc2 = new RegexStringComparator(check);
- int code = rsc2.compareTo(Bytes.toBytes("zpb@163.com"), 0, "zpb@163.com".length());
- System.out.println(code); // 0
- int code2 = rsc2.compareTo(Bytes.toBytes("zpb#163.com"), 0, "zpb#163.com".length());
- System.out.println(code2); // 1
- }
- }
其 compareTo() 方法有两种引擎实现, 对应两套正则匹配规则, 分别是 JAVA 版和 JONI 版 (面向 JRuby), 默认为 RegexStringComparator.EngineType.JAVA. 如下:
- public int compareTo(byte[] value, int offset, int length) {
- return this.engine.compareTo(value, offset, length);
- }
- public static enum EngineType {
- JAVA,
- JONI;
- private EngineType() {
- }
- }
具体实现都很简单, 都是调用正则语法匹配. 以下是 JAVA EngineType 实现:
- public int compareTo(byte[] value, int offset, int length) {
- String tmp;
- if (length < value.length / 2) {
- tmp = new String(Arrays.copyOfRange(value, offset, offset + length), this.charset);
- } else {
- tmp = new String(value, offset, length, this.charset);
- }
- return this.pattern.matcher(tmp).find() ? 0 : 1;
- }
JONI EngineType 实现:
- public int compareTo(byte[] value, int offset, int length) {
- Matcher m = this.pattern.matcher(value);
- return m.search(offset, length, this.pattern.getOptions()) < 0 ? 1 : 0;
- }
都很容易理解, 不多说了.
7. SubstringComparator
介绍: 判断提供的子串是否出现在 value 中, 并且不区分大小写. 包含字串返回 0, 不包含返回 1, 仅支持 EQUAL 和非 EQUAL.
先看一个小例子:
- public class SubstringComparatorDemo {
- public static void main(String[] args) {
- String value = "aslfjllkabcxxljsl";
- SubstringComparator sc = new SubstringComparator("abc");
- int i = sc.compareTo(Bytes.toBytes(value), 0, value.length());
- System.out.println(i); // 0
- SubstringComparator sc2 = new SubstringComparator("abd");
- int i2 = sc2.compareTo(Bytes.toBytes(value), 0, value.length());
- System.out.println(i2); // 1
- SubstringComparator sc3 = new SubstringComparator("ABC");
- int i3 = sc3.compareTo(Bytes.toBytes(value), 0, value.length());
- System.out.println(i3); // 0
- }
- }
这个比较器实现也相当简单, 不多说了, 如下:
- public int compareTo(byte[] value, int offset, int length) {
- return Bytes.toString(value, offset, length).toLowerCase().contains(this.substr) ? 0 : 1;
- }
到此, 七种比较器就介绍完了. 如果对源码不敢兴趣, 也建议一定要看看文中的小例子, 熟悉下每种比较器的构造函数及结果输出. 后续在使用 HBase 过滤器的过程中, 会经常用到. 当然除了这七种比较器, 大家也可以自定义比较器.
来源: https://www.cnblogs.com/zpb2016/p/12775374.html