HBase源码分析5—BloomFilter

by yiz96 15. 2 月 2018 hbase, 数据库 1

源码分析的5和6两篇合起来准备一起探讨HBase的底层存储格式，BloomFilter Block也是HFile的一部分。但是BloomFilter这个东西思想上虽然很简单，但是原理上并不简单，所以把BloomFilter这部分单独拎出来讲一下。

BloomFilter介绍

是什么

BloomFilter是一种基于概率的存在检测算法。我们把BloomFilter看做是一个有m个bit位的数组 S[1..m] ，另外我们还拥有 k 个相互独立的哈希函数 H[1..k] 。当向BloomFilter中添加元素 x 时，先计算哈希结果 R=H(x) ，然后将R中的每个结果对应的 S[R[i]] 的位置1。当查询时，同样的计算哈希结果 R=H(x) ，然后查询 S 中对应的位是不是已经置1了，如果全为1，则有可能存在，否则则不存在。

图摘自BloomFilter Servey

参数分析

首先假定几个参数，为了后面分析代码方便，我们这里先告诉大家他们在HBase源码中叫什么名字。

参数名	HBase源码中的变量名	含义
n	maxKeys	插入到BloomFilter的元素个数
m	bitSize	BloomFilter的长度
k	hashCount (nbHash)	哈希函数个数
e	err	误正率（False Positive Rate）

首先很显然，BloomFilter对于判断不存在是完全准确的，但是对于判断存在可能会出错，这是由于插入元素不断增加，哈希碰撞的概率会变大。首先我们估计一下误正率（False Positive Rate） f 。我们假设哈希函数是完全随机的，那么BloomFilter中的某一位在插入 n 个元素后仍然是0的概率是

1/m 是某个哈希函数选中这一位的概率，有 k 个函数和 n 个元素。后面这个是求极限用洛必达法则导出来的（数学捉急了）……

那么，误正率为：

求上式的最小值，可以得到m、n确定下的合适的k的取值

在这个课件中放了一些不同的 m 、 n 、 k 计算出的误正率的对比，也可以看出我们的结论是对的。

另外当m和误正率f固定的情况下，最优的m可以得到

这两个结论是贯穿代码的两个结论，大部分函数都可以由这俩方程变换一下得到。

关于本部分内容可以详细参考以下材料

Network Applications of Bloom Filters: A Survey

CS 6550: Bloom Filters and Hashing

BloomFilter折叠

这部分没看到哪里有讲解，但是在代码中发现了有类似 compactBloom这种函数，只能从论文中找找相关分析。

V1的BloomFilter为什么叫 ByteBloomFilter ？是因为它使用了一个 ByteBuffer 做为BloomFilter的存储数组，每个byte可以存8个标志位（因为1 byte=8 bit）。我们从前一部分知道，m的最佳大小由 f 和 n 影响，所以 m 可能会很大，可以通过一些手段来减少空间占用，所以就出现了折叠（folding）技术。

折叠的方法如上图，就是将多个相同大小的块OR起来成为1个块。 ByteBloomFilter实现的折叠策略很暴力：

找到一个块数 B ，使得 B 是 m 的最大二次幂因数，且折叠后理论容量大于当前容量，就折叠，循环直到不能折叠为止。

可以对照代码看一下：

    public void compactBloom() {
        // see if the actual size is exponentially smaller than expected.
        if (this.keyCount > 0 && this.bloom.hasArray()) {
            int pieces = 1;  // 要被分成多少块进行折叠，初始是一整块
            int newByteSize = (int) this.byteSize;   // byteBuffer的长度
            int newMaxKeys = this.maxKeys;  // 最大容量

            // while exponentially smaller & folding is lossless
            while ((newByteSize & 1) == 0 && newMaxKeys > (this.keyCount << 1)) {
                pieces <<= 1;       // 块数x2，相当于对折
                newByteSize >>= 1;  // buffer长度对折
                newMaxKeys >>= 1;   // 最大容量减半
            }

            // if we should fold these into pieces
            if (pieces > 1) {
                byte[] array = this.bloom.array();
                int start = this.bloom.arrayOffset();
                int end = start + newByteSize;
                int off = end;
                for (int p = 1; p < pieces; ++p) {
                    for (int pos = start; pos < end; ++pos) {
                        array[pos] |= array[off++]; // pos在第一个区间循环，off是往后滑动的。。脑补一下。。
                    }
                }
                // folding done, only use a subset of this array
                this.bloom.rewind();
                this.bloom.limit(newByteSize);
                this.bloom = this.bloom.slice();
                this.byteSize = newByteSize;
                this.maxKeys = newMaxKeys;
            }
        }
    }

public void compactBloom() {

// see if the actual size is exponentially smaller than expected.

if (this.keyCount > 0 && this.bloom.hasArray()) {

int pieces = 1; // 要被分成多少块进行折叠，初始是一整块

int newByteSize = (int) this.byteSize; // byteBuffer的长度

int newMaxKeys = this.maxKeys; // 最大容量

// while exponentially smaller & folding is lossless

while ((newByteSize & 1) == 0 && newMaxKeys > (this.keyCount << 1)) {

pieces <<= 1; // 块数x2，相当于对折

newByteSize >>= 1; // buffer长度对折

newMaxKeys >>= 1; // 最大容量减半

}

// if we should fold these into pieces

if (pieces > 1) {

byte[] array = this.bloom.array();

int start = this.bloom.arrayOffset();

int end = start + newByteSize;

int off = end;

for (int p = 1; p < pieces; ++p) {

for (int pos = start; pos < end; ++pos) {

array[pos] |= array[off++]; // pos在第一个区间循环，off是往后滑动的。。脑补一下。。

}

// folding done, only use a subset of this array

this.bloom.rewind();

this.bloom.limit(newByteSize);

this.bloom = this.bloom.slice();

this.byteSize = newByteSize;

this.maxKeys = newMaxKeys;

}

在这种策略下，直观上某些BloomFilter会被极大程度地压缩，实际受限于理论容量的影响并不会这样。我们已经假设了hash函数是完全随机的，假设当前的buffer被分为了 b 个等大小的块，那么hash结果落在哪个块里也是完全随机的，将 b 个块重叠起来，hash的结果落在哪个位置上还是随机的。

由前面的推导结果可以看到，当f确定下来的时候， m 和 n 是线性相关的， k 也不变，那么在 m 被折叠，且折叠后的 n' 可以容纳当前所有key的情况下， f 理论上也不会比折叠前高。所以不会出现准确度下降的问题。

但是为什么只能按二次幂块进行压缩，暂时没找到相关论文。关于这里我看了一些论文，感觉说的跟HBase源码实现的都不是一回事。没什么其他思路，暂时搁置这里了。

（我觉得可能就是工程方便而已）

本部分内容也有两篇材料可以参考

Compressed Bloom Filters

Folding and Unfolding Bloom Filters – An off-Line Planing and on-Line Optimisation

部分代码解析

代码没啥好讲的了，都比较简单，就是前面的结论变形一下……

CompoundBloomFilter 实质上就是包了一层 ByteBloomFilter ，把东西都放HFile里，按需加载……这里不详细讲了先……

HBase源码分析5—BloomFilter

BloomFilter介绍

是什么

参数分析

BloomFilter折叠

部分代码解析

1 thought on “HBase源码分析5—BloomFilter”

发表回复取消回复

BloomFilter介绍

是什么

参数分析

BloomFilter折叠

部分代码解析

1 thought on “HBase源码分析5—BloomFilter”

发表回复 取消回复

发表回复取消回复