前言：

最近被網絡爬蟲中的去重政策所困擾。使用一些其他的“理想”的去重政策，不過在運作過程中總是會不太聽話。不過當我發現了BloomFilter這個東西的時候，的确，這裡是我目前找到的最靠譜的一種方法。

如果，你說URL去重嘛，有什麼難的。那麼你可以看完下面的一些問題再說這句話。

關于BloomFilter：

Bloom filter 是由 Howard Bloom 在 1970 年提出的二進制向量資料結構，它具有很好的空間和時間效率，被用來檢測一個元素是不是集合中的一個成員。如果檢測結果為是，該元素不一定在集合中；但如果檢測結果為否，該元素一定不在集合中。是以Bloom filter具有100%的召回率。這樣每個檢測請求傳回有“在集合内（可能錯誤）”和“不在集合内（絕對不在集合内）”兩種情況，可見 Bloom filter 是犧牲了正确率以節省空間。

以前的去重政策：

1.想到過的URL去重政策

在資料庫中建立字段的UNIQUE屬性
在資料庫中建立一個唯一的索引，在插入資料之前檢查待插入的資料是否存在
使用Set或HashSet儲存資料，確定唯一
使用Map或是一個定長數組記錄某一個URL是否被通路過

2.以上去重政策存在的問題

（1）對于在資料庫中建立字段的UNIQUE屬性，的确是可以避免一些重複性操作。不過在多次MySQL報錯之後，程式可能會直接崩潰，是以這種方式不可取

（2）如果我們要在每一次插入資料之前都去檢查待插入的資料是否存在，這樣勢必會影響程式的效率

（3）這種方式是我在第一次嘗試的時候使用的，放棄繼續使用的原因是：OOM。當然，這裡并不是程式的記憶體洩露，而程式中真的有這麼多記憶體需要被占用（因為從待通路隊列中解析出來的URL要遠比它本身要多得多）

（4）在前幾篇部落格中，我就有提到使用Map對象來儲存URL的通路資訊。不過，現在我要否定它。因為，在長時間運作之後，Map也是會占用大量的記憶體。隻不過，會比第3種方式要小一些。下面是使用Map<Integer, Integer>去重，在長時間運作中記憶體的使用情況：

網絡爬蟲：URL去重政策之布隆過濾器(BloomFilter)的使用前言：關于BloomFilter：以前的去重政策：BloomFilter的使用：

BloomFilter的使用：

1.一般情況下BloomFilter使用記憶體的情況：

網絡爬蟲：URL去重政策之布隆過濾器(BloomFilter)的使用前言：關于BloomFilter：以前的去重政策：BloomFilter的使用：

2.爬蟲程式中BloomFilter使用記憶體的情況（已運作4小時）：

網絡爬蟲：URL去重政策之布隆過濾器(BloomFilter)的使用前言：關于BloomFilter：以前的去重政策：BloomFilter的使用：

3.程式結構圖

網絡爬蟲：URL去重政策之布隆過濾器(BloomFilter)的使用前言：關于BloomFilter：以前的去重政策：BloomFilter的使用：

4.BloomFilter的一般使用

此處關于BloomFilter的Java代碼部分，參考于：http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

如果你看了上面的文章，相信你已經了解到布隆過濾器的空間複雜度是S(n)=O(n)。關于這一點，相信你已經從上面的記憶體使用情況中了解到了這一點。那麼以下會是一些相關的Java代碼展示。而在查重過程也很有效率，時間複雜度是T(n)=O(1)。

BloomFilter.java

import java.util.BitSet;

public class BloomFilter {
    
    /* BitSet初始配置設定2^24個bit */
    private static final int DEFAULT_SIZE = 1 << 25;
    
    /* 不同哈希函數的種子，一般應取質數 */
    private static final int[] seeds = new int[] { 5, 7, 11, 13, 31, 37, 61 };
    
    private BitSet bits = new BitSet(DEFAULT_SIZE);
    
    /* 哈希函數對象 */
    private SimpleHash[] func = new SimpleHash[seeds.length];

    public BloomFilter() {
        for (int i = 0; i < seeds.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
        }
    }

    // 将字元串标記到bits中
    public void add(String value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true);
        }
    }

    // 判斷字元串是否已經被bits标記
    public boolean contains(String value) {
        if (value == null) {
            return false;
        }
        
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        
        return ret;
    }

    /* 哈希函數類 */
    public static class SimpleHash {
        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        // hash函數，采用簡單的權重和hash
        public int hash(String value) {
            int result = 0;
            int len = value.length();
            for (int i = 0; i < len; i++) {
                result = seed * result + value.charAt(i);
            }
            return (cap - 1) & result;
        }
    }
}

Test.java

public class Test {

    private final String[] URLS = {
            "http://www.csdn.net/",
            "http://www.baidu.com/",
            "http://www.google.com.hk",
            "http://www.cnblogs.com/",
            "http://www.zhihu.com/",
            "https://www.shiyanlou.com/",
            "http://www.google.com.hk",
            "https://www.shiyanlou.com/",
            "http://www.csdn.net/"
    };
    
    private void testBloomFilter() {
        BloomFilter filter = new BloomFilter();
        for (int i = 0; i < URLS.length; i++) {
            if (filter.contains(URLS[i])) {
                System.out.println("contain: " + URLS[i]);
                continue;
            }
            
            filter.add(URLS[i]);
        }
    }

    public static void main(String[] args) {
        Test t = new Test();
        t.testBloomFilter();
    }
}

5.BloomFilter在爬蟲中過濾重複的URL

public class ParserRunner implements Runnable {

    private SpiderSet mResultSet = null;
    private WebInfoModel mInfoModel = null;
    private int mIndex;
    private final boolean DEBUG = false;
    
    private SpiderBloomFilter mFlagBloomFilter = null;
    
    public ParserRunner(SpiderSet set, WebInfoModel model, int index, SpiderBloomFilter filter) {
        mResultSet = set;
        mInfoModel = model;
        mIndex = index;
        mFlagBloomFilter = filter;
    }
    
    
    @Override
    public void run() {
        long t = System.currentTimeMillis();

        SpiderQueue tmpQueue = new SpiderQueue();
        PythonUtils.fillAddressQueueByPython(tmpQueue, mInfoModel.getAddress(), mInfoModel.getLevel());
        
        WebInfoModel model = null;
        while (!tmpQueue.isQueueEmpty()) {
            model = tmpQueue.poll();
            if (model == null || mFlagBloomFilter.contains(model.getAddress())) {
                continue;
            }
            
            mResultSet.add(model);
            mFlagBloomFilter.add(model.getAddress());
        }
        
        tmpQueue = null;
        model = null;
        
        System.err.println("Thread-" + mIndex + ", UsedTime-" + (System.currentTimeMillis() - t) + ", SetSize = " + mResultSet.size());
        t = 0;
    }

    @SuppressWarnings("unused")
    private void sleep(long millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

如果你看過我之前的部落格，那麼上面的這一段代碼相信你會比較熟悉。

這段代碼的功能是：生産者。從待通路隊列中消費一個model，然後調用Python生産連結的清單Queue，并将生成的清單Queue offer到結果SpiderSet中。

網絡爬蟲：URL去重政策之布隆過濾器(BloomFilter)的使用前言：關于BloomFilter：以前的去重政策：BloomFilter的使用：

前言：

關于BloomFilter：

以前的去重政策：

1.想到過的URL去重政策

2.以上去重政策存在的問題

BloomFilter的使用：

1.一般情況下BloomFilter使用記憶體的情況：

2.爬蟲程式中BloomFilter使用記憶體的情況（已運作4小時）：

3.程式結構圖

4.BloomFilter的一般使用

5.BloomFilter在爬蟲中過濾重複的URL

繼續閱讀

關鍵詞權重的量化方法TF/IDF

seo技巧篇

最新資料：微信及WeChat月活12.5億多，2021年第二季遊戲收入430億元

ubuntu設定全攻略

超詳細robots.txt寫法大全和禁止目錄收錄及指定頁面

正确了解和判斷PR劫持的方法（站長必看）

celery分布式

網絡蜘蛛Spider的邏輯Logic（二）

傳說中比google和百度牛的十佳搜尋類網站

2023爬蟲學習筆記 -- 多線程操作

百度、新浪、Mixi、Apache社群贊助的開源key-value分布式存儲系統[轉載]

Boss直聘Python爬蟲實戰

門戶通專訪月光部落格：第一部落格是如何打造成的

門戶通專訪草根站長九天狼：做站貴在堅持

專家訪談：搜尋開源力量：Lucene技術前景

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch