天天看點

boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源碼分析

使用例子,

// note: use articleextractor unless defaultextractor gives better results for you

string text = articleextractor .instance.gettext(url);

那就從acticleextractor開始分析, 這個類用了singleton的design pattern, 使用instance取得唯一的執行個體, 實際處理如下步驟

html parser 

the html parser is based upon cyberneko 1.9.13. it is called internally from within the extractors.

the parser takes an html document and transforms it into a textdocument , consisting of one or more textblocks . it knows about specific html elements (script, option etc.) that are ignored automatically.

each textblock stores a portion of text from the html document. initially (after parsing) almost every textblock represents a text section from the html document, except for a few inline elements that do not separate per defintion (for example '<a>'anchor tags).

the textblock objects also store shallow text statistics for the block's content such as the number of words and the number of words in anchor text.

extractors 

extractors consist of one or more pipelined filters . they are used to get the content of a webpage. several different extractors exist, ranging from a generic defaultextractor to extractors specific for news article extraction (articleextractor).

articleextractor.process() 就包含了這個pipeline filter, 這個design做的非常具有可擴充性, 把整個處理過程分成若幹小的步驟分别實作, 在用的時候象搭積木一樣搭成一個處理流. 當想擴充或改變處理過程時, 非常簡單, 隻需加上或替換其中的一塊就可以了.

這樣也非常友善于多語言擴充, 比如這兒用的english包裡的相應的處理函數,

import de.l3s.boilerpipe.filters.english.ignoreblocksaftercontentfilter;

import de.l3s.boilerpipe.filters.english.keeplargestfulltextblockfilter;

如果要擴充到其他語言, 如韓文, 隻需在filters包裡面加上個korean包, 分别實作這些filter處理函數, 然後隻需要修改import, 就可以實作對韓語的support.

 terminatingblocksfinder.instance.process(doc)

                | new documenttitlematchclassifier(doc.gettitle()).process(doc)

                | numwordsrulesclassifier.instance.process(doc)

                | ignoreblocksaftercontentfilter.default_instance.process(doc)

                | blockproximityfusion.max_distance_1.process(doc)

                | boilerplateblockfilter.instance.process(doc)

                | blockproximityfusion.max_distance_1_content_only.process(doc)

                | keeplargestfulltextblockfilter.instance.process(doc)

                | expandtitletocontentfilter.instance.process(doc);

下面具體看一下處理流的每個環節.

terminatingblocksfinder

finds blocks which are potentially indicating the end of an article text and marks them with {@link defaultlabels#indicates_end_of_text}. this can be used in conjunction with a downstream {@link ignoreblocksaftercontentfilter}.(意思是ignoreblocksaftercontentfilter必須作為它的 downstream)

原理很簡單, 就是判斷這個block, 在tb.getnumwords() < 20的情況下是否滿足下面的條件,

text.startswith("comments")

                        || n_comments.matcher(text).find() //n_comments = pattern.compile("(?msi)^[0-9]+ (comments|users responded in)")

                        || text.contains("what you think...")

                        || text.contains("add your comment")

                        || text.contains("add comment")

                        || text.contains("reader views")

                        || text.contains("have your say")

                        || text.contains("reader comments")

                        || text.equals("thanks for your comments - this feedback is now closed")

                        || text.startswith("© reuters")

                        || text.startswith("please rate this")

如果滿足就認為這個block為artical的結尾, 并加上标記tb.addlabel(defaultlabels.indicates_end_of_text);

documenttitlematchclassifier

這個很簡單, 就是根據'<title>'的内容去頁面中去标注title的位置, 做法就是根據'<title>'的内容産生一個potentialtitles清單, 然後去比對block, 比對上就标注成defaultlabels.title

numwordsrulesclassifier

classifies {@link textblock}s as content/not-content through rules that have been determined using the c4.8 machine learning algorithm, as described in the paper "boilerplate detection using shallow text features" (wsdm 2010), particularly using number of words per block and link density per block.

這個子產品實作了個分類器, 用于區分content/not-content , 分類器的建構參見上面這篇文章的4.3節.

分類器使用decision trees算法, 用标注過的google news作為訓練集, 接着對訓練完的decision trees經行剪枝, applying reduced-error pruning we were able to simplify the decision tree to only use 6 dimensions (2 features each for current, previous and next block) without a significant loss in accuracy.

最後用僞碼描述出decision trees的decision過程, 這就是使用decision trees的最大好處, 它的decision rules是可以了解的, 是以可以用各種語言描述出來.

這個子產品實作的是algorithm 2 classifier based on number of words

curr_linkdensity <= 0.333333

| prev_linkdensity <= 0.555556

| | curr_numwords <= 16

| | | next_numwords <= 15

| | | | prev_numwords <= 4: boilerplate

| | | | prev_numwords > 4: content

| | | next_numwords > 15: content

| | curr_numwords > 16: content

| prev_linkdensity > 0.555556

| | curr_numwords <= 40

| | | next_numwords <= 17: boilerplate

| | | next_numwords > 17: content

| | curr_numwords > 40: content

curr_linkdensity > 0.333333: boilerplate

有了classifies, 接下來的事情就是對于所有block進行分類并标注.

ignoreblocksaftercontentfilter

marks all blocks as "non-content" that occur after blocks that have been marked {@link defaultlabels#indicates_end_of_text}. these marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60). this can be used in conjunction with an upstream {@link terminatingblocksfinder}.

這個子產品是terminatingblocksfinder子產品的downstream, 就是說必須在它後面做, 簡單的很, 找到defaultlabels#indicates_end_of_text, 後面的内容全标為boilerplate.

除了前面正文length不到minimum number of words(default: 60), 還需要繼續抓點文字湊數.

blockproximityfusion

fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. this probably makes sense only in cases where an upstream filter already has removed some blocks.

這個子產品用來合并block的, 合并的依據主要是根據兩個block的offset的內插補點不大于2, 也就是說中間最多隻能隔一個block.

當要求contentonly時, 會check兩個block都标注為content時才會fusion.

int diffblocks = block.getoffsetblocksstart() - prevblock.getoffsetblocksend() - 1;

if (diffblocks <= maxblocksdistance)

那麼block的offset怎麼來的了, 查一下block構造的時候的代碼

boilerpipehtmlcontenthandler .flushblock()

textblock tb = new textblock(textbuffer.tostring().trim(), currentcontainedtextelements, numwords, numlinkedwords, numwordsinwrappedlines, numwrappedlines, offsetblocks);

offsetblocks++;

textblock構造函數

this.offsetblocksstart = offsetblocks;

this.offsetblocksend = offsetblocks;

可以看出初始情況下, block的offset就是遞增的, 并且再沒有做過fusion的情況下, offsetblocksstart和offsetblocksend是相等的.

是以象注釋講的那樣, 隻有當upstream filter remove了部分blocks以後, 這個子產品的合并依據才是有意義的, 不然在沒有任何删除的情況下, 所有block都滿足fusion條件.

看完這段代碼, 我很奇怪, paper中fusion是根據text density的, 而這兒隻是根據block的offset, 有所減弱.

there, adjacent text fragments of similar text density (interpreted as /similar class") are iteratively fused until the blocks' densities (and therefore the text classes) are distinctive

enough.

而且我更加不了解的是, 在articleextractor關于這個子產品的用法如下,

                  blockproximityfusion.max_distance_1.process(doc)

調用了blockproximityfusion兩次, 分别在boilerplateblockfilter(含義在下節)的down,upstream, 對于blockproximityfusion.max_distance_1_content_only.process(doc)的調用我還是能了解 的, 再删除完非content的block後, 對剩下的block做一下fusion, 比如原來兩個block中間隔了個廣告. 不過這兒根據offset, 而不根據text density, 個人覺得功能有所減弱.

可是對于blockproximityfusion.max_distance_1.process(doc)的調用, 可能是我沒看懂, 實在無法了解, 為什麼要加這步, 唯一的解釋是想将一些沒有标注為content的block fusion到content裡面去. 奇怪的是這兒fusion是無條件的(在沒有删除block的情況下,判斷offset無效), 隻需要目前的block是content是就和prev進行fusion. 而且為什麼隻判斷目前block, prevblock是content是否也應該fusion.個人覺得這邊邏輯完全不合理......

boilerplateblockfilter

removes {@link textblock}s which have explicitly been marked as "not content"

沒啥好說的, 就是周遊每個block, 把沒有标注為"content"的都删掉.

keeplargestfulltextblockfilter

keeps the largest {@link textblock} only (by the number of words). in case of more than one block with the same number of words, the first block is chosen. all discarded blocks are marked "not content" and flagged as {@link defaultlabels#might_be_content}

很好了解, 找出最大的文本block作為正文, 其他的标注為defaultlabels#might_be_content

expandtitletocontentfilter

marks all {@link textblock}s "content" which are between the headline and the part that has already been marked content, if they are marked {@link defaultlabels#might_be_content}. this filter is quite specific to the news domain.

邏輯是找出标注為defaultlabels.title的block, 和content開始的那個block, 把這兩個block之間的标注為might_be_content的都改标注為content.

textdocument.getcontent() 

最後需要做的一步, 是把抽取的内容輸出成文本. 周遊每一個标注為content的block, 把内容append并輸出.

defaultextractor 

下面再看看除了articleextractor (針對news)以外, 很常用的defaultextractor

simpleblockfusionprocessor.instance.process(doc)

                | densityrulesclassifier.instance.process(doc);

相對比較簡單, 就三步, 第二步很奇怪, 前面沒有任何upstream會标注content, 那麼這步就什麼都不會做

simpleblockfusionprocessor

merges two subsequent blocks if their text densities are equal.

周遊每一個block, 兩個block的text densities相同就merge

densityrulesclassifier

classifies {@link textblock}s as content/not-content through rules that have been determined using the c4.8 machine learning algorithm, as described in the paper "boilerplate detection using shallow text features", particularly using text densities and link densities.

參照numwordsrulesclassifier , 這兒實作了paper裡面的algorithm 1 densitometric classifier

| | curr_textdensity <= 9

| | | next_textdensity <= 10

| | | | prev_textdensity <= 4: boilerplate

| | | | prev_textdensity > 4: content

| | | next_textdensity > 10: content

| | curr_textdensity > 9

| | | next_textdensity = 0: boilerplate

| | | next_textdensity > 0: content

| | next_textdensity <= 11: boilerplate

| | next_textdensity > 11: content

如果有興趣, 你可以學習其他extractor, 或自己design合适自己的extractor.

本文章摘自部落格園,原文釋出日期:2011-07-05

繼續閱讀