HtmlParser基礎教程

本部落格中的一些内容為網絡轉載，用于學習，如果涉及版權問題，請留言！謝謝

1、相關資料

其它html 解釋器：jsoup等。由于htmlparser自2006年以後就再沒更新，目前很多人推薦使用jsoup代替它。

2、使用htmlpaser的關鍵步驟

（1）通過parser類建立一個解釋器

（2）建立filter或者visitor

（3）使用parser根據filter或者visitor來取得所有符合條件的節點

（4）對節點内容進行處理

3、使用parser的構造函數建立解釋器

parser()

parser(lexer lexer)

parser(lexer lexer, parserfeedback fb)

parser(string resource)

parser(string resource, parserfeedback feedback)

parser(urlconnection connection)

parser(urlconnection connection, parserfeedback fb)

4、htmlpaser使用node對象儲存各節點資訊

（1）通路各個節點的方法

node getparent ()：取得父節點

nodelist getchildren ()：取得子節點的清單

node getfirstchild ()：取得第一個子節點

node getlastchild ()：取得最後一個子節點

node getprevioussibling ()：取得前一個兄弟（不好意思，英文是兄弟姐妹，直譯太麻煩而且不符合習慣，對不起女同胞了）

node getnextsibling ()：取得下一個兄弟節點

（2）取得node内容的函數

string gettext ()：取得文本

string toplaintextstring()：取得純文字資訊。

string tohtml () ：取得html資訊（原始html）

string tohtml (boolean verbatim)：取得html資訊（原始html）

string tostring ()：取得字元串資訊（原始html）

page getpage ()：取得這個node對應的page對象

int getstartposition ()：取得這個node在html頁面中的起始位置

int getendposition ()：取得這個node在html頁面中的結束位置

5、使用filter通路node節點及其内容

（1）filter的種類

顧名思義，filter就是對于結果進行過濾，取得需要的内容。

所有的filter均實作了nodefilter接口，此接口隻有一個方法boolean accept(node node)，用于确定某個節點是否屬于此filter過濾的範圍。

htmlparser在org.htmlparser.filters包之内一共定義了16個不同的filter，也可以分為幾類。

判斷類filter：

tagnamefilter

hasattributefilter

haschildfilter

hasparentfilter

hassiblingfilter

isequalfilter

邏輯運算filter：

andfilter

notfilter

orfilter

xorfilter

其他filter：

nodeclassfilter

stringfilter

linkstringfilter

linkregexfilter

regexfilter

cssselectornodefilter

除此以外，可以自定義一些filter，用于完成特殊需求的過濾。

（2）filter的使用示例

以下示例用于提取html檔案中的連結

[java] view plaincopy在code上檢視代碼片派生到我的代碼片

package org.ljh.search.html;

import java.util.hashset;

import java.util.set;

import org.htmlparser.node;

import org.htmlparser.nodefilter;

import org.htmlparser.parser;

import org.htmlparser.filters.nodeclassfilter;

import org.htmlparser.filters.orfilter;

import org.htmlparser.tags.linktag;

import org.htmlparser.util.nodelist;

import org.htmlparser.util.parserexception;

//本類建立用于html檔案解釋工具

public class htmlparsertool {

}

程式中的一些說明：

（1）通過node#gettext()取得節點的string。

org.htmlparser.nodes the nodes package has the concrete node implementations.

org.htmlparser.tags the tags package contains specific tags.是以可以通過此方法直接判斷一個節點是否某個标簽内容。

其中用到的linkfilter接口定義如下：

//本接口所定義的過濾器，用于判斷url是否屬于本次搜尋範圍。

public interface linkfilter {

測試程式如下：

import java.util.iterator;

import org.junit.test;

public class htmlparsertooltest {

輸出結果如下：

<a href="http://www.baidu.com/">http://www.baidu.com/</a>

<a href="http://www.baidu.com/duty/">http://www.baidu.com/duty/</a>

<a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=">http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=</a>

<a href="http://music.baidu.com">http://music.baidu.com</a>

<a href="http://ir.baidu.com">http://ir.baidu.com</a>

<a href="http://www.baidu.com/gaoji/preferences.html">http://www.baidu.com/gaoji/preferences.html</a>

<a href="http://news.baidu.com">http://news.baidu.com</a>

<a href="http://map.baidu.com">http://map.baidu.com</a>

<a href="http://music.baidu.com/search?fr=ps&key=">http://music.baidu.com/search?fr=ps&key=</a>

<a href="http://image.baidu.com">http://image.baidu.com</a>

<a href="http://zhidao.baidu.com">http://zhidao.baidu.com</a>

<a href="http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=">http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=</a>

<a href="http://www.baidu.com/more/">http://www.baidu.com/more/</a>

<a href="http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w">http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w</a>

<a href="http://wenku.baidu.com">http://wenku.baidu.com</a>

<a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=">http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=</a>

<a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f">https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f</a>

<a href="http://www.baidu.com/cache/sethelp/index.html">http://www.baidu.com/cache/sethelp/index.html</a>

<a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt">http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt</a>

<a href="http://tieba.baidu.com/f?kw=&fr=wwwt">http://tieba.baidu.com/f?kw=&fr=wwwt</a>

<a href="http://home.baidu.com">http://home.baidu.com</a>

<a href="https://passport.baidu.com/v2/?reg&regtype=1&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f">https://passport.baidu.com/v2/?reg&regtype=1&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f</a>

<a href="http://v.baidu.com">http://v.baidu.com</a>

<a href="http://e.baidu.com/?refer=888">http://e.baidu.com/?refer=888</a>

;

<a href="http://tieba.baidu.com">http://tieba.baidu.com</a>

<a href="http://baike.baidu.com">http://baike.baidu.com</a>

<a href="http://wenku.baidu.com/search?word=&lm=0&od=0">http://wenku.baidu.com/search?word=&lm=0&od=0</a>

<a href="http://top.baidu.com">http://top.baidu.com</a>

<a href="http://map.baidu.com/m?word=&fr=ps01000">http://map.baidu.com/m?word=&fr=ps01000</a>

htmlparser中的node，無法擷取标簽中的文字，如：

中無法擷取123，隻能轉換為字元串，從中取截取，目前發現是這樣的。

HtmlParser基礎教程

繼續閱讀

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

android 主線程的相關問題

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

sqlServer根據經緯查距離

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method