HtmlParser基础教程

本博客中的一些内容为网络转载，用于学习，如果涉及版权问题，请留言！谢谢

1、相关资料

其它html 解释器：jsoup等。由于htmlparser自2006年以后就再没更新，目前很多人推荐使用jsoup代替它。

2、使用htmlpaser的关键步骤

（1）通过parser类创建一个解释器

（2）创建filter或者visitor

（3）使用parser根据filter或者visitor来取得所有符合条件的节点

（4）对节点内容进行处理

3、使用parser的构造函数创建解释器

parser()

parser(lexer lexer)

parser(lexer lexer, parserfeedback fb)

parser(string resource)

parser(string resource, parserfeedback feedback)

parser(urlconnection connection)

parser(urlconnection connection, parserfeedback fb)

4、htmlpaser使用node对象保存各节点信息

（1）访问各个节点的方法

node getparent ()：取得父节点

nodelist getchildren ()：取得子节点的列表

node getfirstchild ()：取得第一个子节点

node getlastchild ()：取得最后一个子节点

node getprevioussibling ()：取得前一个兄弟（不好意思，英文是兄弟姐妹，直译太麻烦而且不符合习惯，对不起女同胞了）

node getnextsibling ()：取得下一个兄弟节点

（2）取得node内容的函数

string gettext ()：取得文本

string toplaintextstring()：取得纯文本信息。

string tohtml () ：取得html信息（原始html）

string tohtml (boolean verbatim)：取得html信息（原始html）

string tostring ()：取得字符串信息（原始html）

page getpage ()：取得这个node对应的page对象

int getstartposition ()：取得这个node在html页面中的起始位置

int getendposition ()：取得这个node在html页面中的结束位置

5、使用filter访问node节点及其内容

（1）filter的种类

顾名思义，filter就是对于结果进行过滤，取得需要的内容。

所有的filter均实现了nodefilter接口，此接口只有一个方法boolean accept(node node)，用于确定某个节点是否属于此filter过滤的范围。

htmlparser在org.htmlparser.filters包之内一共定义了16个不同的filter，也可以分为几类。

判断类filter：

tagnamefilter

hasattributefilter

haschildfilter

hasparentfilter

hassiblingfilter

isequalfilter

逻辑运算filter：

andfilter

notfilter

orfilter

xorfilter

其他filter：

nodeclassfilter

stringfilter

linkstringfilter

linkregexfilter

regexfilter

cssselectornodefilter

除此以外，可以自定义一些filter，用于完成特殊需求的过滤。

（2）filter的使用示例

以下示例用于提取html文件中的链接

[java] view plaincopy在code上查看代码片派生到我的代码片

package org.ljh.search.html;

import java.util.hashset;

import java.util.set;

import org.htmlparser.node;

import org.htmlparser.nodefilter;

import org.htmlparser.parser;

import org.htmlparser.filters.nodeclassfilter;

import org.htmlparser.filters.orfilter;

import org.htmlparser.tags.linktag;

import org.htmlparser.util.nodelist;

import org.htmlparser.util.parserexception;

//本类创建用于html文件解释工具

public class htmlparsertool {

}

程序中的一些说明：

（1）通过node#gettext()取得节点的string。

org.htmlparser.nodes the nodes package has the concrete node implementations.

org.htmlparser.tags the tags package contains specific tags.因此可以通过此方法直接判断一个节点是否某个标签内容。

其中用到的linkfilter接口定义如下：

//本接口所定义的过滤器，用于判断url是否属于本次搜索范围。

public interface linkfilter {

测试程序如下：

import java.util.iterator;

import org.junit.test;

public class htmlparsertooltest {

输出结果如下：

<a href="http://www.baidu.com/">http://www.baidu.com/</a>

<a href="http://www.baidu.com/duty/">http://www.baidu.com/duty/</a>

<a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=">http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=</a>

<a href="http://music.baidu.com">http://music.baidu.com</a>

<a href="http://ir.baidu.com">http://ir.baidu.com</a>

<a href="http://www.baidu.com/gaoji/preferences.html">http://www.baidu.com/gaoji/preferences.html</a>

<a href="http://news.baidu.com">http://news.baidu.com</a>

<a href="http://map.baidu.com">http://map.baidu.com</a>

<a href="http://music.baidu.com/search?fr=ps&key=">http://music.baidu.com/search?fr=ps&key=</a>

<a href="http://image.baidu.com">http://image.baidu.com</a>

<a href="http://zhidao.baidu.com">http://zhidao.baidu.com</a>

<a href="http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=">http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=</a>

<a href="http://www.baidu.com/more/">http://www.baidu.com/more/</a>

<a href="http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w">http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w</a>

<a href="http://wenku.baidu.com">http://wenku.baidu.com</a>

<a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=">http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=</a>

<a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f">https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f</a>

<a href="http://www.baidu.com/cache/sethelp/index.html">http://www.baidu.com/cache/sethelp/index.html</a>

<a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt">http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt</a>

<a href="http://tieba.baidu.com/f?kw=&fr=wwwt">http://tieba.baidu.com/f?kw=&fr=wwwt</a>

<a href="http://home.baidu.com">http://home.baidu.com</a>

<a href="https://passport.baidu.com/v2/?reg&regtype=1&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f">https://passport.baidu.com/v2/?reg&regtype=1&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f</a>

<a href="http://v.baidu.com">http://v.baidu.com</a>

<a href="http://e.baidu.com/?refer=888">http://e.baidu.com/?refer=888</a>

;

<a href="http://tieba.baidu.com">http://tieba.baidu.com</a>

<a href="http://baike.baidu.com">http://baike.baidu.com</a>

<a href="http://wenku.baidu.com/search?word=&lm=0&od=0">http://wenku.baidu.com/search?word=&lm=0&od=0</a>

<a href="http://top.baidu.com">http://top.baidu.com</a>

<a href="http://map.baidu.com/m?word=&fr=ps01000">http://map.baidu.com/m?word=&fr=ps01000</a>

htmlparser中的node，无法获取标签中的文字，如：

中无法获取123，只能转换为字符串，从中取截取，目前发现是这样的。

HtmlParser基础教程

继续阅读

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

android 主线程的相关问题

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

sqlServer根据经纬查距离

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method