Jericho Html Parser初探

作者：SharpStill

Jericho作為其SourceForge上人氣最高的最新Html解析架構，自然有其強大的理由。但是由于目前中國人使用的不多，是以網上的中文教程和資料不多，是以造成了大家的學習困難。是以，我們從學習複雜度，代碼量等初學者入門名額來看看這個架構的魔力吧。可以使用制作開源爬蟲引擎。

這個例子我們以淘寶這樣的購物網站作為解析執行個體。

淘寶網的頁面分為http://list.taobao.com http://www.taobao.com/go/chn/game,（類似album）和http://item.taobao.com(類似video)和面還有許許多多這樣的頁面，我們利用Jericho Html Parser作為頁面解析架構，來看一下他的威力。

這個網頁解析架構的xml書寫如下：

JerichoHtml Parser的核心的類便是Source類，source類代表了html文檔，他可以從URL得到文檔或者從String得到。

In certain circumstances you may be able to improve performance bycalling the

fullSequentialParse()

method before calling anytagsearch methods. See the documentation of the

fullSequentialParse()

method for details.

在其說明文檔中有這樣一句話，就是說如果在特定情況下可以使用

fullSequentialParse()

方法，提高解析速度，這個方法裡的說明：Calling this method can greatly improve performance if most or allof the tags in the document need to be parsed.

如果在一個類裡将大部分或者所有的tag标記都解析了的話，比如我們經常需要提取出網頁所有的URL或者圖檔連結，在這種情況下使用這種方法可以加快提取速度，但是值得注意的一點是：隻有在Source對象被new出來的後面一句緊接着調用這句話有效。緊接着調用Tag Search Method(文檔中有詳細說明)即可。

我們以提取這個頁面為例：

Jericho Html Parser初探

這個頁面包含以下幾點：價格，運費資訊，所在地區，收藏人氣，寶貝類型。利用這個頁面提取，看看其程式設計效率能提高多少。

package com.test.html;

import java.util.List;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Source;

import com.test.html.bean.ShoppingDetail;

public class HtmlParseTest {
	
	public static ShoppingDetail extract(String inputHtml){
		Source source = new Source(inputHtml);
		Element form  = source.getElementById("J_FrmBid");
		List<Element> inputArea = form.getAllElements("input");
		String price ="";
		String area ="";
		String transportInfo="";
		
		for(Element input : inputArea){
			if(input.getAttributeValue("name").equals("buy_now"))price = input.getAttributeValue("value");
			if(input.getAttributeValue("name").equals("region"))area =  input.getAttributeValue("value");
			if(input.getAttributeValue("name").equals("who_pay_ship"))transportInfo =  input.getAttributeValue("value");
			
		}
		Element others  = source.getAllElementsByClass("other clearfix").get(0);
		String otherInfo = others.getContent().getTextExtractor().toString().trim();
		int startBabyType =otherInfo.indexOf("寶貝類型：");
		int endBabyType=  otherInfo.indexOf("收藏人氣：");
		String babyType = otherInfo.substring(startBabyType+5,endBabyType);
		int endStore = otherInfo.indexOf("類似收藏");
		String storeCount = otherInfo.substring(endBabyType+5,endStore-1).trim();
		ShoppingDetail detail = new ShoppingDetail();
		detail.setArea(area);
		detail.setBabyType(babyType);
		detail.setPrice(price);
		detail.setStroreCount(Integer.parseInt(storeCount));
		detail.setTransportInfo(transportInfo);
		return detail;
	}
	
	public static void main(String[] args) throws Exception {
		String content = HttpClientUtils.getContent("http://item.taobao.com/item.htm?id=3144581940", "UTF-8");
		ShoppingDetail detail = HtmlParseTest.extract(content);
		System.out.println(detail);

運作結果如下所示

[email protected][title=索尼 PSP 4.3寸搖杆遊戲機 8G mp5高清電影+TV輸出+收音,price=235.00,transportInfo=賣家承擔運費,area=廣東深圳,stroreCount=17711,babyType=全新 ]

提取這個頁面的核心代碼僅為上面的那個函數，可以說從程式設計的複雜度而言比HtmlParser減少了不少，而且也沒有繁瑣的通路者模式了。

下面作為對比，我們來看看HtmlParser在這個頁面解析上所做的代碼量：

@Override
	public boolean execute(Context context) throws Exception {
		//spring若與commons chain可以內建IOC的話，就删去
		parserTool = new SingleHtmlParserTool();
		filterUtils = new FilterUtils();
		filterUtils.setTool(new PropertiesTool("/Config/Properties/TaobaoFilter.properties"));
		
		//這一段若研究了spring的膠水機制就删了
		
		String url = ((String)context.get("url")).toLowerCase();
		
		String seller_Url = "";
		String seller_Nickname = "";
		String seller_Taobao_Id = "";
		
		
		if(url.contains(Url_Pattern)){
			System.out.println("開始提取資料自"+url);
			NodeList forms = parserTool.getContainerInnerTags(url, formFilter);
			Map<String,String> valueMap = new HashMap<String,String>();//key-value map to store
			for(int i = 0;i<forms.size();i++){
				FormTag form = (FormTag)forms.elementAt(i);
				NodeList inputs = form.getFormInputs();
				for(int j = 0;j<inputs.size();j++){
					InputTag input = (InputTag)inputs.elementAt(j);
					String inputName = input.getAttribute("name");
					String inputValue = input.getAttribute("value");
					valueMap.put(inputName, inputValue);
				}
				
				//進入賣家分析程式分析時要用的一些Context中的值
				seller_Taobao_Id = valueMap.get("seller_id");
				seller_Nickname = valueMap.get("seller_nickname");
				seller_Url = "http://rate.taobao.com/user-rate-"+seller_Taobao_Id+".htm";
				//進入賣家分析程式分析時要用的一些Context中的值
				
				filterUtils.filterMap(valueMap, filterStoreItemKey,
						merchandiseItemSeparatorKey,
						DictItem.MerchandiseItemSiteToEngine);// 過濾網址上的其他雜質<inputname="雜質" />，保證最後存于資料庫的map被傳回
				valueMap.put("TaobaoSite_Url", url);// 最後把網頁"内容"裡提取不出來的url加入其中

				// 存入資料庫
				IBasicSqlDao dao = BeanMapDaoUsingCommonsBeanutils.getInstance();
				dao.storeMapToDb(valueMap, sqlId, MerchandiseItem.class);
				// 存入資料庫
				for(Map.Entry<String, String> entry : valueMap.entrySet()){
					System.out.println(entry.getKey()+":"+entry.getValue());
				}
				System.out.println("---頁面"+url+"提取完畢---");
				//重構可以删去
				valueMap.clear();// 清除map
			}
			Context sellerContext = new UrlContext(seller_Url);
			sellerContext.put("Seller_Nickname", seller_Nickname);
			sellerContext.put("Seller_Taobao_Id", seller_Taobao_Id);
			EmergencyLinkUrl.addUnVisitUrl(sellerContext, this);
			
			
			return true;
		}
		return false;
	}

可以明顯看到，Jercho從程式設計風格來看，擁有這幾大優勢：

去除了内部類，由于Html Parser使用過濾器或通路者模式，不可避免地引入内部類。
程式設計使用泛型，簡化編寫，Html Parser沒有使用泛型作為其架構，可能其創造年代較早，但是大量使用了設計模式，可以窺見其作者對設計模式的功力很深。
可以直接提取頁面文本資訊，将HTML标簽去除。這個在全文搜尋時非常常用。
Tag Search Method類似XPath的提取方式，可以不受限制地提取若幹層以下的元素Element。

目前觀察到的幾大優勢使我們有理由相信，Jercho在未來的Html解析屆會成為翹楚的。以下為一篇不錯的Jercho Html Parser文章推薦下：

http://www.ehelper.com.cn/blog/post/httpclient-jericho.html

我又查了一下資料，發現 jsoup HTML解析器比 htmpparser 更好

HtmlUnit 比 httpclient更易用

Jericho Html Parser初探

繼續閱讀

AS3 類庫

VS2010/MFC程式設計入門之二（利用MFC向導生成單文檔應用程式架構）

VS2010/MFC程式設計入門之四（MFC應用程式架構分析）

推技術聊天室的實作(下)

領域模組化實作思考

一篇文章教你如何在一個月内學會爬取大規模資料

國内常用開源鏡像網站彙總

項目管理那些事兒

pmbok學習筆記（1）

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

推薦2個開源聊天伺服器1，關于移動網際網路2，actor.im3，goim4，telegram5，總結

wecenter二次開發系列（三）——多個wc架構同域網站共享cookie

下載下傳APP顯示伺服器檢索資訊出錯 ”RPC:S-7:AEC-0“等

SSM架構（二）------------表現層的SpringMVC

阿裡巴巴分布式服務架構 Dubbo 團隊成員梁飛專訪

sort()函數到底是怎樣進行數字排序的