Jsoup簡介——使用Java抓取網頁資料

轉載請注明出處： http://blog.csdn.net/allen315410/article/details/40115479

概述

jsoup 是一款Java 的HTML解析器，可直接解析某個URL位址、HTML文本内容。它提供了一套非常省力的API，可通過DOM，CSS以及類似于jQuery的操作方法來取出和操作資料。jsoup的主要功能如下：

1. 從一個URL，檔案或字元串中解析HTML； 2. 使用DOM或CSS選擇器來查找、取出資料； 3. 可操作HTML元素、屬性、文本； jsoup是基于MIT協定釋出的，可放心使用于商業項目。關于Jsoul的更多介紹，請通路Jsoul的官網： http://jsoup.org/ 關于Jsoul的jar包下載下傳位址： http://jsoup.org/download 關于Jsoul的官網API文檔查詢： http://jsoup.org/apidocs/

========================================================================================================

入門

1.解析和周遊一個html文檔

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

(更詳細内容可檢視解析一個HTML字元串.)

其解析器能夠盡最大可能從你提供的HTML文檔來創見一個幹淨的解析結果，無論HTML的格式是否完整。比如它可以處理：

沒有關閉的标簽 (比如： Lorem Ipsum parses to Lorem Ipsum )
隐式标簽 (比如. 它可以自動将 <td>Table data</td> 包裝成 <table><tr><td>? )
建立可靠的文檔結構（html标簽包含head 和 body，在head隻出現恰當的元素）

一個文檔的對象模型

文檔由多個Elements和TextNodes組成 (以及其它輔助nodes：詳細可檢視：nodes package tree).
其繼承結構如下： Document 繼承 Element 繼承 Node . TextNode 繼承 Node .
一個Element包含一個子節點集合，并擁有一個父Element。他們還提供了一個唯一的子元素過濾清單。

參見

資料抽取：DOM周遊
資料抽取：Selector syntax

========================================================================================================

輸入

2.解析一個html字元串

存在問題

來自使用者輸入，一個檔案或一個網站的HTML字元串，你可能需要對它進行解析并取其内容，或校驗其格式是否完整，或想修改它。怎麼辦？jsonu能夠幫你輕松解決這些問題

解決方法

使用靜态

Jsoup.parse(String html)

方法或

Jsoup.parse(String html, String baseUri)

示例代碼：

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

描述

    parse(String html, String baseUri)

這方法能夠将輸入的HTML解析為一個新的文檔 (Document），參數 baseUri 是用來将相對 URL 轉成絕對URL，并指定從哪個網站擷取文檔。如這個方法不适用，你可以使用

parse(String html)

方法來解析成HTML字元串如上面的示例。.

隻要解析的不是空字元串，就能傳回一個結構合理的文檔，其中包含(至少) 一個head和一個body元素。

一旦擁有了一個Document，你就可以使用Document中适當的方法或它父類

Element

和

Node

中的方法來取得相關資料。

========================================================================================================

3.解析一個body片斷

存在問題

你需要從一個網站擷取和解析一個HTML文檔，并查找其中的相關資料。你可以使用下面解決方法：

解決方法

使用

Jsoup.connect(String url)

方法:

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

說明

     connect(String url)

方法建立一個新的

Connection

, 和

get()

取得和解析一個HTML檔案。如果從該URL擷取HTML時發生錯誤，便會抛出 IOException，應适當處理。

Connection

接口還提供一個方法鍊來解決特殊請求，具體如下：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

這個方法隻支援Web URLs (

http

和

https

協定); 假如你需要從一個檔案加載，可以使用

 parse(File in, String charsetName)

代替。

問題

假如你有一個HTML片斷 (比如. 一個

div

包含一對

标簽; 一個不完整的HTML文檔) 想對它進行解析。這個HTML片斷可以是使用者送出的一條評論或在一個CMS頁面中編輯body部分。

辦法

使用

Jsoup.parseBodyFragment(String html)

方法.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

說明

    parseBodyFragment

方法建立一個空殼的文檔，并插入解析過的HTML到

body

元素中。假如你使用正常的

Jsoup.parse(String html)

方法，通常你也可以得到相同的結果，但是明确将使用者輸入作為 body片段處理，以確定使用者所提供的任何糟糕的HTML都将被解析成body元素。

Document.body()

方法能夠取得文檔body元素的所有子元素，與

doc.getElementsByTag("body")

相同。

保證安全Stay safe

假如你可以讓使用者輸入HTML内容，那麼要小心避免跨站腳本攻擊。利用基于

Whitelist

的清除器和

clean(String bodyHtml, Whitelist whitelist)

方法來清除使用者輸入的惡意内容。

========================================================================================================

4.從一個URL加載一個Document對象

存在問題

你需要從一個網站擷取和解析一個HTML文檔，并查找其中的相關資料。你可以使用下面解決方法：

解決方法

使用

Jsoup.connect(String url)

方法:

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

說明

     connect(String url)

方法建立一個新的

Connection

, 和

get()

取得和解析一個HTML檔案。如果從該URL擷取HTML時發生錯誤，便會抛出 IOException，應适當處理。

Connection

接口還提供一個方法鍊來解決特殊請求，具體如下：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

這個方法隻支援Web URLs (

http

和

https

協定); 假如你需要從一個檔案加載，可以使用

 parse(File in, String charsetName)

代替。

========================================================================================================

5.根據一個檔案加載Document對象

問題

在本機硬碟上有一個HTML檔案，需要對它進行解析從中抽取資料或進行修改。

辦法

可以使用靜态

Jsoup.parse(File in, String charsetName, String baseUri)

方法：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

說明

      parse(File in, String charsetName, String baseUri)

這個方法用來加載和解析一個HTML檔案。如在加載檔案的時候發生錯誤，将抛出IOException，應作适當處理。

     baseUri

參數用于解決檔案中URLs是相對路徑的問題。如果不需要可以傳入一個空的字元串。

另外還有一個方法

parse(File in, String charsetName)

，它使用檔案的路徑做為

baseUri

。這個方法适用于如果被解析檔案位于網站的本地檔案系統，且相關連結也指向該檔案系統。

========================================================================================================

資料抽取

6.使用dom方法來周遊一個Document對象

問題

你有一個HTML文檔要從中提取資料，并了解這個HTML文檔的結構。

方法

将HTML解析成一個

Document

之後，就可以使用類似于DOM的方法進行操作。示例代碼：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

說明

Elements這個對象提供了一系列類似于DOM的方法來查找元素，抽取并處理其中的資料。具體如下：

查找元素

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements() , firstElementSibling() , lastElementSibling() ; nextElementSibling() , previousElementSibling()
Graph: parent() , children() , child(int index)

元素資料

attr(String key) 擷取屬性 attr(String key, String value) 設定屬性
attributes() 擷取所有屬性
id() , className() and classNames()
text() 擷取文本内容 text(String value) 設定文本内容
html() 擷取元素内HTML html(String value) 設定元素内的HTML内容
outerHtml() 擷取元素外HTML内容
data() 擷取資料内容（例如：script和style标簽)
tag() and tagName()

操作HTML和文本

append(String html) , prepend(String html)
appendText(String text) , prependText(String text)
appendElement(String tagName) , prependElement(String tagName)
html(String value)

========================================================================================================

7.使用選擇器文法來查找元素

問題

你想使用類似于CSS或jQuery的文法來查找和操作元素。

方法

可以使用

Element.select(String selector)

和

Elements.select(String selector)

方法實作：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //帶有href屬性的a元素
Elements pngs = doc.select("img[src$=.png]");
  //擴充名為.png的圖檔

Element masthead = doc.select("div.masthead").first();
  //class等于masthead的div标簽

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之後的a元素

說明

jsoup elements對象支援類似于CSS (或jquery)的選擇器文法，來實作非常強大和靈活的查找功能。.

這個

select

方法在

Document

Element

,或

Elements

對象中都可以使用。且是上下文相關的，是以可實作指定元素的過濾，或者鍊式選擇通路。

Select方法将傳回一個

Elements

集合，并提供一組方法來抽取和處理結果。

Selector選擇器概述

tagname : 通過标簽查找元素，比如： a
ns|tag : 通過标簽在命名空間查找元素，比如：可以用 fb|name 文法來查找 <fb:name> 元素
#id : 通過ID查找元素，比如： #logo
.class : 通過class名稱查找元素，比如： .masthead
[attribute] : 利用屬性查找元素，比如： [href]
[^attr] : 利用屬性名字首來查找元素，比如：可以用 [^data-] 來查找帶有HTML5 Dataset屬性的元素
[attr=value] : 利用屬性值來查找元素，比如： [width=500]
[attr^=value] , [attr$=value] , [attr*=value] : 利用比對屬性值開頭、結尾或包含屬性值來查找元素，比如： [href*=/path/]
[attr~=regex] : 利用屬性值比對正規表達式來查找元素，比如： img[src~=(?i)\.(png|jpe?g)]
* : 這個符号将比對所有元素

Selector選擇器組合使用

el#id : 元素+ID，比如： div#logo
el.class : 元素+class，比如： div.masthead
el[attr] : 元素+class，比如： a[href]
任意組合，比如： a[href].highlight
ancestor child : 查找某個元素下子元素，比如：可以用 .body p 查找在"body"元素下的所有 p 元素
parent > child : 查找某個父元素下的直接子元素，比如：可以用 div.content > p 查找 p 元素，也可以用 body > * 查找body标簽下所有直接子元素
siblingA + siblingB : 查找在A元素之前第一個同級元素B，比如： div.head + div
siblingA ~ siblingX : 查找A元素之前的同級X元素，比如： h1 ~ p
el, el, el :多個選擇器組合，查找比對任一選擇器的唯一進制素，例如： div.masthead, div.logo

僞選擇器selectors

:lt(n) : 查找哪些元素的同級索引值（它的位置在DOM樹中是相對于它的父節點）小于n，比如： td:lt(3) 表示小于三列的元素
:gt(n) :查找哪些元素的同級索引值大于 n ，比如 ： div p:gt(2) 表示哪些div中有包含2個以上的p元素
:eq(n) : 查找哪些元素的同級索引值與 n 相等，比如： form input:eq(1) 表示包含一個input标簽的Form元素
:has(seletor) : 查找比對選擇器包含元素的元素，比如： div:has(p) 表示哪些div包含了p元素
:not(selector) : 查找與選擇器不比對的元素，比如： div:not(.logo) 表示不包含 class=logo 元素的所有 div 清單
:contains(text) : 查找包含給定文本的元素，搜尋不區分大不寫，比如： p:contains(jsoup)
:containsOwn(text) : 查找直接包含給定文本的元素
:matches(regex) : 查找哪些元素的文本比對指定的正規表達式，比如： div:matches((?i)login)
:matchesOwn(regex) : 查找自身包含文本比對指定正規表達式的元素
注意：上述僞選擇器索引是從0開始的，也就是說第一個元素索引值為0，第二個元素index為1等

可以檢視

Selector

API參考來了解更詳細的内容

========================================================================================================

8.從元素集合抽取屬性、文本和html内容

問題

在解析獲得一個Document執行個體對象，并查找到一些元素之後，你希望取得在這些元素中的資料。

方法

要取得一個屬性的值，可以使用 Node.attr(String key) 方法
對于一個元素中的文本，可以使用 Element.text() 方法
對于要取得元素或屬性中的HTML内容，可以使用 Element.html() , 或 Node.outerHtml() 方法

示例：

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);//解析HTML字元串傳回一個Document實作
Element link = doc.select("a").first();//查找第一個a元素

String text = doc.body().text(); // "An example link"//取得字元串中的文本
String linkHref = link.attr("href"); // "http://example.com/"//取得連結位址
String linkText = link.text(); // "example""//取得連結位址中的文本

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"//取得連結内的html内容

說明

上述方法是元素資料通路的核心辦法。此外還其它一些方法可以使用：

Element.id()
Element.tagName()
Element.className() and Element.hasClass(String className)

這些通路器方法都有相應的setter方法來更改資料.

參見

Element 和 Elements 集合類的參考文檔
URLs處理
使用CSS選擇器文法來查找元素

========================================================================================================

9.URL處理

問題

你有一個包含相對URLs路徑的HTML文檔，需要将這些相對路徑轉換成絕對路徑的URLs。

方法

在你解析文檔時確定有指定 base URI ，然後

使用

abs:

屬性字首來取得包含

base URI

的絕對路徑。代碼如下：

Document doc = Jsoup.connect("http://www.open-open.com").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://www.open-open.com/"

說明

在HTML元素中，URLs經常寫成相對于文檔位置的相對路徑：

<a href="/download" target="_blank" rel="external nofollow" >...</a>

. 當你使用

Node.attr(String key)

方法來取得a元素的href屬性時，它将直接傳回在HTML源碼中指定定的值。

假如你需要取得一個絕對路徑，需要在屬性名前加

abs:

字首。這樣就可以傳回包含根路徑的URL位址

attr("abs:href")

是以，在解析HTML文檔時，定義base URI非常重要。

如果你不想使用

abs:

字首，還有一個方法能夠實作同樣的功能

Node.absUrl(String key)

。

========================================================================================================

10.程式示例：擷取所有連結

這個示例程式将展示如何從一個URL獲得一個頁面。然後提取頁面中的所有連結、圖檔和其它輔助内容。并檢查URLs和文本資訊。

運作下面程式需要指定一個URLs作為參數

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

        print("\nMedia: (%d)", media.size());
        for (Element src : media) {
            if (src.tagName().equals("img"))
                print(" * %s: <%s> %sx%s (%s)",
                        src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
                        trim(src.attr("alt"), 20));
            else
                print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
        }

        print("\nImports: (%d)", imports.size());
        for (Element link : imports) {
            print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
        }

        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

示例輸入結果

Fetching http://news.ycombinator.com/...

Media: (38)
 * img: <http://ycombinator.com/images/y18.gif> 18x18 ()
 * img: <http://ycombinator.com/images/s.gif> 10x1 ()
 * img: <http://ycombinator.com/images/grayarrow.gif> x ()
 * img: <http://ycombinator.com/images/s.gif> 0x10 ()
 * script: <http://www.co2stats.com/propres.php?s=1138>
 * img: <http://ycombinator.com/images/s.gif> 15x1 ()
 * img: <http://ycombinator.com/images/hnsearch.png> x ()
 * img: <http://ycombinator.com/images/s.gif> 25x1 ()
 * img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)
 
Imports: (2)
 * link <http://ycombinator.com/news.css> (stylesheet)
 * link <http://ycombinator.com/favicon.ico> (shortcut icon)
 
Links: (141)
 * a: <http://ycombinator.com>  ()
 * a: <http://news.ycombinator.com/news>  (Hacker News)
 * a: <http://news.ycombinator.com/newest>  (new)
 * a: <http://news.ycombinator.com/newcomments>  (comments)
 * a: <http://news.ycombinator.com/leaders>  (leaders)
 * a: <http://news.ycombinator.com/jobs>  (jobs)
 * a: <http://news.ycombinator.com/submit>  (submit)
 * a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW>  (login)
 * a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://www.readwriteweb.com/archives/facebook_gets_faster_debuts_homegrown_php_compiler.php?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter>  (Facebook speeds up PHP)
 * a: <http://news.ycombinator.com/user?id=mcxx>  (mcxx)
 * a: <http://news.ycombinator.com/item?id=1094578>  (9 comments)
 * a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914>  ("Tough. Django produces XHTML.")
 * a: <http://news.ycombinator.com/user?id=andybak>  (andybak)
 * a: <http://news.ycombinator.com/item?id=1094649>  (3 comments)
 * a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce>  (More)
 * a: <http://news.ycombinator.com/lists>  (Lists)
 * a: <http://news.ycombinator.com/rss>  (RSS)
 * a: <http://ycombinator.com/bookmarklet.html>  (Bookmarklet)
 * a: <http://ycombinator.com/newsguidelines.html>  (Guidelines)
 * a: <http://ycombinator.com/newsfaq.html>  (FAQ)
 * a: <http://ycombinator.com/newsnews.html>  (News News)
 * a: <http://news.ycombinator.com/item?id=363>  (Feature Requests)
 * a: <http://ycombinator.com>  (Y Combinator)
 * a: <http://ycombinator.com/w2010.html>  (Apply)
 * a: <http://ycombinator.com/lib.html>  (Library)
 * a: <http://www.webmynd.com/html/hackernews.html>  ()
 * a: <http://mixpanel.com/?from=yc>  ()

資料修改

11.設定屬性值

問題

在你解析一個Document之後可能想修改其中的某些屬性值，然後再儲存到磁盤或都輸出到前台頁面。

方法

可以使用屬性設定方法

Element.attr(String key, String value)

, 和

Elements.attr(String key, String value)

假如你需要修改一個元素的

class

屬性，可以使用

Element.addClass(String className)

和

Element.removeClass(String className)

方法。

Elements

提供了批量操作元素屬性和class的方法，比如：要為div中的每一個a元素都添加一個

rel="nofollow"

可以使用如下方法：

doc.select("div.comments a").attr("rel", "nofollow");

說明

與

Element

中的其它方法一樣，

attr

方法也是傳回當

Element

(或在使用選擇器是傳回

Elements

集合)。這樣能夠很友善使用方法連用的書寫方式。比如：

doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");

========================================================================================================

12.設定元素的html内容

問題

你需要一個元素中的HTML内容

方法

可以使用

Element

中的HTML設定方法具體如下：

Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
div.prepend("<p>First</p>");//在div前添加html内容
div.append("<p>Last</p>");//在div之後添加html内容
// 添完後的結果: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>

Element span = doc.select("span").first(); // <span>One</span>
span.wrap("<li><a href='http://example.com/'></a></li>");
// 添完後的結果: <li><a href="http://example.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span>One</span></a></li>

說明

Element.html(String html) 這個方法将先清除元素中的HTML内容，然後用傳入的HTML代替。
Element.prepend(String first) 和 Element.append(String last) 方法用于在分别在元素内部HTML的前面和後面添加HTML内容
Element.wrap(String around) 對元素包裹一個外部HTML内容。

參見

可以檢視API參考文檔中

Element.prependElement(String tag)

和

Element.appendElement(String tag)

方法來建立新的元素并作為文檔的子元素插入其中。

========================================================================================================

13.設定元素的文本内容

問題

你需要修改一個HTML文檔中的文本内容

方法

可以使用

Element

的設定方法：:

Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five > four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five > four Last</div>

說明

文本設定方法與 HTML setter 方法一樣：

Element.text(String text) 将清除一個元素中的内部HTML内容，然後提供的文本進行代替
Element.prepend(String first) 和 Element.append(String last) 将分别在元素的内部html前後添加文本節點。

對于傳入的文本如果含有像

等這樣的字元，将以文本處理，而非HTML。

========================================================================================================

HTML清理

14.消除不受信任的html (來防止xss攻擊)

問題

在做網站的時候，經常會提供使用者評論的功能。有些不懷好意的使用者，會搞一些腳本到評論内容中，而這些腳本可能會破壞整個頁面的行為，更嚴重的是擷取一些機要資訊，此時需要清理該HTML，以避免跨站腳本cross-site scripting攻擊（XSS）。

方法

使用jsoup HTML

Cleaner

方法進行清除，但需要指定一個可配置的

Whitelist

。

String unsafe = 
  "<p><a href='http://example.com/' οnclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" target="_blank" rel="external nofollow"  rel="nofollow">Link</a></p>

說明

XSS又叫CSS (Cross Site Script) ，跨站腳本攻擊。它指的是惡意攻擊者往Web頁面裡插入惡意html代碼，當使用者浏覽該頁之時，嵌入其中Web裡面的html代碼會被執行，進而達到惡意攻擊使用者的特殊目的。XSS屬于被動式的攻擊，因為其被動且不好利用，是以許多人常忽略其危害性。是以我們經常隻讓使用者輸入純文字的内容，但這樣使用者體驗就比較差了。

一個更好的解決方法就是使用一個富文本編輯器WYSIWYG如CKEditor 和TinyMCE。這些可以輸出HTML并能夠讓使用者可視化編輯。雖然他們可以在用戶端進行校驗，但是這樣還不夠安全，需要在伺服器端進行校驗并清除有害的HTML代碼，這樣才能確定輸入到你網站的HTML是安全的。否則，攻擊者能夠繞過用戶端的Javascript驗證，并注入不安全的HMTL直接進入您的網站。

jsoup的whitelist清理器能夠在伺服器端對使用者輸入的HTML進行過濾，隻輸出一些安全的标簽和屬性。

jsoup提供了一系列的

Whitelist

基本配置，能夠滿足大多數要求；但如有必要，也可以進行修改，不過要小心。

這個cleaner非常好用不僅可以避免XSS攻擊，還可以限制使用者可以輸入的标簽範圍。

參見

參閱XSS cheat sheet ，有一個例子可以了解為什麼不能使用正規表達式，而采用安全的whitelist parser-based清理器才是正确的選擇。
參閱 Cleaner ，了解如何傳回一個 Document 對象，而不是字元串
參閱 Whitelist ，了解如何建立一個自定義的whitelist
nofollow 連結屬性了

========================================================================================================

jsoup 的基本功能到這裡就介紹完畢，但由于jsoup 良好的可擴充性API 設計，你可以通過選擇器的定義來開發出非常強大的HTML 解析功能。再加上jsoup 項目本身的開發也非常活躍，是以如果你正在使用Java ，需要對HTML 進行處理，不妨試試。

以上中文文檔摘自： http://www.open-open.com/jsoup/ 如若看的不清楚，請直接通路該站！

Jsoup簡介——使用Java抓取網頁資料

概述

入門

1.解析和周遊一個html文檔

一個文檔的對象模型

參見

輸入

2.解析一個html字元串

存在問題

解決方法

描述

3.解析一個body片斷

存在問題

解決方法

說明

問題

辦法

說明

保證安全Stay safe

4.從一個URL加載一個Document對象

存在問題

解決方法

說明

5.根據一個檔案加載Document對象

問題

辦法

說明

資料抽取

6.使用dom方法來周遊一個Document對象

問題

方法

說明

查找元素

元素資料

操作HTML和文本

7.使用選擇器文法來查找元素

問題

方法

說明

Selector選擇器概述

Selector選擇器組合使用

僞選擇器selectors

8.從元素集合抽取屬性、文本和html内容

問題

方法

說明

參見

9.URL處理

問題

方法

說明

10.程式示例：擷取所有連結

示例輸入結果

資料修改

11.設定屬性值

問題

方法

說明

12.設定元素的html内容

問題

方法

說明

參見

13.設定元素的文本内容

問題

方法

說明

HTML清理

14.消除不受信任的html (來防止xss攻擊)

問題

方法

說明

參見

繼續閱讀