文章目錄

Jsoup簡介
依賴
解析URL
解析字元串
解析檔案
使用Dom擷取元素
- 1. 根據id擷取元素
- 2. 根據标簽擷取元素
- 3. 根據class擷取元素
- 4. 根據屬性擷取元素
從元素中擷取資料
使用選擇器擷取元素
- 選擇器的組合使用

Jsoup簡介

Jsoup是一款Java的HTML解析器,可以直接解析某個URL位址,HTML文本内容.它提供了一套非常省力的API,通過可以DOM,CSS以及類似于jQuery的操作方法取出和操作資料.

依賴

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

<!--操作檔案的工具類-->
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.6</version>
</dependency>

<!--字元串處理工具-->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.7</version>
</dependency>

<!--測試用例-->
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
    <scope>test</scope>
</dependency>

解析URL

根據url下載下傳資料,然後對下載下傳的資料進行解析

@Test
public void testUrl() throws Exception {

    // 解析url
    // 第一個參數為通路的url，第二個參數為通路的逾時時間
    Document doc = Jsoup.parse(new URL("http://www.baidu.com"), 1000);

    //使用标簽選擇器擷取title中的内容
    String title = doc.getElementsByTag("title").first().text();

    System.out.println(title);
}

解析字元串

@Test
public void testString() throws Exception {

    //使用工具類讀取檔案，擷取字元串
    String content = FileUtils.readFileToString(new File("C:\\Users\\yubin14\\Desktop\\new.html"), "utf8");

    //使用Jsoup解析字元串
    Document doc = Jsoup.parse(content);

    String title = doc.getElementsByTag("title").first().text();

    System.out.println(title);
}

解析檔案

@Test
public void testFile() throws Exception {

    //解析檔案
    Document doc = Jsoup.parse(new File("C:\\Users\\yubin14\\Desktop\\new.html"), "utf8");

    String title = doc.getElementsByTag("title").first().text();
    System.out.println(title);
}

使用Dom擷取元素

1. 根據id擷取元素

//1. 根據id擷取元素
Element element = doc.getElementById("people");
//列印标簽中的文本
System.out.println(element.text());

2. 根據标簽擷取元素

//2. 根據标簽擷取元素
Element element = doc.getElementsByTag("span").first();

System.out.println(element.text());

3. 根據class擷取元素

//3. 根據class擷取元素
Element element = doc.getElementsByClass("lione").first();

System.out.println(element.text());

4. 根據屬性擷取元素

//4. 根據屬性擷取元素
Element element1 = doc.getElementsByAttribute("abc").first();
//根據屬性名和屬性值擷取元素
Element element2 = doc.getElementsByAttributeValue("abc", "123").first();

System.out.println(element1.text());

從元素中擷取資料

@Test
public void testData() throws Exception {

    Document doc = Jsoup.parse(new File("C:\\Users\\yubin14\\Desktop\\new.html"), "utf8");

    Element element = doc.getElementsByTag("img").first();

    String str = "";
    //從元素中擷取id
    str = element.id();

    //從元素中擷取className
    str = element.className(); //多個類名時不拆分
    Set<String> classNames = element.classNames();//多個類名時拆分

    //從元素中根據屬性名擷取屬性值
    str = element.attr("src");

    //從元素中擷取所有屬性
    Attributes attributes = element.attributes();
    System.out.println(attributes);

    //從元素中擷取文本内容
    str = element.text();


    System.out.println(str);

}

使用選擇器擷取元素

@Test
public void testSelector() throws Exception {
    Document doc = Jsoup.parse(new File("C:\\Users\\yubin14\\Desktop\\new.html"), "utf8");

    //通過标簽查找元素
    Elements elements = doc.select("span");
    //        for (Element element : elements) {
    //            System.out.println(element.text());
    //        }

    //通過id查找元素
    Element element = doc.select("#people").first();
    System.out.println(element.text());

    //通過class擷取元素
    Element element1 = doc.select(".lione").first();
    System.out.println(element1.text());

    //通過屬性擷取元素
    Element element2 = doc.select("[abc]").first();
    System.out.println(element2.text());

    //通過屬性名和屬性值查找元素
    Elements elements1 = doc.select("[abc=123]");
    for (Element element3 : elements1) {
        System.out.println(element3.text());
    }
}

選擇器的組合使用

@Test
public void testSelector2() throws Exception {

    Document doc = Jsoup.parse(new File("C:\\Users\\yubin14\\Desktop\\new.html"), "utf8");

    //元素+id: el#id
    Elements select = doc.select("div#a");
    System.out.println(select.text());

    //元素+class: el.class
    Element element = doc.select("li.news-meta-item").first();
    System.out.println(element.text());

    //元素+屬性名：el[attr]
    Element first = doc.select("li[abc]").first();
    System.out.println(first.text());

    //查找某元素下面的子元素 parent child
    Elements elements = doc.select(".s-rank-title div");
    for (Element element1 : elements) {
        System.out.println(element1.text());
    }

    //查找某元素的直接子元素 parent > child
    Elements elements1 = doc.select("c > a");
    for (Element element1 : elements1) {
        System.out.println(element1.text());
    }

    //查找某元素的所有直接子元素 parent > *
    Elements select1 = doc.select("parent > *");
    for (Element element1 : select1) {
        System.out.println(element1.text());
    }

}

元素,id,class,屬性名的組合是任意的,可以有兩個,也可以有多個

如何使用Jsoup解析HTMLJsoup簡介依賴解析URL解析字元串解析檔案使用Dom擷取元素從元素中擷取資料使用選擇器擷取元素

文章目錄

Jsoup簡介

依賴

解析URL

解析字元串

解析檔案

使用Dom擷取元素

1. 根據id擷取元素

2. 根據标簽擷取元素

3. 根據class擷取元素

4. 根據屬性擷取元素

從元素中擷取資料

使用選擇器擷取元素

選擇器的組合使用

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method