groovy爬蟲練習之——企業資訊

2021-12-13 10:28:46

擷取資料的方法依然采取了regex正則比對的方法，請求架構采用了java，爬蟲語言是groovy，本地拼接好sql語句，發送到mysql服務端，完成存儲。

代碼如下：

package com.fan
 2
 3import com.fantest.httpclient.FanLibrary
 4import com.fantest.mysql.MySqlTest
 5import com.fantest.utils.Regex
 6import net.sf.json.JSONObject
 7
 8class Company extends FanLibrary {
 9    static void main(String[] args) {
10        for (def i in 1..1060) {
11            getPage(i)
12//                getInfo("/eportal/ui?pageId=307900&t=toDetail&ZSBH=D311056737")
13        }
14        testOver()
15    }
16
17    static getPage(int page) {
18        def url = "http://www.***.gov.cn/eportal/ui?pageId=307900"
19        def params = new JSONObject()
20        params.put("filter_LIKE_QYMC", EMPTY)
21        params.put("filter_LIKE_YYZZZCH", EMPTY)
22        params.put("filter_LIKE_ZSBH", EMPTY)
23        params.put("filter_LIKE_XXDZ", EMPTY)
24        params.put("currentPage", page)
25        params.put("pageSize", 15)
26        params.put("OrderByField", EMPTY)
27        params.put("OrderByDesc", EMPTY)
28        def response = getHttpResponse(getHttpPost(url, params))
29        def s = response.getString("content")
30        def all = Regex.regexAll(s, "<td s.*?浏覽")
31        for (int i = 1; i < all.size(); i++) {
32            def get = all.get(i)
33            def regex = Regex.getRegex(get, "href=\".*?\"").replace("amp;", EMPTY)
34            getInfo(regex)
35            sleep(3)
36        }
37        return response;
38    }
39
40    static getInfo(String url) {
41        try {
42            url = "http://www.***.gov.cn" + url;
43            def response = getHttpResponse(getHttpGet(url))
44            def content = response.getString("content")
45            def all = Regex.regexAll(content, "<td class=\"label\".*?\n.*\n.*\n.*\n.*\n.*")
46            def name = all.get(0).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
47            def adress = all.get(1).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
48            def money = all.get(2).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
49            def sid = all.get(3).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
50            def type = all.get(4).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
51            def man = all.get(5).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
52            def paper = all.get(6).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
53            def level = all.get(7).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
54            def gov = all.get(8).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
55            def time = all.get(9).replaceAll("<.*?>", EMPTY).replaceAll("(\n| )", EMPTY).split("：")[1]
56            def start = time.split("~")[0]
57            def end = time.split("~")[1]
58            String sql = "INSERT INTO company (name,adress,money,sid,type,man,paper,level,gov,start,end) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");"
59            sql = String.format(sql, name, adress, money, sid, type, man, paper, level, gov, start, end)
60            output(sql)
61            MySqlTest.sendWork(sql)
62        }
63        catch (Exception e) {
64            output(e)
65        }
66    }
67}

第一頁的網頁結構如下：

第二頁詳情頁結構如下:

groovy爬蟲練習之——企業資訊

繼續閱讀

10年軟體測試工程師，隻剩下這點感悟了（初級測試工程師必看）

做軟體測試工程師如何突破月薪20k?

測試工程師必備武器

面試題解析：你接口測試是怎麼做的？

軟體測試工程師跳槽：收到面試電話後，我會做的事。

測試學習筆記之軟體測試高頻面試題

sqlServer根據經緯查距離

軟體測試經典面試題（小題彙總）

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題