我的第一個簡單python爬蟲

利用xpath解析資料，用openpyxl存儲資料

明确目标：挑戰杯網址

找到想要的資料：項目名稱+項目簡介

解析開始：一開始我是用bs4但是發現資料提取得太雜亂，而且還有我不需要的字元串

解決方法：利用xpath來解析

一開始直接按照官宣：

from lxml import etree#結果etree下面爆紅

然後查資料問老師就：

from lxml import html#從html間接調用etree
import requests
res = requests.get(url)
tree = res.text#将html代碼存入tree
tree = etree.HTML(tree)#調用etree來解析html
xpath_1 = '/html/body/div[2]/div/div[2]/div[2]/div/p/text()'#xpath文法用來提取資料
xpath_2 = '/html/body/div[2]/div/div[2]/div[2]/div/div/p[1]/a/text()'
print(xpath_1)
print(xpath_2)

解析到資料後發現隻有一條，因為html中不同的xpath位址，那麼就意味着上面[ ]中的數字可以修改

是以加入一個循環語句和一個拼接文法用來修改[ ]中的資料

for i in range(2,4):
    xpath_1 = '/html/body/div[2]/div/div[2]/div[{}]/div/p/text()'.format(i)
    xpath_2 = '/html/body/div[2]/div/div[2]/div[{}]/div/div/p[1]/a/text()'.format(i)

當然解析到了資料後還需要将資料存起來，在這裡就需要用到openpyxl

最後寫成完整版：

from lxml import html
import requests
import openpyxl
wb = openpyxl.Workbook()
sheet = wb.active
etree = html.etree
url_1 = 'http://www.tiaozhanbei.net/project/search/?category=&pro_type=&province_id=&keyword=%E6%99%BA%E8%83%BD&small_category='
url_list = []
url_list.append(url_1)
for x in range(30):
    url_2 = 'http://www.tiaozhanbei.net/project/search/?page={}&category=&province_id=&pro_type=&keyword=%E6%99%BA%E8%83%BD&small_category='.format(x)
    url_list.append(url_2)
for url in url_list:

    res = requests.get(url)
    tree = res.text
    tree = etree.HTML(tree)
    lists = []
    for i in range(2,12):
        xpath_1 = '/html/body/div[2]/div/div[2]/div[{}]/div/p/text()'.format(i)
        xpath_2 = '/html/body/div[2]/div/div[2]/div[{}]/div/div/p[1]/a/text()'.format(i)
        # print(xpath_2)
        title = tree.xpath(xpath_1)
        content = tree.xpath(xpath_2)
        lists.append(content)
        lists.append(title)
        print(lists)
    for list in lists:
        sheet.append(list)
wb.save('shuju.xlsx')
wb.close()

小結

目的：這個爬蟲是完全我自己寫的第一個爬蟲用來爬取挑戰杯的一些發明簡介。

感悟：些這個爬蟲時我遇到了兩個問題在這裡我将他們記錄下來為了證明一種學習态度，

邊寫代碼要變做記錄，因為我還是一種隻有理論沒有時間的階段！！！

我的第一個簡單python爬蟲我的第一個簡單python爬蟲

我的第一個簡單python爬蟲

利用xpath解析資料，用openpyxl存儲資料

小結

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入