python通過request子產品可以很簡單地通過連結位址擷取網絡文本。

python的re模闆有強大的正規表達式功能來處理文本。

python的檔案讀寫功能也很簡單和強大。

1 python通過request子產品通過連結位址擷取網絡文本

1.1 安裝request子產品

在CMD進入py.exe目錄

開始菜單→運作（windows+r）→cmd→通過cd指令進入到python安裝目錄下的Scripts檔案，如：

cd C:\Users\userName\AppData\Local\Programs\Python\Python36-32\Scripts

輸入pip install requests

或者打開Python檔案的安裝目錄，進入Scripts檔案中，按住Shift鍵+滑鼠右擊，在右鍵中選擇“在此處打開指令視窗”。

或者直接在cmd視窗中輸入以下指令：

pip install requests -i http://pypi.douban.com/simple --trusted-host=pypi.douban.com

1.2 通過連結位址擷取網絡文本

import requests
href = "https://www.3zmm.net/files/article/html/98709/98709808/"
html_response = requests.get(href)
#html_response.encoding = 'utf-8'
html = html_response.text
print(html)

運作結果：

2 建立目錄檔案

對需要提取網頁的連結制作目錄檔案index.html（可手工也可通過代碼提取）。

（為示範需要，截取一部分）：

<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110286.html">第1章 出門即是江湖</a>
<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110285.html">第2章 麻将出千</a>
<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110284.html">第3章 移山卸嶺</a>
<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110283.html">第4章 初次試探</a>
<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110282.html">第5章 炸金花</a>

當然也可以直接擷取網絡文本，将通過正規表達式查找建立list。這裡為示範需要，建立index.html目錄檔案。（目錄檔案可以随時修改，相當于網絡截取的目錄，示範時更靈活）

3 讀取index.html，并建立連結和标題list

import re
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    print('href: ',link[0])
    print('title: ',link[1],'\n')

運作效果：

4 讀取index.html中的連結的網絡文本

通過連結讀取網絡文本。

import re
import requests
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    print(link[1],'\n\n')
    print(chapter_html,'\n\n')

運作效果：

5 文本提取

在網頁源檔案中提取主體文本。

import re
import requests
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    print(link[1],'\n\n')
    print(chapter_content,'\n\n')

運作效果：

6 文本清洗

将不需要的文本替換為空白。

import re
import requests
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    str = '<script>chaptererror();</script><br />　　請記住本書首發域名：www.3zmm.net。三掌門手機版閱讀網址：m.3zmm.net'
    chapter_content = chapter_content.replace(str,'')
    chapter_content = chapter_content.replace(link[0],'')
    print(link[1],'\n\n')
    print(chapter_content,'\n\n')

運作效果：

7 文本處理（文本查找、替換）

import re
import requests
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')
    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')
    str = '<script>chaptererror();</script><br />　　請記住本書首發域名：www.3zmm.net。三掌門手機版閱讀網址：m.3zmm.net'
    chapter_content = chapter_content.replace(str,'')
    chapter_content = chapter_content.replace(link[0],'')
    print(link[1],'\n\n')
    print(chapter_content,'\n\n')

運作效果：

8 将文本分别寫入檔案

import re
import requests

# 1 讀取目錄檔案并提取包含連結和标題的list
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    
# 2 按連結讀取網頁文本
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    
# 3 提取（截取）文本
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    
# 4 文本清洗（删除不需要文本）
    str = '<script>chaptererror();</script><br />　　請記住本書首發域名：www.3zmm.net。三掌門手機版閱讀網址：m.3zmm.net'
    chapter_content = chapter_content.replace(str,'')
    chapter_content = chapter_content.replace(link[0],'')
    
# 5 文本處理（查找、替換）

    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')
    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')

    print(link[1],'\n\n')
    # 6 資料持久化（寫入到檔案）
    fb = open('%s.html'%link[1], 'w', encoding='utf-8');#%s用%link[1]替換
    fb.write(chapter_content)
    fb.close

9 将文本分别寫入檔案并适當的添加CSS、JS

import re
import requests
    
# 1 讀取目錄檔案并提取包含連結和标題的list
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    
# 2 按連結讀取網頁文本
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    
# 3 提取（截取）文本
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    
# 4 文本清洗（删除不需要文本）
    str = '<script>chaptererror();</script><br />　　請記住本書首發域名：www.3zmm.net。三掌門手機版閱讀網址：m.3zmm.net'
    chapter_content = chapter_content.replace(str,'')
    chapter_content = chapter_content.replace(link[0],'')
    
# 5 文本處理（查找、替換）

    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')
    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')

    print(link[1],'\n\n')
    # 6 資料持久化（寫入到檔案,并适當添加CSS、JS）
    sn = re.findall(r'第(.*?)章',link[1])[0]
    fb = open('%s.html'%sn, 'w', encoding='utf-8');#%s用%link[1]替換
    
    fheader = open('header.html','r',encoding="UTF-8")
    fb.write(fheader.read())
    fheader.close()

    fb.write('\n<h4>')
    fb.write(sn)

    cha = link[1].replace(sn,'');
    cha = cha.replace('第章 ','')
    fb.write(' ')
    fb.write(cha)
    fb.write('</h4>\n')
    fb.write(chapter_content)

    ffooter = open('footer.html','r',encoding="UTF-8")
    fb.write(ffooter.read())
    ffooter.close()
    fb.close()

也可以直接将檔案頭部、尾部寫入檔案：

import re
import requests
    
# 1 讀取目錄檔案并提取包含連結和标題的list
with open("index.html",'rU',encoding='utf-8') as strf:
    str = strf.read()
res = r'<a href ="(.*?)">(.*?)</a>' # 使用()分組（分為兩組）
indexList = re.findall(res,str)
for link in indexList:
    
# 2 按連結讀取網頁文本
    chapter_response = requests.get(link[0])
    #chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    
# 3 提取（截取）文本
    chapter_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',chapter_html)[0]
    
# 4 文本清洗（删除不需要文本）
    str = '<script>chaptererror();</script><br />　　請記住本書首發域名：www.3zmm.net。三掌門手機版閱讀網址：m.3zmm.net'
    chapter_content = chapter_content.replace(str,'')
    chapter_content = chapter_content.replace(link[0],'')
    
# 5 文本處理（查找、替換）

    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')
    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')

    print(link[1],'\n\n')
    # 6 資料持久化（寫入到檔案,并适當添加CSS、JS）
    sn = re.findall(r'第(.*?)章',link[1])[0]
    fb = open('%s.html'%sn, 'w', encoding='utf-8');#%s用%link[1]替換
    
    # 6.1 寫檔案頭資料
    #fheader = open('header.html','r',encoding="UTF-8")
    #fb.write(fheader.read())
    #fheader.close()
    headertxt = '''
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
<link ID="CSS" href="../cssjs/css.css" rel="stylesheet" type="text/css" />
<script charset="utf-8" language="JavaScript" type="text/javascript" src="../cssjs/js.js"></script>
<script>docWrite1();</script>
</head>

<body>
<div id="container">    
    '''
    fb.write(headertxt)
    
    # 6.2 寫檔案主體
    fb.write('\n<h4>')
    fb.write(sn)

    cha = link[1].replace(sn,'');
    cha = cha.replace('第章 ','')
    fb.write(' ')
    fb.write(cha)
    fb.write('</h4>\n')
    fb.write(chapter_content)
    
    # 6.2 寫檔案尾部
    #ffooter = open('footer.html','r',encoding="UTF-8")
    #fb.write(ffooter.read())
    #ffooter.close()
    
    footertxt = '''
<div>
<script type=text/javascript>
	docWrite2();
    bootfunc();
    window.onload = myfun;
</script>
</div>

</body>
</html>    
    '''
    fb.write(footertxt)
    fb.close()

-End-

python｜通過一個簡單爬蟲執行個體簡單了解文本解析與讀寫

1 python通過request子產品通過連結位址擷取網絡文本

2 建立目錄檔案

3 讀取index.html，并建立連結和标題list

4 讀取index.html中的連結的網絡文本

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入