轉載請注明出處https://blog.csdn.net/weixin_45163516
利用Beautiful子產品和強大的正規表達式來爬取網頁
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(soup.find('h1').get_text(), ' url: ', his[-1])
# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
print(his)
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
for i in range(20):
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find('h1').get_text(), ' url: ', his[-1])
# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
網絡爬蟲 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95’]
0 網絡爬蟲 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 www url: /item/%E4%B8%87%E7%BB%B4%E7%BD%91
2 www url: /item/%E4%B8%87%E7%BB%B4%E7%BD%91
3 千年技術獎 url: /item/%E5%8D%83%E5%B9%B4%E6%8A%80%E6%9C%AF%E5%A5%96
4 京都大學 url: /item/%E4%BA%AC%E9%83%BD%E5%A4%A7%E5%AD%A6
5 京都大學 url: /item/%E4%BA%AC%E9%83%BD%E5%A4%A7%E5%AD%A6
6 義項 url: /item/%E4%B9%89%E9%A1%B9
7 祝福 url: /item/%E7%A5%9D%E7%A6%8F
8 南腔北調集 url: /item/%E5%8D%97%E8%85%94%E5%8C%97%E8%B0%83%E9%9B%86
9 祝福 url: /item/%E7%A5%9D%E7%A6%8F
10 魯迅 url: /item/%E9%B2%81%E8%BF%85
11 魏晉風度及文章與藥及酒之關系 url: /item/%E9%AD%8F%E6%99%8B%E9%A3%8E%E5%BA%A6%E5%8F%8A%E6%96%87%E7%AB%A0%E4%B8%8E%E8%8D%AF%E5%8F%8A%E9%85%92%E4%B9%8B%E5%85%B3%E7%B3%BB
當爬取到第十一個關鍵字之後出錯了,這個是因為在該篇文章中沒有了可以跳轉的網頁關鍵字,是以出現了錯誤。
學習自:莫煩python