天天看點

爬蟲随機爬取百度百科"網絡爬蟲"利用Beautiful子產品和強大的正規表達式來爬取網頁

轉載請注明出處https://blog.csdn.net/weixin_45163516

利用Beautiful子產品和強大的正規表達式來爬取網頁

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random


base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]


url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(soup.find('h1').get_text(), '    url: ', his[-1])

# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
    his.append(random.sample(sub_urls, 1)[0]['href'])
else:
    # no valid sub link found
    his.pop()
print(his)

his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()
           

網絡爬蟲 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95’]

0 網絡爬蟲 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

1 www url: /item/%E4%B8%87%E7%BB%B4%E7%BD%91

2 www url: /item/%E4%B8%87%E7%BB%B4%E7%BD%91

3 千年技術獎 url: /item/%E5%8D%83%E5%B9%B4%E6%8A%80%E6%9C%AF%E5%A5%96

4 京都大學 url: /item/%E4%BA%AC%E9%83%BD%E5%A4%A7%E5%AD%A6

5 京都大學 url: /item/%E4%BA%AC%E9%83%BD%E5%A4%A7%E5%AD%A6

6 義項 url: /item/%E4%B9%89%E9%A1%B9

7 祝福 url: /item/%E7%A5%9D%E7%A6%8F

8 南腔北調集 url: /item/%E5%8D%97%E8%85%94%E5%8C%97%E8%B0%83%E9%9B%86

9 祝福 url: /item/%E7%A5%9D%E7%A6%8F

10 魯迅 url: /item/%E9%B2%81%E8%BF%85

11 魏晉風度及文章與藥及酒之關系 url: /item/%E9%AD%8F%E6%99%8B%E9%A3%8E%E5%BA%A6%E5%8F%8A%E6%96%87%E7%AB%A0%E4%B8%8E%E8%8D%AF%E5%8F%8A%E9%85%92%E4%B9%8B%E5%85%B3%E7%B3%BB

當爬取到第十一個關鍵字之後出錯了,這個是因為在該篇文章中沒有了可以跳轉的網頁關鍵字,是以出現了錯誤。

學習自:莫煩python