python爬蟲入門--建立爬蟲

urllib 是 Python 的标準庫，包含了從網絡請求資料，處理 cookie，甚至改變像請求頭和使用者代理這些中繼資料的函數。（https://docs.python.org/3/ library/urllib.html）。

Python 2.x 裡的 urllib2 庫，可能會發現 urllib2 與 urllib 有些不同。在 Python 3.x 裡，urllib2 改名為 urllib，被分成一些子子產品：urllib.request、 urllib.parse 和 urllib.error。

#導入庫,我的部落格上有文檔
from urllib.request import urlopen
'''
urllib.request使用

'''
html = urlopen("https://blog.csdn.net/qq_35706045")
print(html.read())

$python pa.py

$python3 pa.py

#第三方庫，推薦
"""
pip install requests

"""
import requests
'''
response = requests.get('http://www.baidu.com')
print(response.status_code)      # 列印狀态碼
print(response.url)              # 列印請求url
print(response.headers)          # 列印頭資訊
print(response.cookies)          # 列印cookie資訊
print(response.text)             #以文本形式列印網頁源碼
print(response.content)          #以位元組流形式列印

'''
r = requests.get('https://blog.csdn.net/qq_35706045')
print(r.next)

BeautifulSoup 庫最常用的對象恰好就是 BeautifulSoup 對象。讓

from urllib.request import urlopen
from bs4 import BeautifulSoup
'''
bs4庫簡單使用

'''
html = urlopen("https://blog.csdn.net/qq_35706045")
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj.h1)

新的 BeautifulSoup 4 版本（也叫 BS4）。

BeautifulSoup 4 的所有安裝方法都在 http://www. crummy.com/software/BeautifulSoup/bs4/doc/ 裡面。

Linux 系統上的基本安裝方法是： $sudo apt-get install python-bs4

對于 Mac 系統，首先用 $sudo easy_install pip 安裝 Python 的包管理器 pip，然後運作 $pip install beautifulsoup4

$python > from bs4 import BeautifulSoup 如果沒有錯誤，說明導入成功了。

另外，還有一個 Windows 版 pip（https://pypi.python.org/pypi/setuptools）的 .exe 格式安裝器，裝了之後你就可以輕松安裝和管理包了： >pip install beautifulsoup4

一個基本的爬蟲，帶有異常回報

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("https://blog.csdn.net/qq_35706045")
if title == None:
    print("Title could not be found")
else:
    print(title)

由于使用python版本不同，使用pip安裝時注意，使用想對pip版本，避免奇怪的異常。

或者，Python網絡資料采集有以下，通過虛拟環境使用不同python版本，主要是py2,py3

用虛拟環境儲存庫檔案如果你同時負責多個 Python 項目，或者想要輕松打包某個項目及其關聯的庫檔案，再或者你擔心已安裝的庫之間可能有沖突，那麼你可以安裝一個 Python 虛拟環境來分而治之。

當一個 Python 庫不用虛拟環境安裝的時候，你實際上是全局安裝它。

這通常需要有管理員權限，或者以 root 身份安裝，這個庫檔案對裝置上的每個使用者和每個項目都是存在的。好在建立虛拟環境非常簡單： $ virtualenv scrapingEnv

這樣就建立了一個叫作 scrapingEnv 的新環境，你需要先激活它再使用： $ cd scrapingEnv/ $ source bin/activate

激活環境之後，你會發現環境名稱出現在指令行提示符前面，提醒你目前處于虛拟環境中。

後面你安裝的任何庫和執行的任何程式都是在這個環境下運作。

在建立的 scrapingEnv 環境裡，可以安裝并使用 BeautifulSoup：

(scrapingEnv)ryan$ pip install beautifulsoup4

(scrapingEnv)ryan$ python > from bs4 import BeautifulSoup >

當不再使用虛拟環境中的庫時，可以通過釋放指令來退出環境：

(scrapingEnv)ryan$ deactivate ryan$ python > from bs4 import BeautifulSoup Traceback (most recent call last): File "", line 1, in ImportError: No module named 'bs4' 将項目關聯的所有庫單獨放在一個虛拟環境裡，還可以輕松打包整個環境發生給其他人。隻要他們的 Python 版本和你的相同，你打包的代碼就可以直接通過虛拟環境運行，不需要再安裝任何庫。

python爬蟲入門--建立爬蟲

繼續閱讀

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰：利用beautiful soup爬取貓眼電影TOP100榜單内容-2

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰