網易雲音樂火不火我不知道，但是評論很火，也見過很多的文章抓取網易雲音樂，今天自己抓一次，感覺有一點小坑，尤其是對初學者來說，今天正好把我抓取的過程和遇到的問題說一下。

找歌曲

網易雲音樂的網址是https://music.163.com/，打來連接配接我們選擇如下資訊

python爬蟲-抓取網易雲音樂評論-selenium找歌曲小結

之是以找熱門歌曲是評論多。這樣抓到的資料也多。如果你是Chrome浏覽器，F12打開調試工具，找到如下資訊，用這個來通路歌曲清單，其他的浏覽器自己查去吧，像什麼firebug什麼的，自己下載下傳。

然後找到清單需要通路的url

python爬蟲-抓取網易雲音樂評論-selenium找歌曲小結

用這個連結去通路，但是要注意的是，他的頁面是嵌套了iframe的，是以如果用單存的xpath是找不到資訊的，是以需要使用selenium，他也沒什麼，selenium這也沒什麼，就是模仿浏覽器來爬蟲，稍微慢一點，但是基本能滿足所有的需求，這裡要做的就是根據iframe來擷取裡邊的html資訊。解析這篇我們同樣使用bs,下一篇再使用xpath來通路。

安裝selenium

pip install selenium

安裝 webdriver

下載下傳：http://npm.taobao.org/mirrors/chromedriver/2.43/,直接下載下傳你需要的版本，然後解壓到你常用的目錄。下邊會用的到

擷取soup

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

song_url = "https://music.163.com/discover/toplist?id=3778678"
# 擷取驅動
driver_path = "/Users/menghaibin/Downloads/chromedriver"
chrome_options = Options()
chrome_options.add_argument('--headless')
drive = webdriver.Chrome(driver_path, chrome_options=chrome_options)
#頭資訊
headers = {
    "Host": "music.163.com",
    "Referer": "https://music.163.com/",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}

# 擷取soup
def getSoup(url):
    drive.get(url)
    iframe = drive.find_elements_by_id('g_iframe')[0]
    drive.switch_to.frame(iframe)
    return BeautifulSoup(drive.page_source, "lxml")

主要就是利用selenium的驅動，擷取iframe,然後從iframe中擷取html，擷取soup。

然後就是查詢歌曲，查詢頁面了，這些和上一篇的東西類似，擷取歌曲資訊沒什麼大問題，但是我看了其他人抓取評論的時候，用的是這個：

python爬蟲-抓取網易雲音樂評論-selenium找歌曲小結

這是一個post請求，看到這個參數，加密，好，上網查了一下，有大神居然解密了，牛逼，看了看大神解析的過程，牛逼，看不懂，是以我就幹脆還是用老辦法，直接用selenium的webdriver來生成chrome的驅動來擷取資訊吧。反正結果一樣。

直接上代碼

完整版

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

song_url = "https://music.163.com/discover/toplist?id=3778678"
# comment_url = "http://music.163.com/api/v1/resource/comments/R_SO_4_516997458?limit=20&offset=40"

driver_path = "/Users/menghaibin/Downloads/chromedriver"
chrome_options = Options()
chrome_options.add_argument('--headless')
drive = webdriver.Chrome(driver_path, chrome_options=chrome_options)

headers = {
    "Host": "music.163.com",
    "Referer": "https://music.163.com/",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}


def getSoup(url):
    drive.get(url)
    iframe = drive.find_elements_by_id('g_iframe')[0]
    drive.switch_to.frame(iframe)
    return BeautifulSoup(drive.page_source, "lxml")


# 擷取歌曲資訊
def getAllSong():
    soup = getSoup(song_url)
    nodes = soup.select(".m-table-rank tbody tr")
    players = []
    for node in nodes:
        rank = node.select_one(".num").get_text()
        song_href = "https://music.163.com" + node.select("td")[1].select_one("a")["href"]
        song_name = node.select("td")[1].select_one("b")["title"]
        song_id = node.select_one(".ply")["data-res-id"]
        song_time = node.select("td")[2].select_one(".u-dur").get_text()
        song_player = node.select("td")[3].select_one("span")["title"]

        song_info = {
            "rank": rank,
            "song_href": song_href,
            "song_name": song_name,
            "song_id": song_id,
            "song_time": song_time,
            "song_player": song_player
        }
        players.append(song_info)
    return players


def getComments(song_href):
    soup = getSoup(song_href)
    comment_nodes = soup.select(".cmmts .itm")
    comments = []
    for node in comment_nodes:
        comment_user = node.select_one(".s-fc7").get_text()
        comment_content = node.select_one(".f-brk").get_text()
        comment_content_str = str(comment_content).split("：")[1]
        comment_time = node.select_one("div .time").get_text()
        comment_thumb_up = node.select_one("div .rp a").get_text()
        comment_thumb_up_str = str(comment_thumb_up).replace("(", "").replace(")", "").strip()

        if (comment_thumb_up_str.find("萬") > 0 or (comment_thumb_up_str.strip() != '回複' and
                                                   comment_thumb_up_str.strip() != '' and int(
                    comment_thumb_up_str) > 1000)):
            comment = {"user": comment_user,
                       "content": comment_content_str,
                       "time": comment_time,
                       "thumb_up": comment_thumb_up_str}
            comments.append(comment)
    return comments


if __name__ == '__main__':
    list_songs = getAllSong()
    for song in list_songs:
        print(song)
        print(getComments(song["song_href"]))
        print("-" * 50)
    drive.quit()

偷懶擷取評論

但是在晚上看到了一個混迹大神，把擷取評論的接口瞎改了一下，居然也可以擷取評論，這裡給大家貼出來

http://music.163.com/api/v1/resource/comments/R_SO_4_516997458?limit=20&offset=40

，真的是可以，也是厲害了，不過我沒有直接用這種方法，還是老一套，這個借口擷取的是json。還是挺友善的。

小結

以前都是小打小鬧，都是一些簡單的抓取，第一次遇到iframe,還是收獲挺大的，剛剛學python抓取，暫時還沒有找到替代selenium的更好的工具。

python爬蟲-抓取網易雲音樂評論-selenium找歌曲小結

找歌曲

安裝selenium

安裝 webdriver

擷取soup

完整版

偷懶擷取評論

小結

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入