天天看點

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

python爬取網易雲音樂評論及相關資訊

  • urllib
  • requests
  • 正規表達式
  • 爬取網易雲音樂評論及相關資訊

urllib了解

參考連結:

https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001432688314740a0aed473a39f47b09c8c7274c9ab6aee000/

requests了解

參考連結:

http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

正規表達式

參考連結:

http://www.runoob.com/python/python-reg-expressions.html

爬取網易雲音樂評論及相關資訊

1、分析網易雲頁面

2、擷取加密的參數 params 和 encSecKey

3、爬取網易雲音樂評論及相關資訊

1、分析網易雲音樂頁面

參考連結https://blog.csdn.net/fengxinlinux/article/details/77950209

爬取網易雲音樂的評論,首先在浏覽器打開一首需要爬取評論的歌曲頁面,比如:https://music.163.com/#/song?id=862102137

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

然後,在Chrome浏覽器(其他類似)按F12鍵,按F5重新整理,根據下圖箭頭順序,按順序操作

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

如下圖所示,即為頁面資訊。

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

如下圖所示,即為熱評内容,擷取頁面内容,然後解析content即可,解析步驟繼續向下看。

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

2、擷取加密的參數 params 和 encSecKey

參考連結:https://www.zhihu.com/question/36081767

3、爬取網易雲音樂評論及相關資訊(代碼有備援)

(1)使用User Agent和代理IP隐藏身份之為何要設定User Agent

參考連結:https://blog.csdn.net/c406495762/article/details/60137956

agents = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

import random

# 爬取多首歌評時可以每次随機選取一個User Agent 
header = {'User-Agent': ''.join(random.sample(agents, ))} 
# random.sample() 的值是清單, ''.join()轉清單為字元串
print(header)
           

(2)爬取指定一首歌的熱評

注意:分析頁面發現,熱評隻在每一首歌的首頁,有15條。

參考連結:https://blog.csdn.net/fengxinlinux/article/details/77950209

代碼說明:代碼中的url和data參數值在上面的圖中圈出的部分複制。
# -*-coding:utf-8-*-

"""
   爬取網易雲音樂指定歌曲的15條熱評,
   2018年6月26日

"""

import urllib.request
import urllib.error
import urllib.parse
import json


# 抓取網易雲音樂指定url的熱評
def get_hotComments():
    url = 'https://music.163.com/weapi/v1/resource/comments/R_SO_4_862102137?csrf_token='   # 歌評url
    header = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    # post請求表單資料
    data = {'params':'LPkOcWb/uz2Nj6xw+RFhGJ1PkFi4+lh4agK+1jRGmjMAiOcJ5RHxQBbZa+aME54AUdi21JkqLu/yeHjjIaLQJ4wzqiuzrzYUKciRCqmCDX9ziKoktv5mgvvlA5+A9a7kUF5pabudBaWjsut+9T5kfHQBv75fIcDRt/Anyb8FnG/Ro6R8IStD0/JknFvH5+3S',
            'encSecKey':'5627cc7941cf4cbd59b13668efe38a622ed0889d33cdcf603d18b025eb34ac434c882ed5ad16ca06e88e40a8b91de455483d0b88b6b460ca146b163e67ebab688b2feb4f22369db85a926744bad9114d3cddd00ca6255d7cdcad6cf7b9300a6fdf49adf983087cd830131fabbac39ec4a526432958309cf92c0b5a6bc177078b'}
    postdata = urllib.parse.urlencode(data).encode('utf8')  # 進行編碼
    request = urllib.request.Request(url, headers=header, data=postdata)
    response = urllib.request.urlopen(request).read().decode('utf8')
    json_dict = json.loads(response)   # 擷取json
    hot_comment = json_dict['hotComments']  # 擷取json中的熱門評論

    num = 
    for item in hot_comment:
        print('第%d條評論:' % num + item['content'])
        num += 

if __name__ == '__main__':
    get_hotComments()
           

代碼輸出,如下圖:有完整15條資料,截圖範圍有限,顯示6條。

python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

(3)爬取網易雲音樂199首熱歌榜每首歌的評論資料

分析問題 ,要擷取一首歌曲的頁面,

參考連結:

擷取歌曲名和歌曲id https://blog.csdn.net/fengxinlinux/article/details/77950209

擷取一首歌所有頁面的評論資料 https://www.cnblogs.com/weixuqin/p/8905867.html

代碼說明1:如果導入from Crypto.Cipher import AES提示錯誤No module named Crypto.Cipher,請參考文章

https://blog.csdn.net/qiang12qiang12/article/details/80805016

當頁面評論不足指定頁面的數量時,代碼可以選擇跳過或者break
# -*- coding:utf-8 -*-

"""
    爬取網易雲音樂熱歌榜的最新評論,指定頁數的所有評論,比如前10頁
    2018年6月26日

"""

import os
import re
import random
import urllib.request
import urllib.error
import urllib.parse
from Crypto.Cipher import AES
import base64
import requests
import json
import time


agents = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

headers = {
    'Host':'music.163.com',
    'Origin':'https://music.163.com',
    'Referer':'https://music.163.com/song?id=28793052',
    'User-Agent':''.join(random.sample(agents, ))
}

# 除了第一個參數,其他參數為固定參數,可以直接套用
# offset的取值為:(評論頁數-1)*20,total第一頁為true,其餘頁為false
# 第一個參數
# first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}'
# 第二個參數
second_param = "010001"
# 第三個參數
third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
# 第四個參數
forth_param = "0CoJUm6Qyw8W8jud"


# 擷取參數
def get_params(page):  # page為傳入頁數
    iv = "0102030405060708"
    first_key = forth_param
    second_key =  * 'F'
    if(page == ):  # 如果為第一頁
        first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}'
        h_encText = AES_encrypt(first_param, first_key, iv)
    else:
        offset = str((page-)*)
        first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' % (offset,'false')
        h_encText = AES_encrypt(first_param, first_key, iv)
    h_encText = AES_encrypt(h_encText, second_key, iv)
    return h_encText


# 擷取 encSecKey
def get_encSecKey():
    encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"
    return encSecKey


# 解密過程
def AES_encrypt(text, key, iv):
    pad =  - len(text) % 
    text = text + pad * chr(pad)
    encryptor = AES.new(key, AES.MODE_CBC, iv)
    encrypt_text = encryptor.encrypt(text)
    encrypt_text = base64.b64encode(encrypt_text)
    encrypt_text = str(encrypt_text, encoding="utf-8")  # 注意一定要加上這一句,沒有這一句則出現錯誤
    return encrypt_text


# 獲得評論json資料
def get_json(url, params, encSecKey):
    data = {
         "params": params,
         "encSecKey": encSecKey
    }
    response = requests.post(url, headers=headers, data=data)
    return response.content.decode('utf-8')  # 解碼


# 擷取熱歌榜所有歌曲名稱和id
def get_all_hotSong():
    url = 'http://music.163.com/discover/toplist?id=3778678'    # 網易雲雲音樂熱歌榜url
    header = {'User-Agent': ''.join(random.sample(agents, ))}  # random.sample() 的值是清單, ''.join()轉清單為字元串
    request = urllib.request.Request(url=url, headers=header)
    html = urllib.request.urlopen(request).read().decode('utf8')   # 打開url
    html = str(html)     # 轉換成str
    # print(html)
    pat1 = r'<ul class="f-hide"><li><a href="/song\?id=\d*?" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >.*</a></li></ul>'  # 進行第一次篩選的正規表達式
    result = re.compile(pat1).findall(html)     # 用正規表達式進行篩選
    # print(result)
    result = result[]     # 擷取tuple的第一個元素

    pat2 = r'<li><a href="/song\?id=\d*?" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >(.*?)</a></li>'  # 進行歌名篩選的正規表達式
    pat3 = r'<li><a href="/song\?id=(\d*?)" target="_blank" rel="external nofollow" >.*?</a></li>'   # 進行歌ID篩選的正規表達式
    hot_song_name = re.compile(pat2).findall(result)     # 擷取所有熱門歌曲名稱
    hot_song_id = re.compile(pat3).findall(result)     # 擷取所有熱門歌曲對應的Id
    # print(hot_song_name)
    # print(hot_song_id)

    return hot_song_name, hot_song_id


# 抓取某一首歌的前page頁評論
def get_all_comments(hot_song_id, page, hot_song_name, hot_song_order):  # hot_song_order為了給檔案命名添加一個編号
    all_comments_list = []  # 存放所有評論
    url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_' + hot_song_id + '?csrf_token='   # 歌評url

    dir = os.getcwd() + '\\Comments\\'
    if not os.path.exists(dir):     # 判斷目前路徑是否存在,沒有則建立new檔案夾
        os.makedirs(dir)

    num = 
    f = open(dir + str(hot_song_order) + ' ' + hot_song_name + '.txt', 'w', encoding='utf-8')
    # ' '是為了防止檔案名也是數字混合,加個空格分隔符,寫入檔案, a 追加

    for i in range(page):  # 逐頁抓取
        # print(url, i)
        params = get_params(i+)
        encSecKey = get_encSecKey()
        json_text = get_json(url, params, encSecKey)
        # print(json_text)

        json_dict = json.loads(json_text)
        for item in json_dict['comments']:
            comment = item['content']  # 評論内容
            num += 
            f.write(str(num) + '.' + comment + '\n')
            comment_info = str(comment)
            all_comments_list.append(comment_info)
        print('第%d首歌的%d頁抓取完畢!' % (hot_song_order, i+))
        # time.sleep(random.choice(range(1, 3)))   # 爬取過快的話,設定休眠時間,跑慢點,減輕伺服器負擔
    f.close()
    # print(all_comments_list)
    # print(len(all_comments_list))
    return all_comments_list


if __name__ == '__main__':
    start_time = time.time()  # 開始時間

    hot_song_name, hot_song_id = get_all_hotSong()

    num = 
    while num < len(hot_song_name):    # 儲存所有熱歌榜中的熱評
        print('正在抓取第%d首歌曲熱評...' % (num+))
        # 熱門歌曲評論很多,每首爬取最新的10頁評論
        get_all_comments(hot_song_id[num], , hot_song_name[num], num+)
        print('第%d首歌曲熱評抓取成功' % (num+))
        num += 

    end_time = time.time()  # 結束時間
    print('程式耗時%f秒.' % (end_time - start_time))
           
代碼說明2:當頁面評論不足指定頁面的數量時,代碼可以選擇break或者繼續爬空,break代碼如下(替換上面的函數即可):
def get_all_comments(url, page):
    all_comments_list = []  # 存放所有評論
    num = 
    f = open('TTTTTTTTTTTest.txt', 'a', encoding='utf-8')  # 寫入檔案
    for i in range(page):  # 逐頁抓取
        print(url, i)
        params = get_params(i+)
        encSecKey = get_encSecKey()
        json_text = get_json(url, params, encSecKey)
        # print(json_text)

        json_dict = json.loads(json_text)
        onePageComments = []  # 判斷評論最後一頁,如果評論數為0,則結束爬取
        for item in json_dict['comments']:
            comment = item['content']  # 評論内容
            onePageComments.append(comment)
            num += 
            f.write(str(num) + '.' + comment + '\n')
            comment_info = str(comment)
            all_comments_list.append(comment_info)
        if len(onePageComments) == :
            break
        print("本頁評論數量", len(onePageComments))
        print('第%d頁抓取完畢!' % (i+))
        # time.sleep(random.choice(range(1, 3)))   # 爬取過快的話,設定休眠時間,跑慢點,減輕伺服器負擔
    f.close()
    return all_comments_list
           
比如,設定爬取頁數為50頁,而爬取的評論隻有19頁時,爬完19頁即終止。
python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

(4)爬取網易雲音樂每首歌的其他資料

包括,評論總數,評論釋出者昵稱、頭像、ID等等,可以根據自己需要解析頁面擷取。
def get_all_comments(url, page):
    all_comments_list = []  # 存放所有評論
    num = 
    f = open('Test.txt', 'w', encoding='utf-8')  # 寫入檔案
    for i in range(page):  # 逐頁抓取
        print(url, i)
        params = get_params(i+)
        encSecKey = get_encSecKey()
        json_text = get_json(url, params, encSecKey)
        # print(json_text)

        json_dict = json.loads(json_text)
        print("熱評總數:", json_dict["total"])  # 熱評總數

        for item in json_dict['hotComments']:
            comment = item['content']  # 評論内容
            print("使用者資訊:", item['user'])
            # 評論釋出者昵稱、頭像、ID
            nickname = item['user']['nickname']
            avatarUrl = item['user']['avatarUrl']
            userId = item['user']['userId']

            urllib.request.urlretrieve(avatarUrl, "headPortrait\\" + str(userId) + '.jpg')  # 存放頭像
            # 根據爬取的頭像連結下載下傳頭像到本地

            num += 
            f.write(str(num) + '.' + comment + '\n')
            comment_info = str(comment)
            all_comments_list.append(comment_info)
        print('第%d頁抓取完畢!' % (i+))

        # time.sleep(random.choice(range(1, 3)))   # 爬取過快的話,設定休眠時間,跑慢點,減輕伺服器負擔
    f.close()

    return all_comments_list
           
比如,擷取第一頁使用者資訊。
python爬取網易雲音樂評論及相關資訊python爬取網易雲音樂評論及相關資訊

繼續閱讀