爬蟲綜合 - 天天看點

老師：MissDu 送出作業

一.把爬取的内容儲存取MySQL資料庫

import pandas as pd
import pymysql
from sqlalchemy import create_engine
conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
engine = create_engine(conInfo,encoding='utf-8')
df = pd.DataFrame(allnews)
df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False）

二.爬蟲綜合大作業

選擇一個熱點或者你感興趣的主題。
選擇爬取的對象與範圍。
了解爬取對象的限制與限制。
爬取相應内容。
做資料分析與文本分析。
形成一篇文章，有說明、技術要點、有資料、有資料分析圖形化展示與說明、文本分析圖形化展示與說明。
文章公開釋出。

參考：

32個Python爬蟲項目

都是誰在反對996？

Python和Java薪資最高，C#最低！

給《流浪地球》評1星的都是什麼心态？

《都挺好》彈幕資料，比劇情還精彩？

爬了自己的微信好友，原來他們是這樣的人……

春節人口遷徙大資料報告！

七夕前消費趨勢資料

爬了一下天貓上的Bra購買記錄，有了一些羞羞哒的發現...

Python做了六百萬字的歌詞分析，告訴你中國Rapper都在唱些啥

分析了42萬字歌詞後，終于搞清楚民謠歌手唱什麼了

十二星座的真實面目

唐朝詩人之間的關系到底是什麼樣的？

中國姓氏排行榜

三.爬蟲注意事項

1.設定合理的爬取間隔，不會給對方運維人員造成壓力，也可以防止程式被迫中止。

import time
import random
time.sleep(random.random()*3)

2.設定合理的user-agent，模拟成真實的浏覽器去提取内容。

首先打開你的浏覽器輸入：about:version。
使用者代理:
收集一些比較常用的浏覽器的user-agent放到清單裡面。
然後import random，使用随機擷取一個user-agent
定義請求頭字典headers={’User-Agen‘：}
發送request.get時，帶上自定義了User-Agen的headers

3.需要登入

發送request.get時，帶上自定義了Cookie的headers

headers={’User-Agen‘：

'Cookie': }

4.使用代理IP

通過更換IP來達到不斷高效爬取資料的目的。

headers = {

"User-Agent": "",

}

proxies = {

"http": " ",

"https": " ",

response = requests.get(url, headers=headers, proxies=proxies)

BILIBILI每日排行榜視訊資訊擷取

擷取bilibili每日全站排行榜，提取标簽，評論。

擷取評論

API： http://api.bilibili.cn/feedback

參數

aid	true	int	AV号
page			頁碼
pagesize	false		單頁傳回的記錄條數，最大不超過300，預設為10。
ver			API版本,最新是3
order		string	排序方式預設按釋出時間倒序可選：good 按點贊人數排序 hot 按熱門回複排序

ver1

傳回值字段	字段類型	字段說明
mid		會員ID
lv		樓層
fbid		評論ID
msg		評論資訊
ad_check		狀态 (0: 正常 1: UP主隐藏 2: 管理者删除 3: 因舉報删除)
face		釋出人頭像
rank		釋出人顯示辨別
nick		釋出人暱稱
totalResult		總評論數
pages		總頁數

replay












good		點贊數?
isgood		是否已點贊?
device	未知
create		建立評論的UNIX時間
create_at	String	建立評論的可讀時間(2016-01-20 15:52)
reply_count		回複數量
level_info	list	使用者的等級資訊?
sex		使用者的性别

例： AV number=50164983：

http://api.bilibili.com/x/reply?type=1&oid=50164983&pn=1&nohot=1&sort=0

傳回資訊：

再看Preview

可知Size=20，count=2824，使用者名在[MEMBER][UNAME]下，回複資訊在[content][message]下

如此balabala.......

标簽分析

可以發現b站使用者多喜歡看生活向，鬼畜向的視訊

排行榜第一的視訊的評論資料（ps我對蔡某人沒任何意見）

這是在資料庫裡面

排行榜：

以下是代碼：

頭部

UA = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
    'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
    'MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile ',
    'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
    'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
    'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+',
    'Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0',
    'Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) ',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)',
    'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999',
]

headers = {
        'Referer': 'https://www.bilibili.com/v/douga/mad/?spm_id_from=333.334.b_7072696d6172795f6d656e75.3',
        'User-Agent': choice(UA)
    }

herder={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Host': 'www.bilibili.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7',
"Cookie": "fts=1519723607; pgv_pvi=9921057792; im_notify_type_5172000=0; LIVE_BUVID=d9102c76da863db3e7c92490dc7c1458; LIVE_BUVID__ckMd5=300ca52bca0020e2; im_local_unread_5172000=0; buvid3=633B41F7-7489-4AFF-A338-C6B691D748BF163029infoc; CURRENT_FNVAL=16; _uuid=154F2A25-2995-7B95-9278-CEB7B98119CB36766infoc; UM_distinctid=16797b478ab161-09c84bb5055ad7-b79183d-144000-16797b478ac59c; stardustvideo=-1; sid=iv38z60z; CURRENT_QUALITY=32; DedeUserID=5172000; DedeUserID__ckMd5=177188bf6c38a514; SESSDATA=7901bc88%2C1557721353%2Ccca68741; bili_jct=7b58735b2fbf739a2a7ca05ffb0aa722; rpdid=|(J~R)uJlkYl0J'ullYJluJYY; bp_t_offset_5172000=247013898595062047; _dfcaptcha=cf9b64400c2062d1a78de2019210c7fb",
}

評論

def getAllCommentList(id):
    url = "http://api.bilibili.com/x/reply?type=1&oid=" + str(id) + "&pn=1&nohot=1&sort=0"
    r = requests.get(url)
    numtext = r.text
    json_text = json.loads(numtext)
    commentsNum = json_text["data"]["page"]["count"]
    page = commentsNum // 20 + 1
    for n in range(1,page):
        url = "https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn="+str(n)+"&type=1&oid="+str(id)+"&sort=1&nohot=1"
        req = requests.get(url)
        text = req.text
        json_text_list = json.loads(text)
        for i in json_text_list["data"]["replies"]:
            info={}
            info['username']=i['member']['uname']
            info['text']=i['content']['message']
            infolist.append(info)


def saveTxt(filename,filecontent):
    df = pd.DataFrame(filecontent)
    df.to_csv(filename+'.csv')
    print('視訊：'+filename+'的評論已儲存')

标簽

def gettag(id):
    ranksss={}
    url="https://www.bilibili.com/video/av"+str(id)+''
    tag = requests.get(url,headers=herder)
    tag.encoding = 'utf-8'
    tagsoup = BeautifulSoup(tag.text, 'html.parser')
    tagwww=tagsoup.select('.tm-info')
    for ii in tagsoup.select('.tm-info'):
        tag1 = ii.select('.crumb')[1].text.replace('>','')
        tag2 = ii.select('.crumb')[2].text
        ranksss['tag1']=tag1
        ranksss['tag2']=tag2
    return ranksss

主體資訊

for ii in soup.select('.rank-list'):
    for ifo in ii.select('.rank-item'):
          ranks={}
          rankUrl = ifo.select('.title')[0]['href']
          ranktitle = ifo.select('.title')[0].text
          ranknum = ifo.select('.data-box')[0].text
          rankdanmus = ifo.select('.data-box')[1].text
          rankmaker = ifo.select('.data-box')[2].text
          rankfie = ifo.select('.pts')[0].text.replace('綜合得分','')
          id = re.findall('(\d{7,8})', rankUrl)[-1]  # 擷取車牌号
          ranks=gettag(str(id))
          ranks['up'] = rankmaker
          ranks['title'] = ranktitle
          print(ranks['tag1'])
          ranks['url'] = rankUrl
          ranks['Play volume'] = ranknum
          ranks['Barrage'] = rankfie
          ranks['overall ratings'] = rankdanmus
          ranklist.append(ranks)
          with open('tag.txt', "a", encoding='utf-8') as txt:
              txt.write(ranks['tag1']+ranks['tag2'])
          infolist.clear()
          getAllCommentList(id)  # 給定車牌号擷取評論資訊
          saveTxt(ranktitle, infolist)

詞雲

from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import jieba


mask_png = plt.imread("fate.jpeg")
cloud = WordCloud(
    font_path=r"C:\Windows\Fonts\simhei.ttf",# 詞雲自帶的字型不支援中文，在windows環境下使用黑體中文
    background_color="white",  # 背景顔色
    max_words=500,  # 詞雲顯示的最大詞數
    max_font_size=150,  # 字型最大值
    random_state=50,
    mask=mask_png,
    width=1000, height=860, margin=2,)
def stopWordsList():
    stopwords = [line.strip() for line in open('csw.txt', encoding='UTF-8').readlines()]
    return stopwords
txt = open(r'C:\Users\Ltp\Downloads\bd\tag.txt', 'r', encoding='utf-8').read()
stopWords = stopWordsList()
for exc in stopWords:
    txt = txt.replace(exc, '')
wordList = jieba.lcut(txt);
wordDict = {}
woreSet=set(wordList)
woreSet=woreSet-set(stopWords)
for word in wordList:
    if word not in stopWords:
        if len(word) == 1:
            continue
        else:
            wordDict[word] = wordDict.get(word, 0) + 1
wordCloudLS = list(wordDict.items())
wordCloudLS.sort(key=lambda x: x[1], reverse=True)
for i in range(35):
    print(wordCloudLS[i])
wcP = " ".join(wordList)
mywc = cloud.generate(wcP)
plt.imshow(mywc)
plt.axis("off")
plt.show()