天天看點

爬蟲綜合

老師:MissDu 送出作業

一.把爬取的内容儲存取MySQL資料庫

  • import pandas as pd
  • import pymysql
  • from sqlalchemy import create_engine
  • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
  • engine = create_engine(conInfo,encoding='utf-8')
  • df = pd.DataFrame(allnews)
  • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)

二.爬蟲綜合大作業

  1. 選擇一個熱點或者你感興趣的主題。
  2. 選擇爬取的對象與範圍。
  3. 了解爬取對象的限制與限制。
  4. 爬取相應内容。
  5. 做資料分析與文本分析。
  6. 形成一篇文章,有說明、技術要點、有資料、有資料分析圖形化展示與說明、文本分析圖形化展示與說明。
  7. 文章公開釋出。

參考:

32個Python爬蟲項目

都是誰在反對996?

Python和Java薪資最高,C#最低!

給《流浪地球》評1星的都是什麼心态?

《都挺好》彈幕資料,比劇情還精彩?

爬了自己的微信好友,原來他們是這樣的人……

春節人口遷徙大資料報告!

七夕前消費趨勢資料

爬了一下天貓上的Bra購買記錄,有了一些羞羞哒的發現...

Python做了六百萬字的歌詞分析,告訴你中國Rapper都在唱些啥

分析了42萬字歌詞後,終于搞清楚民謠歌手唱什麼了

十二星座的真實面目

唐朝詩人之間的關系到底是什麼樣的?

中國姓氏排行榜

三.爬蟲注意事項

1.設定合理的爬取間隔,不會給對方運維人員造成壓力,也可以防止程式被迫中止。

  • import time
  • import random
  • time.sleep(random.random()*3)

2.設定合理的user-agent,模拟成真實的浏覽器去提取内容。

  1. 首先打開你的浏覽器輸入:about:version。
  2. 使用者代理:
  3. 收集一些比較常用的浏覽器的user-agent放到清單裡面。
  4. 然後import random,使用随機擷取一個user-agent
  5. 定義請求頭字典headers={’User-Agen‘:}
  6. 發送request.get時,帶上自定義了User-Agen的headers

3.需要登入

發送request.get時,帶上自定義了Cookie的headers

headers={’User-Agen‘:  

'Cookie':    }

4.使用代理IP

通過更換IP來達到不斷高 效爬取資料的目的。

headers = {

    "User-Agent": "",

}

proxies = {

    "http": " ",

    "https": " ",

response = requests.get(url, headers=headers, proxies=proxies)

BILIBILI每日排行榜視訊資訊擷取

擷取bilibili每日全站排行榜,提取标簽,評論。 

擷取評論

API: http://api.bilibili.cn/feedback

參數

aid true int AV号
page 頁碼
pagesize false 單頁傳回的記錄條數,最大不超過300,預設為10。
ver API版本,最新是3
order string 排序方式 預設按釋出時間倒序 可選:good 按點贊人數排序 hot 按熱門回複排序

ver1

傳回值字段 字段類型 字段說明
mid 會員ID
lv 樓層
fbid 評論ID
msg 評論資訊
ad_check 狀态 (0: 正常 1: UP主隐藏 2: 管理者删除 3: 因舉報删除)
face 釋出人頭像
rank 釋出人顯示辨別
nick 釋出人暱稱
totalResult 總評論數
pages 總頁數

replay

good 點贊數?
isgood 是否已點贊?
device 未知
create 建立評論的UNIX時間
create_at String 建立評論的可讀時間(2016-01-20 15:52)
reply_count 回複數量
level_info list 使用者的等級資訊?
sex 使用者的性别

例: AV number=50164983:

http://api.bilibili.com/x/reply?type=1&oid=50164983&pn=1&nohot=1&sort=0      
爬蟲綜合

傳回資訊:

爬蟲綜合

再看Preview

爬蟲綜合

可知Size=20,count=2824,使用者名在[MEMBER][UNAME]下,回複資訊在[content][message]下

如此balabala.......

标簽分析

爬蟲綜合

可以發現b站使用者多喜歡看 生活向,鬼畜向的視訊

排行榜第一的視訊的評論資料(ps我對蔡某人沒任何意見)

爬蟲綜合

這是在資料庫裡面

爬蟲綜合

排行榜:

爬蟲綜合
爬蟲綜合

以下是代碼:

頭部

UA = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
    'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
    'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
    'MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile ',
    'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
    'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
    'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+',
    'Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0',
    'Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) ',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)',
    'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999',
]

headers = {
        'Referer': 'https://www.bilibili.com/v/douga/mad/?spm_id_from=333.334.b_7072696d6172795f6d656e75.3',
        'User-Agent': choice(UA)
    }

herder={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Host': 'www.bilibili.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7',
"Cookie": "fts=1519723607; pgv_pvi=9921057792; im_notify_type_5172000=0; LIVE_BUVID=d9102c76da863db3e7c92490dc7c1458; LIVE_BUVID__ckMd5=300ca52bca0020e2; im_local_unread_5172000=0; buvid3=633B41F7-7489-4AFF-A338-C6B691D748BF163029infoc; CURRENT_FNVAL=16; _uuid=154F2A25-2995-7B95-9278-CEB7B98119CB36766infoc; UM_distinctid=16797b478ab161-09c84bb5055ad7-b79183d-144000-16797b478ac59c; stardustvideo=-1; sid=iv38z60z; CURRENT_QUALITY=32; DedeUserID=5172000; DedeUserID__ckMd5=177188bf6c38a514; SESSDATA=7901bc88%2C1557721353%2Ccca68741; bili_jct=7b58735b2fbf739a2a7ca05ffb0aa722; rpdid=|(J~R)uJlkYl0J'ullYJluJYY; bp_t_offset_5172000=247013898595062047; _dfcaptcha=cf9b64400c2062d1a78de2019210c7fb",
}      

評論

def getAllCommentList(id):
    url = "http://api.bilibili.com/x/reply?type=1&oid=" + str(id) + "&pn=1&nohot=1&sort=0"
    r = requests.get(url)
    numtext = r.text
    json_text = json.loads(numtext)
    commentsNum = json_text["data"]["page"]["count"]
    page = commentsNum // 20 + 1
    for n in range(1,page):
        url = "https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn="+str(n)+"&type=1&oid="+str(id)+"&sort=1&nohot=1"
        req = requests.get(url)
        text = req.text
        json_text_list = json.loads(text)
        for i in json_text_list["data"]["replies"]:
            info={}
            info['username']=i['member']['uname']
            info['text']=i['content']['message']
            infolist.append(info)


def saveTxt(filename,filecontent):
    df = pd.DataFrame(filecontent)
    df.to_csv(filename+'.csv')
    print('視訊:'+filename+'的評論已儲存')      

标簽

def gettag(id):
    ranksss={}
    url="https://www.bilibili.com/video/av"+str(id)+''
    tag = requests.get(url,headers=herder)
    tag.encoding = 'utf-8'
    tagsoup = BeautifulSoup(tag.text, 'html.parser')
    tagwww=tagsoup.select('.tm-info')
    for ii in tagsoup.select('.tm-info'):
        tag1 = ii.select('.crumb')[1].text.replace('>','')
        tag2 = ii.select('.crumb')[2].text
        ranksss['tag1']=tag1
        ranksss['tag2']=tag2
    return ranksss      

主體資訊

for ii in soup.select('.rank-list'):
    for ifo in ii.select('.rank-item'):
          ranks={}
          rankUrl = ifo.select('.title')[0]['href']
          ranktitle = ifo.select('.title')[0].text
          ranknum = ifo.select('.data-box')[0].text
          rankdanmus = ifo.select('.data-box')[1].text
          rankmaker = ifo.select('.data-box')[2].text
          rankfie = ifo.select('.pts')[0].text.replace('綜合得分','')
          id = re.findall('(\d{7,8})', rankUrl)[-1]  # 擷取車牌号
          ranks=gettag(str(id))
          ranks['up'] = rankmaker
          ranks['title'] = ranktitle
          print(ranks['tag1'])
          ranks['url'] = rankUrl
          ranks['Play volume'] = ranknum
          ranks['Barrage'] = rankfie
          ranks['overall ratings'] = rankdanmus
          ranklist.append(ranks)
          with open('tag.txt', "a", encoding='utf-8') as txt:
              txt.write(ranks['tag1']+ranks['tag2'])
          infolist.clear()
          getAllCommentList(id)  # 給定車牌号擷取評論資訊
          saveTxt(ranktitle, infolist)      

詞雲

from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import jieba


mask_png = plt.imread("fate.jpeg")
cloud = WordCloud(
    font_path=r"C:\Windows\Fonts\simhei.ttf",# 詞雲自帶的字型不支援中文,在windows環境下使用黑體中文
    background_color="white",  # 背景顔色
    max_words=500,  # 詞雲顯示的最大詞數
    max_font_size=150,  # 字型最大值
    random_state=50,
    mask=mask_png,
    width=1000, height=860, margin=2,)
def stopWordsList():
    stopwords = [line.strip() for line in open('csw.txt', encoding='UTF-8').readlines()]
    return stopwords
txt = open(r'C:\Users\Ltp\Downloads\bd\tag.txt', 'r', encoding='utf-8').read()
stopWords = stopWordsList()
for exc in stopWords:
    txt = txt.replace(exc, '')
wordList = jieba.lcut(txt);
wordDict = {}
woreSet=set(wordList)
woreSet=woreSet-set(stopWords)
for word in wordList:
    if word not in stopWords:
        if len(word) == 1:
            continue
        else:
            wordDict[word] = wordDict.get(word, 0) + 1
wordCloudLS = list(wordDict.items())
wordCloudLS.sort(key=lambda x: x[1], reverse=True)
for i in range(35):
    print(wordCloudLS[i])
wcP = " ".join(wordList)
mywc = cloud.generate(wcP)
plt.imshow(mywc)
plt.axis("off")
plt.show()