老師:MissDu 送出作業
一.把爬取的内容儲存取MySQL資料庫
- import pandas as pd
- import pymysql
- from sqlalchemy import create_engine
- conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
- engine = create_engine(conInfo,encoding='utf-8')
- df = pd.DataFrame(allnews)
- df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)
二.爬蟲綜合大作業
- 選擇一個熱點或者你感興趣的主題。
- 選擇爬取的對象與範圍。
- 了解爬取對象的限制與限制。
- 爬取相應内容。
- 做資料分析與文本分析。
- 形成一篇文章,有說明、技術要點、有資料、有資料分析圖形化展示與說明、文本分析圖形化展示與說明。
- 文章公開釋出。
參考:
32個Python爬蟲項目
都是誰在反對996?
Python和Java薪資最高,C#最低!
給《流浪地球》評1星的都是什麼心态?
《都挺好》彈幕資料,比劇情還精彩?
爬了自己的微信好友,原來他們是這樣的人……
春節人口遷徙大資料報告!
七夕前消費趨勢資料
爬了一下天貓上的Bra購買記錄,有了一些羞羞哒的發現...
Python做了六百萬字的歌詞分析,告訴你中國Rapper都在唱些啥
分析了42萬字歌詞後,終于搞清楚民謠歌手唱什麼了
十二星座的真實面目
唐朝詩人之間的關系到底是什麼樣的?
中國姓氏排行榜
三.爬蟲注意事項
1.設定合理的爬取間隔,不會給對方運維人員造成壓力,也可以防止程式被迫中止。
- import time
- import random
- time.sleep(random.random()*3)
2.設定合理的user-agent,模拟成真實的浏覽器去提取内容。
- 首先打開你的浏覽器輸入:about:version。
- 使用者代理:
- 收集一些比較常用的浏覽器的user-agent放到清單裡面。
- 然後import random,使用随機擷取一個user-agent
- 定義請求頭字典headers={’User-Agen‘:}
- 發送request.get時,帶上自定義了User-Agen的headers
3.需要登入
發送request.get時,帶上自定義了Cookie的headers
headers={’User-Agen‘:
'Cookie': }
4.使用代理IP
通過更換IP來達到不斷高 效爬取資料的目的。
headers = {
"User-Agent": "",
}
proxies = {
"http": " ",
"https": " ",
response = requests.get(url, headers=headers, proxies=proxies)
BILIBILI每日排行榜視訊資訊擷取
擷取bilibili每日全站排行榜,提取标簽,評論。
擷取評論
API: http://api.bilibili.cn/feedback
參數
aid | true | int | AV号 |
page | 頁碼 | ||
pagesize | false | 單頁傳回的記錄條數,最大不超過300,預設為10。 | |
ver | API版本,最新是3 | ||
order | string | 排序方式 預設按釋出時間倒序 可選:good 按點贊人數排序 hot 按熱門回複排序 |
ver1
傳回值字段 | 字段類型 | 字段說明 |
---|---|---|
mid | 會員ID | |
lv | 樓層 | |
fbid | 評論ID | |
msg | 評論資訊 | |
ad_check | 狀态 (0: 正常 1: UP主隐藏 2: 管理者删除 3: 因舉報删除) | |
face | 釋出人頭像 | |
rank | 釋出人顯示辨別 | |
nick | 釋出人暱稱 | |
totalResult | 總評論數 | |
pages | 總頁數 |
replay
good | 點贊數? | |
isgood | 是否已點贊? | |
device | 未知 | |
create | 建立評論的UNIX時間 | |
create_at | String | 建立評論的可讀時間(2016-01-20 15:52) |
reply_count | 回複數量 | |
level_info | list | 使用者的等級資訊? |
sex | 使用者的性别 |
例: AV number=50164983:
http://api.bilibili.com/x/reply?type=1&oid=50164983&pn=1&nohot=1&sort=0

傳回資訊:
再看Preview
可知Size=20,count=2824,使用者名在[MEMBER][UNAME]下,回複資訊在[content][message]下
如此balabala.......
标簽分析
可以發現b站使用者多喜歡看 生活向,鬼畜向的視訊
排行榜第一的視訊的評論資料(ps我對蔡某人沒任何意見)
這是在資料庫裡面
排行榜:
以下是代碼:
頭部
UA = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile ',
'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+',
'Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0',
'Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) ',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)',
'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999',
]
headers = {
'Referer': 'https://www.bilibili.com/v/douga/mad/?spm_id_from=333.334.b_7072696d6172795f6d656e75.3',
'User-Agent': choice(UA)
}
herder={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Host': 'www.bilibili.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7',
"Cookie": "fts=1519723607; pgv_pvi=9921057792; im_notify_type_5172000=0; LIVE_BUVID=d9102c76da863db3e7c92490dc7c1458; LIVE_BUVID__ckMd5=300ca52bca0020e2; im_local_unread_5172000=0; buvid3=633B41F7-7489-4AFF-A338-C6B691D748BF163029infoc; CURRENT_FNVAL=16; _uuid=154F2A25-2995-7B95-9278-CEB7B98119CB36766infoc; UM_distinctid=16797b478ab161-09c84bb5055ad7-b79183d-144000-16797b478ac59c; stardustvideo=-1; sid=iv38z60z; CURRENT_QUALITY=32; DedeUserID=5172000; DedeUserID__ckMd5=177188bf6c38a514; SESSDATA=7901bc88%2C1557721353%2Ccca68741; bili_jct=7b58735b2fbf739a2a7ca05ffb0aa722; rpdid=|(J~R)uJlkYl0J'ullYJluJYY; bp_t_offset_5172000=247013898595062047; _dfcaptcha=cf9b64400c2062d1a78de2019210c7fb",
}
評論
def getAllCommentList(id):
url = "http://api.bilibili.com/x/reply?type=1&oid=" + str(id) + "&pn=1&nohot=1&sort=0"
r = requests.get(url)
numtext = r.text
json_text = json.loads(numtext)
commentsNum = json_text["data"]["page"]["count"]
page = commentsNum // 20 + 1
for n in range(1,page):
url = "https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn="+str(n)+"&type=1&oid="+str(id)+"&sort=1&nohot=1"
req = requests.get(url)
text = req.text
json_text_list = json.loads(text)
for i in json_text_list["data"]["replies"]:
info={}
info['username']=i['member']['uname']
info['text']=i['content']['message']
infolist.append(info)
def saveTxt(filename,filecontent):
df = pd.DataFrame(filecontent)
df.to_csv(filename+'.csv')
print('視訊:'+filename+'的評論已儲存')
标簽
def gettag(id):
ranksss={}
url="https://www.bilibili.com/video/av"+str(id)+''
tag = requests.get(url,headers=herder)
tag.encoding = 'utf-8'
tagsoup = BeautifulSoup(tag.text, 'html.parser')
tagwww=tagsoup.select('.tm-info')
for ii in tagsoup.select('.tm-info'):
tag1 = ii.select('.crumb')[1].text.replace('>','')
tag2 = ii.select('.crumb')[2].text
ranksss['tag1']=tag1
ranksss['tag2']=tag2
return ranksss
主體資訊
for ii in soup.select('.rank-list'):
for ifo in ii.select('.rank-item'):
ranks={}
rankUrl = ifo.select('.title')[0]['href']
ranktitle = ifo.select('.title')[0].text
ranknum = ifo.select('.data-box')[0].text
rankdanmus = ifo.select('.data-box')[1].text
rankmaker = ifo.select('.data-box')[2].text
rankfie = ifo.select('.pts')[0].text.replace('綜合得分','')
id = re.findall('(\d{7,8})', rankUrl)[-1] # 擷取車牌号
ranks=gettag(str(id))
ranks['up'] = rankmaker
ranks['title'] = ranktitle
print(ranks['tag1'])
ranks['url'] = rankUrl
ranks['Play volume'] = ranknum
ranks['Barrage'] = rankfie
ranks['overall ratings'] = rankdanmus
ranklist.append(ranks)
with open('tag.txt', "a", encoding='utf-8') as txt:
txt.write(ranks['tag1']+ranks['tag2'])
infolist.clear()
getAllCommentList(id) # 給定車牌号擷取評論資訊
saveTxt(ranktitle, infolist)
詞雲
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import jieba
mask_png = plt.imread("fate.jpeg")
cloud = WordCloud(
font_path=r"C:\Windows\Fonts\simhei.ttf",# 詞雲自帶的字型不支援中文,在windows環境下使用黑體中文
background_color="white", # 背景顔色
max_words=500, # 詞雲顯示的最大詞數
max_font_size=150, # 字型最大值
random_state=50,
mask=mask_png,
width=1000, height=860, margin=2,)
def stopWordsList():
stopwords = [line.strip() for line in open('csw.txt', encoding='UTF-8').readlines()]
return stopwords
txt = open(r'C:\Users\Ltp\Downloads\bd\tag.txt', 'r', encoding='utf-8').read()
stopWords = stopWordsList()
for exc in stopWords:
txt = txt.replace(exc, '')
wordList = jieba.lcut(txt);
wordDict = {}
woreSet=set(wordList)
woreSet=woreSet-set(stopWords)
for word in wordList:
if word not in stopWords:
if len(word) == 1:
continue
else:
wordDict[word] = wordDict.get(word, 0) + 1
wordCloudLS = list(wordDict.items())
wordCloudLS.sort(key=lambda x: x[1], reverse=True)
for i in range(35):
print(wordCloudLS[i])
wcP = " ".join(wordList)
mywc = cloud.generate(wcP)
plt.imshow(mywc)
plt.axis("off")
plt.show()