期中爬蟲綜合作業

作業要求來自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3075

一、了解網站

今日熱榜是一個非常不錯的新聞資訊網站，這裡更多的一些新聞資訊都可以及時的檢視，及時的擷取，需要看的一些行業新聞都可以讓你分分鐘解決自己的問題。

今日熱榜網站特色：

1.這裡需要的一些資訊都可以看看，還是蠻值得期待的！

2.所有的資金都可以及時的擷取，這裡掌握最新的熱點資訊，讓你擷取更多的資訊！

3.每天都可以搜羅全網最新的一些新聞資訊，這裡真的蠻開心

二、分析網站

網站分别分為：綜合、科技、社群等子產品

利用浏覽器檢查功能了解社群子產品下的，V2EX的今日熱議的節點連結

點選連結，進入V2EX的今日熱議節點網站

爬蟲需求：爬取每個節點網站的标題

分析：利用浏覽器檢查功能，觀察 '标題' 代碼的共通性

三、代碼

擷取各子產品的節點網站的連結：

1 def get_model_url():
 2     ua = UserAgent()
 3     headers = {'User-Agent': ua.random}
 4     tag_set = set()
 5 
 6     url = 'https://tophub.today/c/{}?p={}'
 7     subjects = ['news', 'tech', 'community']
 8 
 9     for sub in subjects:
10         for page in range(1, 10):
11             res = requests.get(url.format(sub, page), headers=headers)
12             res.encoding = 'utf-8'
13             soup = BeautifulSoup(res.text, 'lxml')
14             for span in soup.select('.cc-cd-is > a'):
15                 tag_set.add(span['href'])
16             time.sleep(random.random() * 2)
17 
18     return tag_set

爬取各子產品的所有節點網站的标題，得到标題清單，寫入到txt檔案，拼接成字元串

1 def get_model_reci(tags_set):
 2     print(tags_set)
 3     ua = UserAgent()
 4     headers = {'User-Agent': ua.random}
 5     txt_list = []
 6 
 7     print('get_model_reci前'+ str(time.strftime('%m-%d  %H:%M:%S', time.localtime(time.time()))))
 8 
 9     url1 = 'https://tophub.today{}'
10     for ci in tags_set:
11         res = requests.get(url1.format(ci), headers=headers)
12         res.encoding = 'utf-8'
13         soup = BeautifulSoup(res.text, 'lxml')
14         for a in soup.select('tbody > tr > .al > a'):
15             txt_list.append(a.text.strip()+'\n')
16         time.sleep(random.random() * 2)
17 
18     print('get_model_reci後' + str(time.strftime('%m-%d  %H:%M:%S', time.localtime(time.time()))))
19     f = open('title_txt','w',encoding='utf-8')
20     f.writelines(txt_list)
21     f.close()
22     print('檔案寫入後' + str(time.strftime('%m-%d  %H:%M:%S', time.localtime(time.time()))))
23 
24     return ''.join(txt_list)

爬取得到39000條資訊寫入title_txt.txt檔案：

對爬取的标題資訊進行分詞，并計算其詞頻，進行排序，傳回詞頻最高的前100個詞的清單：

1 def get_reci(txt):
 2     words = []  # 分詞清單
 3 
 4     f = open('title_txt', 'r', encoding='utf-8')
 5     txt_list = f.readlines()
 6     f.close()
 7 
 8     for line in txt_list:
 9         txt += line.strip()
10     print(len(txt))
11     cut_words = list(jieba.cut(txt))  # 分詞
12 
13     reci_dict = {'l': 1}
14     for word in cut_words:
15         if word not in stop_words and len(word) > 1 :
16             if word in words:
17                 reci_dict[word]+=1
18                 continue
19             words.append(word)
20             reci_dict[word] = 1
21 
22     dic = sorted(reci_dict.items(), key=lambda item: item[1], reverse=True)
23 
24     l = []
25     for i in range(100):
26         l.append(dic[i][0])
27         print(dic[i][0], " : ", dic[i][1])
28 
29     return l

列印詞雲，并儲存詞頻到reci.csv檔案

任意選擇詞頻最高的前100個詞，模拟今日熱榜的搜尋get方法得到相關熱詞新聞的标題、出處、時間、連結，将其寫入news.txt檔案

1 def get_info_from_search(search_tag):
 2     url = 'https://tophub.today/search?q=%E5%88%98%E5%BC%BA'
 3     ua = UserAgent()
 4     headers = {'User-Agent': ua.random}
 5     txt_list = {}
 6 
 7     url1 = 'https://tophub.today/search?q={}'
 8 
 9     for t in search_tag:
10         print(t)
11         url = quote(url1.format(t), safe=";/?:@&=+$,", encoding="utf-8")  # 中文編碼
12         print(url)
13         res = requests.get(url, headers=headers)
14         res.encoding = 'utf-8'
15         soup = BeautifulSoup(res.text, 'lxml')
16 
17         tmp = []
18         title = []
19         href = []
20         author = []
21         p_time = []
22         for a_content in soup.select('tbody > tr > .al > a'):
23             title.append(a_content.text)
24         for a_href in soup.select('tbody > tr > .al > a'):
25             href.append(a_href['href'])
26         for td in soup.select('tbody > tr > td:nth-of-type(4)'):
27             author.append(td.text)
28         for td in soup.select('tbody > tr > td:nth-of-type(5)'):
29             p_time.append(td.text)
30         time.sleep(random.random() * 2)
31 
32         for i in range(len(title)):
33             tmp.append(title[i])
34             tmp.append(href[i])
35             tmp.append(author[i])
36             tmp.append(p_time[i])
37 
38         txt_list[t] = tmp
39     return txt_list

觀察get方式搜尋關鍵詞

得到相關熱詞新聞的标題、出處、時間、連結，将其寫入news.txt檔案