爬取全部的校園新聞

作業要求來源：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2941

要求：

1、從新聞url擷取新聞詳情

2、從清單頁的url擷取新聞url

3、生成所頁清單頁的url并擷取全部新聞

4、設定合理的爬取間隔

5、用pandas做簡單的資料處理并儲存成csv和sql檔案

源代碼：

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import re
import pandas as pd
import time
import random
import sqlite3

newsUrl = 'http://news.gzcc.cn/html/2005/xiaoyuanxinwen_0710/4.html'
listUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'


def click(url):
    id = re.findall('(\d{1,5})', url)[-1]
    clickUrl = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(id)
    resClick = requests.get(clickUrl)
    newsClick = int(resClick.text.split('.html')[-1].lstrip("('").rstrip("');"))
    return newsClick


def newsdt(showinfo):
    newsDate = showinfo.split()[0].split(':')[1]
    newsTime = showinfo.split()[1]
    newsDT = newsDate + ' ' + newsTime
    dt = datetime.strptime(newsDT, '%Y-%m-%d %H:%M:%S')
    return dt


def anews(url):#從新聞url擷取新聞詳情： 字典,anews
    newsDetail = {}
    res = requests.get(url)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    newsDetail['newsTitle'] = soup.select('.show-title')[0].text
    showinfo = soup.select('.show-info')[0].text
    newsDetail['newsDT'] = newsdt(showinfo)
    newsDetail['newsClick'] = click(newsUrl)
    return newsDetail


def alist(url):#從清單頁的url擷取新聞url：清單append(字典) alist
    res = requests.get(listUrl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    newsList = []
    for news in soup.select('li'):
        if len(news.select('.news-list-title')) > 0:
            newsUrl = news.select('a')[0]['href']
            newsDesc = news.select('.news-list-description')[0].text
            newsDict = anews(newsUrl)
            newsDict['description'] = newsDesc
            newsList.append(newsDict)
    return newsList


alist(listUrl)

alist(newsUrl)
res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')

for news in soup.select('li'):
    if len(news.select('.news-list-title')) > 0:
        newsUrl = news.select('a')[0]['href']
        print(anews(newsUrl))

allnews = []
for i in range(97, 107):#爬取學号尾數開始的10個清單頁
    listUrl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html'.format(i)
    allnews.extend(alist(listUrl))

print("allnewsLength={}".format(len(allnews)))
print(allnews)

res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
for news in soup.select('li'):
    if len(news.select('.news-list-title')) > 0:
        newsUrl = news.select('a')[0]['href']
        print(anews(newsUrl))

s1 = pd.Series([100, 23, 'bugingcode'])
print(s1)
pd.Series(anews)
newsdf = pd.DataFrame(allnews)
for i in range(5):
    print(i)
    time.sleep(random.random() * 3)#設定爬取的時間間隔
    print(newsdf)

newsdf.to_csv(r'D:\py_file\gzcc.csv',encoding='utf_8_sig')#儲存成csv格式，為避免亂碼，設定編碼格式為utf_8_sig

with sqlite3.connect(r'D:\py_file\gzccnewsdb.sqlite') as db:#儲存檔案為sql
    newsdf.to_sql('gzccnewsdb',db)

結果：

1、新聞詳情：

2、新聞清單：

3、儲存成csv檔案：

4、儲存成為sql檔案