爬虫综合大作业

一.把爬取的内容保存取MySQL数据库

二.爬虫综合大作业

爬取汽车之家网站信息：

1、主题：爬取汽车之家当中新闻的的内容，对内容中的词语进行分析，生成词云

网址：https://www.autohome.com.cn/news/?p=s#liststart

2、具体步骤实现

获取Url页面中的时间、来源、名字和内容<br>

def

getNewsDetail(Url):

res

requests.get(Url)

res.encoding

'gb2312'

Ssoup

BeautifulSoup(res.text,

'html.parser'

news

{}

if

len

(Ssoup.select(

'.time'

))>

time

Ssoup.select(

'.time'

)[

].text.rstrip(

' '

).lstrip(

'\r\n'

dt

datetime.strptime(time,

'%Y年%m月%d日 %H:%M'

else

dt

'none'

if

len

(Ssoup.select(

'.source'

))>

source

Ssoup.select(

'.source'

)[

].text.lstrip(

"来源："

else

source

'none'

if

len

(Ssoup.select(

'.name'

)) >

name

Ssoup.select(

'.name'

)[

].text.lstrip(

'\n'

).rstrip(

'\n'

else

name

'none'

if

len

(Ssoup.select(

'.details'

)) >

content

Ssoup.select(

'.details'

)[

].text.strip()

else

content

'none'

news[

'time'

dt

news[

'source'

source

news[

'name'

name

news[

'content'

content

writeContent(news[

'content'

])

print

(dt,source,name,Url)

return

news

　由于是一个函数所以需要适应所有的页面而不是只是适合一个页面，所以需要判断是否存在时间、姓名、来源等，没有的页面给这些值赋予none

获取一个页面中有多少条新闻信息<br>

def

getListPage(pageUrl):

res

requests.get(pageUrl)

res.encoding

'gb2312'

soup

BeautifulSoup(res.text,

'html.parser'

newslist

[]

for

in

soup.select(

".article-wrapper"

):

for

in

a.select(

'li'

):

if

len

(b.select(

"a"

)) >

newsUrl

'http:'

b.select(

"a"

)[

].attrs[

'href'

newslist.append(getNewsDetail(newsUrl))

return

(newslist)

　由于该页面中存在许多li，所以需要对li和a先进行便利

获取有多少页<br>

def

getPageN():

res

requests.get(pageUrl)

res.encoding

'gb2312'

soup

BeautifulSoup(res.text,

'html.parser'

for

in

soup.select(

'.page'

):

int

(a.select(

'a'

)[

].text)

return

pageUrl

'https://www.autohome.com.cn/news/'

newstotal

[]

newstotal.extend(getListPage(pageUrl))

getPageN()

for

in

range

，n):

listPageUrl

'https://www.autohome.com.cn/news/{}/#liststart'

format

(i)

newstotal.extend(getListPage(listPageUrl))

# df = pandas.DataFrame(newstotal)
# import openpyxl
# df.to_excel('work.xlsx')

由于该新闻网站的页面过多，在爬取过程胡出现连接错误，所以在后面的内容只是爬取到第161页的数据，大概2018年一整年的数据

将<br>

import

jieba

open

'content.txt'

'r'

,encoding

'utf-8'

story

f.read()

f.close()

sep

'''，。‘’“”：；（）！？、《》 . < > / - 0 1 2 3 4 5 6 7 8 9

A B C D E F G H I J K L M N O P Q R S T U V W X Y J

a b c d e f g h i j k l m n o p q r s t u v w x y j'''

exclude

' '

'　'

' '

for

in

sep:

story

story.replace(c,'')

tem

list

(jieba.cut(story))

wordDict

{}

words

list

set

(tem)

exclude)

for

in

range

len

(words)):

wordDict[words[w]]

story.count(

str

(words[w]))

dictList

list

(wordDict.items())

dictList.sort(key

lambda

x:x[

],reverse

True

open

'news.txt'

'a'

,encoding

"utf-8"

for

in

range

):

f.write(dictList[i][

'\n'

f.close()

　读取刚刚爬取的content.txt中的内容，用jieba词库对内容进行分词，统计前150个祠是什么，然后存储到news.txt中　

import

wordcloud

from

PIL

import

Image,ImageSequence

import

numpy as np

import

matplotlib.pyplot as plt

from

wordcloud

import

WordCloud,ImageColorGenerator

import

jieba

open

"news.txt"

"r"

,encoding

'utf-8'

str1

f.read()

stringList

list

(jieba.cut(str1))

delset

"，"

"。"

"："

"“"

"”"

"？"

" "

"；"

"！"

"、"

stringset

set

(stringList)

delset

countdict

{}

for

in

stringset:

countdict[i]

stringList.count(i)

image

Image.

open

'G:\Work\Python1\\789.jpg'

graph

np.array(image)

font

'C:\Windows\Fonts\simhei.TTF'

wc

WordCloud(font_path

font,background_color

'White'

,max_words

,mask

graph)

wc.generate_from_frequencies(countdict)

image_color

ImageColorGenerator(graph)

plt.imshow(wc)

plt.axis(

"off"

plt.show()

读取news.txt中获取的前150个祠，生成词云

3、结果

4、思想及结论

爬虫爬取数据还是具有一定的实际意义，从汽车之家的新闻网站中可以获取到汽车新闻资讯的热门词汇，增长对汽车的了解。

通过这次的爬虫大作业，加深我对爬取数据步骤等的了解和运用，在以后的工作生活中会起到一定作用。