理解爬虫原理

作业来自：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2881

1. 简单说明爬虫原理

　　简单地说，互联网就像一张大的蜘蛛网，数据便是存放在蜘蛛网的各个节点，爬虫就像一只蜘蛛，沿着网络抓去自己需要的数据。爬虫：向网站发起请求，获取资源后进行分析并提取有用的数据的程序。

2. 理解爬虫开发过程

1).简要说明浏览器工作原理；

　　在浏览器输入内容，浏览器会将请求发送到服务器，服务器响应后返回响应结果给浏览器，然后根据响应向用户显示相关内容。

2).使用 requests 库抓取网站数据；

requests.get(url) 获取校园新闻首页html代码

3).了解网页

写一个简单的html文件，包含多个标签，类，id

4).使用 Beautiful Soup 解析网页；

通过BeautifulSoup(html_sample,'html.parser')把上述html文件解析成DOM Tree

select（选择器）定位数据

找出含有特定标签的html元素

找出含有特定类名的html元素

找出含有特定id名的html元素

3.提取一篇校园新闻的标题、发布时间、发布单位、作者、点击次数、内容等信息

如url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html'

要求发布时间为datetime类型，点击次数为数值型，其它是字符串类型。import requests

代码：

import requests
import bs4
from bs4 import BeautifulSoup as bs
from datetime import datetime

def html(url):
    response=requests.get(url=url)
    response.encoding='utf-8'
    soup=bs(response.text,'html.parser')
    return soup

url="http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0328/11080.html"
url2='http://oa.gzcc.cn/api.php?op=count&id=11080&modelid=80'

#标题
title=html(url).select('div .show-title')[0].text
print("新闻标题："+title)
#时间
time1=html(url).select('div .show-info')[0].text.split()[0].split(':')[1]
time2=html(url).select('div .show-info')[0].text.split()[1]
Time=time1+ ' ' +time2
print("发布时间："+Time)
#发布单位
comFrom=html(url).select('div .show-info')[0].text.split()[4].split('：')[1]
print("发布单位："+comFrom)
#作者
write=html(url).select('div .show-info')[0].text.split()[2].split('：')[1]
print("作者："+write)
#点击次数
count=html(url2).text.split()[0].split('html')[-1]
ss="()';"
for i in ss:
    count=count.replace(i,'')
co=int(count)
print("点击次数：",co)
#内容
cont=html(url).select('div .show-content')[0].text.replace('。','\n')
print("新闻内容：")
print(cont)

#字符串转化为Data类型
now=datetime.strptime(Time,'%Y-%m-%d %H:%M:%S')
print(type(now))
#Data转化字符串
now1=datetime.now()
now1=datetime.strftime(now1,'%Y{y}-%m{m}-%d{d} %H{H}:%M{M}:%S{S}').format(y='年',m='月',d='日',H='时',M='分',S='秒')
print(now1)

运行结果：