擷取一篇新聞的全部資訊

2019-04-01 21:39:00

作業要求來自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2894

給定一篇新聞的連結newsUrl，擷取該新聞的全部資訊

标題、作者、釋出機關、稽核、來源

釋出時間:轉換成datetime類型

點選：

newsUrl
newsId(使用正規表達式re)
clickUrl(str.format(newsId))
requests.get(clickUrl)
newClick(用字元串處理，或正規表達式)
int()

整個過程包裝成一個簡單清晰的函數。

嘗試去爬取一個你感興趣的網頁。

import requests
import re
from bs4 import BeautifulSoup

#擷取html頁面
def getHtml(url):
    r=requests.get(url);
    r.status_code;
    r.encoding=r.apparent_encoding;
    html=r.text;
    #print(html);
    return html;

#擷取新聞的資訊
def newsInfo(html):
  soup=BeautifulSoup(html,"html.parser");
  title=soup.select(".news_title"); #擷取新聞的标題
  oneInfo=soup.select(".news_about");
  time=re.findall("</p>.*<p>(.*?)<span>",str(oneInfo[0]),re.S) #擷取新聞的釋出時間
  source=re.findall("來源:(.*?)</span>",str(oneInfo[0]),re.S) #擷取新聞的來源
  twoInfo=soup.select(".news_txt");
  writer=re.findall("</div>文：(.*?)<br/>",str(twoInfo[0]),re.S) #擷取新聞的作者
  news=twoInfo[0].text; #擷取新聞的内容
  return title,time,source,writer,news;

#擷取新聞編号
def newsid(url):
    newsID=re.findall('(\d{7})',url)[-1]
    return newsID;

#主方法
def main():
    url = "https://www.thepaper.cn/newsDetail_forward_3231590";
    html = getHtml(url);
    newsID=newsid(url);
    title, time, source, writer,news= newsInfo(html);
    print("新聞編号:" + newsID);
    print("标題:"+str(title[0].text).strip(" "));
    print("釋出時間:" + str(time[0]).strip("\n").strip(" "));
    print("來源:" + str(source[0]).strip(" "));
    print("作者:" + str(writer [0]).strip(" "));
    print("新聞内容:" + news.strip(" "));

main();

擷取一篇新聞的全部資訊

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入