python bs4子產品_【python總結】bs4子產品

2023-08-03 10:13:02

一、導入子產品：

from bs4 import BeautifulSoup

二、urllib子產品與BeautifulSoup()

import urllib.request #引入urllib.request子產品

url='http://www.baidu.com'resp=urllib.request.urlopen(url) #擷取網頁資訊

html=resp.read() #以讀的方式讀取網頁

soup= BeautifulSoup(html, 'lxml') #解析網頁html,指定解析器為lxml

BeautifulSoup()文法，第一個指定解析的東西，第二個參數指定解析器

python bs4子產品_【python總結】bs4子產品

1.Python标準庫解析器文法

BeautifulSoup(html,“html.parser”)

2.lxml HTML 解析器文法

BeautifulSoup(html,“lxml”)

3.lxml XML 解析器文法

BeautifulSoup(html,[lxml”,“xml”])

BeautifulSoup(html,“xml”)

4.html5lib解析器文法

BeautifulSoup(html,“html5lib”)

按照層級重新排列：

soup.prettify()

輸出排列：

print(soup.prettify())

三、選擇html中的元素(注意：以下代碼中，soup = BeautifulSoup(html, 'lxml') ),解析器是lxml

總的來說，很多都沒有用，有用的大概隻有以下幾個：

查找：(所有)

soup.select('标簽名')

顯示标簽内容:

#String 字元串 soup.title.string#soup.title.string 會把注釋也選上，注釋和其他的type()不同，需要排除注釋的文字

soup.title.string#Comment 注釋

soup.a.stringfor item insoup.body.contents:print(item.name)

但一次隻能顯示一個

按照css查找：

soup.select('.sister')) #通過Class查詢，點後面的是class名

psoup.select('#link1') #通過ID查詢，點後面是ID名

層級查找：

soup.select('head > title') #層級關系查詢，這是查找

裡面的

*********************************結束線*********************************