python bs4模块_【python总结】bs4模块

2023-08-03 10:13:02

一、导入模块：

from bs4 import BeautifulSoup

二、urllib模块与BeautifulSoup()

import urllib.request #引入urllib.request模块

url='http://www.baidu.com'resp=urllib.request.urlopen(url) #获取网页信息

html=resp.read() #以读的方式读取网页

soup= BeautifulSoup(html, 'lxml') #解析网页html,指定解析器为lxml

BeautifulSoup()语法，第一个指定解析的东西，第二个参数指定解析器

python bs4模块_【python总结】bs4模块

1.Python标准库解析器语法

BeautifulSoup(html,“html.parser”)

2.lxml HTML 解析器语法

BeautifulSoup(html,“lxml”)

3.lxml XML 解析器语法

BeautifulSoup(html,[lxml”,“xml”])

BeautifulSoup(html,“xml”)

4.html5lib解析器语法

BeautifulSoup(html,“html5lib”)

按照层级重新排列：

soup.prettify()

输出排列：

print(soup.prettify())

三、选择html中的元素(注意：以下代码中，soup = BeautifulSoup(html, 'lxml') ),解析器是lxml

总的来说，很多都没有用，有用的大概只有以下几个：

查找：(所有)

soup.select('标签名')

显示标签内容:

#String 字符串 soup.title.string#soup.title.string 会把注释也选上，注释和其他的type()不同，需要排除注释的文字

soup.title.string#Comment 注释

soup.a.stringfor item insoup.body.contents:print(item.name)

但一次只能显示一个

按照css查找：

soup.select('.sister')) #通过Class查询，点后面的是class名

psoup.select('#link1') #通过ID查询，点后面是ID名

层级查找：

soup.select('head > title') #层级关系查询，这是查找

里面的

*********************************结束线*********************************