Python爬虫urllib笔记(四)之使用BeautifulSoup爬取百度贴吧

2023-08-05 14:45:31

BeautifulSoup-第三方分析网页内容库--替换正则表达式(有官方中文文档可查看)

提取百度贴吧网页中的楼主发的图片

# -*- coding:utf-8 -
# 
# 
# BeautifulSoup-第三方分析网页内容库--替换正则表达式(有官方中文文档可查看)
# pip install beautifulsoup4

import urllib
from bs4 import BeautifulSoup

def get_content(url):
    html = urllib.urlopen(url)
    content = html.read()
    html.close()
    return content

def get_imgs(info):
	"""
	<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=4a711e3af1246b607b0eb27cdbf91a35/9c019245d688d43f73ecd19b7a1ed21b0ef43b10.jpg" 
	size="15633" height="900" width="507">
	"""
	soup=BeautifulSoup(info)
	#找出所有img标签--加入样式键对值 用_标识防止关键字冲突
	all_img=soup.find_all('img', class_='BDE_Image')
	# i = 0
	# for img in all_img:
	# 	#提取每个代码段的src的地址
	# 	print img['src']
	# 	#下载文件
	# 	urllib.urlretrieve(img['src'],'F:\\data\\pachong\\pic2\\%s.jpg' % i)
	# 	i +=1
	#把所有地址返回成数组形式
	return [img['src'] for img in all_img]

info=get_content("http://tieba.baidu.com/p/4364768066")	
print get_imgs(info)

Python爬虫urllib笔记(四)之使用BeautifulSoup爬取百度贴吧

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的