爬蟲 #筆記 #爬蟲 #Python

爬蟲

2013-11-12 22:19:30

爬蟲

下載下傳貼吧和圖庫中的所有圖檔

1.擷取源代碼

子產品 import urllib

函數

def gethtml(url):

page = urllib.urlopen(url)

html = page.read()

return html

2.使用正規表達式截取需要的資訊

src="http"

例子：

#!/user/bin/python

import re

import urllib

#擷取url的html代碼

def getHtml(url):

def getImg(html):

reg = r'src="(.*?\.jpg)" pic_ext'

#編譯正則，加快執行

imgre = re.compile(reg)

imglist = re.findall(imgre,html)

x==0

for imgurl in imglist:

urllib.urlretrieve(imgurl,'%s.jpg' %x)

x+=1

if __name__ =='__main__':

爬蟲