網絡爬蟲-Python

2014-12-06 15:52:43

周末沒事自己寫了個網絡爬蟲，先介紹一下它的功能，這是個小程式，主要用來抓取網頁上的文章，部落格等，首先找到你要抓取的文章，比如韓寒的新浪部落格，進入他的文章目錄，記下目錄的連接配接比如 http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html，裡面每篇文章都有個連接配接，我們現在需要做的就是根據每個連結進入并把文章複制到你自己的電腦檔案裡。這就把文章爬下來了哈哈，不說了直接來代碼吧

import urllib

import time

url=['']*50

j = 0

con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read() #目錄連結

i=0

title = con.find(r'<a title=') #找到第一次出現<a title=的位置

href = con.find(r'href=',title) #找到<a title=之後出現href=的位置

html = con.find(r'.html',href) #同上

while title != -1 and href != -1 and html != -1 and i<50: #目錄下面大概50篇文章

url[i] = con[href + 6:html +5] #抓取每篇文章的連結

print url[i]

title = con.find(r'<a title=',html) #循環抓取每篇文章

href = con.find(r'href=',title)

html = con.find(r'.html',href)

i= i+1

while j < 50:

content = urllib.urlopen(url[j]).read() #讀取每個連結内的内容

#print content

filename = url[j][-26:]

open(filename,'w+').write(content) #把内容寫到你自己定義的檔案下

print 'downloading' ,url[j]

j = j+1

time.sleep(1) #睡眠時間

網絡爬蟲-Python

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入