天天看點

python3爬蟲douban_【學習筆記】Python3爬蟲

案例1:Python3爬蟲代理池

1.找一個公開的代理ip網站

比如西刺免費代理IP

2.編寫xpath進行篩選

//tr/td[2]/text()

//tr/td[3]/text()

//tr/td[6]/text()

3.編寫代碼

import requests

import os,time,random

from fake_useragent import UserAgent

from lxml import etree

class ProxySpider(object):

def __init__(self):

self.baseurl = 'https://www.xicidaili.com/nn/{}'

self.xpathip = '//tr/td[2]/text()'

self.xpathport = '//tr/td[3]/text()'

self.xpathhttps = '//tr/td[6]/text()'

self.ua = UserAgent()

def request_html(self,url):

try:

header = {'User-Agent':'Mozilla/4.0'}

html = requests.get(url=url, headers=header).text

return html

except Exception as e:

print(e)

return 'error'

def proxy_request_html(self,url,ip,isHttps):

time.sleep(random.randint(1,2))

proxy = {}

if isHttps is True:

proxy = {

'https': ip

}

else:

proxy = {

'http': ip

}

try:

header = {'User-Agent': self.ua.random}

html = requests.get(url=url,headers=header, proxies=proxy, timeout=8)

return True

except Exception as e:

print(ip,e)

return False

def get_html(self,url):

print(url)

html = self.request_html(url)

self.parse_html(html)

def parse_html(self,html):

item_ip = []

item_port = []

item_http = []

xpathobj = etree.HTML(html)

iplist = xpathobj.xpath(self.xpathip)

for ip in iplist:

item_ip.append(ip)

port_list = xpathobj.xpath(self.xpathport)

for port in port_list:

item_port.append(port)

httpsStrs = xpathobj.xpath(self.xpathhttps)

for is_https in httpsStrs:

item_http.append(is_https)

for li in range(0, len(item_ip),1):

test_ip = item_ip[li]+":"+item_port[li]

print('開始檢測ip')

if item_http[li] == "HTTPS":

self.test_proxy(test_ip,True)

elif item_http[li] == "HTTP":

self.test_proxy(test_ip, False)

def test_proxy(self,proxy_address,isHttps):

ret = self.proxy_request_html('https://www.baidu.com/',proxy_address,isHttps)

if ret is True:

with open('proxy.log','a+') as f:

f.write(proxy_address+'\n')

print('代理節點可用',proxy_address)

def run(self):

url = self.baseurl.format(1)

self.get_html(url)

if __name__ == '__main__':

spider = ProxySpider()

spider.run();

4.結果

python3爬蟲douban_【學習筆記】Python3爬蟲

可以看出,http的基本都是可以使用的,HTTPS的基本都不能使用

案例2:Python3爬蟲-baidutieba-xpath

使用xpath插件,進行篩選

直接滑鼠在想篩選的文字或者圖檔,右鍵,就有xpath,然後F12,修改修改就可以了

python3爬蟲douban_【學習筆記】Python3爬蟲
python3爬蟲douban_【學習筆記】Python3爬蟲

2.編寫代碼

Response Content

We can read the content of the server’s response. Consider the GitHub timeline again:

import requests

r = requests.get('https://api.github.com/events')

r.text

'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encodingproperty:

r.encoding

'utf-8'

r.encoding = 'ISO-8859-1'

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTML and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

Requests will also use custom encodings in the event that you need them. If you have created your own encoding and registered it with the codecs module, you can simply use the codec name as the value of r.encoding and Requests will handle the decoding for you.

Binary Response Content

You can also access the response body as bytes, for non-text requests:

r.content

b'[{"repository":{"open_issues":0,"url":"https://github.com/...

The gzip and deflate transfer-encodings are automatically decoded for you.

For example, to create an image from binary data returned by a request, you can use the following code:

from PIL import Image

from io import BytesIO

i = Image.open(BytesIO(r.content))

JSON Response Content

There’s also a builtin JSON decoder, in case you’re dealing with JSON data:

import requests

r = requests.get('https://api.github.com/events')

r.json()

[{'repository': {'open_issues': 0, 'url': 'https://github.com/...

In case the JSON decoding fails, r.json() raises an exception. For example, if the response gets a 204 (No Content), or if the response contains invalid JSON, attempting r.json() raises ValueError: No JSON object could be decoded.

It should be noted that the success of the call to r.json() does not indicate the success of the response. Some servers may return a JSON object in a failed response (e.g. error details with HTTP 500). Such JSON will be decoded and returned. To check that a request is successful, user.raise_for_status() or check r.status_code is what you expect.

Raw Response Content

In the rare case that you’d like to get the raw socket response from the server, you can access r.raw. If you want to do this, make sure you set stream=True in your initial request. Once you do, you can do this:

r = requests.get('https://api.github.com/events', stream=True)

r.raw

r.raw.read(10)

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In general, however, you should use a pattern like this to save what is being streamed to a file:

with open(filename, 'wb') as fd:

for chunk in r.iter_content(chunk_size=128):

fd.write(chunk)

Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.

from lxml import etree

import re,time,os,random

import requests

from urllib import parse

from fake_useragent import UserAgent

class BaiduTiebaSpider(object):

def __init__(self):

self.baseurl = r'http://tieba.baidu.com/f?kw={}&pn={}'

self.title_baseurl = r'https://tieba.baidu.com{}'

self.picXpath = r'//cc//img[@class="BDE_Image"]/@src'

self.titleurlXpath = r'//li//a[@class="j_th_tit "]/@href'

self.videoXpath = r'/div[@class="video_src_wrap_main"]/video/@src'

self.ua = UserAgent()

self.savePath = r'/home/user/work/spider/baidu/BaiduTieba/'

def get_html(self,url):

# header = {'User-Agent':self.ua.random}

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}

res = requests.get(url=url,headers=header).content

return res

def parse_html(self,html):

parse = etree.HTML(html)

titlelink_list = parse.xpath(self.titleurlXpath)

for li in titlelink_list:

titleurl = self.title_baseurl.format(li)

print(titleurl)

self.save_html(titleurl)

time.sleep(random.randint(2,3))

def save_html(self,url):

html = self.get_html(url)

parse = etree.HTML(html)

piclinks = parse.xpath(self.picXpath)

for pics in piclinks:

self.save_img(pics,self.savePath+pics[-10:])

videolinks = parse.xpath(self.videoXpath)

for videos in videolinks:

self.save_img(videos,self.savePath+videos[-10:])

def save_img(self,imgurl,filename):

img = self.get_html(imgurl)

with open(filename,'wb') as f :

f.write(img)

print(filename,'DownLoad Sucess')

def run(self):

name = input('輸入要查詢的貼吧名稱>')

start = input('Start Page>')

end = input('End Page>')

mainurl = self.baseurl.format(parse.quote(name),0)

print(mainurl)

pagehtml = self.get_html(mainurl)

self.parse_html(pagehtml)

if __name__ == '__main__':

spider = BaiduTiebaSpider()

spider.run();

注意header

3.結果

python3爬蟲douban_【學習筆記】Python3爬蟲

案例3:Python3 爬蟲-鍊家2手房-xpath

XPath 是一門在 XML 文檔中查找資訊的語言。XPath 可用來在 XML 文檔中對元素和屬性進行周遊。

XPath 是 W3C XSLT 标準的主要元素,并且 XQuery 和 XPointer 都建構于 XPath 表達之上。

是以,對 XPath 的了解是很多進階 XML 應用的基礎。

Xpath插件在Chrome浏覽器商店中,360浏覽器擴充中心裡也有

打開一個網頁,F12,就在最後的

python3爬蟲douban_【學習筆記】Python3爬蟲

3.xpath過濾

python3爬蟲douban_【學習筆記】Python3爬蟲

在這裡面測試,測試好了,放到代碼中

Python3 代碼

import requests

import random,re,time

from fake_useragent import UserAgent

from lxml import etree

class LianJiaSpider(object):

def __init__(self):

self.baseurl = 'https://sz.lianjia.com/ershoufang/pg{}/'

self.ua = UserAgent()

def get_html(self,url):

header = {'User-Agent':self.ua.random}

html = requests.get(url,headers=header,timeout=5).text

# html.encoding = 'utf-8'

self.parse_html(html)

def parse_html(self,html):

parse = etree.HTML(html)

li_list = parse.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')

item = {}

for i in li_list:

item['name'] = i.xpath('.//a[@data-el="region"]/text()')[0]

info_list = i.xpath('.//div[@class="houseInfo"]/text()')[0].split('|')

item['model'] = info_list[0].strip()

item['area'] = info_list[1].strip()

item['direction'] = info_list[2].strip()

item['perfect'] = info_list[3].strip()

item['floor'] = info_list[4].strip()

item['age'] = info_list[5].strip()

item['address'] = i.xpath('.//div[@class="positionInfo"]/a/text()')[0].strip()

item['total'] = i.xpath('.//div[@class="totalPrice"]/span/text()')[0].strip()

item['unit'] = i.xpath('.//div[@class="unitPrice"]/span/text()')[0].strip()[2:-4]

print(item)

def run(self):

url = self.baseurl.format(1)

self.get_html(url)

if __name__ == '__main__':

spider = LianJiaSpider();

spider.run();

5.結果

python3爬蟲douban_【學習筆記】Python3爬蟲

窮 窮 窮,買不起,2手都買不起

案例4:python3 爬蟲-百度圖檔

import requests

import re,time,random,os

from urllib import parse

from fake_useragent import UserAgent

class BaiduImgSpider(object):

def __init__(self):

self.baseurl = 'https://image.baidu.com/search/index?tn=baiduimage&word={}'

self.count = 1;

self.ua = UserAgent()

self.savepath = '/home/user/work/spider/day03/'

self.re_str = r'{"thumbURL":"(.*?)","replaceUrl":'

def get_html(self,name,orgname):

header = {'User-Agent':self.ua.random}

url = self.baseurl.format(name)

html = requests.get(url=url,headers = header).text

pattent = re.compile(self.re_str,re.S)

img_list = pattent.findall(html)

path = self.savepath+orgname

if not os.path.exists(path):

os.mkdir(path)

for img_link in img_list:

print(img_link)

self.save_img(img_link,path)

time.sleep(random.randint(1,2))

def save_img(self,url,path):

header = {'User-Agent': self.ua.random}

html = requests.get(url=url,headers=header).content

filename = path+"/"+str(self.count)+'.jpg'

with open(filename,'wb') as f:

f.write(html)

print('下載下傳成功',filename)

self.count += 1

def run(self):

search_name = input('輸入要擷取的名字>');

word = parse.quote(search_name)

self.get_html(word,search_name)

if __name__ == '__main__':

spider = BaiduImgSpider()

spider.run();

直接上代碼了,非常簡單的

python3爬蟲douban_【學習筆記】Python3爬蟲

案例5:Python3 爬蟲 電影天堂

from urllib import request

import re,time,random

from fake_useragent import UserAgent

class DyTTSpider(object):

def __init__(self):

self.base_url = 'https://www.dytt8.net'

self.url_one = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'

self.ua = UserAgent()

def get_html(self,url):

header = {'User-Agent':self.ua.random}

req = request.Request(url,headers=header)

res = request.urlopen(req)

ret = res.read().decode('gb2312','ignore')

return ret;

def re_html(self,html,restr):

patent = re.compile(restr,re.S)

ret = patent.findall(html)

return ret;

def parse_html(self,one_url):

html_ret = self.get_html(one_url);

re_str = r'

';

ret_list = self.re_html(html_ret, re_str);

for link in ret_list:

print(link)

self.parse_sencond(self.base_url+link)

time.sleep(random.randint(2,3))

def parse_sencond(self,second_html):

item = {}

html_ret = self.get_html(second_html)

re_str = r'

(.*?).*?'

two_list = self.re_html(html_ret,re_str)

item['name'] = two_list[0].strip()

item['dlink'] = two_list[1].strip()

def run(self):

geturl = self.url_one.format(1)

self.parse_html(geturl)

if __name__ == '__main__':

dy = DyTTSpider()

dy.run();

案例6:Python3 爬蟲 youdao

import requests

import random,time

from hashlib import md5

from fake_useragent import UserAgent

'''

var t = n.md5(navigator.appVersion)

, r = "" + (new Date).getTime()

, i = r + parseInt(10 * Math.random(), 10);

return {

ts: r,

bv: t,

salt: i,

sign: n.md5("fanyideskweb" + e + i + "Nw(nmmbP%A-r6U3EUn]Aj")

}

'''

class FanyiSpider(object):

def __init__(self):

self.baseurl = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'

self.ua = UserAgent()

def make_formdata_string(self,word):

formdata = {

"i": "",

"from": "AUTO",

"to": "AUTO",

"smartresult": "dict",

"client": "fanyideskweb",

"salt": "",

"sign": "",

"ts": "",

"bv": "37074a7035f34bfbf10d32bb8587564a",

"doctype": "json",

"version": "2.1",

"keyfrom": "fanyi.web",

"action": "FY_BY_REALTlME",

}

s = md5()

formdata['i'] = word;

formdata['ts'] = str(int(time.time()*1000));

# formdata['bv'] = s.hexdigest();

formdata['salt'] = formdata['ts'] + str(int(random.randint(0,9)))

signstring = "fanyideskweb" + word + formdata['salt'] + "Nw(nmmbP%A-r6U3EUn]Aj"

s.update(signstring.encode())

formdata['sign'] = s.hexdigest();

return formdata;

def make_headerString(self):

headerdata = {

"Accept": "application/json, text/javascript, **",

"accept-encoding": "gzip, deflate, br",

"accept-language": "zh-CN,zh;q=0.9",

"referer": "https://careers.tencent.com/search.html?index={}",

"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",

}

self.ua = UserAgent()

def get_html(self,pageindex):

self.header['user-agent'] = self.ua.random

self.header['referer'] = self.header['referer'].format(pageindex)

url = self.baseurl.format(pageindex)

ret = requests.get(url=url,headers=self.header)

print(ret.text)

def run(self):

self.get_html(3)

if __name__ == '__main__':

spider = TencentSpider()

spider.run();

python3爬蟲douban_【學習筆記】Python3爬蟲

案例8 douban電影

整體比較簡單,F12裡面都有,網站傳回的json

import requests

from fake_useragent import UserAgent

import random,time

import json

from urllib import parse

class DoubanMovieSpider(object):

def __init__(self):

self.baseurl = 'https://movie.douban.com/j/search_tags?type=movie&source='

self.detilurl = 'https://movie.douban.com/j/search_subjects?type=movie&tag={}&sort=recommend&page_limit=20&page_start={}'

self.ua = UserAgent()

self.douban_movie_types = []

def make_header(self):

data = {

'User-Agent':self.ua.random

}

return data

def get_html(self,url):

ret = requests.get(url=url,headers=self.make_header()).text

return ret

def run(self):

movie_type_json = self.get_html(self.baseurl)

move_type = json.loads(movie_type_json);

for item in move_type['tags']:

self.douban_movie_types.append(item)

print(item,sep=None)

want_type = input('請輸入感興趣的類别>')

startpage = input('輸入開始頁>')

endpage = input('輸入結束頁>')

if want_type in self.douban_movie_types:

# for i in range(startpage,endpage,20):

insterurl = self.detilurl.format(parse.quote(want_type),0)

movie_infos_json = self.get_html(insterurl)

movie_infos = json.loads(movie_infos_json)

for info in movie_infos['subjects']:

print(info['title'],info['rate'],info['url'])

if __name__ == '__main__':

spider = DoubanMovieSpider()

spider.run();

python3爬蟲douban_【學習筆記】Python3爬蟲

案例9:xiaomiappshop

擷取應用類别和總頁數,多線程擷取

import requests

import json

from threading import Thread

from queue import Queue

import time,random

from fake_useragent import UserAgent

from lxml import etree

class AppShopSpider(object):

def __init__(self):

self.mainurl = 'http://app.mi.com/'

self.baseurl = 'http://app.mi.com/categotyAllListApi?page={}&categoryId={}&pageSize=30'

self.xpathtype = '//ul[@class="category-list"]/li/a'

self.xpathpagenum = '//div[@class="pages"]/a[6]/text()'

self.q = Queue()

self.ua = UserAgent()

self.type_code = {}

def make_url(self,categoryid,startpage,endpage):

for page in range(startpage,endpage,1):

url = self.baseurl.format(page,categoryid)

self.q.put(url)

def get_url(self):

if not self.q.empty():

url = self.q.get()

print(url)

self.parse_html(self.get_html(url))

def get_html(self,url):

header = {'User-Agent':self.ua.random}

ret = requests.get(url=url,headers=header).text

return ret;

def parse_html(self,html):

jsonstr = json.loads(html)

for item in jsonstr['data']:

print(item['displayName'])

def get_typecode(self):

html = self.get_html(self.mainurl)

xpathobj = etree.HTML(html)

ret = xpathobj.xpath(self.xpathtype)

for item in ret:

apptype = item.xpath('./text()')[0]

appcode = item.xpath('./@href')[0].split('/')[-1]

self.type_code[apptype] = appcode

print(apptype)

def get_typepage(self,keyword):

for key,value in self.type_code.items():

if key == keyword:

url = self.baseurl.format(0,value)

html = self.get_html(url)

jsonstr = json.loads(html)

pagenum = int(int(jsonstr['count'])/30 +1)

print('總頁數為>',pagenum)

def run(self):

self.get_typecode();

instert_type = input('輸入想擷取的類别>')

self.get_typepage(instert_type)

strtpage = int(input('輸入開始頁面>'))

endpage = int(input('輸入結束頁面>'))

self.make_url(self.type_code[instert_type],strtpage,endpage)

thread_list= []

for i in range(5):

t = Thread(target=self.get_url())

thread_list.append(t)

t.start();

for t in thread_list:

t.join()

if __name__ == '__main__':

starttime = time.time()

spider = AppShopSpider()

spider.run();

endtime = time.time()

print('時間>%.2f'%(endtime-starttime))

python3爬蟲douban_【學習筆記】Python3爬蟲

案例10:jingdong商品

使用selenium調起chrome浏覽器抓取資料,速度慢,但是簡單

2020-04-19

from selenium import webdriver

import time

class JDSpider(object):

def __init__(self):

self.baseurl = 'https://www.jd.com/'

self.searchinput_xpath = '//*[@id="key"]'

self.searchButton_xpath = '//*[@id="search"]/div/div[2]/button'

self.browser = None

self.detail_xpath = '//*[@id="J_goodsList"]/ul/li'

self.nextpage_xpath = '//*[@id="J_bottomPage"]/span[1]/a[9]'

self.sum = 0;

def get_html(self,url,word):

self.browser.get(url)

search_input = self.browser.find_element_by_xpath(self.searchinput_xpath)

search_input.send_keys(word)

time.sleep(3)

self.send_click(self.searchButton_xpath)

self.parse_html()

def send_click(self,xpathstr):

button = self.browser.find_element_by_xpath(xpathstr)

button.click()

time.sleep(3)

def scrollend(self):

js = "var q=document.documentElement.scrollTop=100000"

self.browser.execute_script(js)

time.sleep(3)

def parse_html(self):

self.scrollend()

item = {}

li_list = self.browser.find_elements_by_xpath(self.detail_xpath)

for li in li_list:

item['price'] = li.find_element_by_xpath('.//div[@class="p-price"]').text.strip()

item['title'] = li.find_element_by_xpath('.//div[@class="p-name p-name-type-2"]/a/em').text.strip()

item['commit'] = li.find_element_by_xpath('.//div[@class="p-commit"]/strong').text.strip()

print(item)

self.sum +=1;

def run(self):

name = input('輸入想搜尋的關鍵字>')

self.browser = webdriver.Chrome()

self.get_html(self.baseurl,name)

while True:

if self.browser.page_source.find('pn-next disable') == -1:

self.browser.find_element_by_class_name('pn-next').click()

time.sleep(3)

self.parse_html()

else:

break;

if __name__ == '__main__':

spider = JDSpider();

spider.run();

print('共計:',spider.sum)

python3爬蟲douban_【學習筆記】Python3爬蟲