laitimes

Web crawlers, crawling second-hand listings

author:A little white engaged in Java

First, I believe that many friends must have had the idea of using this language to write crawlers after learning Python. Xiaobai, who just learned Python (called Xiaobai because it is very white), and so do I. Here Xiaobai and Xiaobai's partners share the process of using Python crawlers to crawl the listing information of second-hand housing networks as Python newbies. If a big guy passes by, ask the big guy to point out the areas that need to be improved, Xiaobai, I am grateful.

Before writing crawlers, we should first figure out what crawlers do. Xiaobai, who lacks language expression here, has to borrow Du Niang's answer to introduce to you:

Web crawlers, also known as web spiders, web robots, in the FOAF community, more often called web chasers, are programs or scripts that automatically scrape World Wide Web information according to certain rules, and some less commonly used names include ants, automatic indexes, simulation programs or worms.

After roughly understanding crawlers, we are going to start preparing to write crawlers. First of all, we have determined that it is written in python language, and then we have to determine what development platform and environment to use, here, Xiaobai is using the Windows platform (I checked some information before preparing and saw some big guys saying that Linux is more comfortable to use than Windows in terms of development, so you can also try Linux), the development environment is pycharm2021 community version (began to download the professional version, But I found that I need to enter the activation code every day to open it, and the community version is enough for us ordinary people, so I re-downloaded the community version).

The next step should be to prepare to write crawlers, in fact, in Xiaobai's view, writing crawlers is actually the application of various libraries. Let's start with our general idea: crawl data - parse data - store data. The first step we need to install the libraries we need, there are many installation methods on the Internet, so I won't say it specifically. The next step is to import the library we need, and the statement to import the library is import + library; For example, the library import I need in my code is as follows:

Web crawlers, crawling second-hand listings
# -*- coding: utf-8 -*-
from urllib import request
import re
import time
import threading
import random
import pymysql
from hashlib import md5
#from ua_info import ua_list
from fake_useragent import UserAgent
import sys

class study():

    def __init__(self):
        self.url='https://bj.lianjia.com/ershoufang/pg{}/'

    # 1.请求函数
    def get_html(self, url):
        ua=UserAgent()
        #print(ua.chrome)
        headers = {'User-Agent': ua.chrome}
        req = request.Request(url=url, headers=headers)
        res = request.urlopen(req)
        # 本网站使用utf8的编码格式
        html = res.read().decode('utf8', 'ignore')
        return html

    # 2.正则解析函数
    def re_func(self,re_bds,html):
        pattern = re.compile(re_bds,re.S)
        r_list = pattern.findall(html)

        return r_list
    # 格式化,正则表达式匹配页面
    def parse_html(self,one_url):
        # 调用请求函数,获取一级页面
        one_html = self.get_html(one_url)
        #print(one_html)
        #re_bds = '<a class="" .*?data-el="ershoufang".*?>(.*?)</a>'
        #re_bds='<div class="info clear">.*?</div>'
        #re_bds='<div class="title"><a.*?data-el="ershoufang".*?>(.*?)</a>.*?<span class="goodhouse_tag tagBlock">(.*?)</span></div>'
        re_bds='<div class="info clear"><div class="title"><a.*?data-el="ershoufang".*?>(.*?)</a>.*?<span class="goodhouse_tag tagBlock">(.*?)</span></div><div class="flood"><div class="positionInfo"><span class="positionIcon"></span><a .*?data-el="region">(.*?)</a>   -  <a href="https://bj.lianjia.com/ershoufang/(.*?)/" target="_blank">(.*?)</a> </div></div><div class="address"><div class="houseInfo"><span class="houseIcon"></span>(.*?)</div></div><div class="followInfo"><span class="starIcon"></span>(.*?)</div><div class="tag">(.*?)</div><div class="priceInfo"><div class="totalPrice totalPrice2"><i> </i><span class="">(.*?)</span><i>万</i></div><div class="unitPrice".*?><span>(.*?)</span></div></div></div>'
        link_list = self.re_func(re_bds,one_html)
        print(link_list)
        for link in link_list:
            print (link)
            print(100*'*')
    #对于部分正常可以获取到最大页码数,可以采用
    def get_max_page(self,one_url):
        one_html = self.get_html(one_url)
        print(one_html)
        re_bds = '<a href="/ershoufang/pg(.*?)" data-page="(.*?)">(.*?)</a>'
        print(re_bds)
        link_list = self.re_func(re_bds,one_html)
        #print(link_list)
        return link_list[link_list.__len__()-1][1];

    def run(self):
        for i in range(100):
            url = self.url.format(i)
            #定时抓取当前最新的数据
            self.parse_html(url)
        # 输出当地时间
        print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))  
        # 设置一个定时器,循环输出时间
        timer = threading.Timer(6*60*60, self.run)  
        # 启动线程
        timer.start() 
        #self.parse_html('https://bj.lianjia.com/ershoufang/pg1/')

if __name__ == '__main__':
    spider = study()
    spider.run()    

           
Web crawlers, crawling second-hand listings
Web crawlers, crawling second-hand listings
Web crawlers, crawling second-hand listings