天天看點

(一)多線程簡單爬蟲

最近让做一个统计数据的作业,就想去去爬取重庆的房价,决定爬取安居客的数据,然后又想着练习一下多线程就爬取了所有的数据。完整代码GitHub

看网页的代码结构

首先看网页的结构,安居客是按照城市分类,城市又分为不同的地区,我就计划一个一个地区的来抓取数据。首先看一下所有城市那个界面的结构

<!-- 我们需要的数据在下面的a标签里面 -->
<html>
    <body>
        <div class="content">
            <div class="city-itm"> 
                <div class="letter_city">
                    <ul>
                        <li>
                            <div class="city_list">
                                <a href="" class="hot"></a>
                                <a href="" class="hot"></a>
                            </div>
                       </li>
                       <li>
                            <div class="city_list">
                                <a href="" class="hot"></a>
                                <a href="" class="hot"></a>
                            </div>
                       </li>
                    </ul>
                </div>
            </div>
        </div>
    </body>
</html>
           

随便找一个城市进去找到那个城市的所有地区,我们要的是界面里面新房下面的按地区找房的地址,这里要注意有些城市是没有新房,打开后只有热门板块。打开一个有地区的城市看代码结构

<!-- 我们需要的数据在按地区下面的a标签里面 -->
<!-- 当没有新房地区的城市class="clearfix"标签下面的内容就不一样了  -->
<html>
    <body>
        <div id="content">
            <div class="left-cont fl">
                <div class="buy-house tab-contents" id="content_Rd1">
                    <div class="clearfix">
                        <!-- 这里是二手房的列表  --->
                        <div class="details float_l"></div>
                        <!-- 这里是二手房的列表  --->
                        <div class="details float_l">
                            <!-- 按地区  --->
                            <div class="areas">
                                <a> </a>
                            </div>
                            <!-- 按价格  --->
                            <div class="prices"></div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>
           

然后我们找一个地区看结构

<!-- 若是第一页并且还有其他的页,就要获取那些链接 --->
<html>
    <body>
        <div id="container">
            <div class="list-contents">
                <div class="list-results">
                    <div class="key-list">
                        <!--  这是楼盘的列表  -->
                        <div class="item-mod"></div>
                    </div>
                    <!--  这是页码的标签 -->
                    <div class="list-page">
                        <div class="pagination">
                            <a></a>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>
           

然后楼盘的结构,还是刚才的界面,只是我们要看class=”item-mod”的div里面的结构。

<div class="item-mod" data-link="" data-soj=""  rel="">
    <a class="pic"><image></a>
    <div class="info">
        <a class"lp-name">
            <h3><span class="items-name>融创白象街</span></h3>
        </a>
        <a class="address">
            <span>[&nbsp;渝中&nbsp;解放碑&nbsp;]&nbsp;凯旋路16号</span>
        </a>
        <a class="tags-wrap>
            <div class="tag-panel">
                <!-- 这里可能不是两个都有,但是class="status-icon wuyetp"肯定有 -->
                <i class="status-icon onsale">在售</i>
                <i class="status-icon wuyetp">住宅</i>
            </div>
        </a>
    </div>
    <a class="favor-pos">
        <p class="price">最低<span>1000</span>万元/套起</p>
    </a>
</div>
           

开始写代码

先写一个模拟http请求的方法

public class HttpUtils {
    public static String CreatHttpGet(String url) {
        HttpClient httpclient = HttpClients.createDefault();
        HttpGet httpget = new HttpGet(url);

        httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2)");
        httpget.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
        httpget.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");

        HttpResponse response;
        String result = null;
        try {
            response = httpclient.execute(httpget);

            int statusCode = response.getStatusLine().getStatusCode();
            if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY)
                    || (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
                String newUri = response.getLastHeader("Location").getValue();
                httpclient = HttpClients.createDefault();
                httpget = new HttpGet(newUri);
                response = httpclient.execute(httpget);
            }

            HttpEntity entity = response.getEntity();
            if (entity != null) {
                // 将源码流保存在一个byte数组当中,因为可能需要两次用到该流,
                byte[] bytes = EntityUtils.toByteArray(entity);
                String charSet = "";
                charSet = EntityUtils.getContentCharSet(entity);
                result = new String(bytes, charSet);
            }
        } catch (ClientProtocolException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        httpclient.getConnectionManager().shutdown();
        return result;
    }
}
           

开始写解析网页的方法

解析网页获取数据可以用Pattern,开始我就是用的这个,测试时模拟的字符串短能正常使用,但是解析网页时就出问题了,后来百度后使用了Jsoup,解析网页很方便。

解析网页的思路想好以后我就想获取所有城市用一个线程。获取城市的地区用一个线程池,在城市的list里面去取链接后,获得网页解析出地区的链接。获取楼盘用一个线程池,在地区链接的list中取链接,解析出这个地区的楼盘。把数据添加到数据库也用一个线程池,得到一个地区所有的楼盘后就向线程池中添加一个线程。

开始时我是直接开始5个线程,每个线程一直去取链表的数据,直到链表为空,线程就结束。但是这样不是我想要的效果,然后就改成了得到一个链接就向线程池中提交一个任务。我把主要的代码放出来,完整的代码Github

搜索城市
public class CitySearchThread extends Thread {
    public void run() {
        String result = null;
        result = HttpUtils.CreatHttpGet("https://www.anjuke.com/sy-city.html");
        if (result != null) {
            try {
                Document doc = Jsoup.parse(result);
                Element cityItmElement = doc.getElementsByClass("city-itm").first();
                Element lettercityElement = cityItmElement.getElementsByClass("letter_city").first();
                Element ulElement = lettercityElement.getElementsByTag("ul").first();
                Elements liElements = ulElement.getElementsByTag("li");
                for (int i = ; i < liElements.size() - ; i++) {
                    Element cityListElement = liElements.get(i).getElementsByClass("city_list").first();
                    Elements cityElements = cityListElement.getElementsByTag("a");
                    for (int j = ; j < cityElements.size(); j++) {
                        Element cityElement = cityElements.get(j);
                        String url = cityElement.absUrl("href");
                        synchronized (DataList.cityList) {
                            DataList.cityList.add(url);
                            if (DataList.cityList.size() == ) {
                                DataList.cityList.notifyAll();
                            }
                        }
                    }
                }
                System.out.println("城市搜索完成");
            }catch (Exception e) {
                System.out.println("网页被拦截了");
            }finally {
                // 标志搜索所有的城市完成
                DataList.cityFlag = false;
            }
        }else {
            // 标志搜索所有的城市完成
            DataList.cityFlag = false;
            System.out.println("城市搜索完成");
        }
    }
}
           
搜索地区
public class DistrictSearchThread extends Thread {
    public void run() {
        ExecutorService pool = Executors.newFixedThreadPool();
        boolean flag = true;
        while (flag) {
            String url = null;
            synchronized (DataList.cityList) {
                try {
                    url = DataList.cityList.getFirst();
                    DataList.cityList.removeFirst();
                } catch (Exception e) {
                }
                if (url == null) {
                    if (DataList.cityFlag == false) {
                        flag = false;
                    } else {
                        try {
                            DataList.cityList.wait();// 防止最后一直等待,没有唤醒他
                        } catch (InterruptedException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }
                    }
                }
            }
            pool.execute(new SearchUtil(url));
        }
        pool.shutdown();
        while (true) {
            if (pool.isTerminated()) {
                DataList.districtFlag = false;
                System.out.println("地区搜索完成");
                break;
            }
            try {
                Thread.sleep();
            } catch (InterruptedException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }

    }

    class SearchUtil extends Thread {
        private String url;

        public SearchUtil(String url) {
            super();
            this.url = url;
        }

        @Override
        public void run() {
            String result = null;
            // 当线程是刚唤醒的那str是null
            if (url != null) {
                result = HttpUtils.CreatHttpGet(url);
                if (result != null) {
                    Document doc = Jsoup.parse(result);
                    try {
                        Element contentElement = doc.getElementById("content_Rd1");
                        Element detailsfloat_lElement = contentElement.getElementsByClass("details float_l").get();
                        Element areasElement = detailsfloat_lElement.getElementsByClass("areas").first();
                        Elements aElements = areasElement.getElementsByTag("a");
                        for (int i = ; i < aElements.size(); i++) {
                            Element cityElement = aElements.get(i);
                            String districtUrl = cityElement.absUrl("href");
                            synchronized (DataList.districtList) {
                                DataList.districtList.add(districtUrl);
                                if (DataList.districtList.size() == ) {
                                    DataList.districtList.notifyAll();
                                }
                            }
                        }
                    } catch (Exception e) {
                         System.out.println(Thread.currentThread().getName() + "=====区域:" + url + "无相关数据");
                    }
                }
            }
        }
    }
}
           
获取楼盘和添加楼盘
public class HouseSearchThread extends Thread {
    ExecutorService pool = Executors.newFixedThreadPool();
    ExecutorService pool1 = Executors.newFixedThreadPool();
    @Override
    public void run() {
        boolean flag = true;
        while (flag) {
            String url = null;
            synchronized (DataList.districtList) {
                try {
                    url = DataList.districtList.getFirst();
                    DataList.districtList.removeFirst();
                } catch (Exception e) {
                }
                if (url == null) {
                    if (DataList.districtFlag == false) {
                        flag = false;
                    } else {
                        try {
                            DataList.districtList.wait();
                        } catch (InterruptedException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }
                    }
                }
            }
            pool.execute(new SearchUtil(url));
        }
        pool.shutdown();
        while (true) {
            if (pool.isTerminated()) {
                System.out.println("楼盘搜索完成");
                break;
            }
            try {
                Thread.sleep();
            } catch (InterruptedException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
        pool1.shutdown();
        while (true) {
            if (pool1.isTerminated()) {
                System.out.println("楼盘添加完成");
                break;
            }
            try {
                Thread.sleep();
            } catch (InterruptedException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
        System.out.println("完成");
    }
    class SearchUtil extends Thread {
        private String url;

        public SearchUtil(String url) {
            this.url = url;
        }
        @Override
        public void run() {
            String baseUrl = url;
            String result = null;

            if (url != null) {
                result = HttpUtils.CreatHttpGet(url);
                if (result != null) {
                    LinkedList<Elements> resultList = new LinkedList<Elements>();
                    Map<String, Object> map = ExcisionUtil.roughExcisionAPage(result);// 获取当地的首页获取信息和页数
                    Integer page = (Integer) map.get("page");
                    resultList.add((Elements) map.get("list"));
                    // 获取当地的剩下几页
                    if (page != null) {
                        for (; page > ; page--) {
                            String nextUrl = baseUrl + "p" + page + "/";
                            result = HttpUtils.CreatHttpGet(nextUrl);
                            resultList.add(ExcisionUtil.roughExcisionNPage(result));
                        }
                    }
                    // 添加信息
                    Iterator<Elements> it = resultList.iterator();
                    while (it.hasNext()) {
                        Elements ele = it.next();
                        if (ele != null) {
                            LinkedList<House> houseList = ExcisionUtil.exactExcision(ele);
                            pool1.execute(new AddUtil(houseList));
                        }
                    }
                }
            }
        }
    }

    class AddUtil extends Thread{
        private LinkedList<House> houseList;
        public AddUtil(LinkedList<House> houseList) {
            this.houseList = houseList;
        }
        @Override
        public void run() {
            SqlSession sqlSession = MyBatisUtil.getSqlSession(true);
            HouseMapper mapper = sqlSession.getMapper(HouseMapper.class);
            Iterator<House> it = houseList.iterator();
            while(it.hasNext()) {
                House house = it.next();
                mapper.addHouse(house);
            }
            sqlSession.close();
        }
    }
}
           
解析网页获得楼盘
public class ExcisionUtil {
    /**
     * 切割出来包含楼盘的elements,包含页数
     * @param result
     * @return
     */
    public static Map<String,Object> roughExcisionAPage(String result){
        HashMap<String, Object> map = new HashMap<String, Object>();
        if(result != null) {
            Document doc = Jsoup.parse(result);
            try {
                Element elementRoot = doc.getElementById("container");
                Element listElement = elementRoot.getElementsByClass("list-contents").first();
                //Element element2 = elements.first();
                Element keyListElement = listElement.getElementsByClass("key-list").first();
                Elements houseElements = keyListElement.getElementsByClass("item-mod");
                Element listPageElement = listElement.getElementsByClass("list-page").first();
                Elements pageElement = listPageElement.getElementsByTag("a");
                map.put("list", houseElements);
                map.put("page", pageElement.size()+);
            }catch(Exception e) {

            }
        }
        return map;
    }
    /**
     * 粗略切割,不包含页数
     * @param result
     * @return
     */
    public static Elements roughExcisionNPage(String result){
        Elements houseElements = null;
        if(result != null) {
            Document doc = Jsoup.parse(result);
            try {
                Element elementRoot = doc.getElementById("container");
                Element listElement = elementRoot.getElementsByClass("list-contents").first();
                Element keyListElement = listElement.getElementsByClass("key-list").first();
                houseElements = keyListElement.getElementsByClass("item-mod");
            }catch(Exception e) {

            }
        }
        return houseElements;
    }
    /**
     * 获取房子的信息
     * @param elements
     * @return
     */
    public static LinkedList<House> exactExcision(Elements elements){
        LinkedList<House> resultList = new LinkedList<House>();
        for(int i = ;i<elements.size();i++) {
            String name = null;
            String address = null;
            String state = null;
            String describe = null;
            String price = null;
            Element element = elements.get(i);
            Element infoElement = element.getElementsByClass("infos").first();

            try {
                Element nameElement = infoElement.getElementsByClass("lp-name").first();
                Element h3Element = nameElement.getElementsByTag("h3").first().getAllElements().first();
                name = h3Element.text();
            }catch(Exception e) {
            }

            try {
                Element addressElement = infoElement.getElementsByClass("address").first();
                Element spanElement = addressElement.getElementsByTag("span").first();
                address = spanElement.text();
            }catch(Exception e) {
            }

            Element stateElement = null;
            Element describeElement = null;
            try {
                Element tagswrapElement = infoElement.getElementsByClass("tags-wrap").first();
                Element tagpanelElement = tagswrapElement.getElementsByClass("tag-panel").first();
                Elements sdElements = tagpanelElement.getElementsByTag("i");
                if(sdElements.size() == ) {
                    describeElement = sdElements.first();
                }
                if(sdElements.size() == ) {
                    stateElement = sdElements.first();
                    describeElement = sdElements.get();
                }
                if(stateElement != null) {
                    state = stateElement.text();
                }
                if(describeElement != null) {
                    describe = describeElement.text();
                }
            }catch(Exception e) {
            }

            try {
                Element favorposElement = element.getElementsByClass("favor-pos").first();
                Element pElement = favorposElement.getElementsByTag("p").first();
                price = pElement.text();
            }catch(Exception e) {
            }

            if(name != null) {
                House house = new House(name, address, state, describe, price);
                resultList.add(house);
            }   
        }
        return resultList;
    }
}
           

这次写这个遇到了一些问题,多线程的很多东西以前都很模糊,写完后增加了技巧和理解,细节在后面写出来。

繼續閱讀