天天看點

爬蟲用java還是python_網絡爬蟲是用python比較好,還是Java比較好呢?

eechen

2016/07/11 14:17

可以看看PHP的幾個DOM操作庫: Simple-HTML-DOM , phpQuery, Ganon

比如輕松抓取PHP官方首頁新聞的标題和釋出時間:

<?php

require dirname(__FILE__).'/simple_html_dom.php';

$html = file_get_html('http://cn2.php.net');

$news = array();

foreach($html->find('article.newsentry') as $article) {

$item['time'] = trim($article->find('time', 0)->plaintext);

$item['title'] = trim($article->find('h2.newstitle', 0)->plaintext);

//$item['content'] = trim($article->find('div.newscontent', 0)->plaintext);

$news[] = $item;

}

var_export($news);

//輸出

array (

0 =>

array (

'time' => '07 Jul 2016',

'title' => 'PHP 7.1.0 Alpha 3 Released',

),

1 =>

array (

'time' => '27 Jun 2016',

'title' => 'PHP 7.1.0 Alpha 2 Released',

),

2 =>

array (

'time' => '23 Jun 2016',

'title' => 'PHP 5.5.37 is released',

),

3 =>

array (

'time' => '23 Jun 2016',

'title' => 'PHP 5.6.23 is released',

),

4 =>

array (

'time' => '23 Jun 2016',

'title' => 'PHP 7.0.8 Released',

),

5 =>

array (

'time' => '09 Jun 2016',

'title' => 'PHP 7.1.0 Alpha 1 Released',

),

)