天天看点

java+mysql实现网络爬虫

 本文转自 http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/

   网络爬虫,也叫网络蜘蛛,有的项目也把它称作“walker”。维基百科所给的定义是“一种系统地扫描互联网,以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目,其中比较有名的是Heritrix和Apache Nutch。

        有时需要在网上搜集信息,如果需要搜集的是获取方法单一而人工搜集费时费力的信息,比如统计一个网站每个月发了多少篇文章、用了哪些标签,为自然语言处理项目搜集语料,或者为模式识别项目搜集图片等等,就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。

        很多网络爬虫都是用Python,Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间,我把程序限制在只扫描本博客地址下的网页(也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容),并从网址中统计出所用的所有标签。只要稍作修改,去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改,可以作为生成博客sitemap的工具。

        代码也可以在这里下载:johnhany/WPCrawler。

环境需求

        我的开发环境是Windows7 + Eclipse。

        需要XAMPP提供通过url访问MySQL数据库的端口。

        还要用到三个开源的Java类库:

        Apache HttpComponents 4.3 提供HTTP接口,用来向目标网址提交HTTP请求,以获取网页的内容;

        HTML Parser 2.0 用来解析网页,从DOM节点中提取网址链接;

        MySQL Connector/J 5.1.27 连接Java程序和MySQL,然后就可以用Java代码操作数据库。

代码

        代码位于三个文件中,分别是:crawler.java,httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。

crawler.java

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

package

net.johnhany.wpcrawler;

import

java.sql.Connection;

import

java.sql.DriverManager;

import

java.sql.ResultSet;

import

java.sql.SQLException;

import

java.sql.Statement;

public

class

crawler {

public

static

void

main(String args[])

throws

Exception {

String frontpage =

"http://johnhany.net/"

;

Connection conn =

null

;

//connect the MySQL database

try

{

Class.forName(

"com.mysql.jdbc.Driver"

);

String dburl =

"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"

;

conn = DriverManager.getConnection(dburl,

"root"

,

""

);

System.out.println(

"connection built"

);

}

catch

(SQLException e) {

e.printStackTrace();

}

catch

(ClassNotFoundException e) {

e.printStackTrace();

}

String sql =

null

;

String url = frontpage;

Statement stmt =

null

;

ResultSet rs =

null

;

int

count =

;

if

(conn !=

null

) {

//create database and table that will be needed

try

{

sql =

"CREATE DATABASE IF NOT EXISTS crawler"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"USE crawler"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

}

catch

(SQLException e) {

e.printStackTrace();

}

//crawl every link in the database

while

(

true

) {

//get page content of link "url"

httpGet.getByString(url,conn);

count++;

//set boolean value "crawled" to true after crawling this page

sql =

"UPDATE record SET crawled = 1 WHERE URL = '"

+ url +

"'"

;

stmt = conn.createStatement();

if

(stmt.executeUpdate(sql) >

) {

//get the next page that has not been crawled yet

sql =

"SELECT * FROM record WHERE crawled = 0"

;

stmt = conn.createStatement();

rs = stmt.executeQuery(sql);

if

(rs.next()) {

url = rs.getString(

2

);

}

else

{

//stop crawling if reach the bottom of the list

break

;

}

//set a limit of crawling count

if

(count >

1000

|| url ==

null

) {

break

;

}

}

}

conn.close();

conn =

null

;

System.out.println(

"Done."

);

System.out.println(count);

}

}

}

httpGet.java

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

package

net.johnhany.wpcrawler;

import

java.io.IOException;

import

java.sql.Connection;

import

org.apache.http.HttpEntity;

import

org.apache.http.HttpResponse;

import

org.apache.http.client.ClientProtocolException;

import

org.apache.http.client.ResponseHandler;

import

org.apache.http.client.methods.HttpGet;

import

org.apache.http.impl.client.CloseableHttpClient;

import

org.apache.http.impl.client.HttpClients;

import

org.apache.http.util.EntityUtils;

public

class

httpGet {

public

final

static

void

getByString(String url, Connection conn)

throws

Exception {

CloseableHttpClient httpclient = HttpClients.createDefault();

try

{

HttpGet httpget =

new

HttpGet(url);

System.out.println(

"executing request "

+ httpget.getURI());

ResponseHandler<String> responseHandler =

new

ResponseHandler<String>() {

public

String handleResponse(

final

HttpResponse response)

throws

ClientProtocolException, IOException {

int

status = response.getStatusLine().getStatusCode();

if

(status >=

200

&& status <

300

) {

HttpEntity entity = response.getEntity();

return

entity !=

null

? EntityUtils.toString(entity) :

null

;

}

else

{

throw

new

ClientProtocolException(

"Unexpected response status: "

+ status);

}

}

};

String responseBody = httpclient.execute(httpget, responseHandler);

parsePage.parseFromString(responseBody,conn);

}

finally

{

httpclient.close();

}

}

}

parsePage.java

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117

package

net.johnhany.wpcrawler;

import

java.sql.Connection;

import

java.sql.PreparedStatement;

import

java.sql.ResultSet;

import

java.sql.SQLException;

import

java.sql.Statement;

import

org.htmlparser.Node;

import

org.htmlparser.Parser;

import

org.htmlparser.filters.HasAttributeFilter;

import

org.htmlparser.tags.LinkTag;

import

org.htmlparser.util.NodeList;

import

org.htmlparser.util.ParserException;

import

java.net.URLDecoder;

public

class

parsePage {

public

static

void

parseFromString(String content, Connection conn)

throws

Exception {

Parser parser =

new

Parser(content);

HasAttributeFilter filter =

new

HasAttributeFilter(

"href"

);

try

{

NodeList list = parser.parse(filter);

int

count = list.size();

//process every link on this page

for

(

int

i=

; i<count; i++) {

Node node = list.elementAt(i);

if

(node

instanceof

LinkTag) {

LinkTag link = (LinkTag) node;

String nextlink = link.extractLink();

String mainurl =

"http://johnhany.net/"

;

String wpurl = mainurl +

"wp-content/"

;

//only save page from "http://johnhany.net"

if

(nextlink.startsWith(mainurl)) {

String sql =

null

;

ResultSet rs =

null

;

PreparedStatement pstmt =

null

;

Statement stmt =

null

;

String tag =

null

;

//do not save any page from "wp-content"

if

(nextlink.startsWith(wpurl)) {

continue

;

}

try

{

//check if the link already exists in the database

sql =

"SELECT * FROM record WHERE URL = '"

+ nextlink +

"'"

;

stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);

rs = stmt.executeQuery(sql);

if

(rs.next()) {

}

else

{

//if the link does not exist in the database, insert it

sql =

"INSERT INTO record (URL, crawled) VALUES ('"

+ nextlink +

"',0)"

;

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

System.out.println(nextlink);

//use substring for better comparison performance

nextlink = nextlink.substring(mainurl.length());

//System.out.println(nextlink);

if

(nextlink.startsWith(

"tag/"

)) {

tag = nextlink.substring(

4

, nextlink.length()-

1

);

//decode in UTF-8 for Chinese characters

tag = URLDecoder.decode(tag,

"UTF-8"

);

sql =

"INSERT INTO tags (tagname) VALUES ('"

+ tag +

"')"

;

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

//if the links are different from each other, the tags must be different

//so there is no need to check if the tag already exists

pstmt.execute();

}

}

}

catch

(SQLException e) {

//handle the exceptions

System.out.println(

"SQLException: "

+ e.getMessage());

System.out.println(

"SQLState: "

+ e.getSQLState());

System.out.println(

"VendorError: "

+ e.getErrorCode());

}

finally

{

//close and release the resources of PreparedStatement, ResultSet and Statement

if

(pstmt !=

null

) {

try

{

pstmt.close();

}

catch

(SQLException e2) {}

}

pstmt =

null

;

if

(rs !=

null

) {

try

{

rs.close();

}

catch

(SQLException e1) {}

}

rs =

null

;

if

(stmt !=

null

) {

try

{

stmt.close();

}

catch

(SQLException e3) {}

}

stmt =

null

;

}

}

}

}

}

catch

(ParserException e) {

e.printStackTrace();

}

}

}

程序原理

        所谓“互联网”,是网状结构,任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描,在图论角度来讲,就是对有向图的遍历(链接是从一个网页指向另一个网页,所以是有向的)。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历:这里和这里。我的程序采用的是广度优先方式。

        程序从crawler.java的main()开始运行。

1 2 3 4

Class.forName(

"com.mysql.jdbc.Driver"

);

String dburl =

"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"

;

conn = DriverManager.getConnection(dburl,

"root"

,

""

);

System.out.println(

"connection built"

);

        首先,调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306,端口值可以在XAMPP主界面看到:

java+mysql实现网络爬虫
java+mysql实现网络爬虫

        Apache和MySQL都启动之后,在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。

java+mysql实现网络爬虫
java+mysql实现网络爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

sql =

"CREATE DATABASE IF NOT EXISTS crawler"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"USE crawler"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"

;

stmt = conn.createStatement();

stmt.executeUpdate(sql);

        连接好数据库后,建立一个名为“crawler”的数据库,在库里建两个表,一个叫“record”,包含字段“recordID”,“URL”和“crawled”,分别记录地址编号、链接地址和地址是否被扫描过;另一个叫“tags”,包含字段“tagnum”和“tagname”,分别记录标签编号和标签名。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

while

(

true

) {

httpGet.getByString(url,conn);

count++;

sql =

"UPDATE record SET crawled = 1 WHERE URL = '"

+ url +

"'"

;

stmt = conn.createStatement();

if

(stmt.executeUpdate(sql) >

) {

sql =

"SELECT * FROM record WHERE crawled = 0"

;

stmt = conn.createStatement();

rs = stmt.executeQuery(sql);

if

(rs.next()) {

url = rs.getString(

2

);

}

else

{

break

;

}

}

}

        接着在一个while循环内依次处理表record内的每个地址。每次处理时,把地址url传递给httpGet.getByString(),然后在表record中把crawled改为true,表明这个地址已经处理过。然后寻找下一个crawled为false的地址,继续处理,直到处理到表尾。

        这里需要注意的细节是,执行executeQuery()后,得到了一个ResultSet结构rs,rs包含SQL查询返回的所有行和一个指针,指针指向结果中第一行之前的位置,需要执行一次rs.next()才能让rs的指针指向第一个结果,同时返回true,之后每次执行rs.next()都会把指针移到下一个结果上并返回true,直至再也没有结果时,rs.next()的返回值变成了false。

        还有一个细节,在执行建库建表、INSERT、UPDATE时,需要用executeUpdate();在执行SELECT时,需要使用executeQuery()。executeQuery()总是返回一个ResultSet,executeUpdate()返回符合查询的行数。

        httpGet.java的getByString()类负责向所给的网址发送请求,然后下载网页内容。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

HttpGet httpget =

new

HttpGet(url);

System.out.println(

"executing request "

+ httpget.getURI());

ResponseHandler<String> responseHandler =

new

ResponseHandler<String>() {

public

String handleResponse(

final

HttpResponse response)

throws

ClientProtocolException, IOException {

int

status = response.getStatusLine().getStatusCode();

if

(status >=

200

&& status <

300

) {

HttpEntity entity = response.getEntity();

return

entity !=

null

? EntityUtils.toString(entity) :

null

;

}

else

{

throw

new

ClientProtocolException(

"Unexpected response status: "

+ status);

}

}

};

String responseBody = httpclient.execute(httpget, responseHandler);

        这段代码是HTTPComponents的HTTP Client组件中给出的样例,在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody,里面保存着网页中的全部字符。

        接着,就需要把responseBody传递给parsePage.java的parseFromString类提取链接。

1 2 3 4 5 6 7 8 9 10 11

Parser parser =

new

Parser(content);

HasAttributeFilter filter =

new

HasAttributeFilter(

"href"

);

try

{

NodeList list = parser.parse(filter);

int

count = list.size();

//process every link on this page

for

(

int

i=

; i<count; i++) {

Node node = list.elementAt(i);

if

(node

instanceof

LinkTag) {

        在HTML文件中,链接一般都在a标签的href属性中,所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点,通过在for循环中依次处理每个节点寻找符合要求的标签,可以把网页中的所有链接提取出来。

        然后通过nextlink.startsWith()进一步筛选,只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。

1 2 3 4 5 6 7 8 9 10 11

sql =

"SELECT * FROM record WHERE URL = '"

+ nextlink +

"'"

;

stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);

rs = stmt.executeQuery(sql);

if

(rs.next()) {

}

else

{

//if the link does not exist in the database, insert it

sql =

"INSERT INTO record (URL, crawled) VALUES ('"

+ nextlink +

"',0)"

;

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

        在表record中查找是否已经存在这个链接,如果存在(rs.next()==true),不做任何处理;如果不存在(rs.next()==false),在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT,所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。

1 2 3 4 5 6 7 8

nextlink = nextlink.substring(mainurl.length());

if

(nextlink.startsWith(

"tag/"

)) {

tag = nextlink.substring(

4

, nextlink.length()-

1

);

tag = URLDecoder.decode(tag,

"UTF-8"

);

sql =

"INSERT INTO tags (tagname) VALUES ('"

+ tag +

"')"

;

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

        去掉链接开头的“http://johnhany.net/”几个字符,提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字,把这给名字提取出来,用UTF-8编码,保证汉字的正常显示,然后存入表tags。类似地还可以加入判断“article/”,“author/”,或“2013/11/”等对其他链接进行归类。

结果

这是两张数据库的截图,显示了程序的部分结果:

java+mysql实现网络爬虫
java+mysql实现网络爬虫
java+mysql实现网络爬虫
java+mysql实现网络爬虫