java+mysql实现网络爬虫

本文转自 http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/

网络爬虫，也叫网络蜘蛛，有的项目也把它称作“walker”。维基百科所给的定义是“一种系统地扫描互联网，以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目，其中比较有名的是Heritrix和Apache Nutch。

有时需要在网上搜集信息，如果需要搜集的是获取方法单一而人工搜集费时费力的信息，比如统计一个网站每个月发了多少篇文章、用了哪些标签，为自然语言处理项目搜集语料，或者为模式识别项目搜集图片等等，就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。

很多网络爬虫都是用Python，Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间，我把程序限制在只扫描本博客地址下的网页（也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容），并从网址中统计出所用的所有标签。只要稍作修改，去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改，可以作为生成博客sitemap的工具。

代码也可以在这里下载：johnhany/WPCrawler。

环境需求

我的开发环境是Windows7 + Eclipse。

需要XAMPP提供通过url访问MySQL数据库的端口。

还要用到三个开源的Java类库：

Apache HttpComponents 4.3 提供HTTP接口，用来向目标网址提交HTTP请求，以获取网页的内容；

HTML Parser 2.0 用来解析网页，从DOM节点中提取网址链接；

MySQL Connector/J 5.1.27 连接Java程序和MySQL，然后就可以用Java代码操作数据库。

代码

代码位于三个文件中，分别是：crawler.java，httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。

crawler.java

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

package

 net.johnhany.wpcrawler;

import

 java.sql.Connection;

import

 java.sql.DriverManager;

import

 java.sql.ResultSet;

import

 java.sql.SQLException;

import

 java.sql.Statement;

public

 class

crawler {

public

static

void

 main(String args[])

throws

 Exception {

String frontpage =

"http://johnhany.net/"

Connection conn =

null

//connect the MySQL database

try

Class.forName(

"com.mysql.jdbc.Driver"

);

String dburl =

"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"

conn = DriverManager.getConnection(dburl,

"root"

""

);

System.out.println(

"connection built"

);

catch

(SQLException e) {

e.printStackTrace();

catch

(ClassNotFoundException e) {

e.printStackTrace();

String sql =

null

String url = frontpage;

Statement stmt =

null

ResultSet rs =

null

int

count =

if

(conn !=

null

) {

//create database and table that will be needed

try

sql =

"CREATE DATABASE IF NOT EXISTS crawler"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"USE crawler"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

catch

(SQLException e) {

e.printStackTrace();

//crawl every link in the database

while

true

) {

//get page content of link "url"

httpGet.getByString(url,conn);

count++;

//set boolean value "crawled" to true after crawling this page

sql =

"UPDATE record SET crawled = 1 WHERE URL = '"

+ url +

"'"

stmt = conn.createStatement();

if

(stmt.executeUpdate(sql) >

) {

//get the next page that has not been crawled yet

sql =

"SELECT * FROM record WHERE crawled = 0"

stmt = conn.createStatement();

rs = stmt.executeQuery(sql);

if

(rs.next()) {

url = rs.getString(

);

else

//stop crawling if reach the bottom of the list

break

//set a limit of crawling count

if

(count >

|| url ==

null

) {

break

conn.close();

conn =

null

System.out.println(

"Done."

);

System.out.println(count);

httpGet.java

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

package

 net.johnhany.wpcrawler;

import

 java.io.IOException;

import

 java.sql.Connection;

import

 org.apache.http.HttpEntity;

import

 org.apache.http.HttpResponse;

import

 org.apache.http.client.ClientProtocolException;

import

 org.apache.http.client.ResponseHandler;

import

 org.apache.http.client.methods.HttpGet;

import

 org.apache.http.impl.client.CloseableHttpClient;

import

 org.apache.http.impl.client.HttpClients;

import

 org.apache.http.util.EntityUtils;

public

 class

httpGet {

public

final

static

 void

getByString(String url, Connection conn)

throws

Exception {

CloseableHttpClient httpclient = HttpClients.createDefault();

try

HttpGet httpget =

new

HttpGet(url);

System.out.println(

"executing request "

+ httpget.getURI());

ResponseHandler<String> responseHandler =

new

ResponseHandler<String>() {

public

String handleResponse(

final

HttpResponse response)

throws

ClientProtocolException, IOException {

int

status = response.getStatusLine().getStatusCode();

if

(status >=

 && status <

) {

HttpEntity entity = response.getEntity();

return

entity !=

null

 ? EntityUtils.toString(entity) :

null

else

throw

new

ClientProtocolException(

"Unexpected response status: "

+ status);

};

String responseBody = httpclient.execute(httpget, responseHandler);

parsePage.parseFromString(responseBody,conn);

finally

httpclient.close();

parsePage.java

package

 net.johnhany.wpcrawler;

import

 java.sql.Connection;

import

 java.sql.PreparedStatement;

import

 java.sql.ResultSet;

import

 java.sql.SQLException;

import

 java.sql.Statement;

import

 org.htmlparser.Node;

import

 org.htmlparser.Parser;

import

 org.htmlparser.filters.HasAttributeFilter;

import

 org.htmlparser.tags.LinkTag;

import

 org.htmlparser.util.NodeList;

import

 org.htmlparser.util.ParserException;

import

 java.net.URLDecoder;

public

 class

parsePage {

public

static

void

 parseFromString(String content, Connection conn)

throws

Exception {

Parser parser =

new

Parser(content);

HasAttributeFilter filter =

new

HasAttributeFilter(

"href"

);

try

NodeList list = parser.parse(filter);

int

count = list.size();

//process every link on this page

for

int

i=

; i<count; i++) {

Node node = list.elementAt(i);

if

(node

instanceof

LinkTag) {

LinkTag link = (LinkTag) node;

String nextlink = link.extractLink();

String mainurl =

"http://johnhany.net/"

String wpurl = mainurl +

"wp-content/"

//only save page from "http://johnhany.net"

if

(nextlink.startsWith(mainurl)) {

String sql =

null

ResultSet rs =

null

PreparedStatement pstmt =

null

Statement stmt =

null

String tag =

null

//do not save any page from "wp-content"

if

(nextlink.startsWith(wpurl)) {

continue

try

//check if the link already exists in the database

sql =

"SELECT * FROM record WHERE URL = '"

 + nextlink +

"'"

stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);

rs = stmt.executeQuery(sql);

if

(rs.next()) {

else

//if the link does not exist in the database, insert it

sql =

"INSERT INTO record (URL, crawled) VALUES ('"

+ nextlink +

"',0)"

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

System.out.println(nextlink);

//use substring for better comparison performance

nextlink = nextlink.substring(mainurl.length());

//System.out.println(nextlink);

if

(nextlink.startsWith(

"tag/"

)) {

tag = nextlink.substring(

, nextlink.length()-

);

//decode in UTF-8 for Chinese characters

tag = URLDecoder.decode(tag,

"UTF-8"

);

sql =

"INSERT INTO tags (tagname) VALUES ('"

 + tag +

"')"

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

//if the links are different from each other, the tags must be different

//so there is no need to check if the tag already exists

pstmt.execute();

catch

(SQLException e) {

//handle the exceptions

System.out.println(

"SQLException: "

+ e.getMessage());

System.out.println(

"SQLState: "

+ e.getSQLState());

System.out.println(

"VendorError: "

+ e.getErrorCode());

finally

//close and release the resources of PreparedStatement, ResultSet and Statement

if

(pstmt !=

null

) {

try

pstmt.close();

catch

(SQLException e2) {}

pstmt =

null

if

(rs !=

null

) {

try

rs.close();

catch

(SQLException e1) {}

rs =

null

if

(stmt !=

null

) {

try

stmt.close();

catch

(SQLException e3) {}

stmt =

null

catch

(ParserException e) {

e.printStackTrace();

程序原理

所谓“互联网”，是网状结构，任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描，在图论角度来讲，就是对有向图的遍历（链接是从一个网页指向另一个网页，所以是有向的）。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历：这里和这里。我的程序采用的是广度优先方式。

程序从crawler.java的main()开始运行。

1 2 3 4

Class.forName(

"com.mysql.jdbc.Driver"

);

String dburl =

"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"

conn = DriverManager.getConnection(dburl,

"root"

""

);

System.out.println(

"connection built"

);

首先，调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306，端口值可以在XAMPP主界面看到：

java+mysql实现网络爬虫

Apache和MySQL都启动之后，在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。

java+mysql实现网络爬虫

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

sql =

"CREATE DATABASE IF NOT EXISTS crawler"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"USE crawler"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

sql =

"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"

stmt = conn.createStatement();

stmt.executeUpdate(sql);

连接好数据库后，建立一个名为“crawler”的数据库，在库里建两个表，一个叫“record”，包含字段“recordID”，“URL”和“crawled”，分别记录地址编号、链接地址和地址是否被扫描过；另一个叫“tags”，包含字段“tagnum”和“tagname”，分别记录标签编号和标签名。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

while

true

) {

httpGet.getByString(url,conn);

count++;

sql =

"UPDATE record SET crawled = 1 WHERE URL = '"

+ url +

"'"

stmt = conn.createStatement();

if

(stmt.executeUpdate(sql) >

) {

sql =

"SELECT * FROM record WHERE crawled = 0"

stmt = conn.createStatement();

rs = stmt.executeQuery(sql);

if

(rs.next()) {

url = rs.getString(

);

else

break

接着在一个while循环内依次处理表record内的每个地址。每次处理时，把地址url传递给httpGet.getByString()，然后在表record中把crawled改为true，表明这个地址已经处理过。然后寻找下一个crawled为false的地址，继续处理，直到处理到表尾。

这里需要注意的细节是，执行executeQuery()后，得到了一个ResultSet结构rs，rs包含SQL查询返回的所有行和一个指针，指针指向结果中第一行之前的位置，需要执行一次rs.next()才能让rs的指针指向第一个结果，同时返回true，之后每次执行rs.next()都会把指针移到下一个结果上并返回true，直至再也没有结果时，rs.next()的返回值变成了false。

还有一个细节，在执行建库建表、INSERT、UPDATE时，需要用executeUpdate()；在执行SELECT时，需要使用executeQuery()。executeQuery()总是返回一个ResultSet，executeUpdate()返回符合查询的行数。

httpGet.java的getByString()类负责向所给的网址发送请求，然后下载网页内容。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

HttpGet httpget =

new

HttpGet(url);

System.out.println(

"executing request "

+ httpget.getURI());

ResponseHandler<String> responseHandler =

new

ResponseHandler<String>() {

public

String handleResponse(

final

HttpResponse response)

throws

ClientProtocolException, IOException {

int

status = response.getStatusLine().getStatusCode();

if

(status >=

 && status <

) {

HttpEntity entity = response.getEntity();

return

entity !=

null

 ? EntityUtils.toString(entity) :

null

else

throw

new

ClientProtocolException(

"Unexpected response status: "

+ status);

};

String responseBody = httpclient.execute(httpget, responseHandler);

这段代码是HTTPComponents的HTTP Client组件中给出的样例，在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody，里面保存着网页中的全部字符。

接着，就需要把responseBody传递给parsePage.java的parseFromString类提取链接。

1 2 3 4 5 6 7 8 9 10 11

Parser parser =

new

Parser(content);

HasAttributeFilter filter =

new

HasAttributeFilter(

"href"

);

try

NodeList list = parser.parse(filter);

int

count = list.size();

//process every link on this page

for

int

i=

; i<count; i++) {

Node node = list.elementAt(i);

if

(node

instanceof

LinkTag) {

在HTML文件中，链接一般都在a标签的href属性中，所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点，通过在for循环中依次处理每个节点寻找符合要求的标签，可以把网页中的所有链接提取出来。

然后通过nextlink.startsWith()进一步筛选，只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。

1 2 3 4 5 6 7 8 9 10 11

sql =

"SELECT * FROM record WHERE URL = '"

+ nextlink +

"'"

stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);

rs = stmt.executeQuery(sql);

if

(rs.next()) {

else

//if the link does not exist in the database, insert it

sql =

"INSERT INTO record (URL, crawled) VALUES ('"

+ nextlink +

"',0)"

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

在表record中查找是否已经存在这个链接，如果存在（rs.next()==true），不做任何处理；如果不存在（rs.next()==false），在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT，所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。

1 2 3 4 5 6 7 8

nextlink = nextlink.substring(mainurl.length());

if

(nextlink.startsWith(

"tag/"

)) {

tag = nextlink.substring(

, nextlink.length()-

);

tag = URLDecoder.decode(tag,

"UTF-8"

);

sql =

"INSERT INTO tags (tagname) VALUES ('"

 + tag +

"')"

pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);

pstmt.execute();

去掉链接开头的“http://johnhany.net/”几个字符，提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字，把这给名字提取出来，用UTF-8编码，保证汉字的正常显示，然后存入表tags。类似地还可以加入判断“article/”，“author/”，或“2013/11/”等对其他链接进行归类。

结果

这是两张数据库的截图，显示了程序的部分结果：

java+mysql实现网络爬虫

java+mysql实现网络爬虫

继续阅读

XX系统实施过程问题总结

sort()函数到底是怎样进行数字排序的

nginx 安装错误信息解决

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method