DBLP数据解析

[New-2016-3-8] ：由于Google Code已经停止服务，最新版代码已经迁移至github https://github.com/kite1988/dblp-parser

运行代码需指定JVM参数 -Xmx1G -DentityExpansionLimit=2500000。如果有任何问题，请回复此文或者发邮件给我。

因为参考论文使用的实验数据是dblp，所以我的论文也打算使用dblp的数据。在网上没有找到解析dblp.xml，然后存入数据库的例子。所以只能自己动手，丰衣足食。dblp官方网站提供了一个简单的使用sax解析的例子（http://dblp.uni-trier.de/db/about/simpleparser/），在例子的启发下，我写出了自己的xml解析版本。

一、dblp的xml文件格式

dblp总共有35中element，分别为

<series> <sub> (inproceedings:title) <www><dblp><booktitle>

<sup> (inproceedings:title)<publisher><journal><author><chapter>

<i> (inrpoceedings:title)<cite><editor><ee><school><article>

<tt> (inrpoceedings:title)<address><cdrom><book><month>

在解析时，要特别注意sub、sup、i、tt，它们的父节点为title。如果在解析时，没有特别处理他们，title的值可能会出问题。

二、数据库建表字段建议

建表时属性的类型和长度可以参考一下

`id` int(8) NOT NULL auto_increment COMMENT 'The internal key in the database',

`key` varchar(150) NOT NULL default '' COMMENT 'The key in the xml file',

`mdate` date NOT NULL COMMENT 'The last modification date of the entry',

`title` longtext NOT NULL COMMENT 'Title of the publication',

//`source` varchar(150) default NULL COMMENT 'Name to the publication source, i.e. Conference, Journal, etc.; for collections, the booktitle is stored here',

//`source_id` varchar(50) default NULL COMMENT 'Reference to the publication source (first part of the dblp_key)',

//`type` varchar(20) NOT NULL default '' COMMENT 'Type of publication, i.e. article, proceedings, etc.',

'booktitle' varchar(150) default NULL COMMENT 'Name of incollection'

`pages` varchar(100) default NULL COMMENT 'Pages in the source, i.e. for example the journal',

`year` int(4) unsigned NOT NULL default '0' COMMENT 'The year of the publication',

'address' varchar(100) default NULL COMMENT 'Address of conference （in proceedings)',

'journal' varchar(150) default NULL COMMENT 'Name of journal where article is published'

`volume` varchar(50) default NULL COMMENT 'Volume of the source where the publication was published',

`number` varchar(20) default NULL COMMENT 'Number of the source where the publication was published',

`month` varchar(30) default NULL COMMENT 'Month(s) when the publication was published',

`url` varchar(150) default NULL COMMENT 'DBLP-internal URL (starting with db/...) where a web-page for that publication can be found on DBLP',

`ee` varchar(200) default NULL COMMENT 'external URL to the electronic edition of the publication',

'cdrom' varchar(200) default NULL COMMENT 'external Path to the PDF version of the electronic edition of the publication',

`publisher` varchar(250) default NULL COMMENT 'Name of the publisher of the publication; school for theses; affiliation for homepages',

'note' varchar(100) default NULL COMMENT 'Note of the inproceeding',

`crossref` varchar(50) default NULL COMMENT 'dblpkey crossreference to one other publication (book, proceeding, in the dblp_collections table), in which this publication was published',

`isbn` varchar(25) default NULL COMMENT 'ISBN number of the collection',

`series` varchar(100) default NULL COMMENT 'Reference to the publication series (books and proceedings only)',

'school' varchar(100) default NULL COMMENT 'School of the author',

'chapter' varchar(10) default NULL COMMENT 'Chapter in incollection'

三、dblp.xml数据陷阱

1、key不是唯一的。

比如，inproceedings的key的格式为：conf/会议名/作者+时间。由于作者名是简写（应该是姓），所以就存在key相同的情况。因为有名字类似的人，在同一会议、同一年发表文章。

我本来打算在数据接库中直接使用key作为主键，最后只好放弃，增加了自动递增的字段作为主键。

2、cite

<cite label="PBR">...</cite> 让人不知所云，我至今没有解析出来它代表的意思。

<cite>...</cite> 更加高深，不知道放在这里有什么意思。

我在解析时，把这种数据都舍弃了。

另外，同一个<inproceedings>里还存在完全相同的<cite> </cite>。又一次无语了，我只好为数据库的论文引用表也增添自动递增字段作为主键。

四、eclipse解析配置

dblp.xml很大，目前找到最小的也有130M（2002-10)，最新的有676M左右。dblp官方的例子程序建议，使用xerces 进行解析。jdk1.6据说有bug，不能解析大的xml。jdk1.5需要进行参数配置：java -mx900M -DentityExpansionLimit=2500000。不过我曾经在jdk1.6下，配合参数配置，成功地解析过676M的xml文件。

后来开发环境转到了eclipse下，eclipse自带xerces ，但也需要进行java的参数配置。否则将会报出entity数目超出6，400的错误（好像是6400）。配置方法如下：

1、选择main所在的类，右键选择"run As"->"Open Run Dialog"

2、在右上方的选项卡选择"Arguments", 在下方的框"VM arguments”输入"-mx900M -DentityExpansionLimit=2500000"。

五、写入数据库

一开始我担心数据集太大，写入数据库的速度很慢。在网上查了一些加速的方法如批处理。后来发现写入数据库的速度还算快。

PreparedStatement stmt = conn

.prepareStatement("insert into temp_inproc(title,year,conference,id) values(?,?,?,?)");

stmt.setString(1, paper.getTitle());

stmt.setInt(2, paper.getYear());

stmt.setString(3, paper.getConference());

stmt.setString(4, paper.getKey());

stmt.execute();

stmt.close();

没有进行优化，读完一个类似inproceedings标签后，把信息拼装成好，就存入数据库一次。采取的是默认的自动提交的方式。把所有的System.out.println注释掉后，速度还是可以忍受的。总共写了82万多条共50.7M的数据，用了191.334s。

DBLP数据解析

继续阅读

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

用mybatis的generator插件在项目中自动生成dao及entity

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

GridView终极用法(一)

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method