WEB資料挖掘（五）——Aperture資料抽取（１）

網上了解到Aperture架構能夠實作從檔案系統中抽取資料，Aperture的介紹如下：

Aperture這個Java架構能夠從各種各樣的資料系統(如：檔案系統、Web站點、IMAP和Outlook郵箱)或存在這些系統中的檔案(如:文檔、圖檔)爬取和搜尋其中的全文本内容與中繼資料。它目前支援的檔案格式如下：

Plain text
HTML, XHTML
XML
PDF (Portable Document Format)
RTF (Rich Text Format)
Microsoft Office: Word, Excel, Powerpoint, Visio, Publisher
Microsoft Works
OpenOffice 1.x: Writer, Calc, Impress, Draw
StarOffice 6.x - 7.x+: Writer, Calc, Impress, Draw
OpenDocument (OpenOffice 2.x, StarOffice 8.x)
Corel WordPerfect, Quattro, Presentations
Emails (.eml files)

Aperture目前版本為1.6.0，它的wiki位址http://aperture.wiki.sourceforge.net/

svn位址　https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/

a 首先安裝svn　sudo apt-get install subversion

b 在workspace目錄建立aperture目錄

c 進入目前目錄　cd workspace/aperture/

d 運作指令 svn co https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/

e 運作指令mvn eclipse:eclipse

f 打開eclipse，導入該項目

開發人員最關心的是怎麼利用Aperture架構的API抽取檔案的内容，下面本人采用maven管理的方式寫一個demo

首先在eclipse中建立maven項目，在pom.xml檔案加入Aperture相關依賴的jar檔案，本人的配置如下

<repositories>
        <repository>
            <id>aperture-repo</id>
            <url>http://aperture.sourceforge.net/maven/</url>
            <name>Aperture Maven Repository</name>
        </repository>        
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.semanticdesktop.aperture</groupId>
            <artifactId>aperture-core</artifactId>
            <version>1.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.semanticdesktop.aperture</groupId>
            <artifactId>aperture-runtime-optional</artifactId>
            <version>1.6.0</version>
            <type>pom</type>
        </dependency>
    </dependencies>

需要注意的是這裡需要額外配置maven倉庫位址，在maven的中央倉庫不存在相關的依賴jar檔案；另外如果上面倉庫不能通路，還需要配置代理伺服器（在${m2_home}/conf/settings.xml檔案中配置）。此時我們就可以看到自動下載下傳的依賴jar檔案：

下面建立Crawler類，用于擷取web的文本資料

public class Crawler {

    /**
     * @param args
     */
    public static void main(String[] args) throws Exception{
        // TODO Auto-generated method stub
        Crawler crawler=new Crawler();
        System.out.println(crawler.extract("http://news.sina.com.cn/s/2013-06-07/044127337162.shtml"));

    }
    public String extract(String url) throws Exception
    {
        DataObject dao =  new HttpAccessor().getDataObject(url, null, null,  new RDFContainerFactoryImpl());
        if (dao instanceof FileDataObject)
        {
            FileDataObject fdo = ((FileDataObject)dao);
            //return IOUtils.toString(fdo.getContent(), "utf-8");
            //Charset.forName("utf-8")
            new HtmlExtractor().extract(fdo.getID(), fdo.getContent(),  null, null, fdo.getMetadata());
            return fdo.getMetadata().getString(NIE.plainTextContent);
        } else {
            return null;
        }
    }

}

運作該class的main方法，即可以看到到輸出目前url的文本内容

---------------------------------------------------------------------------

本系列WEB資料挖掘系本人原創

WEB資料挖掘（五）——Aperture資料抽取（１）

繼續閱讀

Compile workrave under windows &ndash; My exprience 在Windows上編譯Workrave

門戶通專訪草根站長九天狼：做站貴在堅持

與專家面對面：Android開發入門問與答

如何成為一名.net 工程師?

tabpanel 使用問題

為什麼把CSS放頭部，script放下面

CSS之折疊菜單

Hibernate使用Hibernate的“3個準備，7個步驟”Hibernate API簡介操作實體對象對象識别

web開發之前後端渲染

java中，字元串中的函數的替換方法

403 Forbidden，You don't have permission to access / on this server.Forbidden

linux-svn解除安裝與安裝

用mybatis的generator插件在項目中自動生成dao及entity

maven No compiler is provided in this environment. Perhaps you are running on a JRE rather than a J

svn用戶端，重新輸入使用者名密碼

Opendaylight課堂之深度剖析toaster（一）