HBase Java API詳解

hbase是hadoop的資料庫，能夠對大資料提供随機、實時讀寫通路。他是開源的，分布式的，多版本的，面向列的，存儲模型。

在講解的時候我首先給大家講解一下hbase的整體結構，如下圖：

hbase master是伺服器負責管理所有的hregion伺服器，hbase master并不存儲hbase伺服器的任何資料，hbase邏輯上的表可能會劃分為多個hregion，然後存儲在hregion server群中，hbase master server中存儲的是從資料到hregion server的映射。

一台機器隻能運作一個hregion伺服器，資料的操作會記錄在hlog中，在讀取資料時候，hregion會先通路hmemcache緩存，如果緩存中沒有資料才回到hstore中上找，沒一個列都會有一個hstore集合，每個hstore集合包含了很多具體的hstorefile檔案，這些文件是b樹結構的，友善快速讀取。

再看下hbase資料實體視圖如下：

row key

timestamp

column family

uri

parser

url=http://www.taobao.com

title=天天特價

host=taobao.com

url=http://www.alibaba.com

content=每天…

host=alibaba.com

Ø row key: 行鍵，table的主鍵，table中的記錄按照row key排序

Ø timestamp: 時間戳，每次資料操作對應的時間戳，可以看作是資料的version number

Ø column family：列簇，table在水準方向有一個或者多個column family組成，一個column family中可以由任意多個column組成，即column family支援動态擴充，無需預先定義column的數量以及類型，所有column均以二進制格式存儲，使用者需要自行進行類型轉換。

了解了hbase的體系結構和hbase資料視圖夠，現在讓我們一起看看怎樣通過java來操作hbase資料吧！

先說說具體的api先，如下

hbaseconfiguration是每一個hbase client都會使用到的對象，它代表的是hbase配置資訊。它有兩種構造方式：

public hbaseconfiguration()

public hbaseconfiguration(final configuration c)

預設的構造方式會嘗試從hbase-default.xml和hbase-site.xml中讀取配置。如果classpath沒有這兩個檔案，就需要你自己設定配置。

configuration hbase_config = new configuration();

hbase_config.set(“hbase.zookeeper.quorum”, “zkserver”);

hbase_config.set(“hbase.zookeeper.property.clientport”, “2181″);

hbaseconfiguration cfg = new hbaseconfiguration(hbase_config);

建立表

建立表是通過hbaseadmin對象來操作的。hbaseadmin負責表的meta資訊處理。hbaseadmin提供了createtable這個方法：

public void createtable(htabledescriptor desc)

htabledescriptor 代表的是表的schema, 提供的方法中比較有用的有

setmaxfilesize，指定最大的region size

setmemstoreflushsize 指定memstore flush到hdfs上的檔案大小

增加family通過 addfamily方法

public void addfamily(final hcolumndescriptor family)

hcolumndescriptor 代表的是column的schema，提供的方法比較常用的有

settimetolive:指定最大的ttl,機關是ms,過期資料會被自動删除。

setinmemory:指定是否放在記憶體中，對小表有用，可用于提高效率。預設關閉

setbloomfilter:指定是否使用bloomfilter,可提高随機查詢效率。預設關閉

setcompressiontype:設定資料壓縮類型。預設無壓縮。

setmaxversions:指定資料最大儲存的版本個數。預設為3。

一個簡單的例子，建立了4個family的表：

hbaseadmin hadmin = new hbaseadmin(hbaseconfig);

htabledescriptor t = new htabledescriptor(tablename);

t.addfamily(new hcolumndescriptor(“f1″));

t.addfamily(new hcolumndescriptor(“f2″));

t.addfamily(new hcolumndescriptor(“f3″));

t.addfamily(new hcolumndescriptor(“f4″));

hadmin.createtable(t);

删除表

删除表也是通過hbaseadmin來操作，删除表之前首先要disable表。這是一個非常耗時的操作，是以不建議頻繁删除表。

disabletable和deletetable分别用來disable和delete表。

example:

if (hadmin.tableexists(tablename)) {

hadmin.disabletable(tablename);

hadmin.deletetable(tablename);

}

查詢資料

查詢分為單條随機查詢和批量查詢。

單條查詢是通過rowkey在table中查詢某一行的資料。htable提供了get方法來完成單條查詢。

批量查詢是通過制定一段rowkey的範圍來查詢。htable提供了個getscanner方法來完成批量查詢。

public result get(final get get)

public resultscanner getscanner(final scan scan)

get對象包含了一個get查詢需要的資訊。它的構造方法有兩種：

public get(byte [] row)

public get(byte [] row, rowlock rowlock)

rowlock是為了保證讀寫的原子性，你可以傳遞一個已經存在rowlock，否則hbase會自動生成一個新的rowlock。

scan對象提供了預設構造函數，一般使用預設構造函數。

get/scan的常用方法有：

addfamily/addcolumn:指定需要的family或者column,如果沒有調用任何addfamily或者column,會傳回所有的columns.

setmaxversions:指定最大的版本個數。如果不帶任何參數調用setmaxversions,表示取所有的版本。如果不掉用setmaxversions,隻會取到最新的版本。

settimerange:指定最大的時間戳和最小的時間戳，隻有在此範圍内的cell才能被擷取。

settimestamp:指定時間戳。

setfilter:指定filter來過濾掉不需要的資訊

scan特有的方法：

setstartrow:指定開始的行。如果不調用，則從表頭開始。

setstoprow:指定結束的行（不含此行）。

setbatch:指定最多傳回的cell數目。用于防止一行中有過多的資料，導緻outofmemory錯誤。

resultscanner是result的一個容器，每次調用resultscanner的next方法，會傳回result.

public result next() throws ioexception;

public result [] next(int nbrows) throws ioexception;

result代表是一行的資料。常用方法有：

getrow:傳回rowkey

raw:傳回所有的key value數組。

getvalue:按照column來擷取cell的值

scan s = new scan();

s.setmaxversions();

resultscanner ss = table.getscanner(s);

for(result r:ss){

system.out.println(new string(r.getrow()));

for(keyvalue kv:r.raw()){

system.out.println(new string(kv.getcolumn()));

}

插入資料

htable通過put方法來插入資料。

public void put(final put put) throws ioexception

public void put(final list puts) throws ioexception

可以傳遞單個批put對象或者list put對象來分别實作單條插入和批量插入。

put提供了3種構造方式：

public put(byte [] row)

public put(byte [] row, rowlock rowlock)

public put(put puttocopy)

put常用的方法有：

add:增加一個cell

settimestamp:指定所有cell預設的timestamp,如果一個cell沒有指定timestamp,就會用到這個值。如果沒有調用，hbase會将目前時間作為未指定timestamp的cell的timestamp.

setwritetowal: wal是write ahead log的縮寫，指的是hbase在插入操作前是否寫log。預設是打開，關掉會提高性能，但是如果系統出現故障(負責插入的region server挂掉)，資料可能會丢失。

另外htable也有兩個方法也會影響插入的性能

setautoflash: autoflush指的是在每次調用hbase的put操作，是否送出到hbase server。預設是true,每次會送出。如果此時是單條插入，就會有更多的io,進而降低性能.

setwritebuffersize: write buffer size在autoflush為false的時候起作用，預設是2mb,也就是當插入資料超過2mb,就會自動送出到server

htable table = new htable(hbaseconfig, tablename);

table.setautoflush(autoflush);

list lp = new arraylist();

int count = 10000;

byte[] buffer = new byte[1024];

random r = new random();

for (int i = 1; i <= count; ++i) {

put p = new put(string.format(“row%09d”,i).getbytes());

r.nextbytes(buffer);

p.add(“f1″.getbytes(), null, buffer);

p.add(“f2″.getbytes(), null, buffer);

p.add(“f3″.getbytes(), null, buffer);

p.add(“f4″.getbytes(), null, buffer);

p.setwritetowal(wal);

lp.add(p);

if(i%1000==0){

table.put(lp);

lp.clear();

}

删除資料

htable 通過delete方法來删除資料。

public void delete(final delete delete)

delete構造方法有：

public delete(byte [] row)

public delete(byte [] row, long timestamp, rowlock rowlock)

public delete(final delete d)

delete常用方法有

deletefamily/deletecolumns:指定要删除的family或者column的資料。如果不調用任何這樣的方法，将會删除整行。

注意：如果某個cell的timestamp高于目前時間，這個cell将不會被删除，仍然可以查出來。

htable table = new htable(hbaseconfig, “mytest”);

delete d = new delete(“row1″.getbytes());

table.delete(d)

切分表

hbaseadmin提供split方法來将table 進行split.

public void split(final string tablenameorregionname)

如果提供的tablename，那麼會将table所有region進行split ;如果提供的region name，那麼隻會split這個region.

由于split是一個異步操作，我們并不能确切的控制region的個數。

public void split(string tablename,int number,int timeout) throws exception {

configuration hbase_config = new configuration();

hbase_config.set(“hbase.zookeeper.quorum”, globalconf.zookeeper_quorum);

hbase_config.set(“hbase.zookeeper.property.clientport”, globalconf.zookeeper_port);

hbaseconfiguration cfg = new hbaseconfiguration(hbase_config);

hbaseadmin hadmin = new hbaseadmin(cfg);

htable htable = new htable(cfg,tablename);

int oldsize = 0;

t = system.currenttimemillis();

while(true){

int size = htable.getregionsinfo().size();

logger.info(“the region number=”+size);

if(size>=number ) break;

if(size!=oldsize){

hadmin.split(htable.gettablename());

oldsize = size;

} else if(system.currenttimemillis()-t>timeout){

break;

thread.sleep(1000*10);

HBase Java API詳解

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method