關于Hbase二級索引

HBase 是一個列存資料庫，每行資料隻有一個主鍵RowKey，無法依據指定列的資料進行檢索。查詢時需要通過RowKey進行檢索，然後檢視指定列的資料是什麼，效率低下。在實際應用中，我們經常需要根據指定列進行檢索，或者幾個列進行組合檢索，這就提出了建立 HBase 二級索引的需求。

二級索引建構方式：表索引、列索引、全文索引

表索引是将索引資料單獨存儲為一張表，通過 HBase Coprocessor 生成并通路索引資料。
列索引是将索引資料與源資料存儲在相同的 Region 裡，索引資料定義為一個單獨的列族，也是利用 Coprocessor 來生成并通路索引資料。對于表索引，源資料表與索引表的資料一緻性很難保證，通路兩張不同的表也會增加 IO 開銷和遠端調用的次數。對于列索引，單表的資料容量會急劇增加，對同一 Region 裡的多個列族進行 Split 或 Merge 等操作時可能會造成資料丢失或不一緻。
全文索引：以CDH5中的Lily HBase Indexer服務實作，其使用SolrCloud存儲HBase的索引資料，Indexer索引和搜尋不會影響HBase運作的穩定性和HBase資料寫入的吞吐量，因為索引和搜尋過程是完全分開并且異步的。Lily HBase Indexer在CDH5中運作必須依賴HBase、SolrCloud和Zookeeper服務。

基于Solr的Hbase二級索引關于Hbase二級索引關于Key-Value Indexer元件使用 Lily HBase Batch Indexer 進行索引建立HBase叢集的表中列索引的步驟：所遇問題QA

關于Key-Value Indexer元件

CDH官方文檔

hbase-indexer官方WIKI

參考部落格：Email Indexing Using Cloudera Search and HBase

參考部落格： Cloudera Search Solr初探

參考部落格：一種基于UDH Search的HBase二級索引建構方案

CDH5.4中的Key-Value Indexer使用的是Lily HBase NRT Indexer服務，Lily HBase Indexer是一款靈活的、可擴充的、高容錯的、事務性的，并且近實時的處理HBase列索引資料的分布式服務軟體。它是NGDATA公司開發的Lily系統的一部分，已開放源代碼。

Lily HBase Indexer使用SolrCloud來存儲HBase的索引資料，當HBase執行寫入、更新或删除操作時，Indexer通過HBase的replication功能來把這些操作抽象成一系列的Event事件，并用來保證寫入Solr中的HBase索引資料的一緻性。并且Indexer支援使用者自定義的抽取，轉換規則來索引HBase列資料。Solr搜尋結果會包含使用者自定義的columnfamily:qualifier字段結果，這樣應用程式就可以直接通路HBase的列資料。而且Indexer索引和搜尋不會影響HBase運作的穩定性和HBase資料寫入的吞吐量，因為索引和搜尋過程是完全分開并且異步的。

Lily HBase Indexer在CDH5中運作必須依賴HBase、SolrCloud和Zookeeper服務。

使用 Lily HBase Batch Indexer 進行索引

借助 Cloudera Search，您可以利用 MapReduce 作業對 HBase 表進行批量索引。批量索引不需要以下操作：

HBase 複制
Lily HBase Indexer 服務
通過 Lily HBase Indexer 服務注冊 Lily HBase Indexer 配置

該索引器支援靈活的、自定義的、特定于應用程式的規則來将 HBase 資料提取、轉換和加載到 Solr。Solr 搜尋結果可以包含到存儲在 HBase 中的資料的 columnFamily:qualifier 連結。這樣，應用程式可以使用搜尋結果集直接通路比對的原始 HBase 單元格。

建立HBase叢集的表中列索引的步驟：

Tutorial教程

填充 HBase 表。
建立相應的 SolrCloud 集合
建立 Lily HBase Indexer 配置
建立 Morphline 配置檔案
注冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

填充 HBase 表

在配置和啟動系統後，建立 HBase 表并向其添加行。例如：

對于每個新表，在需要通過發出格式指令進行索引的每個列系列上設定 REPLICATION_SCOPE：

$ hbase shell 
     
     
     
      #測試資料：列簇設定REPLICATION_SCOPE
     
     
      disable 'User'
     
     
      drop  'User'
     
     
      create 'User', {NAME => 'data', REPLICATION_SCOPE => 1}  
     
     
     
      disable 'User'
     
     
      alter 'User', {NAME => 'detail', REPLICATION_SCOPE => 1}  
     
     
      enable 'User'
     
     
     
      #新增CF
     
     
      disable 'User'
     
     
      alter 'User', {NAME => 'detail', REPLICATION_SCOPE => 1} 
     
     
      enable 'User'
     
     
     
      #修改現有
     
     
      disable 'User'
     
     
      alter 'User', {NAME => 'data', REPLICATION_SCOPE => 1} 
     
     
      enable 'User'
     
     
     
      # 插入測試資料 
     
     
      put 'User','row1','data:name','u1'
     
     
      put 'User','row1','data:psd','123'

建立相應的 SolrCloud 集合

用于 HBase 索引的 SolrCloud 集合必須具有可容納 HBase 列系列的類型和要進行索引處理的限定符的 Solr 架構。若要開始，請考慮将包括一切 data 的字段添加到預設schema。一旦您決定采用一種schema，使用以下表單指令建立 SolrCloud 集合：

user示例配置

# 生成實體配置檔案：
     
     
      solrctl instancedir --generate $HOME/hbase-indexer/User

編輯schema，需包含以下内容

vim $HOME/hbase-indexer/User/conf/schema.xml

<!-- 綁定rowkey-->
     
     
      <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
     
      
     
     
      <field name="name" type="string" indexed="true" stored="true"/>
     
     
      <field name="psd" type="string" indexed="true" stored="true"/> 
     
     
      <field name="address" type="string" indexed="true" stored="true"/> 
     
     
      <field name="photo" type="string" indexed="true" stored="true"/>

# 建立 collection執行個體并将配置檔案上傳到 zookeeper：
     
     
      solrctl instancedir --create User  $HOME/hbase-indexer/User
     
     
      # 上傳到 zookeeper 之後，其他節點就可以從zookeeper下載下傳配置檔案。接下來建立 collection:
     
     
      solrctl collection --create User

注意

在schema.xml中uniqueKey必須為rowkey,而rowkey預設使用’id’字段表示，中必須要有uniqueKey對應的id字段。

建立 Lily HBase Indexer 配置

Indexer-configuration官方參考

在HBase-Solr的安裝目錄/usr/lib/hbase-solr/下，建立morphline-hbase-mapper.xml檔案，檔案内容如下：

$ vim $HOME/hbase-indexer/morphline-hbase-mapper.xml

<?xml version="1.0"?>
     
      
      <!-- table：需要索引的HBase表名稱-->
     
      
      <!-- mapper：用來實作和讀取指定的Morphline配置檔案類，固定為MorphlineResultToSolrMapper-->
     
     
      <indexer table="User" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">
     
      
     
        
      <!--param中的name參數用來指定目前配置為morphlineFile檔案 -->
     
        
      <!--value用來指定morphlines.conf檔案的路徑，絕對或者相對路徑用來指定本地路徑，如果是使用Cloudera Manager來管理morphlines.conf就直接寫入值morphlines.conf"-->
     
        
      <param name="morphlineFile" value="morphlines.conf"/>
     
     
        
      <!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf -->
     
        
      <param name="morphlineId" value="userMap"/>
     
     
      </indexer>

注意：當使用絕對或者相對路徑來指定路徑時，叢集中的其它機器也要在配置路徑上有該檔案，如果是通過Cloudera Manager管理的話隻需要在CM中修改後即可，CM會自動分發給叢集。當然該配置檔案還有很多其它參數可以配置，擴充閱讀。

建立 Morphline 配置檔案

Morphlines是一款開源的，用來減少建構hadoop ETL資料流程時間的應用程式。它可以替代傳統的通過MapReduce來抽取、轉換、加載資料的過程，提供了一系列的指令工具，

具體可以參見：http://kitesdk.org/docs/0.13.0/kite-morphlines/morphlinesReferenceGuide.html。

對于HBase的其提供了extractHBaseCells指令來讀取HBase的列資料。我們采用Cloudera Manager來管理morphlines.conf檔案，使用CM來管理morphlines. conf檔案除了上面提到的好處之外，還有一個好處就是當我們需要增加索引列的時候，如果采用本地路徑方式将需要重新注冊Lily HBase Indexer的配置檔案，而采用CM管理的話隻需要修改morphlines.conf檔案後重新開機Key-Value HBase Indexer服務即可。

具體操作為：進入Key-Value Store Indexer面闆->配置->服務範圍->Morphlines->Morphlines檔案。在該選項加入如下配置：

注意：每個Collection對應一個morphline-hbase-mapper.xml

基于Solr的Hbase二級索引關于Hbase二級索引關于Key-Value Indexer元件使用 Lily HBase Batch Indexer 進行索引建立HBase叢集的表中列索引的步驟：所遇問題QA

$ vim /$HOME/morphlines.conf

SOLR_LOCATOR : {
     
       
      # Name of solr collection
     
     
        collection : 
      User
     
     
       
      # ZooKeeper ensemble
     
     
        zkHost : 
      "$ZK_HOST"
     
     
      }
     
     
     
     
      morphlines : [
     
     
      {
     
     
      id : 
      userMap
     
     
      importCommands : [
      "org.kitesdk.**", 
      "com.ngdata.**"]
     
     
     
      commands : [
     
     
        {
     
     
          extractHBaseCells {
     
     
            mappings : [
     
     
              {
     
     
                inputColumn : 
      "data:name"
     
     
                outputField : 
      "data_name"
     
     
                type : string
     
     
                source : value
     
     
              },
     
     
              {
     
     
                inputColumn : 
      "data:psd"
     
     
                outputField : 
      "data_psd"
     
     
                type : string
     
     
                source : value
     
     
              },
     
     
              {
     
     
                inputColumn : 
      "data:address"
     
     
                outputField : 
      "data_address"
     
     
                type : string
     
     
                source : value
     
     
              },
     
     
              {
     
     
                inputColumn : 
      "data:photo"
     
     
                outputField : 
      "data_photo"
     
     
                type : string
     
     
                source : value
     
     
              }
     
     
            ]
     
     
          }
     
     
        }
     
     
     
        { logDebug { format : 
      "output record: {}", args : [
      "@{}"] } }
     
     
      ]
     
     
      }
     
     
      ]

注冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

當 Lily HBase Indexer 配置 XML檔案的内容令人滿意，将它注冊到 Lily HBase Indexer Service。上傳 Lily HBase Indexer 配置 XML檔案至 ZooKeeper，由給定的 SolrCloud 集合完成此操作。例如：

hbase-indexer add-indexer \
     
     
      --name userIndexer \
     
     
      --indexer-conf $HOME/hbase-indexer/User/conf/morphline-hbase-mapper.xml \
     
     
      --connection-param solr.zk=server1:2181/solr \
     
     
      --connection-param solr.collection=User \
     
     
      --zookeeper server1:2181

驗證索引器是否已成功建立

執行

$ hbase-indexer list-indexers

驗證索引器是否已成功建立

更多幫助，請使用以下指令：

hbase-indexer add-indexer --help
     
     
      hbase-indexer list-indexers --help
     
     
      hbase-indexer update-indexer --help
     
     
      hbase-indexer delete-indexer --help

測試是solr是否已建立索引

寫入資料時，在solr-webui控制台檢視日志是否更新

put 'User','row1','data','u1'
     
     
     
      put 'User','row1','data:name','u2'
     
     
     
      put 'User','row2','data:name','u2'
     
     
      put 'User','row2','data:psd','123'   
     
     
      put 'User','row2','data:address','address2'   
     
     
      put 'User','row2','data:photo','photo2'   
     
     
     
      put 'User','row2','data:name','u2'
     
     
      put 'User','row2','data:psd','123'   
     
     
      put 'User','row2','detail:address','address2'   
     
     
      put 'User','row2','detail:photo','photo2'  
     
     
     
     
      put 'User','row3','data:name','u2'
     
     
      put 'User','row3','data:psd','123'   
     
     
      put 'User','row3','detail:address','江蘇省南京市'   
     
     
      put 'User','row3','detail:photo','phto3'

基于Solr的Hbase二級索引關于Hbase二級索引關于Key-Value Indexer元件使用 Lily HBase Batch Indexer 進行索引建立HBase叢集的表中列索引的步驟：所遇問題QA

折騰幾天弄好，下一步是如何以建構好的索引Hbase實作多列條件的組合查詢。

擴充指令

# solrctl
     
     
      solrctl instancedir --list 
     
     
      solrctl collection --list 
     
     
     
      # 更新coolection配置
     
     
      solrctl instancedir --update User $HOME/hbase-indexer/User
     
     
      solrctl collection --reload  User
     
     
     
      #删除instancedir
     
     
      solrctl instancedir  --delete  User
     
     
      #删除collection
     
     
      solrctl collection --delete  User
     
     
      #删除collection所有doc
     
     
      solrctl collection --deletedocs User
     
     
      #删除User配置目錄
     
     
      rm -rf $HOME/hbase-indexer/User
     
     
     
      # hbase-indexer
     
     
      # 若修改了morphline-hbase-mapper.xml，需更新索引
     
     
      hbase-indexer update-indexer  -n userIndexer
     
      
     
     
      # 删除索引
     
     
      hbase-indexer delete-indexer  -n userIndexer

所遇問題QA

Lily HBase Indexer Service注冊錯誤

詳細日志

[WARN ][
      08:
      56:
      49,
      677][.com:
      2181)] org.apache.zookeeper.ClientCnxn - Session 
      0x0 
      for server 
      null, unexpected error, closing socket connection and attempting reconnect
     
     
      java.net.ConnectException: Connection refused
     
     
           at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
     
     
           at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
      739)
     
     
           at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:
      350)
     
     
           at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:
      1081)

解決：将幫助文檔原文中的-zookeeper hbase-cluster-zookeeper:2181中hbase-cluster-zookeeper換成zoomkeeper的主機名

schema.xm和morphline.conf配置問題

ERROR  org.apache.solr.common.SolrException: ERROR: [doc=row3] unknown field 
      'data'
     
     
      trueat org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:
      185)

解決方式：

Thanks for the response. In the meantime I got a solution which is fine for me using: But type=”ignored” is a good hint once I want to get rid of the fields I do not need, thanks. –

schema新增配置：

<dynamicField name="*" type="string" indexed="true" stored="true" />
     
     
      <field name="data" type="string" indexed="true" stored="true" multiValued="true"/>

修改schema.xml後，執行以下指令更新配置：

solrctl instancedir –update hbase-collection-user $HOME/hbase-collection-user

solrctl collection –reload hbase-collection-user

修改Collection

當我們建立Collection完成後，如果需要修改schema.xml檔案重新配置需要索引的字段可以按如下操作：

如果是修改原有schema.xml中字段值，而在solr中已經插入了索引資料，那麼我們需要清空索引資料集，清空資料集可以通過solr API來完成。
如果是在原有schema.xml中加入新的索引字段，那麼可以跳過1，直接執行：

solrctl instancedir --update solrtest $HOME/solrtest   
     
     
      solrctl collection --reload solrtest

多個HbaseTable配置schema.xml和morphline.conf

解決方式：

email-schema示例

Q：morphline.conf和morphline-hbase-mapper.xml檔案是否每個HbaseTable都要對應配置一個?

A：每一個Hbase Table對應生成一個Solr的Collection索引，每個索引對應一個Lily HBase Indexer 配置檔案morphlines.conf和morphline配置檔案morphline-hbase-mapper.xml，其中morphlines.conf可由CDH的Key-Value Store Indexer控制台管理，以id區分

官方說明：

Creating 
      a Lily HBase Indexer configuration  
     
     
      Individual Lily HBase Indexers are configured 
      using 
      the hbase-indexer 
      command 
      line 
      utility.   
     
     
      Typically, there is 
      one Lily HBase Indexer configuration 
      for 
      each HBase table,   
     
     
      but there can be 
      as many Lily HBase Indexer configurations 
      as there are tables 
      and column families 
      and corresponding collections 
      in 
      the SolrCloud.   
     
     
      Each Lily HBase Indexer configuration is defined 
      in 
      an XML 
      file such 
      as morphline-hbase-mapper.xml.

對HBaseTable已有資料建立索引

需要用到Lily HBase Indexer的批處理索引功能了

sudo hadoop --config /etc/hadoop/conf \
     
     
      jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.4.4-job.jar \
     
     
      --conf /etc/hbase/conf/hbase-site.xml \
     
     
      -D 'mapred.child.java.opts=-Xmx500m' \
     
     
      --hbase-indexer-zk master:2181 \
     
     
      --collection hbase-collection-user \
     
     
      --hbase-indexer-name userIndexer \
     
     
      --hbase-indexer-file $HOME/hbase-collection-user/conf/morphline-hbase-mapper.xml \
     
     
      --go-live \
     
     
      ``` 
     
     
      錯誤日志  
     
     
     
      ``` java
     
     
      Caused by: java.io.IOException Can not find resource  solrconfig.xml in classpath 
     
     
      or  /root/file:/tmp/hadoop-root/mapred/local/1441858645500/6a1a458e-35e2-4f66-82df-02795ba44e2c.solr.zip/collection1/conf

基于Solr的Hbase二級索引關于Hbase二級索引關于Key-Value Indexer元件使用 Lily HBase Batch Indexer 進行索引建立HBase叢集的表中列索引的步驟：所遇問題QA

關于Hbase二級索引

關于Key-Value Indexer元件

使用 Lily HBase Batch Indexer 進行索引

建立HBase叢集的表中列索引的步驟：

填充 HBase 表

建立相應的 SolrCloud 集合

user示例配置

注意

建立 Lily HBase Indexer 配置

建立 Morphline 配置檔案

注冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

驗證索引器是否已成功建立

測試是solr是否已建立索引

擴充指令

所遇問題QA

Lily HBase Indexer Service注冊錯誤

schema.xm和morphline.conf配置問題

多個HbaseTable配置schema.xml和morphline.conf

對HBaseTable已有資料建立索引

繼續閱讀

solr6.1.0的安裝與入門添加core

2021-08-08 mysql索引

HBase第二天：HBase的API操作，判斷表存在、建立删除表、擷取表中一行或指定列族資料、向表中插入資料、HBase的wordcount、自定義HBaseMapReduce、Hbase內建Hive第6章 HBase API操作

SQLServer 提升查詢速度

詳解SQL中幾種常用的表連接配接方式

oracle 中不使用已有的索引解決辦法

hbase thrift C++ 簡單測試

Cloudera Manager HBase Thrift 接口 Go/Python用戶端

Percolator Google的海量資料增量處理系統

對first_name建立唯一索引uniq_idx_firstname問題描述Sql語句

記一次因MySQL編碼問題導緻的慢查詢排查

【Solr現網問題】索引文檔數量超限

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

大資料技術原理與應用（最後三天備考了！！！）

ubuntu14.04下安裝hbse1.0.1.1

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch

基于Solr的Hbase二級索引 關于Hbase二級索引 關于Key-Value Indexer元件 使用 Lily HBase Batch Indexer 進行索引 建立HBase叢集的表中列索引的步驟： 所遇問題QA

關于Hbase二級索引

關于Key-Value Indexer元件

使用 Lily HBase Batch Indexer 進行索引

建立HBase叢集的表中列索引的步驟：

填充 HBase 表

建立相應的 SolrCloud 集合

user示例配置

注意

建立 Lily HBase Indexer 配置

建立 Morphline 配置檔案

注冊 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

驗證索引器是否已成功建立

測試是solr是否已建立索引

擴充指令

所遇問題QA

Lily HBase Indexer Service注冊錯誤

schema.xm和morphline.conf配置問題

多個HbaseTable配置schema.xml和morphline.conf

對HBaseTable已有資料建立索引

繼續閱讀

基于Solr的Hbase二級索引關于Hbase二級索引關于Key-Value Indexer元件使用 Lily HBase Batch Indexer 進行索引建立HBase叢集的表中列索引的步驟：所遇問題QA