一、已知的问题和不足
二、解决思路
三、代码
3.1 读取config文件内容
3.2 封装SolrServer的获取方式
3.3 编写提交数据到Solr的代码
3.4 拦截HBase的Put和Delete操作信息
四、使用

一、已知的问题和不足

在上一个版本中，实现了使用HBase的协处理器将HBase的二级索引同步到Solr中，但是仍旧有几个缺陷：

写入Solr的Collection是写死在代码里面，且是唯一的。如果我们有一张表的数据希望将不同的字段同步到Solr中该如何做呢？
目前所有配置相关信息都是写死到了代码中的，是否可以添加外部配置文件。
原来的方法是每次都需要编译新的Jar文件单独运行，能否将所有的同步使用一段通用的代码完成？

二、解决思路

针对上面的三个主要问题，我们一一解决

通常一张表会对应多个SolrCollection以及不同的Column。我们可以使用Map[表名->List[（Collection1，List[Columns]),(Collection2,List[Columns])...]]这样的类型，根据表名获取所有的Collection和Column。
通过Typesafe Config读取外部配置文件，达到所有信息可配的目的。
所有的数据都只有Put和Delete，只要我们拦截到具体的消息之后判断当前的表名，然后根据问题一中的Collection和Column即可写入对应的SolrServer。在协处理器中获取表名的是e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString()其中e是ObserverContext；

三、代码

3.1 读取config文件内容

使用typesafe的config组件读取morphlines.conf文件，将内容转换为 Map<String,List<HBaseIndexerMappin>>。具体代码如下

public class ConfigManager {
    private static SourceConfig sourceConfig = new SourceConfig();
 public static Config config;
 static {
        sourceConfig.setConfigFiles("morphlines.conf");
  config =  sourceConfig.getConfig();
  }
    public static Map<String,List<HBaseIndexerMappin>> getHBaseIndexerMappin(){
        Map<String,List<HBaseIndexerMappin>> mappin = new HashMap<String, List<HBaseIndexerMappin>>();
  Config mappinConf = config.getConfig("Mappin");
  List<String> tables = mappinConf.getStringList("HBaseTables");
 for (String table :tables){
            List<Config> confList = (List<Config>) mappinConf.getConfigList(table);
  List<HBaseIndexerMappin> maps = new LinkedList<HBaseIndexerMappin>();
 for(Config tmp :confList){
                HBaseIndexerMappin map = new HBaseIndexerMappin();
  map.solrConnetion = tmp.getString("SolrCollection");
  map.columns = tmp.getStringList("Columns");
  maps.add(map);
  }
            mappin.put(table,maps);
  }
        return mappin;
  }
}

3.2 封装SolrServer的获取方式

因为目前我使用的环境是Solr和HBase公用的同一套Zookeeper，因此我们完全可以借助HBase的Zookeeper信息。HBase的协处理器是运行在HBase的环境中的，自然可以通过HBase的Configuration获取当前的Zookeeper节点和端口，然后轻松的获取到Solr的地址。

public class SolrServerManager implements LogManager {
    static Configuration conf = HBaseConfiguration.create();
 public static String ZKHost = conf.get("hbase.zookeeper.quorum","bqdpm1,bqdpm2,bqdps2");
 public static String ZKPort = conf.get("hbase.zookeeper.property.clientPort","2181");
 public static String SolrUrl = ZKHost + ":" + ZKPort + "/" + "solr";
 public static int zkClientTimeout = 1800000;// 心跳
  public static int zkConnectTimeout = 1800000;// 连接时间

  public static CloudSolrServer create(String defaultCollection){
        log.info("Create SolrCloudeServer .This collection is " + defaultCollection);
  CloudSolrServer solrServer = new CloudSolrServer(SolrUrl);
  solrServer.setDefaultCollection(defaultCollection);
  solrServer.setZkClientTimeout(zkClientTimeout);
  solrServer.setZkConnectTimeout(zkConnectTimeout);
 return solrServer;
  }
}

3.3 编写提交数据到Solr的代码

理想状态下，我们时时刻刻都需要提交数据到Solr中，但是事实上我们数据写入的时间是比较分散的，可能集中再每一天的某几个时间点。因此我们必须保证在高并发下能达到一定数据量自动提交，在低并发的情况下能隔一段时间写入一次。只有两种机制并存的情况下才能保证数据能即时写入。

public class SolrCommitTimer extends TimerTask implements LogManager {
    public Map<String,List<SolrInputDocument>> putCache = new HashMap<String, List<SolrInputDocument>>();//Collection名字->更新（插入）操作缓存
 public Map<String,List<String>> deleteCache = new HashMap<String, List<String>>();//Collection名字->删除操作缓存
  Map<String,CloudSolrServer> solrServers = new HashMap<String, CloudSolrServer>();//Collection名字->SolrServers
 int maxCache =  ConfigManager.config.getInt("MaxCommitSize");
  // 任何时候，保证只能有一个线程在提交索引，并清空集合
  final static Semaphore semp = new Semaphore(1);
  //添加Collection和SolrServer
 public void addCollecttion(String collection,CloudSolrServer server){
        this.solrServers.put(collection,server);
  }
//往Solr添加（更新）数据
    public UpdateResponse put(CloudSolrServer server,SolrInputDocument doc) throws IOException, SolrServerException {
        server.add(doc);
 return server.commit(false, false);
  }
//往Solr添加（更新）数据
    public UpdateResponse put(CloudSolrServer server,List<SolrInputDocument> docs) throws IOException, SolrServerException {
        server.add(docs);
 return server.commit(false, false);
  }
//根据ID删除Solr数据
    public UpdateResponse delete(CloudSolrServer server,String rowkey) throws IOException, SolrServerException {
        server.deleteById(rowkey);
 return server.commit(false, false);
  }
//根据ID删除Solr数据
    public UpdateResponse delete(CloudSolrServer server,List<String> rowkeys) throws IOException, SolrServerException {
        server.deleteById(rowkeys);
 return server.commit(false, false);
  }
//将doc添加到缓存
    public void addPutDocToCache(String collection, SolrInputDocument doc) throws IOException, SolrServerException, InterruptedException {
        semp.acquire();
  log.debug("addPutDocToCache:" + "collection=" + collection + "data=" + doc.toString());
 if(!putCache.containsKey(collection)){
            List<SolrInputDocument> docs = new LinkedList<SolrInputDocument>();
  docs.add(doc);
  putCache.put(collection,docs);
  }else {
            List<SolrInputDocument> cache = putCache.get(collection);
  cache.add(doc);
 if (cache.size() >= maxCache) {
                try {
                    this.put(solrServers.get(collection), cache);
  } finally {
                    putCache.get(collection).clear();
  }
            }
        }
        semp.release();//释放信号量
  }
//添加删除操作到缓存
    public void addDeleteIdCache(String collection,String rowkey) throws IOException, SolrServerException, InterruptedException {
        semp.acquire();
  log.debug("addDeleteIdCache:" + "collection=" + collection + "rowkey=" + rowkey);
 if(!deleteCache.containsKey(collection)){
            List<String> rowkeys = new LinkedList<String>();
  rowkeys.add(rowkey);
  deleteCache.put(collection,rowkeys);
  }else{
            List<String> cache = deleteCache.get(collection);
  cache.add(rowkey);
 if (cache.size() >= maxCache) {
                try{
                    this.delete(solrServers.get(collection),cache);
  }finally {
                    putCache.get(collection).clear();
  }
            }
        }
        semp.release();//释放信号量
  }

    @Override
  public void run() {
        try {
            semp.acquire();
  log.debug("开始插入....");
  Set<String> collections =  solrServers.keySet();
 for(String collection:collections){
                if(putCache.containsKey(collection) && (!putCache.get(collection).isEmpty()) ){
                    this.put(solrServers.get(collection),putCache.get(collection));
  putCache.get(collection).clear();
  }
                if(deleteCache.containsKey(collection) && (!deleteCache.get(collection).isEmpty())){
                    this.delete(solrServers.get(collection),deleteCache.get(collection));
  deleteCache.get(collection).clear();
  }
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
  } catch (Exception e) {
            log.error("Commit putCache to Solr error!Because :" + e.getMessage());
  }finally {
            semp.release();//释放信号量
  }
    }
}

3.4 拦截HBase的Put和Delete操作信息

在每个prePut和preDelete中拦截操作信息，记录表名、列名、值。将这些信息根据表名和Collection名进行分类写入缓存。

public class HBaseIndexerToSolrObserver extends BaseRegionObserver implements LogManager{

    Map<String,List<HBaseIndexerMappin>> mappins = ConfigManager.getHBaseIndexerMappin();

  Timer timer = new Timer();
 int maxCommitTime = ConfigManager.config.getInt("MaxCommitTime"); //最大提交时间，s
  SolrCommitTimer solrCommit = new SolrCommitTimer();
 public HBaseIndexerToSolrObserver(){
        log.info("Initialization HBaseIndexerToSolrObserver ...");
 for(Map.Entry<String,List<HBaseIndexerMappin>> entry : mappins.entrySet() ){
            List<HBaseIndexerMappin> solrmappin = entry.getValue();
 for(HBaseIndexerMappin map:solrmappin){
                String collection = map.solrConnetion;//获取Collection名字
  log.info("Create Solr Server connection .The collection is " + collection);
  CloudSolrServer solrserver = SolrServerManager.create(collection);//根据Collection初始化SolrServer连接
  solrCommit.addCollecttion(collection,solrserver);
  }
        }
        timer.schedule(solrCommit, 10 * 1000L, maxCommitTime * 1000L);
  }

    @Override
  public void postPut(ObserverContext<RegionCoprocessorEnvironment> e,
  Put put, WALEdit edit, Durability durability) throws IOException {
        String table =  e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString();//获取表名
  String rowkey= Bytes.toString(put.getRow());//获取主键
  SolrInputDocument doc = new SolrInputDocument();
  List<HBaseIndexerMappin> mappin = mappins.get(table);
 for(HBaseIndexerMappin mapp : mappin){
            for(String column : mapp.columns){
                String[] tmp = column.split(":");
  String cf = tmp[0];
  String cq = tmp[1];
 if(put.has(Bytes.toBytes(cf),Bytes.toBytes(cq))){
                    Cell cell = put.get(Bytes.toBytes(cf),Bytes.toBytes(cq)).get(0);//获取制定列的数据
  Map<String, String > operation = new HashMap<String,String>();
  operation.put("set",Bytes.toString(CellUtil.cloneValue(cell)));
  doc.setField(cq,operation);//使用原子更新的方式将HBase二级索引写入Solr
  }
            }
            doc.addField("id",rowkey);
 try {
                solrCommit.addPutDocToCache(mapp.solrConnetion,doc);//添加doc到缓存
  } catch (SolrServerException e1) {
                e1.printStackTrace();
  } catch (InterruptedException e1) {
                e1.printStackTrace();
  }
        }
    }

    @Override
  public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e,
  Delete delete,
  WALEdit edit,
  Durability durability) throws IOException{
        String table =  e.getEnvironment().getRegion().getTableDesc().getTableName().getNameAsString();
  String rowkey= Bytes.toString(delete.getRow());
  List<HBaseIndexerMappin> mappin = mappins.get(table);
 for(HBaseIndexerMappin mapp : mappin){
            try {
                solrCommit.addDeleteIdCache(mapp.solrConnetion,rowkey);//添加删除操作到缓存
  } catch (SolrServerException e1) {
                e1.printStackTrace();
  } catch (InterruptedException e1) {
                e1.printStackTrace();
  }
        }

    }

}

四、使用

首先需要添加morphlines.conf文件。里面包含了需要同步数据到Solr的HBase表名、对应的Solr Collection的名字、要同步的列、多久提交一次、最大批次容量的相关信息。具体配置如下：

#最大提交时间（单位：秒）
MaxCommitTime = 30
#最大批次容量
MaxCommitSize = 10000

Mappin {
  HBaseTables: ["HBASE_OBSERVER_TEST"] #需要同步的HBase表名
  "HBASE_OBSERVER_TEST": [
    {
      SolrCollection: "bqjr" #Solr Collection名字
  Columns: [
        "cf1:test_age",   #需要同步的列，格式<列族:列>
  "cf1:test_name"
  ]
    },
  ]
}

该配置文件默认放在各个节点的/etc/hbase/conf/下。如果你希望将配置文件路径修改为其他路径，请修改com.bqjr.bigdata.HBaseObserver.comm.config.SourceConfig类中的configHome路径。

然后将代码打包，上传到HDFS中，将协处理器添加到对应的表中。

#先禁用这张表
disable 'HBASE_OBSERVER_TEST'
#为这张表添加协处理器,设置的参数具体为： jar文件路径|类名|优先级（SYSTEM或者USER）
alter 'HBASE_OBSERVER_TEST','coprocessor'=>'hdfs://hostname:8020/ext_lib/HBaseObserver-1.0.0.jar|com.bqjr.bigdata.HBaseObserver.server.HBaseIndexerToSolrObserver||'
#启用这张表
enable 'HBASE_OBSERVER_TEST'
#删除某个协处理器，"$<bumber>"后面跟的ID号与desc里面的ID号相同
alter 'HBASE_OBSERVER_TEST',METHOD=>'table_att_unset',NAME => 'coprocessor$1'

如果需要新增一张表同步到Solr。只需要修改morphlines.conf文件，分发倒各个节点。然后将协处理器添加到HBase表中，这样就不用再次修改代码了。

HBase协处理器同步二级索引到Solr(续)的更多相关文章

HBase协处理器同步二级索引到Solr
一. 背景二. 什么是HBase的协处理器三. HBase协处理器同步数据到Solr四. 添加协处理器五. 测试六. 协处理器动态加载一. 背景在实际生产中,HBase往往不能满足多维度分析,我们 ...
Hbase(三) hbase协处理器与二级索引
一.协处理器—Coprocessor 1. 起源Hbase 作为列族数据库最经常被人诟病的特性包括:无法轻易建立“二级索引”,难以执行求和.计数.排序等操作.比如,在旧版本的(<0.92)Hb ...
HBase 协处理器实现二级索引
HBase在0.92之后引入了coprocessors,提供了一系列的钩子,让我们能够轻易实现访问控制和二级索引的特性.下面简单介绍下两种coprocessors,第一种是Observers,它实际类 ...
Lily HBase Indexer同步HBase二级索引到Solr丢失数据的问题分析
一.问题描述二.分析步骤2.1 查看日志2.2 修改Solr的硬提交2.3 寻求StackOverFlow帮助2.4 修改了read-row="never"后,丢失部分字段2.5 ...
通过phoenix在hbase上创建二级索引，Secondary Indexing
环境描述: 操作系统版本:CentOS release 6.5 (Final) 内核版本:2.6.32-431.el6.x86_64 phoenix版本:phoenix-4.10.0 hbase版本: ...
CDH版本Hbase二级索引方案Solr key value index
概述在Hbase中,表的RowKey 按照字典排序, Region按照RowKey设置split point进行shard,通过这种方式实现的全局.分布式索引. 成为了其成功的最大的砝码. 然而单一 ...
HBase Region级别二级索引
我们会经常谈及二级索引,这是对全表数据进行另外一种方式的组织存储,是针对table级别的.如果要为HBase上的表实现一个强一致性的二级索引,那么就无法逃避分布式事务,而这一直是用户最期待的功能. 而 ...
hbase基于solr配置二级索引
一.概述 Hbase适用于大表的存储,通过单一的RowKey查询虽然能快速查询,但是对于复杂查询,尤其分页.查询总数等,实现方案浪费计算资源,所以可以针对hbase数据创建二级索引(Hbase Sec ...
HBase的二级索引
使用HBase存储中国好声音数据的案例,业务描述如下: 为了能高效的查询到我们需要的数据,我们在RowKey的设计上下了不少功夫,因为过滤RowKey或者根据RowKey查询数据的效率是最高的,我们的 ...

随机推荐

机房合作（三）：We are Team，We are Family
导读:拖拖拉拉,机房的合作也算是接近了尾声了.在这个过程中,真心是感谢我的两个组员.这个机房合作,看似简单,但我的组员给我的帮助和感动,都是不可忽略的.记得刚开始的时候,我就说过:不怕猪一样的组长,咱 ...
zoj2112 主席树动态第k大（主席树&&树状数组）
Dynamic Rankings Time Limit: 10 Seconds Memory Limit: 32768 KB The Company Dynamic Rankings has ...
HDU-4848 Wow! Such Conquering! 爆搜+剪枝
Wow! Such Conquering! 题意:一个n*n的数字格,Txy表示x到y的时间.最后一行n-1个数字代表分别到2-n的最晚时间,自己在1号点,求到达这些点的时间和的最少值,如果没有满足情 ...
MHA的介绍和测试（一）
MHA的介绍 MySQL的MHA:MySQL的高级可用性管理器和工具MHA的主要目标是在短(通常为10-30秒)的停机时间内自动化主故障转移和slave升级,不受复制一致性问题的困扰,不需要花费大量的 ...
HDU 4819 Mosaic 【二维线段树】
题目大意:给你一个n*n的矩阵,每次找到一个点(x,y)周围l*l的子矩阵中的最大值a和最小值b,将(x,y)更新为(a+b)/2 思路:裸的二维线段树 #include<iostream> ...
P1140 相似基因 (动态规划)
题目背景大家都知道,基因可以看作一个碱基对序列.它包含了4种核苷酸,简记作A,C,G,T.生物学家正致力于寻找人类基因的功能,以利用于诊断疾病和发明药物. 在一个人类基因工作组的任务中,生物学家研究 ...
leetcode 206 头插法
头插法,定义temp,Node temp指向每次头结点,Node每次指向要进行头插的节点. 最后返回temp /** * Definition for singly-linked list. * st ...
poj 3525 求凸包的最大内切圆
Most Distant Point from the Sea Time Limit: 5000MS Memory Limit: 65536K Total Submissions: 3640 ...
16.1113 模拟考试T2
测试题 #4 括号括号[问题描述]有一个长度为?的括号序列,以及?种不同的括号.序列的每个位置上是哪种括号是随机的,并且已知每个位置上出现每种左右括号的概率.求整个序列是一个合法的括号序列的概率.我们 ...
*hdu5632Rikka with Array
$n \leq 10^300$,问所有$i<j$且$f_i>f_j$的$(i,j),1 \leq i \leq n,1 \leq j \leq n$数量.对某个数取模. $f(a,b,0/ ...

HBase协处理器同步二级索引到Solr(续)

一、 已知的问题和不足