企业搜索引擎开发之连接器connector(二十五)
下面开始具体分析连接器是怎么与连接器实例交互的,这里主要是分析连接器怎么从连接器实例获取数据的(前面文章有涉及基于http协议与连接器的xml格式的交互,连接器对连接器实例的设置都是通过配置文件操作的,具体文件操作尚未详细分析(com.google.enterprise.connector.persist.FileStore类))
本文以数据库连接器实例为例来分析,数据库类型连接器是通过调用mybatis(sqlmap映射框架)组件与数据库进行操作的,我们通过前端提交的数据库连接器实例表单信息最终存储在配置文件里面(默认采用文件方式,也可以采用数据库方式存储),连接器启动时通过加载该配置文件映射到数据库连接实例的上下文对象(类似反序列化的概念)
数据库连接实例的上下文对象类属性记录了配置信息及数据操作客户端对象类,同时在其初始化方法将上下文对象设置为数据操作客户端对象的属性
private final MimeTypeDetector mimeTypeDetector = new MimeTypeDetector(); private DBClient client;
private String connectionUrl;
private String connectorName;
private String login;
private String password;
private String sqlQuery;
private String authZQuery;
private String googleConnectorWorkDir;
private String primaryKeys;
private String xslt;
private String driverClassName;
private String documentURLField;
private String documentIdField;
private String baseURL;
private String lobField;
private String fetchURLField;
private String lastModifiedDate;
private String extMetadataType;
private int numberOfRows = 500;
private Integer minValue = -1;
private boolean publicFeed = true;
private boolean parameterizedQueryFlag = false;
private Boolean nullsSortLow = null;
private Collator collator; public DBContext() {
} public void init() throws DBException {
client.setDBContext(this); // If the NULL value sort behaviour has not been explicitly overriden
// in the configuration, fetch it from the DatabaseMetadata.
if (nullsSortLow == null) {
nullsSortLow = client.nullsAreSortedLow();
if (nullsSortLow == null) {
throw new DBException("nullsSortLowFlag must be set in configuration.");
}
}
}
DBClient对象封装了数据读取方法,数据读取调用了mybatis组件的相关API,加载配置信息,生成数据操作sqlsession对象
private boolean hasCustomCollationQuery = false;
protected DBContext dbContext;
protected SqlSessionFactory sqlSessionFactory;
protected DatabaseType databaseType; static {
org.apache.ibatis.logging.LogFactory.useJdkLogging();
} public DBClient() {
} public void setDBContext(DBContext dbContext) throws DBException {
this.dbContext = dbContext;
generateSqlMap();
this.sqlSessionFactory = getSqlSessionFactory(generateMyBatisConfig());
LOG.info("DBClient for database " + getDatabaseInfo() + " is instantiated");
this.databaseType = getDatabaseType();
} /**
* Constructor used for testing purpose. DBCLient initialized with sqlMap
* having crawl query without CDATA section.
*/
@VisibleForTesting
DBClient(DBContext dbContext) throws DBException {
this.dbContext = dbContext;
this.sqlSessionFactory = getSqlSessionFactory(generateMyBatisConfig());
this.databaseType = getDatabaseType();
} private SqlSessionFactory getSqlSessionFactory(String config) {
try {
SqlSessionFactoryBuilder builder = new SqlSessionFactoryBuilder();
return builder.build(new StringReader(config));
} catch (RuntimeException e) {
throw new RuntimeException("XML is not well formed", e);
}
} /**
* @return a SqlSession
*/
@VisibleForTesting
SqlSession getSqlSession()
throws SnapshotRepositoryRuntimeException {
try {
return sqlSessionFactory.openSession();
} catch (RuntimeException e) {
Throwable cause = (e.getCause() != null &&
e.getCause() instanceof SQLException) ? e.getCause() : e;
LOG.log(Level.WARNING, "Unable to connect to the database.", cause);
throw new SnapshotRepositoryRuntimeException(
"Unable to connect to the database.", cause);
}
}
具体的数据读取方法如下:
/**
* @param skipRows number of rows to skip in the database.
* @param maxRows max number of rows to return.
* @return rows - subset of the result of executing the SQL query. E.g.,
* result table with columns id and lastName and two rows will be
* returned as
*
* <pre>
* [{id=1, lastName=last_01}, {id=2, lastName=last_02}]
* </pre>
* @throws DBException
*/
public List<Map<String, Object>> executePartialQuery(int skipRows, int maxRows)
throws SnapshotRepositoryRuntimeException {
// TODO(meghna): Think about a better way to scroll through the result set.
List<Map<String, Object>> rows;
LOG.info("Executing partial query with skipRows = " + skipRows + " and "
+ "maxRows = " + maxRows);
SqlSession session = getSqlSession();
try {
rows = session.selectList("IbatisDBClient.getAll", null,
new RowBounds(skipRows, maxRows));
LOG.info("Sucessfully executed partial query with skipRows = "
+ skipRows + " and maxRows = " + maxRows);
} catch (RuntimeException e) {
checkDBConnection(session, e);
rows = new ArrayList<Map<String, Object>>();
} finally {
session.close();
}
LOG.info("Number of rows returned " + rows.size());
return rows;
}
RepositoryHandler类里面通过对DBContext dbContext和DBClient dbClient的引用来读取数据信息
里面还包装了内部类PartialQueryStrategy实现对数据偏移的控制
/**
* 实际调用的默认是这个实现类
* @author Administrator
*
*/
private class PartialQueryStrategy implements QueryStrategy {
private int skipRows = 0; @Override
public List<Map<String, Object>> executeQuery() {
return dbClient.executePartialQuery(skipRows,
dbContext.getNumberOfRows());
} @Override
public void resetCursor() {
skipRows = 0;
} @Override
public void updateCursor(List<Map<String, Object>> rows) {
skipRows += rows.size();
} @Override
public void logComplete() {
LOG.info("Total " + skipRows
+ " records are crawled during this crawl cycle");
}
}
然后在executeQueryAndAddDocs()方法里面调用该内部类实例对象
/**
* 重启后都是重新开始获取数据,不记录批次信息
* Function for fetching database rows and providing a collection of
* snapshots.
*/
public List<DocumentSnapshot> executeQueryAndAddDocs()
throws SnapshotRepositoryRuntimeException {
List<Map<String, Object>> rows = null; try {
rows = queryStrategy.executeQuery();
} catch (SnapshotRepositoryRuntimeException e) {
LOG.info("Repository Unreachable. Resetting DB cursor to "
+ "start traversal from begining after recovery.");
queryStrategy.resetCursor();
LOG.warning("Unable to connect to the database\n" + e.toString());
throw new SnapshotRepositoryRuntimeException(
"Unable to connect to the database.", e);
}
if (rows.size() == 0) {
queryStrategy.logComplete();
LOG.info("Crawl cycle of database is complete. Resetting DB cursor to "
+ "start traversal from begining");
queryStrategy.resetCursor();
} else {
queryStrategy.updateCursor(rows);
} if (traversalContext == null) {
LOG.info("Setting Traversal Context");
traversalContext = traversalContextManager.getTraversalContext();
JsonDocument.setTraversalContext(traversalContext);
} return getDocList(rows);
}
getDocList(rows)方法实现将数据记录包装为List<DocumentSnapshot>对象
/**
* 将数据包装为List<DocumentSnapshot>
* @param rows
* @return
*/
private List<DocumentSnapshot> getDocList(List<Map<String, Object>> rows) {
LOG.log(Level.FINE, "Building document snapshots for {0} rows.",
rows.size());
List<DocumentSnapshot> docList = Lists.newArrayList();
for (Map<String, Object> row : rows) {
try {
DocumentSnapshot snapshot = docBuilder.getDocumentSnapshot(row);
if (snapshot != null) {
if (LOG.isLoggable(Level.FINER)) {
LOG.finer("DBSnapshotRepository returns document with docID "
+ snapshot.getDocumentId());
}
docList.add(snapshot);
}
} catch (DBException e) {
// See the similar log message in DBSnapshot.getDocumentHandle.
LOG.log(Level.WARNING, "Cannot convert database record to snapshot "
+ "for record " + row, e);
}
}
LOG.info(docList.size() + " document(s) to be fed to GSA");
return docList;
}
RepositoryHandlerIterator类进一步对repositoryHandler的封装,实现数据的迭代器
/**
* Iterates over the collections of {@link DocumentSnapshot} objects
* produced by a {@code RepositoryHandler}.
*/
public class RepositoryHandlerIterator
extends AbstractIterator<DocumentSnapshot> {
private final RepositoryHandler repositoryHandler;
private Iterator<DocumentSnapshot> current; /**
* @param repositoryHandler RepositoryHandler object for fetching DB rows in
* DocumentSnapshot form.
*/
public RepositoryHandlerIterator(RepositoryHandler repositoryHandler) {
this.repositoryHandler = repositoryHandler;
this.current = Iterators.emptyIterator();
} @Override
protected DocumentSnapshot computeNext() {
if (current.hasNext()) {
return current.next();
} else {
current = repositoryHandler.executeQueryAndAddDocs().iterator();
if (current.hasNext()) {
return current.next();
} else {
return endOfData();
}
}
}
}
最后将迭代器交给了DBSnapshotRepository仓库(继承自连接器的SnapshotRepository仓库类,实现了与连接器的接口对接(适配器模式))
/**
* An iterable over the database rows. The main building block for
* interacting with the diffing package.
*/
public class DBSnapshotRepository
implements SnapshotRepository<DocumentSnapshot> {
private final RepositoryHandler repositoryHandler; public DBSnapshotRepository(RepositoryHandler repositoryHandler) {
this.repositoryHandler = repositoryHandler;
} @Override
public Iterator<DocumentSnapshot> iterator()
throws SnapshotRepositoryRuntimeException {
return new RepositoryHandlerIterator(repositoryHandler);
} @Override
public String getName() {
return DBSnapshotRepository.class.getName();
}
}
---------------------------------------------------------------------------
本系列企业搜索引擎开发之连接器connector系本人原创
转载请注明出处 博客园 刺猬的温驯
本人邮箱: chenying998179@163#com (#改为.)
本文链接 http://www.cnblogs.com/chenying99/p/3789054.html
企业搜索引擎开发之连接器connector(二十五)的更多相关文章
- 企业搜索引擎开发之连接器connector(十九)
连接器是基于http协议通过推模式(push)向数据接收服务端推送数据,即xmlfeed格式数据(xml格式),其发送数据接口命名为Pusher Pusher接口定义了与发送数据相关的方法 publi ...
- 企业搜索引擎开发之连接器connector(十八)
创建并启动连接器实例之后,连接器就会基于Http协议向指定的数据接收服务器发送xmlfeed格式数据,我们可以通过配置http代理服务器抓取当前基于http协议格式的数据(或者也可以通过其他网络抓包工 ...
- 企业搜索引擎开发之连接器connector(十六)
本人有一段时间没有接触企业搜索引擎之连接器的开发了,连接器是涉及企业搜索引擎一个重要的组件,在数据源与企业搜索引擎中间起一个桥梁的作用,类似于数据库之JDBC,通过连接器将不同数据源的数据适配到企业搜 ...
- 企业搜索引擎开发之连接器connector(二十九)
在哪里调用监控器管理对象snapshotRepositoryMonitorManager的start方法及stop方法,然后又在哪里调用CheckpointAndChangeQueue对象的resum ...
- 企业搜索引擎开发之连接器connector(二十八)
通常一个SnapshotRepository仓库对象对应一个DocumentSnapshotRepositoryMonitor监视器对象,同时也对应一个快照存储器对象,它们的关联是通过监视器管理对象D ...
- 企业搜索引擎开发之连接器connector(二十六)
连接器通过监视器对象DocumentSnapshotRepositoryMonitor从上文提到的仓库对象SnapshotRepository(数据库仓库为DBSnapshotRepository)中 ...
- 企业搜索引擎开发之连接器connector(二十四)
本人在上文中提到,连接器实现了两种事件依赖的机制 ,其一是我们手动操作连接器实例时:其二是由连接器的自动更新机制 上文中分析了连接器的自动更新机制,即定时器执行定时任务 那么,如果我们手动操作连接器实 ...
- 企业搜索引擎开发之连接器connector(二十二)
下面来分析线程执行类,线程池ThreadPool类 对该类的理解需要对java的线程池比较熟悉 该类引用了一个内部类 /** * The lazily constructed LazyThreadPo ...
- 企业搜索引擎开发之连接器connector(二十)
连接器里面衔接数据源与数据推送对象的是QueryTraverser类对象,该类实现了Traverser接口 /** * Interface presented by a Traverser. Used ...
随机推荐
- C#综合揭秘——细说事务
引言 其实事务在数据层.服务层.业务逻辑层多处地方都会使用到,在本篇文章将会为大家一一细说. 其中前面四节是事务的基础,后面的三节是事务的重点,对事务有基础的朋友可以跳过前面四节. 文章有错漏的地方欢 ...
- Think in java.chm 第14章 多线程
例子1引入线程概念通过得到当前线程方式循环主线程做某事 例子2演示了在主线程之外开启多个线程的基本方式 ( new一个extends Thread ) 例子3 ( task extends Threa ...
- mysql 8.0 初识
1 下载并安装mysql 8.0官网下载比较慢,这里选择163的镜像http://mirrors.163.com/mysql/Downloads/MySQL-8.0/下载版本mysql-8.0.14- ...
- python re示例
#!/usr/bin/env python # encoding: utf-8 # Date: 2018/5/25import re s = '124311200111155214'ss = re.s ...
- 我的solr学习笔记--solr admin 页面 检索调试
前言 Solr/Lucene是一个全文检索引擎,全文引擎和SQL引擎所不同的是强调部分相关度高的内容返回,而不是所有内容返回,所以部分内容包含在索引库中却无法命中是正常现象. 多数情况下我们 ...
- 超文本传输协议http详解
HTTP是一个属于应用层的面向对象的协议,由于其简捷.快速的方式,适用于分布式超媒体信息系统.它于1990年提出,经过几年的使用与发展,得到不断地完善和扩展.目前在WWW中使用的是HTTP/1.0的第 ...
- spring data jpa 2.0
参考: https://www.cnblogs.com/zeng1994/p/7575606.html
- Rhythmk 学习 Hibernate 05 - Hibernate 表间关系 [ManyToOne,OneToMany]
1.项目结构: 1.1.场景说明: 一个订单,包含多个产品 1.2.类文件: Order.java package com.rhythmk.model; import java.util.Date; ...
- Linux下强大的查找命令find 用法和常见用例
Linux系统下find是较为常用的指令,find命令在目录结构中搜索文件,并执行指定的操作,掌握它的形式与用法对我们很有用处. 因为Linux下面一切皆文件,经常需要搜索某些文件来编写,所以对于Li ...
- javascript中的属性类型
ECMA-262第5版在定义只有内部才用的特性(attribute)时,描述了属性(property)的各种特性.ECMA-262定义这些特性是为了实现javascript引擎用的,因此在javasc ...