sphinx中文版Coreseek中文检索引擎安装和使用方法(Linux)

众所周知，在MYSQL数据库中，如果你在百万级别数据库中使用 like 的话那你一定在那骂娘，coreseek是一个针对于中文检索方案的一种全文检索技术，基于sphinx开发的。但是在coreseek中不但支持了mysql数据源，还支持了python、xml、mssql、odbc。而且提供了很多语言PHP、C#、JAVA、python等丰富API接口。在中文全文搜索引擎中，基本没有什么能有coreseek匹敌的（是我太深入了嘛-^-），在千万条数据测试下，coreseek生成索引后全文检索的时间不会超过0.5s，这个速度是非常可观的。

1. 安装必要的编译工作支持

　　安装coreseek之前需要安装这些工具，当然使用yum安装你的机子需要先保证已经联网

yum install make gcc g++ gcc-c++ libtool autoconf automake imake mysql-devel libxml2-devel expat-devel

2. 下载coreseek和编译安装

$ wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz

$ tar xzvf coreseek-3.2.14.tar.gz

$ cd coreseek-3.2.14

##安装mmseg中文分词

$ cd mmseg-3.2.14

$ ./bootstrap    #输出的warning信息可以忽略，如果出现error则需要解决

$ ./configure --prefix=/usr/local/mmseg3

$ make && make install

$ cd ..

##安装coreseek

$ cd csft-3.2.14

$ sh buildconf.sh    #输出的warning信息可以忽略，如果出现error则需要解决

$ ./configure --prefix=/usr/local/coreseek  --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql    ##如果提示mysql问题，可以查看MySQL数据源安装说明，注意--prifix后面的路径要和自己安装的路径一致

$ make && make install

$ cd ..

3. 配置MYSQL数据源

cp /usr/local/coreseek/etc/sphinx.conf.dist /usr/local/coreseek/etc/sphinx.conf

vi /usr/local/coreseek/etc/sphinx.conf

创建一个统计表

sphinx.conf配置如下：

source news

{

    type          = mysql

    sql_host      = localhost

    sql_user      = root

    sql_pass      =123456

    sql_db        = test

    sql_port      =3306

    sql_sock      =/tmp/mysql.sock

    sql_query_pre = SET NAMES utf8

    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM news

    sql_query     = SELECT id, contents, intro, title FROM news WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

}

#设置增量索引，数据量较小时，也可以不设置而定时重新生成索引

source increment : news

{

    sql_query_pre = SET NAMES utf8

    sql_query     = SELECT id, contents, intro, title FROM news WHERE id >( SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

    #这是增量索引的数据源sql。和上面保持一致，唯一的变化，就是where条件之后，这里查询的是大于上次重新生成索引的id，即：刚刚添加的数据

}

index news

{

    source           = news

    path             =/usr/local/coreseek/var/data/news

    docinfo          =extern

    mlock            =0

    morphology       = none

    charset_dictpath =/usr/local/mmseg3/etc/

    charset_type     = zh_cn.utf-8

}

index increment : news

{

    source           = increment

    path             =/usr/local/coreseek/var/data/increment

    charset_dictpath =/usr/local/mmseg3/etc/

    charset_type     = zh_cn.utf-8

}

indexer

{

    mem_limit    =128M

}

searchd

{

    log               =/usr/local/coreseek/var/log/searchd.log

    read_timeout      =5

    client_timeout    =300

    max_children      =30

    pid_file          =/usr/local/coreseek/var/log/searchd.pid

    max_matches       =1000

    seamless_rotate   =1

    preopen_indexes   =0

    unlink_old        =1

    mva_updates_pool  =1M

    max_packet_size   =8M

    max_filter_values =4096

}

说明：如果想配置多个数据源，在配置文件中添加source和index即可，可以像增量那样添加，但不需要后面的 : news，就可以了，需要注意的是在sph_counter即可以数据统计表中，需要使用不同的id来区分不同的数据表

4.生成索引命令

生成索引

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --all --rotate

说明：这时sph_counter 表里会增加一条记录。存放的就是你内容表中的最大id。如果想要生成单个数据源的索引， /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf news --rotate（这条命令只生成news的索引）

开启后台进程

/usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/sphinx.conf

说明：这时候对Mysql数据源进行搜索的话其实已经是有数据的。

增量索引

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf increment --rotate

说明：这里增量索引的名称要换成自己对应的增量索引的名称

合并索引

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --merge news increment --rotate

说明：合并索引后，news索引此时可以检索到所有的数据，但是sph_counter表中最大id是没有变的，因此还需要在一定的时间内再次重新生成所有的索引

为了保持数据的完整性，重新生成索引

  /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --all --rotate

5.执行定时任务，更新增量索引、重新生成索引

*/1****/bin/sh    /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf increment --rotate

*/5****/bin/sh    /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --merge news increment --rotate

301***  /bin/sh  /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --all --rotate

说明：以上任务计划的意思是：每隔一分钟执行一遍增量索引，每五分钟执行一遍合并索引，每天1:30执行整体索引。

6.php操作coreseek

复制api/sphinxapi.php文件到你的项目

require ( "sphinxapi.php" );

//这里是直接把api复制到项目的目录中进行使用，也可以安装php的sphinx扩展进行使用，以下会详细介绍php7.0安装sphinx扩展

6.php安装sphinx扩展

[第一步] 安装依赖libsphinxclient

#cd /usr/local/coreseek-3.2.14/csft-3.2.14/api/libsphinxclient

#./configure  --prefix=/usr/local/sphinxclient

#configure: creating ./config.status

#config.status: creating Makefile

#config.status: error: cannot find input file:Makefile.in   #报错configure失败       

//处理configure报错

编译过程中报了一个config.status: error: cannot find input file: src/Makefile.in这个的错误，然后运行下列指令再次编译就能通过了：

# aclocal

# libtoolize --force

# automake --add-missing

# autoconf

# autoheader

# make clean   

//从新configure编译

# ./configure

# make && make install

[第二步] 安装sphinx的PHP扩展

# wget http://git.php.net/?p=pecl/search_engine/sphinx.git;a=snapshot;h=9a3d08c67af0cad216aa0d38d39be71362667738;sf=tgz

# tar zxvf sphinx-9a3d08c.tar.gz

# cd sphinx-9a3d08c

# /usr/local/php/bin/phpize

# ./configure --with-php-config=/usr/local/php/bin/php-config --with-sphinx=/usr/local/sphinxclient

# make && make install

说明：这里注意php-config的路径要和自己服务器上的路径一致才行，使用php -m 查看模块是否已经安装上，如果里面有sphinx模块，证安装成功，修改php.ini即可成功。这里如果是php7一定要安装php7可用的扩展。

安装完成后修改php.ini，在末尾添加如下内容

extension = sphinx.so

说明：查看phpinfo();查看扩展是否安装成功

然后重启php-fpm即可，执行php -m，看到有sphinx扩展说明安装

/etc/init.d/php-fpm restart