How to generate a new dictionary file of mmseg

0.Usage about mmseg-node memtioned in github :
var mmseg = require("mmseg");
var q = mmseg.open('/usr/local/etc/');
console.log(q.segmentSync("我是中文分词"));

#"/usr/local/etc" is dir of mmseg's dictionary, which has a file "uni.lib" , which is the directionary file

1. so we need a generate directionary file. Before this , we need to install coreseek , ref to http://www.coreseek.cn/products-install/install_on_bsd_linux/
安装前,建议查看:源码包说明README;4.0/4.1版可参考3.2版本安装,步骤相同;如遇到问题,请看详细安装说明。

##下载coreseek:coreseek 3.2.14:点击下载、coreseek 4.0.1:点击下载、coreseek 4.1:点击下载
$ wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz
$ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.0.1-beta.tar.gz
$ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
$ tar xzvf coreseek-3.2.14.tar.gz 或者 coreseek-4.0.1-beta.tar.gz 或者 coreseek-4.1-beta.tar.gz
$ cd coreseek-3.2.14 或者 coreseek-4.0.1-beta 或者 coreseek-4.1-beta

##前提:需提前安装操作系统基础开发库及mysql依赖库以支持mysql数据源和xml数据源
##安装mmseg
$ cd mmseg-3.2.14
$ ./bootstrap #输出的warning信息可以忽略,如果出现error则需要解决
$ ./configure --prefix=/usr/local/mmseg3
$ make && make install
$ cd ..

##安装coreseek
$ cd csft-3.2.14 或者 cd csft-4.0.1 或者 cd csft-4.1
$ sh buildconf.sh #输出的warning信息可以忽略,如果出现error则需要解决
$ ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##如果提示mysql问题,可以查看MySQL数据源安装说明
##debian5 : ubuntu9/10 install mysql:
$ apt-get install mysql-client libmysqlclient15-dev libxml2-dev libexpat1-dev

$ make && make install
$ cd ..

##测试mmseg分词,coreseek搜索(需要预先设置好字符集为zh_CN.UTF-8,确保正确显示中文)
$ cd testpack
$ cat var/test/test.xml #此时应该正确显示中文
$ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml #we can see content in test.xml was divided in "system-default-knowed vocabulary" which base on dictionary file "/usr/local/mmseg3/etc/unilib".
$ /usr/local/coreseek/bin/indexer -c etc/csft.conf --all #regenerate a index

2.generate a new dictionary:
#write the new vocabulary in word_new_input.txt, each vocabulary one line and cd in where you locate your word_new_input.txt
#for example (no # at the beginning of each line):
#雅阁
#马自达

# now you cd in your new vocabulary dir:
$ cd ~/projects/mmseg-3.2.14/new2
$ cat word_new_input.txt | awk '{print $1"\t""1""\nx:1"}' > word_new_gen.txt
$ cat ../data/unigram.txt | word_new_gen.txt > word_new_gen.txt
$ /usr/local/mmseg3/bin/mmseg -u word_new_gen.txt #which generate a word_new_gen.txt.lib file
$ mv word_new_gen.txt.lib uni.lib #rename
#$ cp /usr/local/mmseg3/etc ~/ -r #backup your dictionary file
$ sudo cp uni.lib /usr/local/mmseg3/etc/ #replace the dictionary file with new one
## now you cd in your coreseek-3.2.14/testpack directory
$ /usr/local/coreseek/bin/indexer -c ~/projects/coreseek-3.2.14/testpack/etc/csft.conf --all #regenerate a new index
#above generate some output as the following:
Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
Copyright (c) 2007-2011,
Beijing Choice Software Technologies Inc (http://www.coreseek.com)

using config file 'etc/csft.conf'...
indexing index 'xml'...
collected 3 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 3 docs, 7585 bytes
total 0.010 sec, 746334 bytes/sec, 295.18 docs/sec
total 2 reads, 0.000 sec, 4.2 kb/call avg, 0.0 msec/call avg
total 7 writes, 0.000 sec, 3.1 kb/call avg, 0.0 msec/call avg

#new dict store in /usr/local/mmseg3/etc/
3.test the new dictionary:
3.1 file "var/test/newtest.txt" is the one has new vocabulary sentence:
$ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/newtest.txt
雅阁/x 现在/x 卖/x 多少/x 钱/x ?/x
马自达/x 的/x 重量/x 是/x 多少/x ?/x
3.2 or you can program in coffee:

david@Wade:~/node/node$ coffee
coffee> mmseg=require('mmseg')
{ open: [Function],
clean: [Function],
uniq: [Function] }
coffee> q= mmseg.open( '/usr/local/mmseg3/etc/')
{}
coffee> console.log q.segmentSync('我喜欢开雅阁')
[ '我', '喜欢', '开', '雅阁' ]
undefined
coffee> console.log q.segmentSync('我喜欢开丰田') #丰田 is NOT in the new dictionary
[ '我', '喜欢', '开', '丰', '田' ]
undefined
coffee> console.log q.segmentSync '我喜欢开马自达'
[ '我', '喜欢', '开', '马自达' ]

How to generate a new dictionary file of mmseg的更多相关文章

  1. ORA-01336: specified dictionary file cannot be opened

    这篇介绍使用Logminer时遇到ORA-01336: specified dictionary file cannot be opened错误的各种场景 1:dictionary_location参 ...

  2. generate the call load file

    #!/usr/bin/perl -w $e911_call_percent = 0.0; $ims_node_number = 12; $local_ip = "10.86.52.2&quo ...

  3. [转] How to generate multiple outputs from single T4 template (T4 输出多个文件)

    本文转自:http://www.olegsych.com/2008/03/how-to-generate-multiple-outputs-from-single-t4-template/ Updat ...

  4. Java 输入/输出——File类

    File类是java.io包下代表与平台无关的文件和目录,也就是说,如果希望在程序中操作文件和目录,都可以通过File类来完成.值得指出的是,不管是文件还是目录都是使用File来操作的,File能新建 ...

  5. How to Convert a Class File to a Java File?

    What is a programming language? Before introducing compilation and decompilation, let's briefly intr ...

  6. GPO - File Server Management

    Creating disk space usage quotas: File Screening Generate Storage Report, including file edit audit. ...

  7. Solr搭建大数据查询平台

    参考文章:http://www.freebuf.com/articles/database/100423.html 对上面链接的补充: solr-5.5.0版本已被删除,新url:http://mir ...

  8. Creating a radius based VPN with support for Windows clients

    This article discusses setting up up an integrated IPSec/L2TP VPN using Radius and integrating it wi ...

  9. Wifite.py 修正版脚本代码

    Kali2.0系统自带的WiFite脚本代码中有几行错误,以下是修正后的代码: #!/usr/bin/python # -*- coding: utf-8 -*- """ ...

随机推荐

  1. IOS应用内支付IAP从零开始详解

    前言 什么是IAP,即in-app-purchase 这几天一直在搞ios的应用内购,查了很多博客,发现几乎没有一篇博客可以完整的概括出所有的点,为了防止大伙多次查阅资料,所以写了这一篇博客,希望大家 ...

  2. Java反射《二》获取构造器

    package com.study.reflect; import java.lang.reflect.Constructor; import java.lang.reflect.Invocation ...

  3. Tomcat错误:getOutputStream() has already been called for this response

    使用weblogic部署时,没有报错.客户现场使用tomcat后报错. 在tomcat下jsp中出现此错误一般都是在jsp中使用了输出流(如输出图片验证码,文件下载等),没有妥善处理好的原因.具体的原 ...

  4. File 和 导出jar包

    1.File import java.io.File; import java.io.IOException; public class FileTest { public static void m ...

  5. linux-xshell同时向多台服务器一起发命令

    概述:有时候我们要往多台linux服务器上面步东西,一台一台布能烦死我们.如果能同时向多台服务器发命令岂不美哉. 开工: 首先打开exshell,查看->撰写栏  打开 然后瓷砖排序,看起来方便 ...

  6. DevExpress WPF v18.2新版亮点(四)

    行业领先的.NET界面控件2018年第二次重大更新——DevExpress v18.2日前正式发布,本站将以连载的形式为大家介绍新版本新功能.本文将介绍了DevExpress WPF v18.2的新功 ...

  7. Gym - 100971J (思维+简单bfs)

    题目链接:http://codeforces.com/gym/100971/problem/J J. Robots at Warehouse time limit per test 2.0 s mem ...

  8. L316 波音737Max 危机

    Boeing Scrambles To Restore Faith In Its 737 MAX Airplane After Crashes In the wake of two deadly cr ...

  9. Python 基础day3

    1.简述bit,byte,kb,MB,GB,TB的关系 1TB=1024GB;   1GB=1024MB ;  1MB=1024kb: 1kb=1024byte ; 1byte=8bit 2.简述as ...

  10. go语言求1到100之内的质数

    素数指在一个大于1的自然数中,除了1和此整数自身外,没法被其他自然数整除的数.换句话说,只有两个正因数(1和自己)的自然数即为素数(也叫质数).比1大但不是素数的数称为合数.1和0既非素数也非合数. ...