1.6.7 Detecting Languages During Indexing
1. Detecting Languages During Indexing
在索引的时候,solr可以使用langid UpdateRequestProcessor来识别语言,然后映射文本到特定语言的字段.solr支持这个功能的两个实现:
- Tika的语言解析功能:http://tika.apache.org/0.10/detection.html
- LangDetect语言解析:http://code.google.com/p/language-detection/
可以从 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html中看到它们之间的对比.一般情况下,LangDetect支持更多的语言,具有更高的性能.
参考http://wiki.apache.org/solr/LanguageDetection获取更多的关于langid UpdateRequestProcessor信息.
1.1 Configuring Language Detection
可以在solrconfig.xml中配置langid UpdateRequestProcessor.两个实现具有相同的参数,最少,你需要指定语言识别的字段和字段的结果语言编码.
1.2 Configuring Tika Language Detection
这里是solrconfig.xml 中 Tika langid UpdateRequestProcessor的最小的配置.
<processor
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
1.3 Configuring LangDetect Language Detection
这里是solrconfig.xml中最小的LangDetect langid配置.
<processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFac
tory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
1.4 langid Parameters
正如上面所提到的,两个langid UpdateRequestProcessor的实现具有相同的参数:
| 参数 | 类型 | 默认 | 必填 | 描述 |
| langid | Boolean | true | no | 开启/关闭语言解析 |
| langid.fl | string | none | yes | 逗号或者空格分隔的字段列表.用于语言探测解析. |
| langid.langField | string | none | yes | 对返回的语言代码指定字段 |
| langid.langsField | multivalued string | none | no | 对返回的语言代码指定字段.如果使用langid.map.individual,每一个解析的语言都被添加到这个字段. |
| langid.overwrite | Boolean | false | no | 指定langField和langsField字段的内容是否被重写.如果它们包含值的话. |
| langid.lcmap | string | none | false | 空格分隔的列表,指定冒号分隔的语言代码用于语言解析.举例,你可以能使用这个映射中文,日文,韩文到一个cjk字段,并且映射美国英语和英国英语到一个en代码.可以使用langid.lcmap=ja:cjk zh:cjk ko:cjk . This affects both the values put into the en_GB:en en_US:en.这使这两个值放入到langField和langsField字段中. |
| langid.threshold | float | 0.5 | no | Specifies a threshold value between 0 and 1 that the language identification score must reach before accepts it. With longer langid text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results. |
| langid.whitelist | string | none | no | Specifies a list of allowed language identification codes. Use this in combination with to ensure that you only index langid.map documents into fields that are in your schema. |
| langid.map | Boolean | false | no | Enables field name mapping. If true, Solr will map field names for all fields listed in . langid.fl |
| langid.map.fl | string | none | no | A comma-separated list of fields for that is different langid.map than the fields specified in . langid.fl |
| langid.map.keepOrig | Boolean | false | no | If true, Solr will copy the field during the field name mapping process, leaving the original field in place. |
| langid..map.individual | Boolean | false | no | If true, Solr will detect and map languages for each field individually |
| langid.map.individual.fl | stromh | none | no | 逗号分割的字段列表,使用 langid.map.individual.不同于langid.fl中指定的字段. |
| langid.fallbackFields | string | none | no | If no language is detected that meets the score langid.threshold , or if the detected language is not on the , this langid.whitelist field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in . |
| langid.fallback | string | none | no | Specifies a language code to use if no language is detected or specified in . |
| langid.map.lcmap | string | determined by langid.lcmap |
no | A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common suffix, and map both American and British English *_cjk fields to a single by using *_en langid.map.lcmap=ja:cjk . zh:cjk ko:cjk en_GB:en en_US:en |
| langid.map.pattern | Java regular expression |
none | no | By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter. |
| langid.map.replace | Java replace | none | no | By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter. |
| langid.enforceSchema | Boolean | true | no | If false, the processor does not validate field names against langid your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain |
1.6.7 Detecting Languages During Indexing的更多相关文章
- 1.6 Indexing and Basic Data Operations--目录
1.6.1 什么是 Indexing 1.6.2 Uploading Data with Index Handlers 1.6.3 Uploading Data with Solr Cell usin ...
- 1.5.8 语言分析器(Analyzer)
语言分析器(Analyzer) 这部分包含了分词器(tokenizer)和过滤器(filter)关于字符转换和使用指定语言的相关信息.对于欧洲语言来说,tokenizer是相当直接的,Tokens被空 ...
- Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载
原文地址:https://gist.github.com/maxivak/3e3ee1fca32f3949f052 Install Solr download and install Solr fro ...
- Solr 6.7学习笔记(03)-- 样例配置文件 solrconfig.xml
位于:${solr.home}\example\techproducts\solr\techproducts\conf\solrconfig.xml <?xml version="1. ...
- Solr基础知识二(导入数据)
上一篇讲述了solr的安装启动过程,这一篇讲述如何导入数据到solr里. 一.准备数据 1.1 学生相关表 创建学生表.学生专业关联表.专业表.学生行业关联表.行业表.基础信息表,并创建一条小白的信息 ...
- Go Programming Language
[Go Programming Language] 1.go run %filename 可以直接编译并运行一个文件,期间不会产生临时文件.例如 main.go. go run main.go 2.P ...
- Indexing Sensor Data
In particular embodiments, a method includes, from an indexer in a sensor network, accessing a set o ...
- ESSENTIALS OF PROGRAMMING LANGUAGES (THIRD EDITION) :编程语言的本质 —— (一)
# Foreword> # 序 This book brings you face-to-face with the most fundamental idea in computer prog ...
- 论文阅读(Xiang Bai——【CVPR2012】Detecting Texts of Arbitrary Orientations in Natural Images)
Xiang Bai--[CVPR2012]Detecting Texts of Arbitrary Orientations in Natural Images 目录 作者和相关链接 方法概括 方法细 ...
随机推荐
- REST四种请求(get,delete,put,post) 收集整理 之一
转自:http://blog.csdn.net/cloudcraft/article/details/10087033 资源是REST中最关键的抽象概念,它们是能够被远程访问的应用程序对象.一个资源就 ...
- Js/Jquery获取iframe中的元素 在Iframe中获取父窗体的元素方法
在web开发中,经常会用到iframe,难免会碰到需要在父窗口中使用iframe中的元素.或者在iframe框架中使用父窗口的元素 js 在父窗口中获取iframe中的元素 1. 格式:window ...
- vim插件开发初步
[vim插件开发初步] 将如下代码存在helloworld.vim, 放在~/.vim/plugin目录下,插件即可生效.:w保存代码后, 用:source命令执行后,也可以使用Helloworld命 ...
- POJ 3174 Alignment of the Planets (暴力求解)
题意:给定 n 个坐标,问你三个共线的有多少组. 析:这个题真是坑啊,写着 n <= 770,那么一秒时间,三个循环肯定超时啊,我一直不敢写了,换了好几种方法都WA了,也不知道为什么,在比赛时坑 ...
- VSTO安装部署(完美解决XP+2007)
从开始写VSTO的插件开始,安装部署一直就是一个很大的难题,其实难题的原因主要是针对XP+2007而言.在Win7上,由于基本上都预装了.net framework,所以安装起来其实问题不大. 主要需 ...
- ActiveMQ的消息确认问题
http://riddickbryant.iteye.com/blog/441890 [发送端] session = connection.createSession(Boolean.FALSE, ...
- MFC 构建、消亡 顺序 (一)--单文档 (SDI)
MFC 构建.消亡 顺序 (一)--单文档 (SDI) by:http://www.cnblogs.com/vranger/ (一)SDI 生成顺序 (二)打开文档-“Open” (三)新建文档-“N ...
- javaScript return false
在大多数情况下,为事件处理函数返回false,可以防止默认的事件行为.例如,默认情况下点击一个<a>元素,页面会跳转到该元素href属性指定的页. Return False 就相当于终止 ...
- 转载:rebar和erlang
使用rebar生成erlang release 并进行热代码升级 http://blog.sina.com.cn/s/blog_6530ad590100wmkn.html 使用rebar工具开发erl ...
- chrome 41 空格
chrome 41对半角空格的解析 当做一个汉字宽度来处理了. 导致很多网站出现异常. 目前能想到的方法是删掉用来规范格式的空格. 老版chrome chrome41 和讯网也有这种问题 有更好的处理 ...