向量空间模型实现文档查询（Vector Space Model to realize document query）

xml中文档（query）的结构：

<topic>

<number>CIRB010TopicZH006</number>

<title>科索沃難民潮</title>

<question>

查詢科索沃戰爭中的難民潮情況，以及國際間對其采取的援助。

</question>

<narrative>

相關文件內容包含科省難民湧入的地點、人數。受安置的狀況，難民潮引發的問題，参與救援之國家與國際組織，其援助策略與行動內容之報導。

</narrative>

<concepts>

科省、柯省、科索沃、柯索伏、難民、難民潮、難民營、援助、收容、救援、醫療、人道、避難、馬其頓、土耳其、外交部、國際、聯合國、紅十字會、阿爾巴尼亞裔難民。

</concepts>

</topic>

文档列表的样子（file-list）

CIRB010/cdn/loc/CDN_LOC_0001457

CIRB010/cdn/loc/CDN_LOC_0000294

CIRB010/cdn/loc/CDN_LOC_0000120

CIRB010/cdn/loc/CDN_LOC_0000661

CIRB010/cdn/loc/CDN_LOC_0001347

CIRB010/cdn/loc/CDN_LOC_0000439

词库的样子（vocab.all）中文的话是单个字一行

utf8

Copper

version

EGCG

432Kbps

RESERVECHARDONNAY

TommyHolloway

platts

Celeron266MHz

VOLKSWAGEN

INDEX

SmarTone

倒排文档的表示（inverted-file）

词库中词的行号1 词库中词的行号2（-1表示单个词，仅仅考虑1）文档个数

文档在列表中的行数词出现的次数

代码实现仅仅是考虑单个的字

# -*- coding: utf-8 -*-

#!usr/bin/python

import sys

import getopt

from xml.dom.minidom import parse

import xml.dom.minidom

import scipy.sparse as sp

from numpy import *

from math import log

from sklearn.preprocessing import normalize

#deal with the argv

def main(argv):

	ifFeedback=False

	try:

		opts,args=getopt.getopt(argv,'ri:o:m:d:',[])

	except getopt.GetoptError:

		# run input

		print 'wrong input'

	for opt,arg in opts:

		if opt=='-r' and ifFeedback==False:

			ifFeedback=True

		elif opt=='-i':

			queryFile=arg

		elif opt=='-o':

			rankedList=arg

		elif opt=='-m':

			modelDir=arg

		elif opt=='-d':

			NTCIRDir=arg

		else:

			pass

	return ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#if __name__=='__main__' :

#get the path in the arguments

ifFeedback,queryFile,rankedList,modelDir,NTCIRDir=main(sys.argv[1:])

#print ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#get the file path in the model-dir

vocab=modelDir+'/vocab.all'

fileList=modelDir+'/file-list'

invList=modelDir+'/inverted-file'

#read

pf=open(vocab,'r')

vocab=pf.read()

pf.close()

pf=open(fileList,'r')

fileList=pf.read()

pf.close()

pf=open(invList,'r')

invList=pf.read()

pf.close()

#splitlines

vocab=vocab.splitlines();

fileList=fileList.splitlines()

invList=invList.splitlines()

# vocab dict

vocabDict={}

k=0

while k <len(vocab):

	vocabDict[vocab[k]]=k

	k+=1

#get the TF and IDF matrix

#dimension:

#tfMatrix=sp.csr_matrix(len(fileList),len(vocab))

IDFVector=zeros(len(vocab))

totalDocs=len(fileList)

count=0

tempMatrix=zeros((len(fileList),len(vocab)))

while count<len(invList):

	postings=invList[count]

	post=postings.split(' ')

	k=1

	#just deal with the single word

	if(len(post)>2 and post[1]=='-1'):

		IDFVector[int(post[0])]=int(post[2])

		while k<=int(post[2]):

			line=invList[count+k].split(' ')

			tempMatrix[int(line[0])][int(post[0])]=int(line[1])

			k+=1

	count+=k

tfMatrix=sp.csr_matrix(tempMatrix)

#BM25

doclens=tfMatrix.sum(1)

avglen=doclens.mean()

k=7

b=0.7

#

tp1=tfMatrix*(k+1)

tp2=k*(1-b+b*doclens/avglen)

tfMatrix.data+=array(tp2[tfMatrix.tocoo().row]).reshape(len(tfMatrix.data))

tfMatrix.data=tp1.data/tfMatrix.data

#calculate the idf

k=0

while k<len(vocab):

	if IDFVector[k]!=0:

		IDFVector[k]=log(float(totalDocs)/IDFVector[k])

	k+=1

#tf-idf

tfMatrix.data*=IDFVector[tfMatrix.indices]

#row normalization for tf-idf matrix

normalize(tfMatrix,norm='l2',axis=1,copy=False)

#deal with the query

doc=xml.dom.minidom.parse(queryFile)

root=doc.documentElement

topics=root.getElementsByTagName('topic')

rankList=''

for topic in topics:

	#query vector

	qVector=zeros(len(vocab))

	number=topic.getElementsByTagName('number')[0].childNodes[0].data

	title=topic.getElementsByTagName('title')[0].childNodes[0].data

	question=topic.getElementsByTagName('question')[0].childNodes[0].data

	narrative=topic.getElementsByTagName('narrative')[0].childNodes[0].data

	concepts=topic.getElementsByTagName('concepts')[0].childNodes[0].data

        narrative+=question+concepts

	for w in narrative:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

	for w in title:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

#...normalization

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	if ifFeedback:

		topk=[]

		for score,k in simCount[:20]:

			topk.append(k)

		d=tfMatrix[topk,:].sum(0)/20

		qVector+=array(0.8*d).reshape(len(qVector))

	#.....

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	#.....

	num=number.split('ZH')

	num=num[1]

	for sim in simCount:

		name=fileList[sim[1]]

		name=name.split('/')

		name=name[3].lower()

		rank=num+' '+name

		rankList+=rank+'\n'

pf=open(rankedList,'w')

pf.write(rankList)

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

向量空间模型(Vector Space Model)的理解
1. 问题描述给你若干篇文档,找出这些文档中最相似的两篇文档? 相似性,可以用距离来衡量.而在数学上,可使用余弦来计算两个向量的距离. \[cos(\vec a, \vec b)=\frac {\v ...
Solr相似度名词：VSM(Vector Space Model)向量空间模型
最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查找相似度最高的.最优的结果.索性就直接跳到这个问题看,很多资料都提到了VSM(Vector Space Model)即向 ...
向量空间模型（Vector Space Model）
搜索结果排序是搜索引擎最核心的构成部分,很大程度上决定了搜索引擎的质量好坏.虽然搜索引擎在实际结果排序时考虑了上百个相关因子,但最重要的因素还是用户查询与网页内容的相关性.(ps:百度最臭名朝著的“竞 ...
ES搜索排序，文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
转：Lucene之计算相似度模型VSM(Vector Space Model) : tf-idf与交叉熵关系，cos余弦相似度
原文:http://blog.csdn.net/zhangbinfly/article/details/7734118 最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查 ...
Elasticsearch增删改查之 —— mget多文档查询
之前说过了针对单一文档的增删改查,基本也算是达到了一个基本数据库的功能.本篇主要描述的是多文档的查询,通过这个查询语法,可以根据多个文档的查询条件,返回多个文档集合. 更多内容可以参考我整理的ELK文 ...
ES 父子文档查询
父子文档的特点 1. 父/子文档是完全独立的. 2. 父文档更新不会影响子文档. 3. 子文档更新不会影响父文档或者其它子文档. 父子文档的映射与索引 1. 父子关系 type 的建立必须在索引新建或 ...
css盒子模型、文档流、相对与绝对定位、浮动与清除模型
一.CSS中的盒子模型标准模式和混杂模式(IE).在标准模式下浏览器按照规范呈现页面:在混杂模式下,页面以一种比较宽松的向后兼容的方式显示.混杂模式通常模拟老式浏览器的行为以防止老站点无法工作. h ...
Elasticsearch文档查询
简单数据集到目前为止,已经了解了基本知识,现在我们尝试用更逼真的数据集,这儿已经准备好了一份虚构的JSON,关于客户银行账户信息的.每个文档的结构如下: { , , "firstname& ...

随机推荐

Day17re模块和hashlib模块
re模块正则表达式用一些特殊符号拼凑成的规则,去字符串中匹配到符合规则的东西为什么有正则表达式从字符串中取出想要的数据怎么用正则表达式 re.findall()结果存成列表 \w 匹配一个字 ...
leetcode-1-basic
leetcode-algorithm 1. Two Sum 解法:循环,试呗..简单粗暴.. class Solution { public: vector<int> twoSum(vec ...
（转）iOS开发之同一应用设置不同图标和名称
本文转自:http://www.devzeng.com/blog/ios-two-version-app-setting-profile.html iOS开发之同一应用设置不同图标和名称 SEP 6T ...
inode结构体
inode分为内存中的inode和文件系统中的inode,为了避免混淆,我们称前者为VFS inode, 而后者以EXT2为代表,我们称为Ext2 inod.这里说明的是VFS inode. 重要成员 ...
python中os模块讲解
本文主要介绍一些os模块常用的方法: 先看下我的文件目录结构 D:\LearnTool\pycode\part1 在此目录下的文件如下: abcd.py demo1.1.py demo1.2.py z ...
sqlserver查看死锁进程工具脚本p_lockinfo
/* -- 处理死锁 -- 查看当前进程,或死锁进程,并能自动杀掉死进程 -- 因为是针对死的,所以如果有死锁进程,只能查看死锁进程 -- 当然,你可以通过参数控制,不管有没有死锁,都只查看死锁进程 ...
[uiautomator篇] uiautoviewer 定位不到元素
定位工具: Uiautomatorviewer 在我们的APP中,只有这一个页面,元素无法加载出来,其它的都没有什么问题. 提示的错误:Error while obtaining UI hiera ...
javascript数组学习1
<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8&quo ...
洛谷P2351 [SDOi2012]吊灯【数学】
题目 Alice家里有一盏很大的吊灯.所谓吊灯,就是由很多个灯泡组成.只有一个灯泡是挂在天花板上的,剩下的灯泡都是挂在其他的灯泡上的.也就是说,整个吊灯实际上类似于[b]一棵树[/b].其中编号为 1 ...
【Luogu】P3380树套树模板（线段树套Splay）
题目链接幸甚至哉,歌以咏志. 拿下了曾经是那么遥不可及的线段树,学会了曾经高不可攀的平衡树,弄懂了装B的时候才挂在嘴边的树套树. 每道模板都是链上的一颗珠子.把它们挨个串起来,就成为我成长的历程. ...

向量空间模型实现文档查询（Vector Space Model to realize document query）

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

随机推荐

热门专题