向量空间模型实现文档查询（Vector Space Model to realize document query）

xml中文档（query）的结构：

<topic>

<number>CIRB010TopicZH006</number>

<title>科索沃難民潮</title>

<question>

查詢科索沃戰爭中的難民潮情況，以及國際間對其采取的援助。

</question>

<narrative>

相關文件內容包含科省難民湧入的地點、人數。受安置的狀況，難民潮引發的問題，参與救援之國家與國際組織，其援助策略與行動內容之報導。

</narrative>

<concepts>

科省、柯省、科索沃、柯索伏、難民、難民潮、難民營、援助、收容、救援、醫療、人道、避難、馬其頓、土耳其、外交部、國際、聯合國、紅十字會、阿爾巴尼亞裔難民。

</concepts>

</topic>

文档列表的样子（file-list）

CIRB010/cdn/loc/CDN_LOC_0001457

CIRB010/cdn/loc/CDN_LOC_0000294

CIRB010/cdn/loc/CDN_LOC_0000120

CIRB010/cdn/loc/CDN_LOC_0000661

CIRB010/cdn/loc/CDN_LOC_0001347

CIRB010/cdn/loc/CDN_LOC_0000439

词库的样子（vocab.all）中文的话是单个字一行

utf8

Copper

version

EGCG

432Kbps

RESERVECHARDONNAY

TommyHolloway

platts

Celeron266MHz

VOLKSWAGEN

INDEX

SmarTone

倒排文档的表示（inverted-file）

词库中词的行号1 词库中词的行号2（-1表示单个词，仅仅考虑1）文档个数

文档在列表中的行数词出现的次数

代码实现仅仅是考虑单个的字

# -*- coding: utf-8 -*-

#!usr/bin/python

import sys

import getopt

from xml.dom.minidom import parse

import xml.dom.minidom

import scipy.sparse as sp

from numpy import *

from math import log

from sklearn.preprocessing import normalize

#deal with the argv

def main(argv):

	ifFeedback=False

	try:

		opts,args=getopt.getopt(argv,'ri:o:m:d:',[])

	except getopt.GetoptError:

		# run input

		print 'wrong input'

	for opt,arg in opts:

		if opt=='-r' and ifFeedback==False:

			ifFeedback=True

		elif opt=='-i':

			queryFile=arg

		elif opt=='-o':

			rankedList=arg

		elif opt=='-m':

			modelDir=arg

		elif opt=='-d':

			NTCIRDir=arg

		else:

			pass

	return ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#if __name__=='__main__' :

#get the path in the arguments

ifFeedback,queryFile,rankedList,modelDir,NTCIRDir=main(sys.argv[1:])

#print ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#get the file path in the model-dir

vocab=modelDir+'/vocab.all'

fileList=modelDir+'/file-list'

invList=modelDir+'/inverted-file'

#read

pf=open(vocab,'r')

vocab=pf.read()

pf.close()

pf=open(fileList,'r')

fileList=pf.read()

pf.close()

pf=open(invList,'r')

invList=pf.read()

pf.close()

#splitlines

vocab=vocab.splitlines();

fileList=fileList.splitlines()

invList=invList.splitlines()

# vocab dict

vocabDict={}

k=0

while k <len(vocab):

	vocabDict[vocab[k]]=k

	k+=1

#get the TF and IDF matrix

#dimension:

#tfMatrix=sp.csr_matrix(len(fileList),len(vocab))

IDFVector=zeros(len(vocab))

totalDocs=len(fileList)

count=0

tempMatrix=zeros((len(fileList),len(vocab)))

while count<len(invList):

	postings=invList[count]

	post=postings.split(' ')

	k=1

	#just deal with the single word

	if(len(post)>2 and post[1]=='-1'):

		IDFVector[int(post[0])]=int(post[2])

		while k<=int(post[2]):

			line=invList[count+k].split(' ')

			tempMatrix[int(line[0])][int(post[0])]=int(line[1])

			k+=1

	count+=k

tfMatrix=sp.csr_matrix(tempMatrix)

#BM25

doclens=tfMatrix.sum(1)

avglen=doclens.mean()

k=7

b=0.7

#

tp1=tfMatrix*(k+1)

tp2=k*(1-b+b*doclens/avglen)

tfMatrix.data+=array(tp2[tfMatrix.tocoo().row]).reshape(len(tfMatrix.data))

tfMatrix.data=tp1.data/tfMatrix.data

#calculate the idf

k=0

while k<len(vocab):

	if IDFVector[k]!=0:

		IDFVector[k]=log(float(totalDocs)/IDFVector[k])

	k+=1

#tf-idf

tfMatrix.data*=IDFVector[tfMatrix.indices]

#row normalization for tf-idf matrix

normalize(tfMatrix,norm='l2',axis=1,copy=False)

#deal with the query

doc=xml.dom.minidom.parse(queryFile)

root=doc.documentElement

topics=root.getElementsByTagName('topic')

rankList=''

for topic in topics:

	#query vector

	qVector=zeros(len(vocab))

	number=topic.getElementsByTagName('number')[0].childNodes[0].data

	title=topic.getElementsByTagName('title')[0].childNodes[0].data

	question=topic.getElementsByTagName('question')[0].childNodes[0].data

	narrative=topic.getElementsByTagName('narrative')[0].childNodes[0].data

	concepts=topic.getElementsByTagName('concepts')[0].childNodes[0].data

        narrative+=question+concepts

	for w in narrative:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

	for w in title:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

#...normalization

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	if ifFeedback:

		topk=[]

		for score,k in simCount[:20]:

			topk.append(k)

		d=tfMatrix[topk,:].sum(0)/20

		qVector+=array(0.8*d).reshape(len(qVector))

	#.....

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	#.....

	num=number.split('ZH')

	num=num[1]

	for sim in simCount:

		name=fileList[sim[1]]

		name=name.split('/')

		name=name[3].lower()

		rank=num+' '+name

		rankList+=rank+'\n'

pf=open(rankedList,'w')

pf.write(rankList)

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

向量空间模型(Vector Space Model)的理解
1. 问题描述给你若干篇文档,找出这些文档中最相似的两篇文档? 相似性,可以用距离来衡量.而在数学上,可使用余弦来计算两个向量的距离. \[cos(\vec a, \vec b)=\frac {\v ...
Solr相似度名词：VSM(Vector Space Model)向量空间模型
最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查找相似度最高的.最优的结果.索性就直接跳到这个问题看,很多资料都提到了VSM(Vector Space Model)即向 ...
向量空间模型（Vector Space Model）
搜索结果排序是搜索引擎最核心的构成部分,很大程度上决定了搜索引擎的质量好坏.虽然搜索引擎在实际结果排序时考虑了上百个相关因子,但最重要的因素还是用户查询与网页内容的相关性.(ps:百度最臭名朝著的“竞 ...
ES搜索排序，文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
转：Lucene之计算相似度模型VSM(Vector Space Model) : tf-idf与交叉熵关系，cos余弦相似度
原文:http://blog.csdn.net/zhangbinfly/article/details/7734118 最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查 ...
Elasticsearch增删改查之 —— mget多文档查询
之前说过了针对单一文档的增删改查,基本也算是达到了一个基本数据库的功能.本篇主要描述的是多文档的查询,通过这个查询语法,可以根据多个文档的查询条件,返回多个文档集合. 更多内容可以参考我整理的ELK文 ...
ES 父子文档查询
父子文档的特点 1. 父/子文档是完全独立的. 2. 父文档更新不会影响子文档. 3. 子文档更新不会影响父文档或者其它子文档. 父子文档的映射与索引 1. 父子关系 type 的建立必须在索引新建或 ...
css盒子模型、文档流、相对与绝对定位、浮动与清除模型
一.CSS中的盒子模型标准模式和混杂模式(IE).在标准模式下浏览器按照规范呈现页面:在混杂模式下,页面以一种比较宽松的向后兼容的方式显示.混杂模式通常模拟老式浏览器的行为以防止老站点无法工作. h ...
Elasticsearch文档查询
简单数据集到目前为止,已经了解了基本知识,现在我们尝试用更逼真的数据集,这儿已经准备好了一份虚构的JSON,关于客户银行账户信息的.每个文档的结构如下: { , , "firstname& ...

随机推荐

解决windows管理员已阻止你运行此应用问题
按WIN+R键,打开“运行”,然后输入“gpedit.msc",就是打开组策略,这个在控制面板中也可以打开. 在组策略里找到“计算机配置”-“Windows设置”-“安全设置”-“本地策略” ...
(原)剑指offer之旋转数组
题目描述把一个数组最开始的若干个元素搬到数组的末尾,我们称之为数组的旋转. 输入一个递增排序的数组的一个旋转,输出旋转数组的最小元素. 例如数组{3,4,5,1,2}为{1,2,3,4,5}的一个旋 ...
LA 4256 DP Salesmen
d(i, j)表示使前i个数满足要求,而且第i个数值为j的最小改动次数. d(i, j) = min{ d(i-1, k) | k == j | G[j][k] } #include <cstd ...
关于面试总结-app测试面试题
前言现在面试个测试岗位,都是要求全能的,web.接口.app啥都要会测,那么APP测试一般需要哪些技能呢? 面试app测试岗位会被问到哪些问题,怎样让面试管觉得你对APP测试很精通的样子? 本篇总结 ...
IndiaHacks 2nd Elimination 2017 (unofficial, unrated mirror, ICPC rules)
D. Airplane Arrangements time limit per test 2 seconds memory limit per test 256 megabytes input sta ...
牛腩新闻发布系统（三）：CSS盒子模型及其基本内容
导读: 这些天一直在做牛腩的网页,比如什么首页.出错页.新闻内容页等.在学习的不断推进中,一些刚开始理解的不是很好的东西,也逐渐的深刻了起来.下面,就对这一段时间的学习,做一个总结.主要总结内容有:盒 ...
“玲珑杯”ACM比赛 Round #13 B -- 我也不是B，倍增+二分！
B 我也不是B 这个题做了一下午,比赛两个小时还是没做出来,比完赛才知道要用一个倍增算法确定区间,然后再二分右端点. 题意:定义一个序列的混乱度为累加和:b[i]*v[i],b[i]为这个序 ...
【Luogu】P1586四方定理（DP）
题目链接此题使用DP.设f[i][j]表示数i用j个数表示,则对于所有的k<=sqrt(i),有 f[i][j]=∑f[i-k*k][j-1] 但是这样会有重复情况.所以先枚举k,再枚举i和j ...
BZOJ 3230 相似子串 ——后缀数组
题目的Source好有趣. 我们求出SA,然后求出每一个后缀中与前面本质不同的字符串的个数. 然后二分求出当前的字符串. 然后就是正反两次后缀数组求LCP的裸题了. 要注意,这时两个串的起点可能会相同 ...
[luoguP3413] SAC#1 - 萌数（数位DP）
传送门 gtm的数位dp! 看到好多题解,都是记忆化搜索,好像非常方便啊,但是我还是用递推好了,毕竟还是有些类似数位dp的题用递推的思路,记忆化做不了,现在多培养一下思路首先这道题, 只看长度大于等 ...

向量空间模型实现文档查询（Vector Space Model to realize document query）

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

随机推荐

热门专题