一、请知晓

 本文是基于:

  Event Recommendation Engine Challenge分步解析第一步

  Event Recommendation Engine Challenge分步解析第二步

  Event Recommendation Engine Challenge分步解析第三步

  Event Recommendation Engine Challenge分步解析第四步

  Event Recommendation Engine Challenge分步解析第五步

 需要读者先阅读前五篇文章解析

二、特征构建

 前五步我们已经将需要的数据进行了结构的存储,这一部分我们将利用前五步的数据

 1)生成训练数据

dr = DataRewriter()
print('生成训练数据...\n')
dr.rewriteData(train=True, start=2, header=True)

  我们先来解析这个DataRewriter类的rewriteData方法:该方法把前面user-based协同过滤和item-based协同过滤及各种热度和影响度作为特征组合在一起生成新的训练数据,用于分类器使用

    def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_eco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' )

  fn:即为train.csv或者test.csv

  fout:即为我们要写入保存的文件,data_train.csv或者data_test.csv

  ocolnames:即为我们的特征,如果是train.csv的话应该还有标签-interestednot_interested

  这里以train.csv为例讲解代码,其中train.csv文件如下所示:

  

    def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_eco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' ) ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(',')
#user,event,invited,timestamp,interested,not_interested
userId = cols[0]
eventId = cols[1]
invited = cols[2]
if ln % 500 == 0:
print("%s : %d (userId, eventId) = (%s, %s)" % (fn, ln, userId, eventId))

  

  a)逐行读取train.csv或者test.csv,逗号分隔后获取userId,eventId,invited,即前三列信息,然后调用self.userReco( userId, eventId)方法计算user_reco:

#这是特征构建部分

#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
import scipy.io as sip class DataRewriter:
def __init__(self):
#读入数据做初始化
self.userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
self.eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
self.userEventScores = sio.mmread('PE_userEventScores').todense()
self.userSimMatrix = sio.mmread('US_userSimMatrix').todense()
self.eventPropSim = sio.mmread('EV_eventPropSim').todense()
self.eventContSim = sio.mmread('EV_eventContSim').todense()
self.numFriends = sio.mmread('UF_numFriends')
self.userFriends = sio.mmread('UF_userFriends').todense()
self.eventPopularity = sio.mmread('EA_eventPopularity').todense() def userReco(self, userId, eventId):
"""
根据User-based协同过滤,得到event的推荐度
基本的伪代码思路如下:
for item in i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running average
return top items ranked by weighted average """
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0

  如在处理train.csv的第一行时,userId = 3044012, eventId = 1918771225

#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
userEventScores = sio.mmread('PE_userEventScores').todense()
userSimMatrix = sio.mmread('US_userSimMatrix').todense() userId = '3044012'
eventId = '1918771225'
i = userIndex[userId]
j = eventIndex[eventId] print('The first line in train.csv: userIndex of (userId = %s) is (i = %d) ' %(userId, i) )
print('The first line in train.csv: eventIndex of (eventId = %s) is (j = %d) ' %(eventId, j) ) vs = userEventScores[:, j]#获得所有user对event j兴趣分,即userEventScores的第j+1列
sims = userSimMatrix[i, :]#获得userSimMatrix的第i+1行,即每个user对该user的相似度
prod = sims * vs
try:
print(prod[0, 0] - userEventScores[i, j])
except IndexError:
print(0)

  代码示例结果:

 这样我们得到该useruser_reco

  b)evt_p_reco和evt_c_reco的计算

   过程和上面的userReco()类似,读者可以参考eventPropSim和eventContSim的结构信息

   def eventReco(self, userId, eventId):
"""
根据基于物品的协同过滤,得到Event的推荐度
基本的伪代码思路:
for item i:
for every item j that u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]#user i对每个event的兴趣分
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass return pscore, cscore

  c)user_pop计算:调用self.userPop()方法

   这里需要用户的朋友数(已经用占比表示):

    def userPop(self, userId):
"""
基于用户的朋友个数来推断用户的社交程度
主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
""" if userId in self.userIndex:
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0

  d)frnd_infl计算:调用self.friendInfluence()方法,朋友对该用户的影响,即用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的

   这里需要变量self.userFriends

    def friendInfluence(self, userId):
"""
朋友对用户的影响
主要考虑用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的
用户的朋友圈如果都是积极参加各种event,可能会对当前用户有一定的影响
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
#下面的一行代码是不是有问题呢?
#是不是应该为某个用户的所有朋友的兴趣分之和,然后除以nusers,也就是axis应该=1
return (self.userFriends[i, :].sum(axis=0) / nusers)[0, 0]

  e)evt_pop的计算:调用self.eventPop()方法,某个event的热度,主要通过参与的人数来界定的

   需要用到变量self.eventPopularity

    def eventPop(self, eventId):
"""
活动本身的热度
主要通过参与的参数来界定的
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0]

  

  f)然后就是将该行的信息写入文件保存

   文件信息包含:[invited, user_reco, evt_p_reco, evt_c_reco, user_pop, frnd_infl, evt_pop],如果读取的是train.csv,则还需要append 标签interestednot_interested

#读取一行,处理后,将该行写入,保存
fout.write(','.join( map(lambda x: str(x), ocols)) + '\n')

  g)构建特征完整代码

#这是特征构建部分

#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
import scipy.io as sio
import numpy as np class DataRewriter:
def __init__(self):
#读入数据做初始化
self.userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
self.eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
self.userEventScores = sio.mmread('PE_userEventScores').todense()
self.userSimMatrix = sio.mmread('US_userSimMatrix').todense()
self.eventPropSim = sio.mmread('EV_eventPropSim').todense()
self.eventContSim = sio.mmread('EV_eventContSim').todense()
self.numFriends = sio.mmread('UF_numFriends')
self.userFriends = sio.mmread('UF_userFriends').todense()
self.eventPopularity = sio.mmread('EA_eventPopularity').todense() def userReco(self, userId, eventId):
"""
根据User-based协同过滤,得到event的推荐度
基本的伪代码思路如下:
for item in i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running average
return top items ranked by weighted average """
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0 def eventReco(self, userId, eventId):
"""
根据基于物品的协同过滤,得到Event的推荐度
基本的伪代码思路:
for item i:
for every item j that u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass return pscore, cscore def userPop(self, userId):
"""
基于用户的朋友个数来推断用户的社交程度
主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
""" if userId in self.userIndex:
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0 def friendInfluence(self, userId):
"""
朋友对用户的影响
主要考虑用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的
用户的朋友圈如果都是积极参加各种event,可能会对当前用户有一定的影响
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
#下面的一行代码是不是有问题呢?
#是不是应该为某个用户的所有朋友的兴趣分之和,然后除以nusers,也就是axis应该=1
return (self.userFriends[i, :].sum(axis=0) / nusers)[0, 0] def eventPop(self, eventId):
"""
活动本身的热度
主要通过参与的参数来界定的
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0] def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_reco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' ) ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(',')
#user,event,invited,timestamp,interested,not_interested
userId = cols[0]
eventId = cols[1]
invited = cols[2]
if ln % 500 == 0:
print("%s : %d (userId, eventId) = (%s, %s)" % (fn, ln, userId, eventId)) user_reco = self.userReco( userId, eventId )
evt_p_reco, evt_c_reco = self.eventReco( userId, eventId )
user_pop = self.userPop( userId )
frnd_infl = self.friendInfluence( userId )
evt_pop = self.eventPop( eventId )
ocols = [invited, user_reco, evt_p_reco, evt_c_reco, user_pop, frnd_infl, evt_pop] if train:
ocols.append( cols[4] )#interested
ocols.append( cols[5] )#not_interested fout.write(','.join( map(lambda x: str(x), ocols)) + '\n') fin.close()
fout.close() def rewriteTrainingSet(self):
self.rewriteData(True) def rewriteTestSet(self):
self.rewriteData(False) dr = DataRewriter()
print('生成训练数据...\n')
dr.rewriteData(train=True, start=2, header=True) print('生成预测数据...\n')
dr.rewriteData(train=False, start=2, header=True)
print('done')

 2)生成测试数据:过程和生成训练数据类似

 至此,第六步完成,哪里有不明白的请留言

 在特征构建好了之后,我们有很多办法去训练得到模型和完成预测

 我们来看看Event Recommendation Engine Challenge分步解析第七步

Event Recommendation Engine Challenge分步解析第六步的更多相关文章

  1. Event Recommendation Engine Challenge分步解析第七步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  2. Event Recommendation Engine Challenge分步解析第五步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  3. Event Recommendation Engine Challenge分步解析第四步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  4. Event Recommendation Engine Challenge分步解析第三步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  5. Event Recommendation Engine Challenge分步解析第二步

    一.请知晓 本文是基于Event Recommendation Engine Challenge分步解析第一步,需要读者先阅读上篇文章解析 二.用户相似度计算 第二步:计算用户相似度信息 由于用到:u ...

  6. Event Recommendation Engine Challenge分步解析第一步

    一.简介 此项目来自kaggle:https://www.kaggle.com/c/event-recommendation-engine-challenge/ 数据集的下载需要账号,并且需要手机验证 ...

  7. SpringBoot 源码解析 (六)----- Spring Boot的核心能力 - 内置Servlet容器源码分析(Tomcat)

    Spring Boot默认使用Tomcat作为嵌入式的Servlet容器,只要引入了spring-boot-start-web依赖,则默认是用Tomcat作为Servlet容器: <depend ...

  8. (转) Quick Guide to Build a Recommendation Engine in Python

    本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...

  9. 卷积神经网络 cnnff.m程序 中的前向传播算法 数据 分步解析

    最近在学习卷积神经网络,哎,真的是一头雾水!最后决定从阅读CNN程序下手! 程序来源于GitHub的DeepLearnToolbox 由于确实缺乏理论基础,所以,先从程序的数据流入手,虽然对高手来讲, ...

随机推荐

  1. poj-1273(最大流)

    题解:纯板子题... EK算法 #include<iostream> #include<algorithm> #include<cstring> #include& ...

  2. Nginx TSL/SSL优化握手性能

    L:131

  3. 第五十六 css选择器和盒模型

    1.组合选择器 群组选择器 #每个选择为可以位三种基础选择器任意一个,用逗号隔开,控制多个. div,#div,.div{ color:red } 后代(子代)选择器 .sup .sub{ 后代 } ...

  4. 大学java教案之MySQL安装图解

    一.MYSQL的安装 1.打开下载的mysql安装文件mysql-5.0.27-win32.zip,双击解压缩,运行"setup.exe". 2.选择安装类型,有"Typ ...

  5. 用二分法定义平方根函数(Bisection method Square Root Python)

    Python里面有内置(Built-in)的平方根函数:sqrt(),可以方便计算正数的平方根.那么,如果要自己定义一个sqrt函数,该怎么解决呢? 解决思路:  1. 大于等于1的正数n的方根,范围 ...

  6. Edge Deletion CodeForces - 1076D(水最短路)

    题意: 设从1到每个点的最短距离为d,求删除几条边后仍然使1到每个点的距离为d,使得剩下的边最多为k 解析: 先求来一遍spfa,然后bfs遍历每条路,如果d[v] == d[u] + Node[i] ...

  7. Codeforces Round #539 Div1 题解

    Codeforces Round #539 Div1 题解 听说这场很适合上分QwQ 然而太晚了QaQ A. Sasha and a Bit of Relax 翻译 有一个长度为\(n\)的数组,问有 ...

  8. __init__、__new__、__call__ 方法

    __init__方法 __init__方法负责对象的初始化,系统执行该方法前,其实该对象已经存在了,要不然初始化什么东西呢?先看例子: # class A(object): python2 必须显示地 ...

  9. [NOI2014]购票(斜率优化+线段树)

    题目描述 今年夏天,NOI在SZ市迎来了她30周岁的生日.来自全国 n 个城市的OIer们都会从各地出发,到SZ市参加这次盛会. 全国的城市构成了一棵以SZ市为根的有根树,每个城市与它的父亲用道路连接 ...

  10. 【redis】redis常用命令及操作记录

    redis-cli是Redis命令行界面,可以向Redis发送命令,并直接从终端读取服务器发送的回复. 它有两种主要模式:一种交互模式,其中有一个REPL(read eval print loop), ...