一、请知晓

 本文是基于:

  Event Recommendation Engine Challenge分步解析第一步

  Event Recommendation Engine Challenge分步解析第二步

  Event Recommendation Engine Challenge分步解析第三步

  Event Recommendation Engine Challenge分步解析第四步

 需要读者先阅读前四篇文章解析

二、活跃度/event热度数据

 由于用到event_attendees.csv.gz文件,我们先看看该文件

import pandas as pd
df_events_attendees = pd.read_csv('event_attendees.csv.gz', compression='gzip')
df_events_attendees.head()

 代码示例结果(该文件保存了某event出席情况信息):

 1)变量解释

  nevents:train.csvtest.csv中总共的events数目,这里值为13418

  self.eventPopularity:稀疏矩阵,shape为(nevents,1),保存的值是某个event在上图中yes数目-no数目,即一行行处理上述文件,获取该event的index,后yes列空格分割后数目减去no列空格分割数目,并做归一化 

import pandas as pd
import scipy.io as sio
eventPopularity = sio.mmread('EA_eventPopularity').todense()
pd.DataFrame(eventPopularity)

  代码示例结果:

  第五步完整代码:

from collections import defaultdict
import locale, pycountry
import scipy.sparse as ss
import scipy.io as sio
import itertools
#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle import scipy.spatial.distance as ssd
import datetime
from sklearn.preprocessing import normalize import gzip
import numpy as np import hashlib #处理user和event关联数据
class ProgramEntities:
"""
我们只关心train和test中出现的user和event,因此重点处理这部分关联数据,
经过统计:train和test中总共3391个users和13418个events
"""
def __init__(self):
#统计训练集中有多少独立的用户的events
uniqueUsers = set()#uniqueUsers保存总共多少个用户:3391个
uniqueEvents = set()#uniqueEvents保存总共多少个events:13418个
eventsForUser = defaultdict(set)#字典eventsForUser保存了每个user:所对应的event
usersForEvent = defaultdict(set)#字典usersForEvent保存了每个event:哪些user点击
for filename in ['train.csv', 'test.csv']:
f = open(filename)
f.readline()#跳过第一行
for line in f:
cols = line.strip().split(',')
uniqueUsers.add( cols[0] )
uniqueEvents.add( cols[1] )
eventsForUser[cols[0]].add( cols[1] )
usersForEvent[cols[1]].add( cols[0] )
f.close() self.userEventScores = ss.dok_matrix( ( len(uniqueUsers), len(uniqueEvents) ) )
self.userIndex = dict()
self.eventIndex = dict()
for i, u in enumerate(uniqueUsers):
self.userIndex[u] = i
for i, e in enumerate(uniqueEvents):
self.eventIndex[e] = i ftrain = open('train.csv')
ftrain.readline()
for line in ftrain:
cols = line.strip().split(',')
i = self.userIndex[ cols[0] ]
j = self.eventIndex[ cols[1] ]
self.userEventScores[i, j] = int( cols[4] ) - int( cols[5] )
ftrain.close()
sio.mmwrite('PE_userEventScores', self.userEventScores) #为了防止不必要的计算,我们找出来所有关联的用户或者关联的event
#所谓关联用户指的是至少在同一个event上有行为的用户user pair
#关联的event指的是至少同一个user有行为的event pair
self.uniqueUserPairs = set()
self.uniqueEventPairs = set()
for event in uniqueEvents:
users = usersForEvent[event]
if len(users) > 2:
self.uniqueUserPairs.update( itertools.combinations(users, 2) )
for user in uniqueUsers:
events = eventsForUser[user]
if len(events) > 2:
self.uniqueEventPairs.update( itertools.combinations(events, 2) )
#rint(self.userIndex)
cPickle.dump( self.userIndex, open('PE_userIndex.pkl', 'wb'))
cPickle.dump( self.eventIndex, open('PE_eventIndex.pkl', 'wb') ) #数据清洗类
class DataCleaner:
def __init__(self):
#一些字符串转数值的方法
#载入locale
self.localeIdMap = defaultdict(int) for i, l in enumerate(locale.locale_alias.keys()):
self.localeIdMap[l] = i + 1 #载入country
self.countryIdMap = defaultdict(int)
ctryIdx = defaultdict(int)
for i, c in enumerate(pycountry.countries):
self.countryIdMap[c.name.lower()] = i + 1
if c.name.lower() == 'usa':
ctryIdx['US'] = i
if c.name.lower() == 'canada':
ctryIdx['CA'] = i for cc in ctryIdx.keys():
for s in pycountry.subdivisions.get(country_code=cc):
self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1 self.genderIdMap = defaultdict(int, {'male':1, 'female':2}) #处理LocaleId
def getLocaleId(self, locstr):
#这样因为localeIdMap是defaultdict(int),如果key中没有locstr.lower(),就会返回默认int 0
return self.localeIdMap[ locstr.lower() ] #处理birthyear
def getBirthYearInt(self, birthYear):
try:
return 0 if birthYear == 'None' else int(birthYear)
except:
return 0 #性别处理
def getGenderId(self, genderStr):
return self.genderIdMap[genderStr] #joinedAt
def getJoinedYearMonth(self, dateString):
dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")
return "".join( [str(dttm.year), str(dttm.month) ] ) #处理location
def getCountryId(self, location):
if (isinstance( location, str)) and len(location.strip()) > 0 and location.rfind(' ') > -1:
return self.countryIdMap[ location[location.rindex(' ') + 2: ].lower() ]
else:
return 0 #处理timezone
def getTimezoneInt(self, timezone):
try:
return int(timezone)
except:
return 0 def getFeatureHash(self, value):
if len(value.strip()) == 0:
return -1
else:
#return int( hashlib.sha224(value).hexdigest()[0:4], 16) python3会报如下错误
#TypeError: Unicode-objects must be encoded before hashing
return int( hashlib.sha224(value.encode('utf-8')).hexdigest()[0:4], 16)#python必须先进行encode def getFloatValue(self, value):
if len(value.strip()) == 0:
return 0.0
else:
return float(value) #用户与用户相似度矩阵
class Users:
"""
构建user/user相似度矩阵
"""
def __init__(self, programEntities, sim=ssd.correlation):#spatial.distance.correlation(u, v) #计算向量u和v之间的相关系数
cleaner = DataCleaner()
nusers = len(programEntities.userIndex.keys())#3391
#print(nusers)
fin = open('users.csv')
colnames = fin.readline().strip().split(',') #7列特征
self.userMatrix = ss.dok_matrix( (nusers, len(colnames)-1 ) )#构建稀疏矩阵
for line in fin:
cols = line.strip().split(',')
#只考虑train.csv中出现的用户,这一行是作者注释上的,但是我不是很理解
#userIndex包含了train和test的所有用户,为何说只考虑train.csv中出现的用户
if cols[0] in programEntities.userIndex:
i = programEntities.userIndex[ cols[0] ]#获取user:对应的index
self.userMatrix[i, 0] = cleaner.getLocaleId( cols[1] )#locale
self.userMatrix[i, 1] = cleaner.getBirthYearInt( cols[2] )#birthyear,空值0填充
self.userMatrix[i, 2] = cleaner.getGenderId( cols[3] )#处理性别
self.userMatrix[i, 3] = cleaner.getJoinedYearMonth( cols[4] )#处理joinedAt列
self.userMatrix[i, 4] = cleaner.getCountryId( cols[5] )#处理location
self.userMatrix[i, 5] = cleaner.getTimezoneInt( cols[6] )#处理timezone
fin.close() #归一化矩阵
self.userMatrix = normalize(self.userMatrix, norm='l1', axis=0, copy=False)
sio.mmwrite('US_userMatrix', self.userMatrix) #计算用户相似度矩阵,之后会用到
self.userSimMatrix = ss.dok_matrix( (nusers, nusers) )#(3391,3391)
for i in range(0, nusers):
self.userSimMatrix[i, i] = 1.0 for u1, u2 in programEntities.uniqueUserPairs:
i = programEntities.userIndex[u1]
j = programEntities.userIndex[u2]
if (i, j) not in self.userSimMatrix:
#print(self.userMatrix.getrow(i).todense()) 如[[0.00028123,0.00029847,0.00043592,0.00035208,0,0.00032346]]
#print(self.userMatrix.getrow(j).todense()) 如[[0.00028123,0.00029742,0.00043592,0.00035208,0,-0.00032346]]
usim = sim(self.userMatrix.getrow(i).todense(),self.userMatrix.getrow(j).todense())
self.userSimMatrix[i, j] = usim
self.userSimMatrix[j, i] = usim
sio.mmwrite('US_userSimMatrix', self.userSimMatrix) #用户社交关系挖掘
class UserFriends:
"""
找出某用户的那些朋友,想法非常简单
1)如果你有更多的朋友,可能你性格外向,更容易参加各种活动
2)如果你朋友会参加某个活动,可能你也会跟随去参加一下
"""
def __init__(self, programEntities):
nusers = len(programEntities.userIndex.keys())#3391
self.numFriends = np.zeros( (nusers) )#array([0., 0., 0., ..., 0., 0., 0.]),保存每一个用户的朋友数
self.userFriends = ss.dok_matrix( (nusers, nusers) )
fin = gzip.open('user_friends.csv.gz')
print( 'Header In User_friends.csv.gz:',fin.readline() )
ln = 0
#逐行打开user_friends.csv.gz文件
#判断第一列的user是否在userIndex中,只有user在userIndex中才是我们关心的user
#获取该用户的Index,和朋友数目
#对于该用户的每一个朋友,如果朋友也在userIndex中,获取其朋友的userIndex,然后去userEventScores中获取该朋友对每个events的反应
#score即为该朋友对所有events的平均分
#userFriends矩阵记录了用户和朋友之间的score
#如851286067:1750用户出现在test.csv中,该用户在User_friends.csv.gz中一共2151个朋友
#那么其朋友占比应该是2151 / 总的朋友数sumNumFriends=3731377.0 = 2151 / 3731377 = 0.0005764627910822198
for line in fin:
if ln % 200 == 0:
print( 'Loading line:', ln )
cols = line.decode().strip().split(',')
user = cols[0]
if user in programEntities.userIndex:
friends = cols[1].split(' ')#获得该用户的朋友列表
i = programEntities.userIndex[user]
self.numFriends[i] = len(friends)
for friend in friends:
if friend in programEntities.userIndex:
j = programEntities.userIndex[friend]
#the objective of this score is to infer the degree to
#and direction in which this friend will influence the
#user's decision, so we sum the user/event score for
#this user across all training events
eventsForUser = programEntities.userEventScores.getrow(j).todense()#获取朋友对每个events的反应:0, 1, or -1
#print(eventsForUser.sum(), np.shape(eventsForUser)[1] )
#socre即是用户朋友在13418个events上的平均分
score = eventsForUser.sum() / np.shape(eventsForUser)[1]#eventsForUser = 13418,
#print(score)
self.userFriends[i, j] += score
self.userFriends[j, i] += score
ln += 1
fin.close()
#归一化数组
sumNumFriends = self.numFriends.sum(axis=0)#每个用户的朋友数相加
#print(sumNumFriends)
self.numFriends = self.numFriends / sumNumFriends#每个user的朋友数目比例
sio.mmwrite('UF_numFriends', np.matrix(self.numFriends) )
self.userFriends = normalize(self.userFriends, norm='l1', axis=0, copy=False)
sio.mmwrite('UF_userFriends', self.userFriends) #构造event和event相似度数据
class Events:
"""
构建event-event相似度,注意这里有2种相似度
1)由用户-event行为,类似协同过滤算出的相似度
2)由event本身的内容(event信息)计算出的event-event相似度
"""
def __init__(self, programEntities, psim=ssd.correlation, csim=ssd.cosine):
cleaner = DataCleaner()
fin = gzip.open('events.csv.gz')
fin.readline()#skip header
nevents = len(programEntities.eventIndex)
print(nevents)#13418
self.eventPropMatrix = ss.dok_matrix( (nevents, 7) )
self.eventContMatrix = ss.dok_matrix( (nevents, 100) )
ln = 0
for line in fin:
#if ln > 10:
#break
cols = line.decode().strip().split(',')
eventId = cols[0]
if eventId in programEntities.eventIndex:
i = programEntities.eventIndex[eventId]
self.eventPropMatrix[i, 0] = cleaner.getJoinedYearMonth( cols[2] )#start_time
self.eventPropMatrix[i, 1] = cleaner.getFeatureHash( cols[3] )#city
self.eventPropMatrix[i, 2] = cleaner.getFeatureHash( cols[4] )#state
self.eventPropMatrix[i, 3] = cleaner.getFeatureHash( cols[5] )#zip
self.eventPropMatrix[i, 4] = cleaner.getFeatureHash( cols[6] )#country
self.eventPropMatrix[i, 5] = cleaner.getFloatValue( cols[7] )#lat
self.eventPropMatrix[i, 6] = cleaner.getFloatValue( cols[8] )#lon
for j in range(9, 109):
self.eventContMatrix[i, j-9] = cols[j] ln += 1
fin.close() self.eventPropMatrix = normalize(self.eventPropMatrix, norm='l1', axis=0, copy=False)
sio.mmwrite('EV_eventPropMatrix', self.eventPropMatrix)
self.eventContMatrix = normalize(self.eventContMatrix, norm='l1', axis=0, copy=False)
sio.mmwrite('EV_eventContMatrix', self.eventContMatrix) #calculate similarity between event pairs based on the two matrices
self.eventPropSim = ss.dok_matrix( (nevents, nevents) )
self.eventContSim = ss.dok_matrix( (nevents, nevents) )
for e1, e2 in programEntities.uniqueEventPairs:
i = programEntities.eventIndex[e1]
j = programEntities.eventIndex[e2]
if not ((i, j) in self.eventPropSim):
epsim = psim( self.eventPropMatrix.getrow(i).todense(), self.eventPropMatrix.getrow(j).todense())
if np.isnan(epsim):
epsim = 0
self.eventPropSim[i, j] = epsim
self.eventPropSim[j, i] = epsim if not ((i, j) in self.eventContSim):
#两个向量,如果某个全为0,会返回nan
"""
import numpy as np
a = np.array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0])
b = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) from scipy.spatial.distance import cosine
temp = cosine(a, b)
会出现下面问题:
Warning (from warnings module):
File "D:\Python35\lib\site-packages\scipy\spatial\distance.py", line 644
dist = 1.0 - uv / np.sqrt(uu * vv)
RuntimeWarning: invalid value encountered in double_scalars """
ecsim = csim( self.eventContMatrix.getrow(i).todense(), self.eventContMatrix.getrow(j).todense())
if np.isnan(ecsim):
ecsim = 0
self.eventContSim[i, j] = ecsim
self.eventContSim[j, i] = ecsim sio.mmwrite('EV_eventPropSim', self.eventPropSim)
sio.mmwrite('EV_eventContSim', self.eventContSim) class EventAttendees:
"""
统计某个活动,参加和不参加的人数,从而为活动活跃度做准备
"""
def __init__(self, programEntities):
nevents = len(programEntities.eventIndex)#13418
self.eventPopularity = ss.dok_matrix( (nevents, 1) )
f = gzip.open('event_attendees.csv.gz')
f.readline()#skip header
for line in f:
cols = line.decode().strip().split(',')
eventId = cols[0]
if eventId in programEntities.eventIndex:
i = programEntities.eventIndex[eventId]
self.eventPopularity[i, 0] = len(cols[1].split(' ')) - len(cols[4].split(' '))#yes人数-no人数,即出席人数减未出席人数
f.close() self.eventPopularity = normalize( self.eventPopularity, norm='l1', axis=0, copy=False)
sio.mmwrite('EA_eventPopularity', self.eventPopularity) def data_prepare():
"""
计算生成所有的数据,用矩阵或者其他形式存储方便后续提取特征和建模
"""
print('第1步:统计user和event相关信息...')
pe = ProgramEntities()
print('第1步完成...\n') print('第2步:计算用户相似度信息,并用矩阵形式存储...')
Users(pe)
print('第2步完成...\n') print('第3步:计算用户社交关系信息,并存储...')
UserFriends(pe)
print('第3步完成...\n') print('第4步:计算event相似度信息,并用矩阵形式存储...')
Events(pe)
print('第4步完成...\n') print('第5步:计算event热度信息...')
EventAttendees(pe)
print('第5步完成...\n') #运行进行数据准备
data_prepare()

 

 综上完成数据的预处理和保存功能

 下面我们来看看特征构建:Event Recommendation Engine Challenge分步解析第六步

Event Recommendation Engine Challenge分步解析第五步的更多相关文章

  1. Event Recommendation Engine Challenge分步解析第七步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  2. Event Recommendation Engine Challenge分步解析第六步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  3. Event Recommendation Engine Challenge分步解析第四步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  4. Event Recommendation Engine Challenge分步解析第三步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  5. Event Recommendation Engine Challenge分步解析第二步

    一.请知晓 本文是基于Event Recommendation Engine Challenge分步解析第一步,需要读者先阅读上篇文章解析 二.用户相似度计算 第二步:计算用户相似度信息 由于用到:u ...

  6. Event Recommendation Engine Challenge分步解析第一步

    一.简介 此项目来自kaggle:https://www.kaggle.com/c/event-recommendation-engine-challenge/ 数据集的下载需要账号,并且需要手机验证 ...

  7. Cwinux源码解析(五)

      Cwinux源码解析(五)

  8. (转) Quick Guide to Build a Recommendation Engine in Python

    本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...

  9. 卷积神经网络 cnnff.m程序 中的前向传播算法 数据 分步解析

    最近在学习卷积神经网络,哎,真的是一头雾水!最后决定从阅读CNN程序下手! 程序来源于GitHub的DeepLearnToolbox 由于确实缺乏理论基础,所以,先从程序的数据流入手,虽然对高手来讲, ...

随机推荐

  1. Android与H5交互 原理与对比

    原文:  https://www.jianshu.com/p/345f4d8a5cfa 1.Android调用JS的方法有2种: (1)通过WebView的loadUrl() // 调用js中的函数: ...

  2. mpvue——Error: EPERM: operation not permitted

    报错 $ npm run build > mpvue@ build D:\wamp\www\webpack\mpvue\my-project > node build/build.js w ...

  3. Elasticsearch 中数据类型 text 与 keyword 的区别

    随着ElasticSearch 5.X 系列的到来, 同时也迎来了该版本的重大特性之一: 移除了string类型. 这个变动的根本原因是string类型会给我们带来很多困惑: 因为ElasticSea ...

  4. python3 sys模块

    模块sys有关python运行环境的变量和函数: 常用方法: sys.argv:一个列表,包含脚本外部传入的参数,argv[0]为脚本名 sys.exit([arg]):退出当前程序,可指定返回值或错 ...

  5. Write less code

    If you find yourself writing a lot of code to do something simple, you're probably doing it wrong. A ...

  6. 【原】cpu消耗高,查看对应的线程栈信息

    在压测过程中,有时候cpu会飙升,造成这种现象的原因很多, 可能是gc造成的,也可能是某个方法造成的, 如果从找对应的方法入手,下面简单罗列下步骤: 1.top,获取pid 下面cpu消耗90%左右 ...

  7. [NOI2010]海拔(最小割)

    题目描述 YT市是一个规划良好的城市,城市被东西向和南北向的主干道划分为n×n个区域.简单起见,可以将YT市看作一个 正方形,每一个区域也可看作一个正方形.从而,YT城市中包括(n+1)×(n+1)个 ...

  8. [FJOI2016]神秘数(脑洞+可持久化)

    题目描述 一个可重复数字集合S的神秘数定义为最小的不能被S的子集的和表示的正整数.例如S={1,1,1,4,13}, 1 = 1 2 = 1+1 3 = 1+1+1 4 = 4 5 = 4+1 6 = ...

  9. centos7修改默认网卡名称

    问题场景: 使用centos7有好一阵子了,安装过centos7的朋友都会发现网卡命名跟6.x系统的不一样,类似ifcfg-eno16780032, ens192,或者enp2s0等其他不习惯的.不容 ...

  10. css 选择符中的 >,+,~,=,^,$,*,|,:,空格 的意思

    一,作为元素选择符 * 表示通配选择符 * {} // 所有元素 二,作为关系选择符 空格 表示包含选择符 a div{} // 被a元素包含的div > 表示子元素选择符 a > div ...