Event Recommendation Engine Challenge分步解析第二步

一、请知晓

　本文是基于Event Recommendation Engine Challenge分步解析第一步，需要读者先阅读上篇文章解析

二、用户相似度计算

　第二步：计算用户相似度信息

　　由于用到：users.csv，我们先看看其内容（看前10行）

import pandas as pd

df_users = pd.read_csv('users.csv')

df_users.head(10)

　　结果如下，有国家，有地区：

　　我们使用locale和pycountry模块来将字符串转换成数值：

　　locale.locale_alias字典

import locale

locale.locale_alias

　　下面我们来看看如何对users.csv的信息列进行处理，转换成数值型

　　1） locale列处理

import locale

from collections import defaultdict

localeIdMap = defaultdict(int)

for i, l in enumerate(locale.locale_alias.keys()):

    localeIdMap[l] = i + 1

for each in localeIdMap:

    print(each, '\t', localeIdMap[each])

　　　代码示例结果：

　　　所以传给localeIdMap一个locale的字符串，就可以将其转换成数值型，如果传入的字符串不在localeIdMap的key中，则返回0，这也就体现了defaultdict(int)的作用

print(localeIdMap['en_GB'.lower()])

print(localeIdMap['en_US'.lower()])

print(localeIdMap['id_ID'.lower()])

print(localeIdMap['ka_GE'.lower()])

　　　代码示例结果：

　　2）birthyear列处理

　　　该列处理比较简单，存在就直接转换成数值，不存在就用0填充

def getBirthYearInt(birthYear):

    try:

        return 0 if birthYear=="None" else int(birthYear)

    except:

        return 0

print(getBirthYearInt(1992))

print(getBirthYearInt(None))

　　　代码示例结果：

　　3）gender列处理：male转换为1， female转换为2，空值用0填充

from collections import defaultdict

genderIdMap = defaultdict(int, {'male':1, 'female':2})

print(genderIdMap['male'])

print(genderIdMap['female'])

print(genderIdMap[None])

　　　代码示例结果：

　　4）joinedAt列处理

　　　我们发现该列信息有些共性特点：

import pandas as pd

df_users = pd.read_csv('users.csv')

df_users['joinedAt']

　　　代码示例结果：

　　　我们发现该列要么是None，要么是上面的时间字符串，均有T在中间和S在尾部，根据这个共性我们用datetime模块，提取时间信息：

import datetime

def getJoinedYearMonth(dateString):

    try:

        dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")

        return "".join( [str(dttm.year), str(dttm.month)] )

    except:

        return 0

df_users['joinedAt'].map(getJoinedYearMonth)

　　　代码示例结果（这里要注意dateString为None的情况，也就是说dateString必须符合后面的格式才行，所以这里加了try，但是except的返回值并不代表用0值填充合适）

　　5）location列处理

　　　我们来看看users.csv中location列信息（前20行）：

df_users['location']

　　　代码示例结果：

　　　我们使用pycountry模块来将此列转换为数值型，pycountry.countries是个迭代器：

import pycountry

for i, c in enumerate(pycountry.countries):

    print(i, c)

　　　代码示例结果：

import pycountry

from collections import defaultdict

countryIdMap = defaultdict(int)

for i, c in enumerate(pycountry.countries):

    countryIdMap[c.name.lower()] = i + 1

for each in countryIdMap:

    print(each, '\t', countryIdMap[each])

　　　代码示例结果：

　　　然后看看对location列是如何转换成数值型的：

import pycountry

from collections import defaultdict

countryIdMap = defaultdict(int)

for i, c in enumerate(pycountry.countries):

    countryIdMap[c.name.lower()] = i + 1

def getCountryId(location):

    if (isinstance( location, str)) and len(location.strip()) > 0 and location.rfind('  ') > -1:

        return countryIdMap[ location[location.rindex('  ') + 2: ].lower() ]

    else:

        return 0

print(getCountryId('San Dimas  California'))

print(getCountryId('Jogjakarta  Indonesia'))

　　　代码示例结果：

　　　也就是location是字符串，不为空，并且从右边查询连续两个空格的索引大于-1，也就是存在两个连续空格，就获得两个空格后面的字符串，并转换为数值ID

　　6）timezone列处理：比较简单，存在值就转换为int型，不存在用0填充

def getTimezoneInt(timezone):

    try:

        return int(timezone)

    except:

        return 0

print(getTimezoneInt(-240))#-240

print(getTimezoneInt(240))

print(getTimezoneInt(None))

　　7）将上面处理的1-6列进行归一化

　　　self.userMatrix矩阵的处理中归一化使用了sklearn.preprocessing.normalize()函数，归一化后方便计算两个user的相似度

　　　这里只计算Event Recommendation Engine Challenge分步解析第一步中的uniqueUserPairs，他们因为同一个event事件关联起来了，有联系

　　　计算相关性用到了scipy.spatial.distance.correlation(u, v) #计算向量u和v之间的相关系数（pearson correlation coefficient, Centered Cosine）

　　8）第二步完整代码

from collections import defaultdict

import locale, pycountry

import scipy.sparse as ss

import scipy.io as sio

import itertools

#import cPickle

#From python3, cPickle has beed replaced by _pickle

import _pickle as cPickle

import scipy.spatial.distance as ssd

import datetime

from sklearn.preprocessing import normalize

class ProgramEntities:

    """

    我们只关心train和test中出现的user和event，因此重点处理这部分关联数据，

    经过统计：train和test中总共3391个users和13418个events

    """

    def __init__(self):

        #统计训练集中有多少独立的用户的events

        uniqueUsers = set()#uniqueUsers保存总共多少个用户：3391个

        uniqueEvents = set()#uniqueEvents保存总共多少个events：13418个

        eventsForUser = defaultdict(set)#字典eventsForUser保存了每个user:所对应的event

        usersForEvent = defaultdict(set)#字典usersForEvent保存了每个event：哪些user点击

        for filename in ['train.csv', 'test.csv']:

            f = open(filename)

            f.readline()#跳过第一行

            for line in f:

                cols = line.strip().split(',')

                uniqueUsers.add( cols[0] )

                uniqueEvents.add( cols[1] )

                eventsForUser[cols[0]].add( cols[1] )

                usersForEvent[cols[1]].add( cols[0] )

            f.close()

        self.userEventScores = ss.dok_matrix( ( len(uniqueUsers), len(uniqueEvents) ) )

        self.userIndex = dict()

        self.eventIndex = dict()

        for i, u in enumerate(uniqueUsers):

            self.userIndex[u] = i

        for i, e in enumerate(uniqueEvents):

            self.eventIndex[e] = i

        ftrain = open('train.csv')

        ftrain.readline()

        for line in ftrain:

            cols = line.strip().split(',')

            i = self.userIndex[ cols[0] ]

            j = self.eventIndex[ cols[1] ]

            self.userEventScores[i, j] = int( cols[4] ) - int( cols[5] )

        ftrain.close()

        sio.mmwrite('PE_userEventScores', self.userEventScores)

        #为了防止不必要的计算，我们找出来所有关联的用户或者关联的event

        #所谓关联用户指的是至少在同一个event上有行为的用户user pair

        #关联的event指的是至少同一个user有行为的event pair

        self.uniqueUserPairs = set()

        self.uniqueEventPairs = set()

        for event in uniqueEvents:

            users = usersForEvent[event]

            if len(users) > 2:

                self.uniqueUserPairs.update( itertools.combinations(users, 2) )

        for user in uniqueUsers:

            events = eventsForUser[user]

            if len(events) > 2:

                self.uniqueEventPairs.update( itertools.combinations(events, 2) )

        #rint(self.userIndex)

        cPickle.dump( self.userIndex, open('PE_userIndex.pkl', 'wb'))

        cPickle.dump( self.eventIndex, open('PE_eventIndex.pkl', 'wb') )

#数据清洗类

class DataCleaner:

    def __init__(self):

        #一些字符串转数值的方法

        #载入locale

        self.localeIdMap = defaultdict(int)

        for i, l in enumerate(locale.locale_alias.keys()):

            self.localeIdMap[l] = i + 1

        #载入country

        self.countryIdMap = defaultdict(int)

        ctryIdx = defaultdict(int)

        for i, c in enumerate(pycountry.countries):

            self.countryIdMap[c.name.lower()] = i + 1

            if c.name.lower() == 'usa':

                ctryIdx['US'] = i

            if c.name.lower() == 'canada':

                ctryIdx['CA'] = i

        for cc in ctryIdx.keys():

            for s in pycountry.subdivisions.get(country_code=cc):

                self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1

        self.genderIdMap = defaultdict(int, {'male':1, 'female':2})

    #处理LocaleId

    def getLocaleId(self, locstr):

        #这样因为localeIdMap是defaultdict(int)，如果key中没有locstr.lower()，就会返回默认int 0

        return self.localeIdMap[ locstr.lower() ]

    #处理birthyear

    def getBirthYearInt(self, birthYear):

        try:

            return 0 if birthYear == 'None' else int(birthYear)

        except:

            return 0

    #性别处理

    def getGenderId(self, genderStr):

        return self.genderIdMap[genderStr]

    #joinedAt

    def getJoinedYearMonth(self, dateString):

        dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")

        return "".join( [str(dttm.year), str(dttm.month) ] )

    #处理location

    def getCountryId(self, location):

        if (isinstance( location, str)) and len(location.strip()) > 0 and location.rfind('  ') > -1:

            return self.countryIdMap[ location[location.rindex('  ') + 2: ].lower() ]

        else:

            return 0

    #处理timezone

    def getTimezoneInt(self, timezone):

        try:

            return int(timezone)

        except:

            return 0

class Users:

    """

    构建user/user相似度矩阵

    """

    def __init__(self, programEntities, sim=ssd.correlation):#spatial.distance.correlation(u, v) #计算向量u和v之间的相关系数

        cleaner = DataCleaner()

        nusers = len(programEntities.userIndex.keys())#3391

        #print(nusers)

        fin = open('users.csv')

        colnames = fin.readline().strip().split(',') #7列特征

        self.userMatrix = ss.dok_matrix( (nusers, len(colnames)-1 ) )#构建稀疏矩阵

        for line in fin:

            cols = line.strip().split(',')

            #只考虑train.csv中出现的用户，这一行是作者注释上的，但是我不是很理解

            #userIndex包含了train和test的所有用户，为何说只考虑train.csv中出现的用户

            if cols[0] in programEntities.userIndex:

                i = programEntities.userIndex[ cols[0] ]#获取user：对应的index

                self.userMatrix[i, 0] = cleaner.getLocaleId( cols[1] )#locale

                self.userMatrix[i, 1] = cleaner.getBirthYearInt( cols[2] )#birthyear,空值0填充

                self.userMatrix[i, 2] = cleaner.getGenderId( cols[3] )#处理性别

                self.userMatrix[i, 3] = cleaner.getJoinedYearMonth( cols[4] )#处理joinedAt列

                self.userMatrix[i, 4] = cleaner.getCountryId( cols[5] )#处理location

                self.userMatrix[i, 5] = cleaner.getTimezoneInt( cols[6] )#处理timezone

        fin.close()

        #归一化矩阵

        self.userMatrix = normalize(self.userMatrix, norm='l1', axis=0, copy=False)

        sio.mmwrite('US_userMatrix', self.userMatrix)

        #计算用户相似度矩阵，之后会用到

        self.userSimMatrix = ss.dok_matrix( (nusers, nusers) )#(3391,3391)

        for i in range(0, nusers):

            self.userSimMatrix[i, i] = 1.0

        for u1, u2 in programEntities.uniqueUserPairs:

            i = programEntities.userIndex[u1]

            j = programEntities.userIndex[u2]

            if (i, j) not in self.userSimMatrix:

                #print(self.userMatrix.getrow(i).todense()) 如[[0.00028123,0.00029847,0.00043592,0.00035208,0,0.00032346]]

                #print(self.userMatrix.getrow(j).todense()) 如[[0.00028123,0.00029742,0.00043592,0.00035208,0,-0.00032346]]

                usim = sim(self.userMatrix.getrow(i).todense(),self.userMatrix.getrow(j).todense())

                self.userSimMatrix[i, j] = usim

                self.userSimMatrix[j, i] = usim

        sio.mmwrite('US_userSimMatrix', self.userSimMatrix)  

print('第1步：统计user和event相关信息...')

pe = ProgramEntities()

print('第1步完成...\n')

print('第2步：计算用户相似度信息，并用矩阵形式存储...')

Users(pe)

print('第2步完成...\n')

　针对该步使用的变量作简单介绍：

　　self.userMatrix：user稀疏矩阵，shape为（3391，6）,3391为train和test中出现的所有user，6列为users.csv中后面user的6列信息，并且已经做了数值转换和归一化

　　self.userSimMatrix：用户相似度稀疏矩阵，shape为（3391，3391），这里只计算了分步解析第一步中uniqueUserPairs，我们只需要计算出对同一个event有响应的用户之间的相似度，uniqueUserPairs保存了这样的UserPair

import pandas as pd

round(pd.DataFrame(userSimMatrix))

　　代码示例结果：

userSimMatrix：用户相似度稀疏矩阵，同一个event有响应的用户之间的相似度

　至此，第二步完成，哪里有不明白的请留言

　我们继续看Event Recommendation Engine Challenge分步解析第三步

Event Recommendation Engine Challenge分步解析第二步的更多相关文章

Event Recommendation Engine Challenge分步解析第七步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
Event Recommendation Engine Challenge分步解析第六步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
Event Recommendation Engine Challenge分步解析第五步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
Event Recommendation Engine Challenge分步解析第四步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
Event Recommendation Engine Challenge分步解析第一步
一.简介此项目来自kaggle:https://www.kaggle.com/c/event-recommendation-engine-challenge/ 数据集的下载需要账号,并且需要手机验证 ...
Event Recommendation Engine Challenge分步解析第三步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
(转) Quick Guide to Build a Recommendation Engine in Python
本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...
卷积神经网络 cnnff.m程序中的前向传播算法数据分步解析
最近在学习卷积神经网络,哎,真的是一头雾水!最后决定从阅读CNN程序下手! 程序来源于GitHub的DeepLearnToolbox 由于确实缺乏理论基础,所以,先从程序的数据流入手,虽然对高手来讲, ...
Comprehensive Guide to build a Recommendation Engine from scratch (in Python) / 从0开始搭建推荐系统
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/, 一篇详细 ...

随机推荐

Kafka消费时报错：Producer connection to xxx:9092 unsuccessful
使用kafka消费数据时报Producer错误,具体错误如下: kafka.producer.SyncProducer:103 Producer connection to xxx:9092 unsu ...
Row_Number() over()
分页 ROW_NUMBER() OVER (order by ID) 是先把ID列排序,再为排序以后的每条ID记录返回一个序号.
Codeforces Round #540 Div. 3 F2
考虑将每种颜色构成的极小连通块缩点,然后直接跑树形dp即可,即f[i][0/1]表示子树内是否有颜色向上延伸时删边的方案数.dp时需要去除某点的贡献,最好用前后缀积的做法而不是求逆. 至于如何缩点,假 ...
设置Linux防火墙
设置 Linux 服务器防火墙脚本,Web_iptables.sh 通过内网可访问服务器所有开放端口给跳板机开放sshd端口连接服务器信任ip 所有端口均开放开放部分端口供外部访问 #!/bin ...
BZOJ 2069: [POI2004]ZAW（Dijkstra + 二进制拆分）
题意给定一个有 \(N\) 个点 \(M\) 条边的无向图, 每条无向边最多只能经过一次 . 对于边 \((u, v)\) , 从 \(u\) 到 \(v\) 的代价为 \(a\) , 从 \(v ...
【洛谷2252&HDU1527】取石子游戏（博弈论）
题面 HDU1527 取石子游戏洛谷2252 取石子游戏题解裸的威佐夫博弈 #include<iostream> #include<cmath> using namesp ...
【BZOJ4832】抵制克苏恩（矩阵快速幂，动态规划）
[BZOJ4832]抵制克苏恩(矩阵快速幂,动态规划) 题面 BZOJ 题解一模一样 #include<iostream> #include<cstdio> using na ...
__init__、__new__、__call__ 方法
__init__方法 __init__方法负责对象的初始化,系统执行该方法前,其实该对象已经存在了,要不然初始化什么东西呢?先看例子: # class A(object): python2 必须显示地 ...
「HNOI2016」序列解题报告
「HNOI2016」序列有一些高妙的做法,懒得看考虑莫队,考虑莫队咋移动区间然后你在区间内部找一个最小值的位置,假设现在从右边加最小值左边区间显然可以\(O(1)\),最小值右边的区间是断掉的 ...
BZOJ5262(容斥)
题目描述听着自己美妙的曲子,小Z进入了梦乡.在梦中,小Z仿佛又回到了自己纵横考场的年代.在梦中,小Z参加了一场考试,这场考试一共有n道题,每道题的最终得分都是一个大于等于0的整数.然而醒来后,小Z ...

Event Recommendation Engine Challenge分步解析第二步

Event Recommendation Engine Challenge分步解析第二步的更多相关文章

随机推荐

热门专题