Cohort Analysis是将某一个时期内的用户划分为一个cohort，并将多个cohort进行时间上的某个属性的比较的一种分析方法。Cohort Analysis在有些场景下非常有用。比如一个网站或App，在某个连续的4周里陆续更新或新增了一个功能或设计，想要知道这些功能和设计上的改动对用户的影响，就可以将每周的新注册作为一个cohort，观察这4个cohort在接下来的一段实际里的行为数据，就可以很清楚地观察到4个改动的影响。

最近要做Cohort Analysis，数据都在数据库里，就直接想用Python接数据库直接分析了。Google到一篇讲得很清楚的文章，学着用Python实现一遍。示例数据从这里下载。

1. 读取数据

据说 from ... import ... 的方式在性能上有弊端，一般都推荐直接 import。示例数据是典型的购物数据，按客户的第一次消费时间将客户分为不同的cohort。

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_excel('relay-foods.xlsx', sheet='Purchase Data')

df.head()

	OrderId	OrderDate	UserId	TotalCharges	CommonId	PupId	PickupDate
0	262	2009-01-11	47	50.67	TRQKD	2	2009-01-12
1	278	2009-01-20	47	26.60	4HH2S	3	2009-01-20
2	294	2009-02-03	47	38.71	3TRDC	2	2009-02-04
3	301	2009-02-06	47	53.38	NGAZJ	2	2009-02-09
4	302	2009-02-06	47	14.28	FFYHD	2	2009-02-09

2. 确定 `OrderDate` 的月份，根据 `OrderDate` 分群

分别创建两个字段 OrderPeriod 和 CohortGroup。OrderPeriod是购买日期的月份，CohortGroup是根据购买日期对 UserId 的分群。

df['OrderPeriod'] = df.OrderDate.map(lambda x: x.strftime('%Y-%m'))

df = df.assign(OrderPeriod = df.OrderDate.map(lambda x: x.strftime('%Y-%m'))) \

       .set_index('UserId')

# goupby(level=0): level 是index的level， 对于multiIndex， 可用level=0或1指定根据那个index来group

df = df.assign(CohortGroup = df.groupby(level=0).OrderDate.min().apply(lambda x: x.strftime('%Y-%m'))) \

       .reset_index()

df.head()

	UserId	OrderId	OrderDate	TotalCharges	CommonId	PupId	PickupDate	OrderPeriod	CohortGroup
0	47	262	2009-01-11	50.67	TRQKD	2	2009-01-12	2009-01	2009-01
1	47	278	2009-01-20	26.60	4HH2S	3	2009-01-20	2009-01	2009-01
2	47	294	2009-02-03	38.71	3TRDC	2	2009-02-04	2009-02	2009-01
3	47	301	2009-02-06	53.38	NGAZJ	2	2009-02-09	2009-02	2009-01
4	47	302	2009-02-06	14.28	FFYHD	2	2009-02-09	2009-02	2009-01

3. 计算每个CohortGroup在各个OrderPeriod的用户量

# pd.Series.nunique --> Return number of unique elements in the object.

cohorts = df.groupby(['CohortGroup', 'OrderPeriod']) \

            .agg({'UserId': pd.Series.nunique,

                 'OrderId': pd.Series.nunique,

                 'TotalCharges': 'sum'})

cohorts.rename(columns={'UserId': 'TotalUsers', 'OrderId': 'TotalOrders'}, inplace=True)

cohorts.head()

		TotalCharges	TotalOrders	TotalUsers
CohortGroup	OrderPeriod
2009-01	2009-01	1850.255	30	22
	2009-02	1351.065	25	8
	2009-03	1357.360	26	10
	2009-04	1604.500	28	9
	2009-05	1575.625	26	10

4. 标记每个CohortGroup的Cohort时期

比如，对于2009-01 cohort，其第一个时期是 2009-01，第二到第五个时期为 2009-02，...，2009-05。这里需要将每个CohortGroup的OrderPeriod对应到其第1，2，...个时期。

def cohort_period(df):

    df['CohortPeriod'] = np.arange(len(df)) + 1

    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)

cohorts.head()

		TotalCharges	TotalOrders	TotalUsers	CohortPeriod
CohortGroup	OrderPeriod
2009-01	2009-01	1850.255	30	22	1
	2009-02	1351.065	25	8	2
	2009-03	1357.360	26	10	3
	2009-04	1604.500	28	9	4
	2009-05	1575.625	26	10	5

上面 level=0 实际上就是对 group by CohortGroup，然后对每个group结果 apply cohort_period。groupby 后的结果是这样的：

[(k, v) for k, v in cohorts.head(5).groupby(level=0)]

[('2009-01',

                           TotalCharges  TotalOrders  TotalUsers  CohortPeriod

  CohortGroup OrderPeriod

  2009-01     2009-01          1850.255           30          22             1

              2009-02          1351.065           25           8             2

              2009-03          1357.360           26          10             3

              2009-04          1604.500           28           9             4

              2009-05          1575.625           26          10             5)]

5. 确定分群后的结果正确

x = df[(df.CohortGroup=='2009-01') & (df.OrderPeriod=='2009-01')]

y = cohorts.ix[('2009-01', '2009-01')]

assert(x.UserId.nunique()==y.TotalUsers)

assert(x.OrderId.nunique()==y.TotalOrders)

assert(x.TotalCharges.sum()==y.TotalCharges)

x = df[(df.CohortGroup=='2009-03') & (df.OrderPeriod=='2009-05')]

y = cohorts.ix[('2009-03', '2009-05')]

assert(x.UserId.nunique()==y.TotalUsers)

assert(x.OrderId.nunique()==y.TotalOrders)

assert(x.TotalCharges.sum()==y.TotalCharges)

每个CohortGroup的留存

6. 计算每个CohortGroup在第一个CohortPeriod的用户数量

cohorts = cohorts.reset_index() \

                 .set_index(['CohortGroup', 'CohortPeriod'])

cohort_group_size = cohorts.TotalUsers.groupby(level=0).first()

cohort_group_size

CohortGroup

2009-01     22

2009-02     15

2009-03     13

2009-04     39

2009-05     50

2009-06     32

2009-07     50

2009-08     31

2009-09     37

2009-10     54

2009-11    130

2009-12     65

2010-01     95

2010-02    100

2010-03     24

Name: TotalUsers, dtype: int64

7. 计算每个CohortPeriod的留存率

user_retention = cohorts.TotalUsers.unstack(0).divide(cohort_group_size, axis=1)

user_retention.head()

CohortGroup	2009-01	2009-02	2009-03	2009-04	2009-05	2009-06	2009-07	2009-08	2009-09	2009-10	2009-11	2009-12	2010-01	2010-02	2010-03
CohortPeriod
1	1.000000	1.000000	1.000000	1.000000	1.00	1.00000	1.00	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.00	1.0
2	0.363636	0.200000	0.307692	0.333333	0.26	0.46875	0.46	0.354839	0.405405	0.314815	0.246154	0.261538	0.526316	0.19	NaN
3	0.454545	0.333333	0.384615	0.256410	0.24	0.28125	0.26	0.290323	0.378378	0.222222	0.200000	0.276923	0.273684	NaN	NaN
4	0.409091	0.066667	0.307692	0.333333	0.10	0.18750	0.20	0.225806	0.216216	0.240741	0.223077	0.107692	NaN	NaN	NaN
5	0.454545	0.266667	0.076923	0.153846	0.08	0.21875	0.22	0.193548	0.351351	0.240741	0.100000	NaN	NaN	NaN	NaN

留存率曲线

user_retention[['2009-01', '2009-05', '2009-08']] \

  .plot(figsize=(11, 6), color=['#4285f4', '#EA4335', '#A60628'])

plt.title("Cohorts: User Retention")

plt.xticks(np.arange(1, len(user_retention)+1, 1))

plt.xlim(1, len(user_retention))

plt.ylabel('% of Cohort Purchasing', fontsize=16)

留存率热力图

import seaborn as sns

sns.set(style='white')

plt.figure(figsize=(16, 8))

plt.title('Cohorts: User Retetion', fontsize=14)

sns.heatmap(user_retention.T,

            mask=user_retention.T.isnull(),

            annot=True, fmt='.0%')

Cohort Analysis Using Python的更多相关文章

Cohort Analysis and LifeCycle Grids mixed segmentation with R（转）
This is the third post about LifeCycle Grids. You can find the first post about the sense of LifeCyc ...
Windows学习"Network Analysis in Python"
原代码仓库的地址为 Network Analysis in Python. 主要按照里面的README.md 进行操作,全部仓库有100MB以上.考虑到数据比较大,再加上我对原笔记文件有修改,建议从我 ...
同期群分析（Cohort Analysis）
什么是同期群? 将相同时间段内具有共同行为特征的用户划分为同一个群体,其被称为同期群.“共同行为特征”是指在某个时间段内的行为相似.最常见的是按不同时间的新增用户来划分,然后分析留存率.当然也可以按其 ...
Applied Social Network Analysis in Python 相关笔记3
如果是option2的话,答案选A. 这里节点s,从左边的选择,节点t从右边选择. 这里计算还是用以前的值,不用更新过的值.
Applied Social Network Analysis in Python 相关笔记
《利用Python进行数据分析： Python for Data Analysis 》学习随笔
NoteBook of <Data Analysis with Python> 3.IPython基础 Tab自动补齐变量名变量方法路径解释 ?解释, ??显示函数源码 ?搜索命名 ...
Python for Data Analysis
Data Analysis with Python ch02 一些有趣的数据分析结果 Male描述的是美国新生儿男孩纸的名字的最后一个字母的分布 Female描述的是美国新生儿女孩纸的名字的最后一个字 ...
[Python] 机器学习库资料汇总
声明:以下内容转载自平行宇宙. Python在科学计算领域,有两个重要的扩展模块:Numpy和Scipy.其中Numpy是一个用python实现的科学计算包.包括: 一个强大的N维数组对象Array: ...
【转帖】Python在大数据分析及机器学习中的兵器谱
Flask:Python系的轻量级Web框架. 1. 网页爬虫工具集 Scrapy 推荐大牛pluskid早年的一篇文章:<Scrapy 轻松定制网络爬虫> Beautiful Soup ...

随机推荐

单点登录SSO的实现原理
单点登录SSO(Single Sign On)说得简单点就是在一个多系统共存的环境下,用户在一处登录后,就不用在其他系统中登录,也就是用户的一次登录能得到其他所有系统的信任.单点登录在大型网站里使用得 ...
python 基础干货 01
赋值的实现 a = 'abc' 1. 在内存中创建了'abc'字符串; 2. 在内存中创建了一个名为 a 的变量, 并把它指向刚才创建的'abc', 也就是a中保存着字符串的地址. b = a, 创建 ...
PAT 1069. The Black Hole of Numbers (20)
For any 4-digit integer except the ones with all the digits being the same, if we sort the digits in ...
《Effective C++》学习笔记条款13 以对象管理资源
条款 13 :以对象管理资源例: voidf() { Investment *pInv = createInvestment(); ... ...
Araxis Merge Professional v2014.4565 特别版 | 文件比较合并
http://www.ttrar.com/html/AraxisMerge.html Araxis Merge 是一个可视化的文件比较.合并和同步的软件,能够方便的被软件工程师和 web 站点开发者使 ...
项目FAQ
报错: Conversion from String Literal to Char* is deprecated http://stackoverflow.com/questions/1369030 ...
NDK开发之ndk-build命令详解
毫无疑问,通过执行ndk-build脚本启动android ndk构建系统. 默认情况下,ndk-build脚本在工程的主目录中执行,如: 我们可以用使用-C参数改变上述行为,-C指定工程的目录,这样 ...
VC++判断是否连网
在开发中,需要判断是否有网络连接,于是写了个函数,实现代码如下: //判断是否有网络连接 static BOOL DoHaveInternetConnection() { BOOL bRet = FA ...
Oracle中的job（转载）
我们在项目开发中,常常会有一些复杂的业务逻辑.使用oracle的存储过程,可以大大减少java程序代码的编写工作量,而且存储过程执行在数据库上,这样可以利用oracle的良好性能支持,极大地提高程序执 ...
php读取操作大文件
在php中,对于文件的读取时,最快捷的方式莫过于使用一些诸如file.file_get_contents之类的函数,简简单单的几行代码就能很漂亮的完成我们所需要的功能.但当所操作的文件是一个比较大的 ...

Cohort Analysis Using Python