Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Ponpare is Japan's leading joint coupon site, offering huge discounts on everything from hot yoga, to gourmet sushi, to a summer concert bonanza. The Recruit Coupon Purchase Prediction challenge asked the community to predict which coupons a customer would buy in a given period of time using past purchase and browsing behavior.

Halla Yang finished 2nd ahead of 1,191 other data scientists. His experience working with time series data helped him use unsupervised methods effectively in conjunction with gradient boosting. In this blog, Halla walks through his approach and shares key visualizations that helped him better understand and work with the dataset.
The Basics
What was your background prior to entering this challenge?
I've worked almost a decade in finance as a quantitative researcher and portfolio manager. I've also competed in several Kaggle contests, placing first in the Pfizer Volume Prediction Masters competition, sixth in the Merck Molecular Activity Challenge, and ninth in theDiabetic Retinopathy Detection.
Halla Yang's profile on Kaggle
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
Predicting prices for thousands of stocks and predicting purchases by thousands of Japanese internet users are loosely similar problems. You can forecast stock returns by looking at time series data such as past returns and cross-sectional data such as industry averages. You can forecast coupon purchases by looking at time series features based on past purchases and cross-sectional features based on peer group averages.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
For each (user, coupon) pair, I calculated the probability that the user would purchase that coupon during the test period using a gradient boosting classifier. I sorted the coupons for each user by probability, composing the ten highest probability coupons into my submission.
To train my classifier, I constructed training data for 24 "train periods" that simulated the test period. Train period 1 is the week from 2012-01-08 through 2012-01-14, and includes all coupons with a DISPFROM date - the date on which they're supposed to be first displayed - in that week. Train period 2 is the week from 2012-01-15 through 2012-01-21, and includes all coupons with a DISPFROM date in that week. Train period 24 is the week from 2012-06-17 through 2012-06-23, and includes all coupons with a DISPFROM date in that week.
For each of these training periods, I built a set of features for each (user, relevant coupon) pair. This set of features includes user-specific data, e.g. gender, days on site, and age; coupon-specific data, e.g. catalog price, genre, and price rate; as well as user-coupon interaction data, e.g. how often has the user viewed coupons of the same genre. The target for each observation is set to 1 if the user purchased that coupon during the training week, and 0 otherwise.
To calibrate the parameters of my model, I first trained a model on the first twenty-three weeks of data, and estimated my log loss and confusion matrix on the twenty-fourth week. I then trained a model on the full twenty-four weeks of data to generate my competition submission.
The only supervised learning method I used was gradient boosting, as implemented in the excellent xgboost package. I cycled through other algorithms at the start of my analysis to get a feel for their relative performance - logistic regressions, random forests, SVMs, as well as deep neural networks - but found that gradient boosting was the single best classifier for my approach.
What was your most important insight into the data?
First, many test set and training set coupons were viewed prior to their DISPFROM, the date on which they're supposed to be first displayed, and so one could use direct views as a forecasting variable. The violin plot below shows the distribution of first view times relative to DISPFROM. A negative x-value indicates the coupon was viewed prior to its DISPFROM. Over a quarter of coupons are first viewed more than twelve hours before their DISPFROM, and five percent of coupons are first viewed more than ninety hours before their DISPFROM.

Simply counting the number of times a user has viewed a test set coupon is tremendously helpful in forecasting test set purchases. As shown in the left panel of the figure below, users are 2.5% likely to buy a coupon if they've viewed it exactly once prior to its DISPFROM, but that probability rises to 32% if they've viewed the coupon four or more times.

Second, users tend to buy the same coupons over and over. As shown in the middle panel of the above figure, a user who has purchased a coupon with a given prefecture, genre, and catalog price four or more times has a 38% chance of buying a matched coupon again in the next week if it is offered for sale.
Third, peer group averages can help forecast the behavior of users with little or no history. The right panel of the above figure shows that a user's probability of buying a coupon increases from less than 0.1% to above 0.6% if more than ten percent of age, sex, and geography-matched peers have bought a coupon with the same characteristics.
Fourth, it's important to consider the geographic coverage of each coupon. To be specific, a coupon is relevant for the multiple prefectures listed in coupon_area_train.csv, not just the single prefecture listed for that coupon in coupon_list_train.csv. In the kernel density plots below, I show the purchase intensity for users based in four prefectures: Tokyo, Kanagawa, Osaka, and Aichi, using the geographic data in coupon_list_train.csv. The purchases for Osaka and Aichi users appear strongly bimodal, with an unusually large number of purchases occurring in the Tokyo region.

On the other hand, if we look at all the prefectures that map to a given coupon, we find that Osaka users purchased Tokyo coupons not because they planned to travel to Tokyo, but because these coupons were also local to Osaka. If we plot the geographic intensity of "nearest-to-user" prefecture rather than a coupon's primary listing prefecture, we see much more localized purchase behavior.

Words of Wisdom
Do you have any advice for those just getting started in data science?
Focus on understanding the problem. Without understanding the problem, it's impossible to develop a solution.
Start with simple approaches and models. A fast development cycle is key to testing out ideas and learning what works. Don't start building computationally expensive ensembles until you have iterated through most of your best ideas.
Bio
Halla Yang has worked as a quantitative researcher, portfolio manager and trader at Goldman Sachs Asset Management, Jump Trading and Arrowstreet Capital. He holds a Ph.D. in Business Economics from Harvard, and a B.A. in Physics, summa cum laude, also from Harvard. He is about to start a new position as data scientist at a management consulting firm.
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang的更多相关文章
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...
- How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...
- ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- 如何在 Kaggle 首战中进入前 10%
原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...
- 【转载】如何在 Kaggle 首战中进入前 10%
本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...
随机推荐
- Linux 网络编程一(TCP/IP协议)
以前我们讲过进程间通信,通过进程间通信可以实现同一台计算机上不同的进程之间通信. 通过网络编程可以实现在网络中的各个计算机之间的通信. 进程能够使用套接字实现和其他进程或者其他计算机通信. 同样的套接 ...
- LUA 协程
LUA协程和C#协程非常相似,功能与用法更强大.基础用法: coco = coroutine.create(function (a,b) print("resume args:". ...
- 构建基于WCF Restful Service的服务
前言 传统的Asmx服务,由于遵循SOAP协议,所以返回内容以xml方式组织.并且客户端需要添加服务端引用才能使用(虽然看到网络上已经提供了这方面的Dynamic Proxy,但是没有这种方式简便), ...
- MVC中利用自定义的ModelBinder过滤关键字
上一篇主要讲解了如何利用ActionFilter过滤关键字,这篇主要讲解如何利用自己打造的ModelBinder来过滤关键字. 首先,我们还是利用上一篇中的实体类,但是我们需要加上DataType特性 ...
- Chrome调试工具简单介绍
作为前端开发者都知道,快捷键F12可以打开chrome调试工具.firefox可以打开firebug工具.“工欲善其事,必先利其器”,对调试工具的掌握,能大大提高我们调试代码的效率.因为我平常chro ...
- Navicat创建和设计MySQL事件
1.开启定时器 0:off 1:on SET GLOBAL event_scheduler = 1; 2.在navicat左侧选择一个数据库,单击“时间”-“创建事件”,弹出一个窗口.
- [CareerCup] 14.5 Object Reflection 对象反射
14.5 Explain what object reflection is in Java and why it is useful. Java中的对象反射机制可以获得Java类和对象的反射信息,并 ...
- 浪潮之巅IT那点事之一——AT&T的兴衰
首次接触到<浪潮之巅>这本书,几乎是熬了一个通宵把上下两册全部看完,感慨颇多.从事计算机基础教育多年,每次在讲计算机导论课程时,总是在重复同样的内容,讲一些计算机结构.操作系统.算法.软件 ...
- 网络封包分析工具Charles使用
网址:http://www.charlesproxy.com/ 截取网络封包的工具. 简介 Charles是在Mac下常用的截取网络封包的工具,在做iOS开发时,我们为了调试与服务器端的网络通讯协议, ...
- silverlight ListBox 多列图片效果
这个功能之前用wpf写过一次这次用Silverlight写一次 这两种写法上基本上没有太大的差别 这个Demo并不完美,只是给大家提供一个思路 源码:SilverLightListPricture.r ...