CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen
The Crowdflower Search Results Relevance competition asked Kagglers to evaluate the accuracy of e-commerce search engines on a scale of 1-4 using a dataset of queries & results. Chenglong Chen finished ahead of 1,423 other data scientists to take first place. He shares his approach with us from his home in Guangzhou, Guangdong, China. (To compare winning methodologies, you can read a write-up from the third place team here.)
The competition ran from May 11-July 6, 2015.
The Basics
What was your background prior to entering this challenge?
I was a Ph.D. student in Sun Yat-sen University, Guangzhou, China, and my research mainly focused on passive digital image forensics. I have applied various machine learning methods, e.g., SVM and deep learning, to detect whether a digital image has been edited/doctored, or how much has the image under investigation been resized/rotated.
Chenglong's profile on Kaggle
I am very interested in machine learning and have read quite a lot of related papers. I also love to compete on Kaggle to test out what I have learnt and also to improve my coding skill. Kaggle is a great place for data scientists, and it offers real world problems and data from various domains.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I have a background of image proecssing and have limited knowledge about NLP except BOW/TF-IDF kinda of things. During the competition, I frequently refered to the book Python Text Processing with NLTK 2.0 Cookbook or Google for how to clean text or create features from text.
I did read the paper about ensemble selection (which is the ensembling method I used in this competition) a long time ago, but I haven't have the opportunity to try it out myself in real word problem. I previously only tried simple (weighted) averaging or majority voting. This is the first time I got so serious about the model ensembling part.
How did you get started competing on Kaggle?
It dates back a year and a half ago. At that time, I was taking Prof. Hsuan-Tien Lin's Machine Learning Foundations course on Coursera. He encouraged us to compete on Kaggle to apply what we have learnt to real world problems. From then on, I have occasionally participated in competitions I find interesting. And to be honest, most of my programming skills about Python and R are learnt during Kaggling.
What made you decide to enter this competition?
After I passed my Ph.D. dissertation defense early in May, I have had some spare time before starting my job at an Internet company. I decided that I should learn something new and mostly get prepared for my job. Since my job will be about advertising and mostly NLP related, I thought this challenge would be a great opportunity to familiarize myself with some basic or advanced NLP concepts. This is the main reason that drove me to enter.
Another reason was that this dataset is not very large, which is ideal for practicing ensemble skills. While I have read papers about ensembling methods, I haven't got very serious about ensembling in previous competitions. Usually, I would try very simple (weighted) averaging. I thought this is a good chance to try some of the methods I have read, e.g., stacking generalization and ensemble selection.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
The documentation and code for my approach are available here. Below is a high level overview of my method.
Figure 1. Flowchart of my method
For preprocessing, I mainly performed HTML tags dropping, word replacement, and stemming. For a supervised learning method, I used ensemble selection to generate an ensemble from a model library. The model library was built with models trained using various algorithms, various parameter settings, and various feature sets. I have usedHyperopt (usually used in parameter tuning) to choose parameter setting from a pre-defined parameter space for training different models.
I have tried various objectives, e.g., MSE, softmax, and pairwise ranking. MSE turned out to be the best with an appropriate decoding method. The following is the decoding method I used for MSE (i.e., regression):
- Calculate the pdf/cdf of each median relevance level, 1 is about 7.6%, 1 + 2 is about 22%, 1 + 2 + 3 is about 40%, and 1 + 2 + 3 + 4 is 100%.
- Rank the raw prediction in an ascending order.
- Set the first 7.6% to 1, 7.6% - 22% to 2, 22% - 40% to 3, and the rest to 4.
In CV, the pdf/cdf is calculated using the training fold only, and in the final model training, it is computed using the whole training data.
Figure 2 shows some histograms from my reproduced best single model for one run of CV (only one validation fold is used). Specifically, I plotted histograms of 1) raw prediction, 2) rounding decoding, 3) ceiling decoding, and 4) the above cdf decoding, grouped by the true relevance. It's most obvious that both rounding and ceiling decoding methods have difficulty in predicting relevance 4.
Figure 2. Histograms of raw prediction and predictions using various decoding methods grouped by true relevance. (The code generated this figure is available here.)
Following are the kappa scores for each decoding method (using all 3 runs and 3 folds CV). The above cdf decoding method exhibits the best performance among the three methods we considered.
| Method | CV Mean | CV Std |
| Rounding | 0.404277 | 0.005069 |
| Ceiling | 0.513138 | 0.006485 |
| CDF | 0.681876 | 0.005259 |
What was your most important insight into the data?
I have found that the most important features for predicting the search results relevance is the correlation or distance between query and product title/description. In my solution, I have features like interset word counting features, Jaccard coefficients, Dice distance, and cooccurencen word TF-IDF features, etc. Also, it’s important to perform some word replacements/alignments, e.g., spelling correction and synonym replacement, to align those words with the same or similar meaning.
While I didn't have much time exploring word embedding methods, they are very promising for this problem. During the competition, I came across a paper entitled “From word embeddings to document distances”. The authors of this paper used Word Mover’s Distance (WMD) metric together with word2vec embeddings to measure the distance between text documents. This metric is shown to have superior performance than BOW and TF-IDF features.
Were you surprised by any of your findings?
I have tried optimizing kappa directly uisng XGBoost (see below), but it performed a bit worse than plain regression. This might have something to do with the hessian, which I couldn't get to work unless I used some scaling and change it to its absolute value (see here).
Which tools did you use?
I used Python for this competition. For feature engineering part, I heavily relied on pandas and Numpy for data manipulation, TfidfVectorizer and SVD in Sklearn for extracting text features. For model training part, I mostly used XGBoost, Sklearn, keras and rgf.
I would like to say a few more words about XGBoost, which I have been using very often. It is great, accurate, fast and easy of use. Most importantly, it supports customized objective. To use this functionality, you have to provide the gradient and hessian of your objective. This is quite helpful in my case. During the competition, I have tried to optimize quadratic weighted kappa directly using XGBoost. Also, I have implemented two ordinal regression algorithms within XGBoost framework (both by specifying the customized objective.) These models contribute to the final winning submission too.
How did you spend your time on this competition?
Where I spent my time on the competition changed during the competition.
- In the early stage, I mostly focused on data preprocessing. I have spent quite a lot of time on researching and coding down the methods to perform text cleaning. I have to mention that quite a lot of effort was spent on exploring the data (e.g., figuring out misspellings and synonyms etc.)
- Then, I spent most of my time on feature extraction and trying to figure out what features would be useful for this task. The time was split pretty equally between researching and coding.
- In the same period, I decided to build a model using ensemble selection and realized my implementation was not flexible enough to that goal. So, I spent most of the time refactoring my implementation.
- After that, most of my time was spent on coding down the training and prediction parts of various models. I didn't spend much time on tuning each model's performance. I utilized Hyperopt for parameter tuning and model library building.
- With the pipeline for ensemble selection being built, most of my time was spent on figuring out new features and exploring the provided data.
In short, I would say I have done a lot of researching and coding during this competition.
What was the run time for both training and prediction of your winning solution?
Since the dataset is kinda of a small size and kappa is not very stable, I utilized bagged ensemble selection from a model library containing hundreds or thousands of models to combat overfitting and stabilize my results. I don't have an exact number of the hours or days, but it should take quite a large amount of time to train and make prediction. Furthermore, this also depends on the trade-off between the size of the model library (computation burden) and the performance.
That being said, you should be able to train the best single model (i.e., XGBoost with linear booster) in a few hours. It will give you a model of kappa score about 0.708 (Private LB), which should be enough for a top15 place. For this model, feature extraction occupied most of the time. The training part (using the best parameters I have found) should be a few minutes using multi-threads (e.g., 8).
Words of Wisdom
What have you taken away from this competition?
- Ensembling of a bunch of diverse models helps a lot. Figure 3 shows the CV mean, Public LB, and Private LB scores of my 35 best Public LB submissions generated using ensemble selection. As time went by, I have trained more and more different models, which turned out to be helpful for ensemble selection in both CV and Private LB.
- Do not ever underestimate the power of linear models. They can be much better than tree-based models or SVR with RBF/poly kernels when using raw TF-IDF features. They can be even better if you introduce appropriate nonlinearities.
- Hyperopt is very useful for parameter tuning, and can be used to build model library for ensemble selection.
- Keep your implementation flexible and scaleable. I was lucky to refactor my implementation early on. This allowed me to add new models to the model library very easily.
Figure 3. CV mean, Public LB, and Private LB scores of my 35 best Public LB submissions. One standard deviation of the CV score is plotted via error bar. (The code generated this figure is available here.)
Do you have any advice for those just getting started in data science?
- Use things like Google to find a few relevant research papers. Especially if you are not a domain expert.
- Read the winning solutions for previous competitions. They contain lots of insights and tricks, which are quite inspired and useful.
- Practice makes perfect. Choose one competition that you are interested in on Kaggle and start Kaggling today (and every day)!
Bio
Chenglong Chen
is a recent graduate from Sun Yat-sen University (SYSU), Guangzhou, China, where he received a B.S. degree in Physics in 2010 and recently got a Ph.D. degree in Communication and Information Systems. As a Ph.D. student, his research interests included image processing, multimedia security, pattern recognition, and in particular digital image forensics. He will be starting his job career atTencent this August, working on advertising. Chenglong can be reached at: c.chenglong@gmail.com
CrowdFlower Winner's Interview: 1st place, Chenglong Chen的更多相关文章
- How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...
- ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...
- 如何在 Kaggle 首战中进入前 10%
原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...
- 【转载】如何在 Kaggle 首战中进入前 10%
本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...
随机推荐
- delphi 类型转化
1.typecasting类型强制转化 var B : Boolean; Begin B := Boolean(1); End; 对于对象和接口,采用as操作符进行转化,但要先进行兼容性判断. 2.P ...
- Java 第二天
1.不带访问修饰符号的类成员变量默认是friendly(可以同一个包中访问) 2.protected的类成员可以在同一个包及其子类中访问 3.方法中可以定义内部类(如下面的代码),该类只能访问方法中的 ...
- 在mac上安装pydev for eclipse时,在eclipse的Preferences中无法显示出来的解决方法
参考http://pydev.org/manual_101_install.html 中的说明,该插件依赖java7,在我安装eclipse之前并没有安装jdk,打开eclipse之后,自动安装了一个 ...
- PIL不能关闭文件的解决方案
今天写了一个能指定图片尺寸,以及比例 来搜索分类图片的Python脚本.为了读取多个格式的文件的头,采用了Python PIL库. im = PIL.Image.open(imPath) if im的 ...
- 苹果内付费 IAP
创建app内购买项目 消耗型项目:对于消耗型App内购买项目,用户每次下载时都必须进行购买.一次性服务通常属于消耗型项目,例如钓鱼App 中的鱼饵. 非消耗型项目:对于非消耗型App内购买项目,用户仅 ...
- C#巧用Excel模版变成把Table打印出来
将一个做好的Excel模版,通过程序填上数据然后打印出来这个需求有两种方法一种是通过代码打开Excel模版然后填入数据然后再打印. 第二种方法就是我将要介绍的 1.将Excel设置好格式另存为HTML ...
- JavaScript高级程序设计之函数性能
setTimeout 比 setInterval 性能更好 // 取代setInterval setTimeout(function self () { // code goes here setTi ...
- 通过FileWatcher,监听通过web上传的图片,并进行压缩
需求是这样的,通过web传输过来的图片,无论是JS上传,还是其他的上传方式,都需要生成2张缩略图,分别是用于商品列表的小图small,和用于分享的小图share.基于不同上传方式的不同需求,使用exe ...
- 联想Z470安装10.11懒人版成功!!特此分享!!
折腾黑苹果也断断续续好几个月了,在远景也爬了好多贴,遇到问题基本上靠自己解决,自己组的台式机已基本完美,大学期间买的联想Z470现在是“食之无味,弃之可惜”,想想也来试试装个黑苹果玩玩,之前装过10. ...
- 条款11:在operator=中处理“自我赋值”
什么是自我赋值,就是 v = v 这种类型的语句,也许很多人都会说鄙视这种写法,但是如下的写法会不会出现呢? 比如:a[i] = a[j]; // 不巧的是i可能和j相等 *px = *py ...