Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯

The Otto Group Product Classification Challenge made Kaggle history as our most popular competition ever. Alexander Guschin finished in 2nd place ahead of 3,845 other data scientists. In this blog, Alexander shares his stacking centered approach and explains why you should never underestimate the nearest neighbours algorithm.

3,848 players on 3,514 teams competed to classify items across Otto Group's product lines

The Basics

What was your background prior to entering this challenge?

I have some theoretical understanding of machine learning thanks to my base institute (Moscow Institute of Physics and Technology) and our professor Konstantin Vorontsov, one of the top Russian machine learning specialists. As for my acquaintance with practical problems, another great Russian data scientist who once was Top-1 on Kaggle,Alexander D’yakonov, used to teach a course on practical machine learning every autumn which gave me very good basis. Kagglers may know this course as PZAD.

Alexander's profile on Kaggle

How did you get started competing on Kaggle?

I got started in 2014’s autumn in “Forest Cover Type Prediction”. At that time I had no experience in solving machine learning problems. I found excellent benchmarks in “Titanic: Machine Learning from Disaster” which helped me a lot. After that I understand that machine learning is extremely interesting for me and just tried to participate in every competition I could.

What made you decide to enter this competition?

I wanted to check some ideas for my bachelor work. I liked that Otto competition has quite reliable dataset. You can check everything on cross-validation and changes on CV were close enough to leaderboard. Also, the spirit of competition is quite appropriate for checking ideas.

Let's Get Technical

What preprocessing and supervised learning methods did you use?

My solution’s stacking schema

The main idea of my solution is stacking. Stacking helps you to combine different methods’ predictions of Y (or labels when it comes to multiclass problems) as “metafeatures”. Basically, to obtain metafeature for train, you split your data into K folds, training K models on K-1 parts while making prediction for 1 part that was left aside for each K-1 group. To obtain metafeature for test, you can average predictions from these K models or make single prediction based on all train data. After that you train metaclassifier on features & metafeatures and average predictions if you have several metaclassifiers.

In the beginning of working on the competition I found useful to split data in two groups : (1) train & test, (2) TF-IDF(train) & TF-IDF(test). Many parts of my solution use these two groups in parallel.

Talking about supervised methods, I’ve found that Xgboost and neural networks both give good results on data. Thus I decided to use them as metaclassifiers in my ensemble.

Nevertheless, KNN usually gives predictions that are very different from decision trees or neural networks, so I include them on the first level of ensemble as metafeatures. Random forest and xgboost also happened to be useful as metafeatures.

What was your most important insight into the data?

Probably the main insight was that KNN is capable of making very good metafeatures. Never underestimate nearest neighbours algorithm.

Very important were to combine NN and XGB predictions on the second level. While my final second- level NN and XGB separately scored around .391 on private LB, the combination of them achieved .386, which is very significant improvement. Bagging on the second level helped a lot too.

TSNE in 2 dimensions

Beside this, TSNE in 2 dimensions looks very interesting. We can see on the plot that we have some examples which most likely will be misclassified by our algorithm. It does mean that it won’t be easy to find a way to post-process our predictions to improve logloss.

Also, it seemed interesting that some classes were related closer than others, for example class 1 and class 2. It’s worth trying to distinguish these classes specially.

Final model’s predictions for holdout’

Were you surprised by any of your findings?

Unfortunately, it appears that you won’t necessarily improve your model if you will make your metafeatures better. And when it comes to ensembling, all that you can count on is your understanding of algorithms (basically, the more diverse metafeatures you have, the better) and effort to try as many metafeatures as possible.

The more diverse metafeatures you have, the better. Metafeature by Extratrees vs metafeature by Neural Network.

Which tools did you use?

I only used sklearn, xgboost, lasagne. These are perfect machine learning libraries and I would recommend them to anyone who is starting to compete on Kaggle. Relying on my past experience they are sufficient to try different methods and achieve great results in most Kaggle competitions.

Words of Wisdom

Do you have any advice for those just getting started in data science?

I think that the most useful advice here is try not to stuck trying to fine-tune parameters or stuck using the same approaches every competition. Read through forums, understand winning solutions of past competitions and all of this will give you significant boost whatever your level is. In another words, my point is that reading past solutions is as important as solving competitions.

Also, when you first starting to work on machine learning problems you could make some nasty mistakes which will cost you a lot of time and efforts. Thus it is great if you can work in a team with someone and ask him to check you code or try the same methods on his own. Besides always compare your performance with people on forums.When you see that you algorithm performs much worse than people report on forum, go and check benchmarks for this and other recent competitions and try to figure out the mistake.

Bio

Alexander Guschin is 4th year student in Moscow Institute of Physics and Technology. Currently, Alexander is finishing his bachelor diploma work about ensembling methods.

Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯的更多相关文章

  1. Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang

    Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...

  2. How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo

    How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...

  3. CrowdFlower Winner's Interview: 1st place, Chenglong Chen

    CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...

  4. Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang

    Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...

  5. Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)

    Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...

  6. ICDM Winner's Interview: 3rd place, Roberto Diaz

    ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...

  7. Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham

    Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...

  8. hbase官方文档(转)

    FROM:http://www.just4e.com/hbase.html Apache HBase™ 参考指南  HBase 官方文档中文版 Copyright © 2012 Apache Soft ...

  9. (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探

    目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...

随机推荐

  1. IIS实现301重定向

    301永久重定向对SEO无任何不好的影响,而且网页A的关键词排名和PR级别都会传达给网页B,网站更换了域名,表示本网页永久性转移到另一个地址,对于搜索引擎优化|SEO来说,给搜索引擎一个友好的信息,告 ...

  2. 分享:在微信公众平台做HTML5游戏经验谈(转载与http://software.intel.com/zh-cn/blogs/2013/04/03/html5)

    分享:在微信公众平台做HTML5游戏经验谈 Dawei Cheng 程大伟... 于 星期三, 03/04/2013 - 03:19 提交 最近微信公众游戏平台讨论得如火如荼,大有HTML5游戏即将引 ...

  3. PHP学习笔记--入门篇

    PHP学习笔记--入门篇 一.Echo语句 1.格式 echo是PHP中的输出语句,可以把字符串输出(字符串用双引号括起来) 如下代码 <?php echo "Hello world! ...

  4. sublime text3 针对于前端开发必备的插件

    1.emmet--前身Zen coding:HTML/CSS代码快速编写神器 2.jQuery Package for sublime Text:如果你离不开jQuery的话,这个必备-- 3.JS ...

  5. Objective-C 【完整OC项目-购票系统-系统分析-代码实现】

    电影院买票系统/演唱会买票系统 需求分析: 首先我们进入系统,然后会选择买电影票还是买演唱会票,所以这牵扯两个系统的合成.但是我们知道都是买票系统,所以我们可以先创建一个类,属于购买电影票和演唱会的票 ...

  6. OC4_单例

    // // MusicManager.h // OC4_单例 // // Created by zhangxueming on 15/6/19. // Copyright (c) 2015年 zhan ...

  7. react服务端/客户端,同构代码心得

    FKP-REST是一套全栈javascript框架   react服务端/客户端,同构代码心得 作者:webkixi react服务端/客户端,同构代码心得 服务端,客户端同构一套代码,大前端的梦想, ...

  8. 设置图层符号风格为用已有mxd里的同名图层风格

    //要加载的IFeatureClass IFeatureClass pFeatClass = dataset as IFeatureClass; //新建要加载到mxd文档中的图层 IFeatureL ...

  9. 《RedHatLinux系统修复视频(通过本地镜像)》

    此视频的测试环境: win7下安装的VMware    redhatlinux6.3系统的修复     以删掉kernel系统引导故障重新安装kernel为例 基于本地镜像来对系统进行修复 如疑问可留 ...

  10. Catalyst揭秘 Day4 analyzer解析

    Catalyst揭秘 Day4 analyzer解析 今天继续解析catalyst,主要讲一下analyzer,在sql语句的处理流程中,analyzer是在sqlparse的基础上,把unresol ...