Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)

Peter Best (aka fakeplastictrees) took 1st place in Human or Robot?, our fourth Facebook recruiting competition. Finishing ahead of 984 other data scientists, Peter ignored early results from the public leaderboard and stuck to his own methodology (which involved removing select bots from the training set). In this blog, he shares what led to this winning approach and how the competition helped him grow as a data scientist.

The Basics

What was your background prior to entering this challenge?

After studying chemistry for my first degree, I worked for twenty years in quantitative asset management culminating in being a partner in my own firm. After this, I went back to university to obtain a second degree, in maths this time. Recently, I have been looking for the right project to commit to. My programming experience extends all the way back to a ZX81. Whilst I would never describe myself as a professional programmer, I seem to be able to program.

Kaggle profile for Peter Best (aka fakeplastictrees)

How did you get started competing on Kaggle?

With all my experience, I decided my area of greatest interest lay in analysing complex data. I also quickly realised that my coding skills were from the previous century and thus I opted to learn python. Perhaps the best way to learn a programming language is to actually do something in it and this led me to Kaggle.

Peter's top competition finishes

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Not really. Financial markets are somewhat like auctions and the people who work in financial markets are somewhat like bots (occasionally!), but I don’t think I used any specialist knowledge in this competition. Many examples of data analysis have similarities despite relating to completely different fields.

What made you decide to enter this competition?

The key to this competition looked like it was going to be feature engineering, something I regard as a strength. In addition, the size of the data set was small enough to manage easily on my Mac with short run times.

Let's Get Technical

What preprocessing and supervised learning methods did you use?

Most features were composed by manipulating the data in the bids database then transferring some aggregate measure into a smaller database uniquely indexed by bidder. This smaller database was then used as input to a boosted trees model.

What was your most important insight into the data?

As soon as I started doing some cross-validation on an early version of my model, I found that the statistical error on the estimate of the Area Under the ROC Curve (the evaluation metric used in the competition) was going to be quite high. This led me to undertake extensive resampling in my cross-validation to ensure my estimate of the AUC was accurate enough to base decisions upon. In addition, it suggested that the public leaderboard score was likely to be a poor guide to the eventual private leaderboard score and it was going to be much better to trust my own cross-validation. As a consequence of this I only actually submitted three entries into the competition.

Were you surprised by any of your findings?

One of the most straightforward features to envisage was simply the number of bids made by each bidder. Ex ante, one could suppose that a bot could place far more bids than the average human. When a histogram of the number of bids placed by each bot was considered, however, there was a striking anomaly. Five bots were identified as having only one bid each in the data set, which looks unusual in the distribution, as shown.

Having only one bid per bot suggests that either there are some links with other bots in the data or that they have been labelled as bots using data which we don’t know. Investigation yielded no obvious links to other bots and the latter point implies that the data from these bots won’t generalise well to a different data set. Thus, the algorithms that I used may be negatively affected by trying to learn from these bot examples. I decided then to create a solution that removed these five observations from the training data set and submitted this as one of my two selected entries. It was this entry that won the competition.

Also, note that this picture was on the home page of the competition. There are five bots! Was it a clue?

Which tools did you use?

I did everything in python, using the excellent Scikit-learn package for the boosted trees algorithm.

How did you spend your time on this competition?

I tend to flit to and fro between the two [feature engineering & machine learning] as I’m going along. I tend to start with some feature engineering, then perform some gross parameter tuning for the machine learning algorithm, then return to some more feature engineering and so on. In this competition the feature engineering took up a lot more time than the machine learning. I spent quite a lot of time lovingly hand-crafting features that didn’t improve the model in cross-validation, but this is normal for this sort of competition. Just because you think a feature will be useful doesn’t mean it will be.

What was the run time for both training and prediction of your winning solution?

Just a few minutes, though each cross-validation run took over an hour.

Words of Wisdom

What have you taken away from this competition?

Enhanced self-belief in persevering with what you know to be the correct methodology and not getting distracted by things that don’t really matter (in this competition, that meant the public leaderboard!).

Also, a realisation that even though I think my methods gave me the best chance of doing well, actually winning the competition required the cards to fall my way this time.

Do you have any advice for those just getting started in data science?

Firstly, look through the Kaggle website, especially the forums, to pick up a vast amount of good advice on how to improve your data science.

Mostly though, my advice is to just have a go, make mistakes and learn from these.

Bio

Peter Best is a graduate in both chemistry and mathematics. He worked for many years in the quantitative asset management industry where he performed extensive data analysis. He is currently competing in Kaggle competitions and trying to decide on his next project.

Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)的更多相关文章

  1. CrowdFlower Winner's Interview: 1st place, Chenglong Chen

    CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...

  2. How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo

    How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...

  3. Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham

    Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...

  4. ICDM Winner's Interview: 3rd place, Roberto Diaz

    ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...

  5. Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang

    Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...

  6. Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯

    Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...

  7. Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang

    Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...

  8. 如何在 Kaggle 首战中进入前 10%

    原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...

  9. 【转载】如何在 Kaggle 首战中进入前 10%

    本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...

随机推荐

  1. ColorComboBox

    using System; using System.Collections.Generic; using System.ComponentModel; using System.Drawing; u ...

  2. MongoDB 安装与启动

    一.MongoDB简单介绍 MongoDB是一个高性能,开源.无模式的文档型数据库.是当前NoSql数据库中比較热门的一种.它在很多场景下可用于替代传统的关系型数据库或键/值存储方式. Mongo使用 ...

  3. Java基础知识强化之集合框架笔记28:ArrayList集合练习之去除ArrayList集合中的重复字符串元素(升级)

    1. 需求:ArrayList去除集合中字符串的重复值(字符串的内容相同)     要求:不能创建新的集合,就在以前的集合上做. 2. 代码示例之 去除集合中重复元素,不创建新的集合: package ...

  4. HTML5移动开发中的input输入框类型

    HTML5规范引入了许多新的input输入框类型 在HTML5移动开发中,通过这些新的输入框类型来显示定制后的键盘布局,用户体验更好,更容易填写各种表单 本文中,实测手机为肾4S与米4 数字类型num ...

  5. Android开发文摘集合1

    作者:张明云 原标题:Android 开发中,有哪些坑需要注意? 作者github主页:zmywly8866.github.io/ 在Android library中不能使用switch-case语句 ...

  6. 在企业级开发中使用Try...Catch...会影响效率吗?

    感谢神啊.上帝及老天爷让我失眠,才能够有了本篇文章. 记得不久之前,公司一同事曾经说过:“如果是Winform开发,由于程序是在本地,使用try...catch不会有太大性能问题,可是如果是在web服 ...

  7. ResultSetMetaData rsmd = rs.getMetaData()是什么意思?

    ResultSetMetaData rsmt=rs.getMetaData(); 得到结果集(rs)的结构,比如字段数.字段名等.使用rs.getMetaData().getTableName(1)) ...

  8. Ubuntu Server 14.04 下root无法ssh登陆

    今天安装了Ubuntu Server 14.04   在终端配置了root密码后,使用SecureCRT和putty竟然不能ssh登陆,SecureCRT一直提示密码不对,但是可以肯定输入的密码100 ...

  9. .net RAW(16)与GUID互相转换

    .net 1.raw转guidnew guid(byte[] id);2.guid转rawGuid result;string ids = BitConverter.ToString(result.T ...

  10. C# 线程数

    理论上,一个进程可用虚拟空间是2G,默认情况下,线程的栈的大小是1MB,所以理论上最多只能创建2048个线程,但是一般不会到这么大,因为主线程要占内存,可能还要多点.如果要创建多于2048的话,必须修 ...