How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
An early insight into the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo take first place in the How Much Did It Rain? competition. In this blog, he gets into details on his approach and shares key visualizations (with code!) from his analysis.

351 players on 321 teams built models to predict probabilistic distributions of hourly rainfall
The Basics
What was your background prior to entering this challenge?
My background is primarily in cognitive science and biology, but I dabbled in many different areas while in school. My particular interests are in human learning and behavior and how we can use human activity traces to learn to shape future actions.

Devin's profile on Kaggle
My interest in gaming as a means of teaching and competitive nature has made Kaggle a great fit for my learning style. I started competing seriously on Kaggle in October 2014. I did not have much experience with programming or applied machine learning, and thought entering a competition would provide a structured introduction. Once I started competing I found I had difficult time stopping.
What made you decide to enter this competition?
I thought there was a decent chance I could get into the top five in the competition and this drove me to enter. After finishing the BCI competition I had to decide between Otto group product challenge and this one. I chose How Much Did it Rain because the dataset was difficult to process, and it wasn't obvious how to approach the problem. These factors favored my skills. I didn't feel like I could compete in Otto where the determining factor was going to primarily rely on ensembling skills.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
Most of the preprocessing was just feature generation. Like most other competitors I used descriptive statistics and counts of the different error codes. These made up the bulk of my features and turned out to be enough to get first place. We were given QC'd reflectivity data, but instead of using this information to limit the data used in feature generation I included it as a feature and let learning algorithm (Gradient Boosted Decision Trees) use it as needed.
The most important decision with regard to supervised learning was how to model the output probability distribution. I decided to model it by transforming the problem into a multi-class classification problem with soft output. Since there was not enough data to perform classification with the full 70 classes the problem had to be reduced further. It turned out there were many different ways that people solved this problem, and I highly recommend reading the end of competition thread for some other approaches.

See the code on scripts
I ended up using a simple method in which basic component probability distributions were combined using the output of a classification algorithm. For classes that had enough data a step function was used for a CDF. When there was less data the several labels were combined and replaced by a single value. In this case an estimation of the empirical distribution for that class was used as a component CDF. This method worked well and I used it for most of the competition. I did try regression and classification just on the data from the minority classes but it never performed quite as well as just using the empirical distribution.
What was your most important insight into the data?
Early in the competition I discovered that it was helpful to split the data based on the number of radar scans in each row. Each row has data spanning the hour previous to the rain gauge reading. In some cases there was only one radar scan in others there was more then 50. There are over one hundred thousand rows in the training set with more then 17 radar scans. For this data I wanted to create features which take into account the changing of weather conditions over time. In doing this I realized it was not possible to make these features for the rows that had only 1 or 2 radar scans. This was the initial reason for splitting the dataset. When I started looking for places to split it I found that there was also a strong positive correlation between the number of radar scans and the average rain amount. Those rows with 1 scan had 95% 0mm of rain, while the subset with 17 or more scans only 48% of the data had 0mm of rain. Interestingly for the data with few radar scans many of the most important features were the counts of the error codes.

See the code on scripts
In contrast the most important features in the data with many scans were derived from Reflectivity and HybridScan which have a physical relationship to rain amount. Splitting the data allowed me to use many more features for the higher scan data which gave a large boost to the score. Over 65% of the error came from the data with more then 7 scans. The data with low scans contributed to very small amount of the final score and I was able to spend less time modeling these subsets.
Were you surprised by any of your findings?
The most mysterious aspect of the competition was the 5000 rows in the training data that had Expected rain amount over 70mm. The requirements of the competition only asked us to model up to 69mm of rain in an hour but the evaluation metric punished large classification errors so severely that I felt compelled to figure out how to predict these large values. A quick calculation showed that of the 1.1 million rows in the training set these 5000 large values, if mis-predicted, would account for half of my error.
It turned out that many of the samples with labels above 70mm did not have reflectivity values indicating heavy rain. I was still able to improve my local validation score by treating the large rain amount samples as their own class and using an all zero CDF in generating the final prediction. Unfortunately this also worsened my public leaderboard score by a large amount.

See the code on scripts
Through leaderboard feedback I was able to determine that there were differences in the distribution of these large values in the 2013 training set and the 2014 test set. Removing the rows with large values from the training set turned out to be the best course of action.
My hypothesis about the large values is that they were generated by specific rain gauges, which the learning algorithm was able to detect using features based on DistanceToRadar and the -99903 error code. The -99903 error code can correspond to physical blockage of a radar beam by mountains or other physical objects. Both of these features can help identify specific rain gauges which would lead to overfitting the train set if there were fixes to the malfunction before the start of 2014. As I don't have access to the 2014 labels this will remain speculation for now.
Which tools did you use?
I used python for this competition relying heavily on pandas for data exploration and direct numpy implementations when I needed things to be fast. This was my first competition using Xgboost, and I was very pleased with the ease of use and speed.
How did you spend your time on this competition?
I probably spent 50% percent of my time coding, and then having to refactor when I realized my implementation was not flexible enough to incorporate my new ideas. I also tried several crazy things that required substantial programming time that I didn't end up using.
The other 50% percent was split pretty equally between feature engineering, data exploration and tweaking my classification framework.
Words of Wisdom
What have you taken away from this competition?
I spent many hours coding and refactoring in this competition. Since I had to do nearly the same thing on five different datasets having manually code everything made it difficult to try new ideas. Having a flexible framework to try out many ideas is critical and this is one of the things I spent time learning how to do in this competition. The effort has already payed off in other competitions.
With only one submission a day it was important to try out things in a systematic way. What worked best was changing one aspect of my method and seeing whether it improved my score. I needed to keep records of everything I did or it was possible waste time redoing things I already tried. Having the discipline to keep on track and and not try too many things at once is critical for doing well and this competition put me to the test on this.
Do you have any advice for those just getting started competing on Kaggle?
Read the Kaggle blog post profiling KazAnova for a great high level perspective on competing. I read this about two weeks before the end of the competition and I started saving my models and predictions, and automating more of my process which allowed for some late improvements.
Other then this I think its very helpful to read the forums and follow up on hints given by those at the top of the leaderboard. Very often people will give small hints, and I have gotten in habit of following up on even the smallest clues. This has taught me many new things and helped me find critical insights into problems.
Just for Fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
With the introduction of Kaggle scripts it seems it will now be possible to have solution code evaluated remotely instead of requiring competitors to submit a CSV submission file. I think having this functionality opens up the possibility of solving new types of problems that were not feasible in the past.
With this in mind I would like to run a problem that favors reinforcement learning based solutions. As a simple example we could teach an agent to explore mazes. The training set would consist of several different mazes (perhaps it would be good generate their own training data) and the test set could be another set of unseen mazes hosted on Kaggle. All the training code would be required to run directly on scripts making transition to an evaluation server easy. I don't think this type of problem would have worked without scripts, and I think it would be fun to see if it is possible to turn agent learning problems into Kaggle competitions.
Another possibility with remote execution of solutions would be a Rock Paper Scissors programming tournament. There are already some RPS tournaments available online. Perhaps hosting a variant as a knowledge competition would be possible as these types of competitions are really fun.
What is your dream job?
Ideally I would like to work with neural and behavioural data to help improve human performance and alleviate problems related to mental illness. There are many very challenging problems in this area. Unfortunately most of the current classification frameworks for mental illness are deeply flawed. My dream job would allow for the application of diverse descriptions, methods, and sensors, without the need to push a product out immediately.
My sense is that the amount of theoretical upheaval needed is holding back research in academia, and the ineffectiveness of most current techniques is hampering the development of new businesses (plus the legal issues of the health industry). I would be interested in any project that is making progress through this mire
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo的更多相关文章
- CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...
- ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...
- 如何在 Kaggle 首战中进入前 10%
原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...
- 【转载】如何在 Kaggle 首战中进入前 10%
本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...
随机推荐
- Android开源项目分类汇总【畜生级别】[转]
Android开源项目分类汇总 欢迎大家推荐好的Android开源项目,可直接Commit或在 收集&提交页 中告诉我,欢迎Star.Fork :) 微博:Trinea 主页:www.t ...
- 由 argv引出的main参数 分类: C/C++ 2014-11-08 18:00 154人阅读 评论(0) 收藏
我们经常用的main函数都是不带参数的.因此main 后的括号都是空括号.实际上,main函数可以带参数,这个参数可以认为是 main函数的形式参数.C语言规定main函数的参数只能有两个, 习惯上这 ...
- MongoDB 安装与启动
一.MongoDB简单介绍 MongoDB是一个高性能,开源.无模式的文档型数据库.是当前NoSql数据库中比較热门的一种.它在很多场景下可用于替代传统的关系型数据库或键/值存储方式. Mongo使用 ...
- 初步掌握Yarn的架构及原理
1.YARN 是什么? 从业界使用分布式系统的变化趋势和 hadoop 框架的长远发展来看,MapReduce的 JobTracker/TaskTracker 机制需要大规模的调整来修复它在可扩展性, ...
- Java NIO——Selector机制源码分析---转
一直不明白pipe是如何唤醒selector的,所以又去看了jdk的源码(openjdk下载),整理了如下: 以Java nio自带demo : OperationServer.java Oper ...
- Linux 关闭及重启方式
一.shutdown 命令 作用:关闭或重启系统 使用权限:超级管理员使用 常用选项 1. -r 关机后立即重启 2. -h关机后不重启 3. -f快速关机,重启时跳过fsck(file system ...
- eclipse下将普通的java工程转换成web工程
开发过程中需要对普通的java工程转换成动态的web工程,网络上查询了资料很简单的几步操作就可以搞定,操作步骤如下: 编辑.project 修改以下配置 <nature>org.eclip ...
- spring 定时任务的 执行时间设置规则
单纯针对时间的设置规则org.springframework.scheduling.quartz.CronTriggerBean允许你更精确地控制任务的运行时间,只需要设置其cronExpressio ...
- (转载)记录函数 getStyle() 获取元素 CSS 样式
设置元素(element)的css属性值可以用element的style属性,例如要将element的背景色设置为黑色,可以这么做: element.style.backgroundColor = ' ...
- linux工具问题,tail -f 失效
最近发现一个很奇怪问题: tail -f 不能实时的输出日志