Kaggle竞赛顶尖选手经验汇总
What is your first plan of action when working on a new competition?
理解竞赛,数据,评价标准。
建立交叉验证集。
制定、更新计划。
检索类似竞赛和相关论文。
What does your iteration cycle look like?
Sacrifice a couple of submissions in the beginning of the contest to understand the importance of the different algorithms -- save energy for last 100 meters.
Do the following process for multiple models
- Select a model and do a recursive loop with the following steps:
- Transform data (scaling, log(x+1) values, treat missing values, PCA or none)
- Optimize hyper parameters of the model
- Do feature engineering for that model (as in generate new features)
- Do features' selection for that model (as in reducing them)
- Redo previous steps as optimum parameters are likely to have changed slightly
- Save hold-out predictions to be used later (meta-modelling)
- Check consistency of CV scores with leaderboard. If problematic, re-assess cross-validation process and re-do steps
Create partnerships. Ideally you look for people that are likely to have taken different approaches than you have. Historically (in contrast) I was looking for friends; people I can learn from and people I can have fun with - not so much winning.
Find a good way to ensemble
What does your iteration cycle look like?
It depends on the competition and I usually go through a few stages.
At the beginning I focus on data exploration and try some basic approaches so I iterate pretty quickly.
Once the obvious ideas are exhausted I usually slow down and do some research into the domain -- reading papers, forum post, etc. If I get an idea I would then implement it and submit it to the public LB.
My iteration cycle usually is short -- I rarely work on feature engineering that requires more than a few hours of coding for a particular feature.
My personal experience is that very complicated features usually do not work well -- possibly because of my buggy code.
What does your iteration cycle look like?
Read the overview and data description of the competition carefully
Find similar Kaggle competitions. As a relatively new comer, I have collected and done a basic analysis of all Kaggle competitions.
Read solutions of similar competitions.
Read papers to make sure I don’t miss any progress in the field.
Analyze the data and build a stable CV.
Data pre-processing, feature engineering, model training.
Result analysis such as prediction distribution, error analysis, hard examples.
Elaborate models or design a new model based on the analysis.
Based on data analysis and result analysis, design models to add diversities or solve hard samples.
Ensemble.
Return to a former step if necessary.
What does your iteration cycle look like?
I always prepare the dataset and apply feature engineering as much as I can, then I choose a training algorithm and optimize hyperparameters based on a cross validation score. If a model is good and stable I save the trainset and testset predictions. Then I start all over again using another training algorithm or model. When I have a handful of good model predictions, I start ensembling at the second level of training.
What does your iteration cycle look like?
Understand the dataset. At least enough to build a consistent validation set.
Build a consistent validation set and test its relationship with the leaderboard score.
Build a very simple model.
Look for approaches used in similar competitions in the past.
Start feature engineering, step by step to create a strong model.
Think about ensembling, be it by creating alternate versions of the feature set or using different modeling techniques (xgb, rf, linear regression, neural nets, factorization machines, etc).
What are your favorite machine learning algorithms?
ridge regression, resnet-50, GBT, XGB
What is your approach to hyper-tuning parameters?
用网格搜索。
基于交叉验证集。
查看类似竞赛,相关论文中类似问题下的设置。
对数据和算法的理解和经验。
观察调参前后的输出分布,受影响样本等。
In a few words, what wins competitions?
好的验证集,好的模型和特征,模型融合,从别的竞赛和论文中学习,遵守计划。
Kaggle竞赛顶尖选手经验汇总的更多相关文章
- 《Python机器学习及实践:从零开始通往Kaggle竞赛之路》
<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础 主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代 ...
- 如何使用Python在Kaggle竞赛中成为Top15
如何使用Python在Kaggle竞赛中成为Top15 Kaggle比赛是一个学习数据科学和投资时间的非常的方式,我自己通过Kaggle学习到了很多数据科学的概念和思想,在我学习编程之后的几个月就开始 ...
- 初窥Kaggle竞赛
初窥Kaggle竞赛 原文地址: https://www.dataquest.io/mission/74/getting-started-with-kaggle 1: Kaggle竞赛 我们接下来将要 ...
- 用python参加Kaggle的些许经验总结(收藏)
Step1: Exploratory Data Analysis EDA,也就是对数据进行探索性的分析,一般就用到pandas和matplotlib就够了.EDA一般包括: 每个feature的意义, ...
- 《机器学习及实践--从零开始通往Kaggle竞赛之路》
<机器学习及实践--从零开始通往Kaggle竞赛之路> 在开始说之前一个很重要的Tip:电脑至少要求是64位的,这是我的痛. 断断续续花了个把月的时间把这本书过了一遍.这是一本非常适合基于 ...
- 由Kaggle竞赛wiki文章流量预测引发的pandas内存优化过程分享
pandas内存优化分享 缘由 最近在做Kaggle上的wiki文章流量预测项目,这里由于个人电脑配置问题,我一直都是用的Kaggle的kernel,但是我们知道kernel的内存限制是16G,如下: ...
- kaggle竞赛分享:NFL大数据碗(上篇)
kaggle竞赛分享:NFL大数据碗 - 上 竞赛简介 一年一度的NFL大数据碗,今年的预测目标是通过两队球员的静态数据,预测该次进攻推进的码数,并转换为该概率分布: 竞赛链接 https://www ...
- Kaggle竞赛入门:决策树算法的Python实现
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
- Kaggle竞赛入门(二):如何验证机器学习模型
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
随机推荐
- JavaScript对接百度地图api实现地图标点功能
粗略的做了个地图标点功能 首先,去百度注册开发者账号,然后进入到百度地图开放平台 进入到控制台, 创建浏览器端应用,给个安全域名 然后去开发者文档JavaScript里面找地图展示文档,直接怼上去就行 ...
- poj 2942 求点双联通+二分图判断奇偶环+交叉染色法判断二分图
http://blog.csdn.net/lyy289065406/article/details/6756821 http://www.cnblogs.com/wuyiqi/archive/2011 ...
- 日志输出最不重要的就是控制台输出,控制台输出就是system.out而已
1.日志输出最不重要的就是控制台输出,控制台输出就是system.out而已 2.所以日志输出时候会存在一个Bug就是:stdout要配置在日志输出的最前面,因为stdout控制台输出,最不重要,如果 ...
- windows下本地安装oracle忘记密码,账号被锁咋办
忘记密码咋办: 进入cmd,输入set ORACLE_SID=ymxg (ORACLE_SID的值为你想登录的oracle实例的SID) 然后输入:sqlplus / as sysdba 最后输入: ...
- HDU多校赛第9场 HDU 4965Fast Matrix Calculation【矩阵运算+数学小知识】
难度上.,,确实...不算难 问题是有个矩阵运算的优化 题目是说给个N*K的矩阵A给个K*N的矩阵B(1<=N<=1000 && 1=<K<=6),先把他们乘起 ...
- oracle批量更新
oracle批量更新 学习了:http://blog.csdn.net/zkcharge/article/details/50855755 statement.addBatch(); statemen ...
- 《Android源代码设计模式解析与实战》读书笔记(八)
第八章.状态模式 1.定义 状态模式中的行为是由状态来决定,不同的状态下有不同的行为.当一个对象的内在状态改变时同意改变其行为,这个对象看起来像是改变了其类. 2.使用场景 1.一个对象的行为取决于它 ...
- 软件project师周兆熊给IT学子的倾情奉献
[来信] 贺老师: 你好,我是中兴通讯的一名软件开发project师,名叫周兆熊. 近期看了您的新书<逆袭大学:传给IT学子的正能量>,感觉你真心为当代学子答疑解惑.非常值得敬佩! 从上大 ...
- ural 1005 Stone Pile
这是道01背包题 ,尽管背包不会 ,可是还是看出来了,递推公式写啊写没写出来,后来一同学说是dfs.这我就開始了A了, 题意是给你n个重量是Wn的石头 让你放到两个包里面.看他们两个差值最小, ...
- SVNserver搭建和使用(二)
上一篇介绍了VisualSVN Server和TortoiseSVN的下载,安装,汉化.这篇介绍一下怎样使用VisualSVN Server建立版本号库,以及TortoiseSVN的使用. 首先打开V ...