Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi

This spring, Kaggle hosted two competitions with the ECML PKDD conference in Porto, Portugal. The competitions shared a dataset but focused on different problems. Taxi Trajectory asked participants to predict where a taxi would drop off a customer given partial information on their journey, while Taxi Trip Time's goal was to predict the amount of time a journey would take given the same dataset.

418 players on 345 teams competed to predict the time a taxi journey would take.

Team BlueTaxi finished 3rd in Taxi Trip Time and 7th in Taxi Trajectory. This blog outlines how their team of data scientists from five different countries came together, and their winning approach to the Taxi Trip Time competition.

The Basics

The BlueTaxi Team

The BlueTaxi team is very multicultural, we are Ernesto (El Salvador),Lam (Vietnam), Alessandra (Italy), Bei (China), and Yiannis (Greece). We had great fun participating and winning the third place in this ECML/PKDD Discovery Challenge which was organized as the Kaggle, Taxi Trip Time, competition. In this post we would like to share with you how we did it.

What made you decide to enter this competition?

Ernesto: The Discovery Challenges, organized annually by ECML/PKDD, are always very interesting and this year was not the exception, especially this one organized on top of Kaggle. After a small chat with Lam, we decided to go for it and enter as the BlueTaxi team for both tracks of the challenge.

I took the lead for the trip time prediction and Lam led the destination prediction.

We invited Alessandra, Bei, and Yiannis to consolidate the final team of five.

Lam: I decided to enter because the competition was hosted by Kaggle and the ECML/PKDD 2015 conference will be a great a opportunity for researchers to benefit from exchanging ideas and their experience during the workshop session being held in conjunction with the main conference in Porto next September.

Alessandra: Lam and Ernesto presented the challenge to me and I thought it was very interesting so I decided to join.

Bei: For me that was also the case.

Yiannis: The problem of predicting the destination and trip time for a taxi route seemed very challenging and that's why I decided to participate and join the BlueTaxi team.

What was your background prior to entering this challenge?

Lam: I did my PhD in pattern mining for data streams at Technische Universiteit Eindhoven (TU/e) and joined IBM Research Lab in Ireland about a year and a half ago. My research interests include mining big and fast data on big data platforms with applications in telcos, transportation under the smarter city project.

Alessandra: My background is in transportation analytics. Specifically I work on estimation and prediction of traffic and urban traffic control systems. I hold a PhD degree in Information Technology from Politecnico di Milano and currently I am a Research Scientist at IBM Research - Ireland (also known as the IBM Smarter Cities Technology Centre in Dublin, Ireland.)

Bei: I am a statistician with primary interests in time series analysis, forecasting, resampling/subsampling methods for dependent data and financial econometrics. My recent work focuses on statistical methods in urban applications. I received my PhD in Statistics from the University of Waterloo, Canada. I joined IBM Research, Ireland, in late 2012.

Yiannis: I am a Research Software Engineer at Smarter Cities Technologies Center, IBM Research - Ireland. I hold a Masters Degree from Athens University of Economics and Business in Computer Science. Lately, I have been working with spatio-temporal data on various projects focusing on data curation, efficient storing and analysis. Moreover, I have experience with visualisation of similar type of data, helping data scientists to gain insights.

Ernesto: I hold a PhD in Computer Science from the L3S Research Center in the University of Hannover, Germany. My background is in supervised machine learning applied to Web Science, Social Media Analytics, and Recommender Systems. I joined IBM Research, Ireland, early 2014.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Lam: In the IBM Research lab I have been working on several projects with similar data, e.g. with GPS traces from buses used for prediction of bus arrival time at bus stop, e.g., see our related paper regarding this topic (Flexible Sliding Window for Kernel Regression Based Bus Arrival Time Prediction) in the industry track at ECML/PKDD 2015.

Ernesto: I did not have any particular domain knowledge in transportation systems, but my experience in machine learning, data science and analytics were of course valuable for the competition.

Alessandra: Yes, my knowledge in the transportation field helped me in the challenge.

Yiannis: My experience with spatio-temporal data helped me in the competition.

Bei: I had some experience analyzing transportation data, which helped me in this challenge.

How did you get started competing on Kaggle?

Ernesto: I joined Kaggle a couple of years ago during my PhD. I did enter in some competitions before. The datasets available in Kaggle are usually very interesting and some of them were very useful for my research. But to be honest, I never got good traction in a competition until this time.

Lam: I also joined Kaggle 4 years ago when I was a PhD student. But I haven't tried to compete since then.

Bei: I joined Kaggle in 2012, and this was my second competition since then.

Alesandra and Yiannis: For us this was the first time that we had entered a Kaggle competition

Let's get technical

What preprocessing and supervised learning methods did you use?

First we created our local training set by selecting the cut-off times the same as the five snapshots on the test set (the same week-date as well). We also observe that 14th of August is the day before a big holiday in Portugal and 21th of December is the last Sunday before Christmas, both very particular days.

We created a bunch of features as follows:

1. Features from 10-NN. For every test trip we find 10 nearest neighbours w.r.t the Euclidean distance and consider the durations of those trips as predictors.

2. Features from Kernel Regression. Similar to 10-NN, kernel regression was used to predict the duration. Remember that kernel regression is a smooth version of kNN and these features yield very good results.

3. Some features from the partial trips: travelled distance, number of GPS updates, last GPS coordinates, average speed at different part of the trips and accelerations at different part of the trips.

When matching a test trip with the training trips, we only consider to match the last 100, 200, 300, 400, 500, 1000 meters and the full trips as well. This is because the later part of the trip is more important in some cases. Our model show that the last 500 meters of the trip is very important.

Two trips with different starting points but with the same destination (Porto Airport). The later part of the trajectories are very close to each other. Therefore via trip matching we can guess destination of the other trip if we can guess the destination of trips with similar route.

We also considered contextual matching (match only trips with same taxi id, the same week date, the same call id, etc.) because we observed different distributions of destination for these contexts. The kernel regression on taxi id context produced the best results.

When modelling, we did not predict the duration of the whole trip but instead predict the additional delta travel time with respect to the cut-off timestamp. Since the evaluation metric was RMSLE, we log-transformed the time target labels, i.e., the log of the additional delta time.

Outlier handling: we found that trips with missing values (identified at speed limits 160, 140, 100 Km/h) are more difficult to predict, we try to recover this information on the test set by looking at the gap between the cut-off timestamp and the start timestamp. Unfortunately, this information is not reliable so we decided to remove outliers based on the number of GPS updates based on an absolute deviation from the median of 3.5.

The test dataset for this competition was very small (320 instances), which makes it very prone to overfitting. Our final solution was a robust Ensemble of several models that included: Random Forests, Gradient Boosted Trees, and Extremely Randomized Trees.

BlueTaxi the overall winner approach

Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi的更多相关文章

ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...

Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

(转) Learning Deep Learning with Keras
Learning Deep Learning with Keras Piotr Migdał - blog Projects Articles Publications Resume About Ph ...

Amazon Interview | Set 27
Amazon Interview | Set 27 Hi, I was recently interviewed for SDE1 position for Amazon and got select ...

1301. The Trip
A number of students are members of a club that travels annually to exotic locations. Their destinat ...

烟大 Contest1024 - 《挑战编程》第一章：入门 Problem C: The Trip（水题）
Problem C: The Trip Time Limit: 1 Sec Memory Limit: 64 MBSubmit: 19 Solved: 3[Submit][Status][Web ...

Verilog Tips and Interview Questions
Verilog Interiew Quetions Collection : What is the difference between $display and $monitor and $wr ...

5 Common Interview Mistakes that Could Cost You Your Dream Job (and How to Avoid Them)--ref
There have been many articles on our site on software testing interviews. That is because, we, as IT ...

Lesson 29 Taxi!
Text Captain Ben Fawcett has bought an unusual taxi and has begun a new serivice. The 'taxi' is a sm ...

随机推荐

【学习笔记】【C语言】三目运算符
1.N目运算符像逻辑非(!).负号(-)这种只连接一个数据的符号,称为“单目运算符”,比如!5.-5.像算术运算符.关系运算符.逻辑运算符这种连接二个数据的负号,称为“双目运算符”,比如6+7.8* ...

20150224—ASP.NET基础
一.如何使用VS2012创建ASP.NET的项目. 文件-新建-网站出现以下对话框,选择ASP.NET的空网站(注意,左侧使用的模板是Visual C#) 选择好存放位置,名字之后点击确定.这样就 ...

转: Python集合（set）类型的操作
python的set和其他语言类似, 是一个无序不重复元素集, 基本功能包括关系测试和消除重复元素. 集合对象还支持union(联合), intersection(交), difference(差)和 ...

gcc和arm-linux-gcc 头文件寻找路径【转】
原文地址:http://blog.chinaunix.net/uid-29145190-id-3867605.html 在LINUX程序设计当中,经常会遇到头文件包含的问题,那么这些头文件到底在哪个路 ...

Lucene Field
org.apache.lucene.demo.IndexFiles类中,使用递归的方式去索引文件.在构造了一个IndexWriter索引器之后,就可以向索引器中添加Doucument了,执行真正地建立 ...

zip生成
生成zip文件官方网站:http://www.phpconcept.net/pclzip/ 用法一: 1 <?php 2 include_once('pclzip.lib.php'); ...

说Win7激活
今天晚上给电脑来了个强制关机,后来打开后提示我,该Windos不是正版,顿时无语.诺,看下图:我的桌面也全部变成黑色了…… 后来一想……哦,应该是我的安装光盘里的激活工具激活的不彻底,或者说只是给我激 ...

poj 3237 Tree 树链剖分+线段树
Description You are given a tree with N nodes. The tree’s nodes are numbered 1 through N and its edg ...

IO多路转接select和poll
select IO多路复用的设置方法与信号的屏蔽有点相似: 信号屏蔽需要先设定一个信号集, 初始化信号集, 添加需要屏蔽的信号, 然后用sigprocmask设置 IO多路转接需要先设定一个文件描述符 ...

从word中提取图片的三种方法
方法1:使用截图方法来提取并保存图片,如果你安装了QQ并且运行了的话,你可以使用Ctrl+Alt+A来截图,然后在QQ聊天框中按CTRL+V来保存图片,当然你可以在PS新建文档按CTRL+V来粘贴图片 ...

Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi

Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi

The Basics

The BlueTaxi Team

What made you decide to enter this competition?

What was your background prior to entering this challenge?

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

How did you get started competing on Kaggle?

Let's get technical

What preprocessing and supervised learning methods did you use?

Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi的更多相关文章

随机推荐

热门专题