Comprehensive learning path – Data Science in Python

Journey from a Python noob to a Kaggler on Python

So, you want to become a data scientist or may be you are already one and want to expand your tool repository. You have landed at the right place. The aim of this page is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of steps you need to learn to use Python for data analysis. If you already have some background, or don’t need all the components, feel free to adapt your own paths and let us know how you made changes in the path.

Step 0: Warming up

Before starting your journey, the first question to answer is:

Why use Python?

or

How would Python be useful?

Watch the first 30 minutes of this talk from Jeremy, Founder of DataRobot at PyCon 2014, Ukraine to get an idea of how useful Python could be.

Step 1: Setting up your machine

Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download Anaconda (或者去右边的网址下载:http://www.continuum.io/downloads)from Continuum.io . It comes packaged with most of the things you will need ever. The major downside of taking this route is that you will need to wait for Continuum to update their packages, even when there might be an update available to the underlying libraries. If you are a starter, that should hardly matter.

If you face any challenges in installing, you can find more detailed instructions for various OS here

Step 2: Learn the basics of Python language

You should start by understanding the basics of the language, libraries and data structure. The python track fromCodecademy is one of the best places to start your journey. By end of this course, you should be comfortable writing small scripts on Python, but also understand classes and objects.

Specifically learn: Lists, Tuples, Dictionaries, List comprehensions, Dictionary comprehensions

Assignment: Solve the python tutorial questions on HackerRank. These should get your brain thinking on Python scripting

Alternate resources: If interactive coding is not your style of learning, you can also look at The Google Class for Python. It is a 2 day class series and also covers some of the parts discussed later.

Step 3: Learn Regular Expressions in Python

You will need to use them a lot for data cleansing, especially if you are working on text data. The best way tolearn Regular expressions is to go through the Google class and keep this cheat sheet handy.

Assignment: Do the baby names exercise

If you still need more practice, follow this tutorial for text cleaning. It will challenge you on various steps involved in data wrangling.

Step 4: Learn Scientific libraries in Python – NumPy, SciPy, Matplotlib and Pandas

This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.

  • Practice the NumPy tutorial thoroughly, especially NumPy arrays. This will form a good foundation for things to come.
  • Next, look at the SciPy tutorials. Go through the introduction and the basics and do the remaining ones basis your needs.
  • If you guessed Matplotlib tutorials next, you are wrong! They are too comprehensive for our need here. Instead look at this ipython notebook till Line 68 (i.e. till animations)
  • Finally, let us look at Pandas. Pandas provide DataFrame functionality (like R) for Python. This is also where you should spend good time practicing. Pandas would become the most effective tool for all mid-size data analysis. Start with a short introduction, 10 minutes to pandas. Then move on to a more detailed tutorial on pandas.

You can also look at Exploratory Data Analysis with Pandas and Data munging with Pandas

Additional Resources:

  • If you need a book on Pandas and NumPy, “Python for Data Analysis by Wes McKinney”
  • There are a lot of tutorials as part of Pandas documentation. You can have a look at them here

Assignment: Solve this assignment from CS109 course from Harvard.

Step 5: Effective Data Visualization

Go through this lecture form CS109. You can ignore the initial 2 minutes, but what follows after that is awesome! Follow this lecture up with this assignment

Step 6: Learn Scikit-learn and Machine Learning

Now, we come to the meat of this entire process. Scikit-learn is the most useful library on python for machine learning. Here is a brief overview of the library. Go through lecture 10 to lecture 18 from CS109 course from Harvard. You will go through an overview of machine learning, Supervised learning algorithms like regressions, decision trees, ensemble modeling and non-supervised learning algorithms like clustering. Follow individual lectures with the assignments from those lectures.

Additional Resources:

Assignment: Try out this challenge on Kaggle

Step 7: Practice, practice and Practice

Congratulations, you made it!

You now have all what you need in technical skills. It is a matter of practice and what better place to practice than compete with fellow Data Scientists on Kaggle. Go, dive into one of the live competitions currently running onKaggle and give all what you have learnt a try!

Step 8: Deep Learning

Now that you have learnt most of machine learning techniques, it is time to give Deep Learning a shot. There is a good chance that you already know what is Deep Learning, but if you still need a brief intro, here it is.

I am myself new to deep learning, so please take these suggestions with a pinch of salt. The most comprehensive resource is deeplearning.net. You will find everything here – lectures, datasets, challenges, tutorials. You can also try the course from Geoff Hinton a try in a bid to understand the basics of Neural Networks.

P.S. In case you need to use Big Data libraries, give Pydoop and PyMongo a try. They are not included here as Big Data learning path is an entire topic in itself.


LeaRning Path on R – Step by Step Guide to Learn Data Science on R

One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though, there is an overload of good free resources available on the Internet, this could be overwhelming as well as confusing at the same time.

After digging through endless resources & archives,  here is a comprehensive Learning Path on R to help you learn R from ‘the scratch’. This will help you learn R quickly and efficiently. Time to have fun while lea-R-ning!

Step 0: Warming up

Before starting your journey, the first question to answer is: Why use R? or How would R be useful?

Watch this 90 seconds video from Revolution Analytics to get an idea of how useful R could be. Incidentally Revolution Analytics just got acquired by Microsoft.

Step 1: Setting up your machine

Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download the basic version of R and detailed installation instructions from CRAN (Comprehensive R Archive Network).

You can then install various other packages. There are 9000 packages in R so this can get confusing. Accordingly, we will guide you to install just the basic R packages first. Here is a link to understand packages called CRAN Views.  You can accordingly select the sub type of packages that you are interested in.

How to install a package http://www.r-bloggers.com/installing-r-packages/

Some important packages to learn about: http://blog.yhathq.com/posts/10-R-packages-I-wish-I-knew-about-earlier.html

You should install these three GUIs with all dependent packages.

  • Rattle for Data Mining [Link] or install.packages(“rattle”, dep=c(“Suggests”))
  • R Commander for Basic  Statistics [Link] or install.packages(“Rcmdr”)
  • Deducer (with JGR) for Data Visualization [Link]

You should also install RStudio. It helps making R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment much more productively.

Assignment:

  1. Install R, and RStudio
  2. Install Packages Rcmdr, rattle, and Deducer. Install all suggested packages or dependencies including GUI.
  3. Load these packages using library command and open these GUIs one by one.
Step 2: Learn the basics of R  language

You should start by understanding the basics of the language, libraries and data structure. The R track fromDatacamp is one of the best places to start your journey. Especially see the free Introduction to R course athttps://www.datacamp.com/courses/introduction-to-r. By end of this course, you should be comfortable writing small scripts on R, but also understand data analysis. Alternately, you can also see Code School for R athttp://tryr.codeschool.com/

If you want to learn R offline on your own time – you can use the interactive package swirl fromhttp://swirlstats.com

Specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command

Assignment:

  1. Sign up at http://r-bloggers.com for the daily newsletter concerning R project.
  2. Create a github account at http://github.com
  3. Learn to troubleshoot package installation above by googling for help.
  4. Install package swirl and learn R programming (see above)
  5. Learn from http://datacamp.com

Alternate resources: If interactive coding is not your style of learning, you can also look at The Two Minute Tutorials on R at http://www.twotorials.com/ . It is a video series and also covers some of the parts discussed here. You can also read a comprehensive blog post titled 50 functions to help you clear a job interview in R here.

Step 3: Learn Data Manipulation

You will need to use them a lot for data cleansing, especially if you are working on text data. The best way is to go through the text manipulation and numerical manipulation exercises. You can learn about connecting to databases through the RODBCpackage and writing sql queries to data frames through sqldfpackage.

Assignment:

  1. Read about split, apply, combine approach for data analysis from Journal of Statisical Software.
  2. Try learning about tidy data approach for data analysis.
  3. For connecting to  a RDBMS- a MySQL database through R
  4. You really should do  a data quality exercise.
  5. Bored with analyzing numbers alone. Try sports analysis with a  cricket analysis using R.

If you still need more practice, you can sign up for a $25/month subscription at Datacamp that gives you all tutorials . Please go through the slides here for plyr here.

Step 4: Learn specific packages in R– data.table and dplyr

This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.

  • Practice the data.table tutorial  thoroughly here. Print and study the cheat sheet for data.table
  • Next, you can have a look at the dplyr tutorial here.
  • For text mining, start with creating a word cloud in R and then learn learn through this series of tutorial: Part 1 and Part 2.
  • For social network analysis read through these pages.
  • Do sentiment analysis using Twitter data – check out this and this analysis.
  • For optimization through R read here and here

Additional Resources:

Step 5: Effective Data Visualization through ggplot2
  1. Read about Edward Tufte and his principles on how to make (and not make) data visualizations here . Especially read on data-ink, lie factor and data density.
  2. Read about the common pitfalls on dashboard design by Stephen Few.
  3. For learning grammar of graphics and a practical way to do it in R. Go through thislink from Dr Hadley Wickham creator of ggplot2 and one of the most brilliant R package creators in the world today. You can download the data and slides as well.
  4. Are you interested in visualzing data on spatial analsysis. Go through the amazing ggmap package.
  5. Interested in making animations thorugh R. Look through these examples. Animate package will help youhere.
  6. Slidify will help supercharge your graphics with HTML5.
Step 6: Learn Data Mining and Machine Learning

Now, we come to the most valuable skill for a data scientist which is data mining and machine learning. You can see a very comprehensive set of resources on data mining in R here at http://www.rdatamining.com/ . The rattle package really helps you with an easy to use Graphical User Interface (GUI).  You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html

You will go through an overview of  algorithms like regressions, decision trees, ensemble modeling and   clustering.  You can also see the various machine learning options available in R by seeing the relevant CRAN view here.

Additional Resources:

  • If there is one book on data mining using R you want, it is on Rattle
  • You can learn on time series forecasting from this booklet – A Little Book for Time Series in R .
  • Some machine learning in R is here. You can enroll in a free course here.
Step 7: Practice, practice and Practice

Congratulations, you made it!

You now have all what you need in technical skills.

  1. It is a matter of practice and what better place to practice than compete with fellow Data Scientists on Kaggle. This practice contest will help you start at https://www.kaggle.com/c/titanic-gettingStarted
  2. Read about a more advanced Kaggle Analysis here http://0xdata.com/blog/2014/09/r-h2o-domino/
  3. Stay in touch with what your fellow R coders are doing by subscribing to http://www.r-bloggers.com/
  4. Interact with them on twitter using the #rstats hashtag.
  5. Stuck somewhere? This website is great for learning R quickly as it gives you just the right amount of information.
Step 8: Advanced Topics

Now that you have learnt most of data analytics using R , it is time to give some advanced topics a shot. There is a good chance that you already know many of these, but have a look at these tutorials too.

  1. For using R with Hadoop see this tutorialon using RHadoop.
  2. A Tutorial on using R with MongoDB.
  3. Another nice tutorial on Big Data analysis using R in the NoSQL era.
  4. You can make interactive web applications using Shinyfrom RStudio.
  5. Interested in learning R and Python syntax relate. Read through this guide.

P.S. In case you need to use Big Data a lot please also have a look at RevoScaleR package from Revolution Analytics. It is commercial but academic usage is free. An example project is given here.


备注

Business Analyst using SAS
LeaRning Data Science on R – step by step guide
Data Science in Python – from a python noob to a Kaggler
Data Visualization with QlikView – from starter to a Luminary
Machine Learning with Weka

R8:Learning paths for Data Science[continuous updating…]的更多相关文章

  1. 51 Free Data Science Books

    51 Free Data Science Books A great collection of free data science books covering a wide range of to ...

  2. 40 Questions to test your skill in Python for Data Science

    Comes from: https://www.analyticsvidhya.com/blog/2017/05/questions-python-for-data-science/ Python i ...

  3. 域迁移DA | Learning From Synthetic Data: Addressing Domain Shift for Se | CVPR2018

    文章转自:微信公众号「机器学习炼丹术」 作者:炼丹兄(已授权) 联系方式:微信cyx645016617 论文名称:"Learning From Synthetic Data: Address ...

  4. 学习Data Science/Deep Learning的一些材料

    原文发布于我的微信公众号: GeekArtT. 从CFA到如今的Data Science/Deep Learning的学习已经有一年的时间了.期间经历了自我的兴趣.擅长事务的探索和试验,有放弃了的项目 ...

  5. Comprehensive learning path – Data Science in Python深入学习路径-使用python数据中学习

    http://blog.csdn.net/pipisorry/article/details/44245575 关于怎么学习python,并将python用于数据科学.数据分析.机器学习中的一篇非常好 ...

  6. data mining,machine learning,AI,data science,data science,business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

  7. 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics)之间有什么关系?

    本来我以为不需要解释这个问题的,到底数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)有什么区别,但是前几天因为有个学弟问我,我想了想发现我竟然也回答 ...

  8. 【转】The most comprehensive Data Science learning plan for 2017

    I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...

  9. 【转】Comprehensive learning path – Data Science in Python

    Journey from a Python noob to a Kaggler on Python So, you want to become a data scientist or may be ...

随机推荐

  1. 了不起的Node.js--之一

    在OSX下安装Nodejs 从Node.js官网下载PKG文件,其文件名格式遵循node-v.?.?.?.pkg.若要通过手动编译来进行安装,请确保机器上已安装了XCode,然后根据Linux下的编译 ...

  2. 《Linux内核分析》-- 扒开系统调用的三层皮(下)之system_call中断处理过程 20135311傅冬菁

    20135311傅冬菁  原创作品 <Linux内核分析>MOOC课程 分析system_call中断处理过程 内容分析与总结: 系统调用在内核代码中的工作机制和初始化 系统调用在用户态中 ...

  3. Game over 作业

    终于有一篇不拼代码拼码字的作业了,哈哈哈..... 从寒假到这次结束,经历的博客及编码作业的过程 前面七次作业做个分类: 通往博客园和C++的第一步. 知识点:让我们对C++做一个预习,在学C++前有 ...

  4. 构建之法-软件测试+质量保障+稳定和发布阶段+IT行业的创新+人、绩效和职业道德

    第十三章(软件测试) 要知道为什么有软件测试,首先需要知道软件开发,软件开发者一般都很难检查出自己的错误,所以才需要另外一个人测试,所以软件测试就诞生了. 书本介绍了很多测试方法,各有各的优缺点,至于 ...

  5. 嵌入AppBar并且带搜索建议的搜索框(Android)

    先看结果: 相关的官方文档在这里:Creating a Search Interface Android官方提供了两种方式: 弹出一个Dialog,覆盖当前的Activity界面 在AppBar中扩展 ...

  6. linux 取消控制台报警音

    可以通过setterm -blength 0 设置报警音报警时间,0表示没有报警音 也可以通过setterm -bfreq 10 设置报警音的频率(Hz) 如果通过命令行直接设置,当下会生效,但是重启 ...

  7. kafka 数据一致性-leader,follower机制与zookeeper的区别;

    我写了另一篇zookeeper选举机制的,可以参考:zookeeper 负载均衡 核心机制 包含ZAB协议(滴滴,阿里面试) 一.zookeeper 与kafka保持数据一致性的不同点: (1)zoo ...

  8. InputStream流无法重复读取的解决办法

    前言:今天工作的需要需要读取aws云上S3桶里面的PDF数据,第一步能够正常的获取PDF文件的InputStream流,然后,我为了测试使用了IOUtils.toString(is)将流System. ...

  9. 洛谷P13445 [USACO5.4]奶牛的电信Telecowmunication(网络流)

    题目描述 农夫约翰的奶牛们喜欢通过电邮保持联系,于是她们建立了一个奶牛电脑网络,以便互相交流.这些机器用如下的方式发送电邮:如果存在一个由c台电脑组成的序列a1,a2,...,a(c),且a1与a2相 ...

  10. BZOJ2303 APIO2011方格染色(并查集)

    比较难想到的是将题目中的要求看做异或.那么有ai,j^ai+1,j^ai,j+1^ai+1,j+1=1.瞎化一化可以大胆猜想得到a1,1^a1,j^ai,1^ai,j=(i-1)*(j-1)& ...