Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zipfianacademy.com.

There are plenty of articles and discussions on the web about what data science is, what qualitiesdefine a data scientist, how to nurture them, and how you should position yourself to be acompetitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here we will provide a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science.

At Zipfian Academy, we believe that everyone learns at different paces and in different ways. If you prefer a more structured and intentional learning environment, we run a 12 week immersive bootcamp training people to become data scientists through hands-on projects and real-world applications.

We would love to hear your opinions on what qualities make great data scientists, what a data science curriculum should cover, and what skills are most valuable for data scientists to know.

While the information contained in these resources is a great guide and reference, the best way to become a data scientist is to makecreate, and share!

Environment

While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. We recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.

Development

When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.

Statistics

It is often said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.

Courses

While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPySciPymatplotlib, and pandas.

Books

Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

Machine Learning/Algorithms

A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.

Courses

Books

Data ingestion and cleaning

One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.

Courses

Tutorials

  • Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.

Tools

  • OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.

  • DataWrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.

  • sed: “The ultimate stream editor” — used to process files with regular expressions often used for substitution.

  • awk: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.

Visualization

The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.

Courses

Books

Tutorials

Tools

  • D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).

  • Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.

  • Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.

  • modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).

  • Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

Computing at Scale

When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.

Courses

Books

Putting it all together

Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Courses

Books

Tutorials

Conclusion

Now this just scratches the surface of the infinitely deep field of Data Science and we encourage everyone to go out and try your hand at some science! We would love for you to join theconversation over @zipfianacademy and let us know if you want to learn more about any of these topics.

Blogs

  • Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.

  • Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science.

  • Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.

  • grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.

  • Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)

  • no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.

Resources

If you made it this far, you should check out our 12-week intensive program. Apply here!

Source: http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-scienc

【Repost】A Practical Intro to Data Science的更多相关文章

  1. 【12c】扩展数据类型(Extended Data Types)-- MAX_STRING_SIZE

    [12c]扩展数据类型(Extended Data Types)-- MAX_STRING_SIZE 在12c中,与早期版本相比,诸如VARCHAR2, NAVARCHAR2以及 RAW这些数据类型的 ...

  2. 【repost】H5总结

    1.新增的语义化标签: <nav>: 导航 <header>: 页眉 <footer>: 页脚 <section>:区块 <article> ...

  3. 【LeetCode】170. Two Sum III – Data structure design

    Difficulty:easy  More:[目录]LeetCode Java实现 Description Design and implement a TwoSum class. It should ...

  4. 【MongoDB】mongoimport and mongoexport of data (一)

    In the software development, we usually are faced with a common question of exporting or importing d ...

  5. 【repost】如何学好编程 (精挑细选编程教程,帮助现在在校学生学好编程,让你门找到编程的方向)四个方法总有一个学好编程的方法适合你

    方法(一)编了这么久的程序,一直想找机会总结下其中的心得和方法,但回想我这段编程道路,又很难说清楚,如果按照我走过的所有路来说,显然是不可能的!当我看完了云风的<游戏之旅--编程感悟>和梁 ...

  6. 【转】Jmeter中使用CSV Data Set Config参数化不重复数据执行N遍

    Jmeter中使用CSV Data Set Config参数化不重复数据执行N遍 要求: 今天要测试上千条数据,且每条数据要求执行多次,(模拟多用户多次抽奖) 1.用户id有175个,且没有任何排序规 ...

  7. 【hbase】Unable to read additional data from client sessionid 0x15c92bd1fca0003, likely client has closed socket

    启动hbase ,验证出错 Master is initializing 查看zk日志,发现Unable to read additional data from client sessionid 0 ...

  8. 【repost】H5的新特性及部分API详解

    h5新特性总览 移除的元素 纯表现的元素: basefont.big.center.font等 对可用性产生负面影响的元素: frame.frameset.noframes 新增的API 语义: 能够 ...

  9. 【repost】JavaScript 基本语法

    JavaScript 基本语法,JavaScript 引用类型, JavaScript 面向对象程序设计.函数表达式和异步编程 三篇笔记是对<JavaScript 高级程序设计>和 < ...

随机推荐

  1. [转]C#中的Monitor类

    object obj=new object(); Monitor在锁对象obj上会维持两个线程队列R和W以及一个引用T : (1) T是对当前获得了obj锁的线程的引用(设此线程为CurrThread ...

  2. 修改weblogic部署的应用名称

    通过weblogic管理后台console进行发布本地项目的时候,它会默认以WEB-INF的上一级目录作为访问路径,如,假如你的项目WEB-INF目录的上一层是WebRoot,那么发布后,访问的路径默 ...

  3. jQuery的bind()与live()

    前言 最近一个项目的前端有这样的一个需求:页面中有某按钮,点击按钮之后通过服务器的返回信息更改这个按钮的点击事件执行函数. 方案1 之前小猪使用的方法是给按钮增加class.在jquery中通过cla ...

  4. Sticks(poj1011/uva307)

    题目大意: 乔治有一些碎木棒,是通过将一些相等长度的原始木棒折断得到的,给出碎木棒的总数和各自的长度,求最小的可能的原始木棒的长度:(就是将一些正整数分组,每组加起来和相等,使和尽可能小) 一开始做p ...

  5. Android Phonebook编写联系人UI加载及联系人保存流程(六)

    2014-01-07 11:18:08 将百度空间里的东西移过来. 1. Save contact 我们前面已经写了四篇文章,做了大量的铺垫,总算到了这一步,见证奇迹的时刻终于到了. 用户添加了所有需 ...

  6. C++string的操作

    #include <iostream> using namespace std; int main() { //initilization string str("abc.ddd ...

  7. JSTL标准标签库

    有时使用EL和标准动作达不到目的,于是就引入定制标记. 对于JSP页面创作人员来说,定制标记使用起来比脚本要容易一些.不过对于JAVA程序员来说,简历定制标记处理器反而更困难.幸运的是,已经有了一个标 ...

  8. java基础之 垃圾回收机制

    1. 垃圾回收的意义 在C++中,对象所占的内存在程序结束运行之前一直被占用,在明确释放之前不能分配给其它对象:而在Java中,当没有对象引用指向原先分配给某个对象的内存时,该内存便成为垃圾.JVM的 ...

  9. SharePoint 2013 Nintex Workflow 工作流帮助(二)

    博客地址 http://blog.csdn.net/foxdave 工作流动作 1. Action Set(Logic and flow分组) 它是一个工作流的集合,可以理解为容器的东西.所以它本身并 ...

  10. Javascript基础--类与对象(五)

    js面向(基于)对象编程1.澄清概念 1.1 js中基于对象 == js 面向对象 1.2 js中没有类class,但是它取了一个新的名字,交原型对象,因此 类 = 原型对象. 2.为什么需要对象? ...