11 Facts about Data Science that you must know
11 Facts about Data Science that you must know
Statistics, Machine Learning, Data Science, or Analytics – whatever you call it, this discipline is on rise in last quarter of century primarily owing to increasing data collection abilities and exponential increase in computational power. Field is drawing from pool of engineers, mathematicians, computer scientists, and statisticians, and increasingly, is demanding multi-faceted approach for successful execution. In fact, no branch of engineering, science, or business is far from touch of analytics in any industry. Perhaps you, too, are interested in being, or already are, a data scientist.
However, as one journeys through his/her career in analytics, some truths start becoming evident over time. And while none of them are ground-shattering, they often surprise novices in the field. So, it’s worthwhile to know 11 absolute facts of data science.
1. Data is never clean

Analytics without real data is mere collection of hypotheses and theories. Data helps test them and find the right one suitable in context of end-use in hand. However, in real world data is never clean. Even in organizations which have well established data science centers for decades, data isn’t clean. Apart from missing or wrong values, one of the biggest problems refers to joining multiple datasets into coherent whole. Join key may not be consistent or granularity or format may not be suitable. And it’s not intentional. Data storage enterprises are designed and tightly integrated with front-end software and user who is generating data, and are often independently created. Data scientist enters the scene quite late, and often is just “taker” of data as-in and not part of design.
2. You will spend most of your time cleaning and preparing data

Corollary to above is that large part of your time will be spent in just cleaning and processing data for model consumption. This usually annoys people new to industries. With brilliant mind bursting with sophisticated machine learning methods, spending three-fourth of the time with just data wrangling seems waste of talent and time. Often this leads to dissatisfaction and lack of attention – errors from which can come to bite even the most fanciest of the algorithms. If you cannot do this with equanimity and focus on big picture, then perhaps you should aim for research in statistics rather than career in data science.
3. There is no full automated data science. You need to get your hands dirty
Since data is not clean and requires quite a lot of data processing, there is no ready set of scripts or buttons to push to develop analytic model. Each data and problem is different. There is no substitute for exploring data, testing models, and validating against business sense and domain experts. Depending on problem and your prior experience, you may dirty your hands less, but dirty you will. Only exception is if you get data in specific format and do the same thing over and over, but that already sounds boring, isn’t it?!
4. 95% of the tasks do not require deep learning
95% is obviously a made up number – but the idea is that most real life problems don’t require advance analytic capabilities. Solving real-world problems involves lot more understanding real-world, problem domain, decision makers and end-users, than understanding latest and greatest discovery in statistics. What moves the needle, and moves it quick, is much more valuable than what is rigorous and pure. Often, simplest models like linear regression, logistic regression, and k-Means clustering work wonders as long as problem is well formulated. Even for complex problems, simple models can provide large gains which complex models can only improve marginally. That is not to say that complicated models have no place. In fact, depending on money riding, 0.1% increase in prediction accuracy may be worth millions of dollars.
5. Big Data is just a tool

With the hype around Big Data getting louder every day, I won’t blame you for being enamored of the idea. However, key thing to remember that Big Data is just collection of tools to work with large volume of data in reasonable time and with commodity grade computer hardware. Underlying analytic problem design, modeling best practices, and scrutinizing eyes of astute analyst aren’t replaceable with Big Data. That is not to say that competency in Big Data techniques isn’t handy – it is, more so since world is moving towards Big Data and there may not “Small” Data in couple of years anymore. But tools will come and go; your machine learning experience will only persist. Big data is like analogous to AK47 rifle forpoliceman rather than flintlock carbine rifle. Sure, better tool is preferable to inferior, but being trained in policing is more important than rifle.
6. You should embrace the Bayesian approach
Data science is sequence of hypotheses testing. You have to have going-in belief which you want to prove right or wrong based on observation from data. Stronger is your going-in belief, more counter-evidence you need to prove belief wrong. That, in essence, is Bayesian approach. But while proving your hypothesis right through data is important, proving alternative hypothesis wrong is also equally important. Take this fun puzzle from New York Times to figure out how to think Bayesian.
Alternative to Bayesian thinking is to let your data tell you stories. This can be problematic because sliced and diced some way, data will always tell a story. But without a-priori belief, story may not be true in reality. This is often case of hindsight bias and poor research (and often staple of motivational and self-help books). If you want to find differences in two groups (successful business versus non-successful, athletes versus slobs, rich versus poor), you can always find some. There are hundreds of thousands of human characteristics that some will come out different just by chance. That doesn’t mean that those characteristics made someone different from others. On the other hand, if you have reasonable hypothesis about what could be causing difference, you can verify if you are right or not. In the end, either you explain results from model based on your understanding, or you modify your understandings. There is no point saying that length of nose-hair is predictive of income of person in year fifty because model says so.
7. No one cares how you did it

Consumers of data science models are decision makers and executives, and they want workable and useful model. While it’s tempting for data scientists to explain technical expertise behind the model and show-off the analytic rigor, this is often counter-productive. Your audience cares about outcome and end-use and isn’t bothered about the decision engine you have put together. In fact, complicated explanations about mathematics of model are sure way to bore your users and intimidate against use. Save your expertise with technical discussions among your data science peers.
8. Academia and business are two different worlds

This applies to almost all disciplines and analytics is no exception. Focus in academics is on discovering new methods and proving new theorems. Focus in business is on solving a problem and making money. Doesn’t matter if analytics behind the solution is fancy or not, and no one cares about that anyway. Speed is often of more essence than accuracy. Every business analytic solution should solve a real-life problem and directly or indirectly should contribute to bottom line.
9. Presentation is key

Since end-user and decision maker is often non-mathematical person, selling an analytic solution isn’t different from other sells. You can sell on quality – analytic accuracy – but you can also sell on emotions, aesthetics, story, human angle, and money. Being able to explain your method in simple terms and align with end-users’ interest is art that all data scientists who wants to make significant non-theoretical mark on world must master. At least for a while, that means, story-telling through PowerPoint should remain key weapon in your arsenal.
10. All models are wrong, but some are useful
Models, by definition, model some ‘truth’ in the world. Since world is infinitely complex (think Quantum Mechanics!), models are approximations of reality. Some models are more wrong than others, but all are wrong. However, they can be, and often are, useful since they are better than alternative of no model and no prediction. Realizing what we are aiming for and what we are competing against can be important in shaping our analytic design process – and checking our egos.
11. Just because analytic model is great doesn't mean it will see light of day
As fun as data science is, there is more to the world than your analytical model. If you see about a third or more of your work getting implemented or used then consider yourself lucky. Notwithstanding analytic capabilities, analytic project get shelved for various reasons all the time, including, data changed, problem changed, no one interested in solution, implementation too expensive, benefit not worth the cost, someone else did it first, and solution too advanced for its time. Be calm and carry on.
I realize that perhaps there are more than 11. And perhaps some of these could be clubbed together. Point is not about counter, but about importance of internalizing these realities of industry we want to be part of. Difference companies and industries might be at different spectrum of these facts, but collectively knowing and understanding these ‘facts’ will make one a more satisfied, broad minded, and better data scientist.
(Did I miss any fundamental fact of world of data science? Share in comments below.)
Most facts are picked from Reddit.com
Other Articles by the same author
Other Related Links that you may like
11 Facts about Data Science that you must know的更多相关文章
- 学习笔记之Data Science
Data science - Wikipedia https://en.wikipedia.org/wiki/Data_science Data science, also known as data ...
- 15 Most Read Data Science Articles in 2015. So far …
15 Most Read Data Science Articles in 2015. So far … We've compiled the latest set of "most rea ...
- 40 Questions to test your skill in Python for Data Science
Comes from: https://www.analyticsvidhya.com/blog/2017/05/questions-python-for-data-science/ Python i ...
- 【转】The most comprehensive Data Science learning plan for 2017
I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...
- Data Science: An overview
Week 1 Data Science: An overview Objective: 1.Is data science the same as statistics or analysis? st ...
- 七个用于数据科学(data science)的命令行工具
七个用于数据科学(data science)的命令行工具 数据科学是OSEMN(和 awesome 相同发音),它包括获取(Obtaining).整理(Scrubbing).探索(Exploring) ...
- 推荐几个来自 MOOCs的 Data Science
数据科学是一个大领域,如果你想成为一个优秀的数据专家,自学是必要的技能. MOOCs是数据科学的主要来源.有许多网站提供了 MOOCs,比如Coursera.Coursera和Udacity都还不错. ...
- 学习Data Science/Deep Learning的一些材料
原文发布于我的微信公众号: GeekArtT. 从CFA到如今的Data Science/Deep Learning的学习已经有一年的时间了.期间经历了自我的兴趣.擅长事务的探索和试验,有放弃了的项目 ...
- Data Science at the Command Line学习笔记(一)
学习Data Science at the Command Line时,win7下安装环境是遇到了一些小问题,最后通过百度解决. 官方指导可以在这个地址找到:http://datascienceatt ...
随机推荐
- 31_网络编程(Socket套接字编程)_讲义
今日内容介绍 1.网络三要素及传输协议 2.实现UDP协议的发送端和接收端 3.实现TCP协议的客户端和服务器 4.TCP上传文件案例 01网络模型 *A:网络模型 TCP/IP协议中的四层分别是应用 ...
- InputStreamReader & OutputStreamWriter
InputStreamReader 是字节流通向字符流的桥梁:它使用指定的 charset 读取字节并将其解码为字符. OutputStreamWriter 是字符流通向字节流的桥梁:可使用指定的 c ...
- KEIL C51代码优化详细分析
阅读了<单片机与嵌入式系统应用>2005年第10期杂志<经验交流>栏目的一篇文章<Keil C51对同一端口的连续读取方法>(原文)后,笔者认为该文并未就此问题进行 ...
- vue.js 中slot 用处大(转载)
什么是组件? 组件(Component)是 Vue.js 最强大的功能之一.组件可以扩展 HTML 元素,封装可重用的代码.在较高层面上,组件是自定义元素,Vue.js 的编译器为它添加特殊功能.在有 ...
- vue使用axios发送数据请求
本文章是基于vue-cli脚手架下开发 1.安装 npm install axios --s npm install vue-axios --s 2.使用.在index.js中(渲染App组件的那个j ...
- 数据库事务的四大特性以及4种事务的隔离级别-以及对应的5种JDBC事务隔离级别
本篇讲诉数据库中事务的四大特性(ACID),并且将会详细地说明事务的隔离级别. 如果一个数据库声称支持事务的操作,那么该数据库必须要具备以下四个特性: ⑴ 原子性(Atomicity) 原子性是指事务 ...
- 第212天:15种CSS居中的方式,最全了
CSS居中是前端工程师经常要面对的问题,也是基本技能之一.今天有时间把CSS居中的方案整理了一下,目前包括水平居中,垂直居中及水平垂直居中方案共15种.如有漏掉的,还会陆续的补充进来,算做是一个备忘录 ...
- ctex2.9.2输出中文
安装了ctex2.9.2,打开WinEdt7.0 准备编译论文, 但是中文的地方都是空白,不显示, 到网上找了N个方法,就是引入CJK包,然后加入一下CJK命令来控制中文显示, 结果搞得乱七八糟,还是 ...
- MT【126】点对个数两题之二【图论】
在平面上有\(n\) 个点$S={x_1,x_2\cdots,x_n}, $ 证明在这 \(n\) 个点中距离为 \(1\) 的点对数不超过 \(\dfrac{n}{4}+\dfrac{2}{2}n^ ...
- 【刷题】BZOJ 2154 Crash的数字表格
Description 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b)表示能同时被a和b整除的最小正整数.例如 ...