Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use
If you’re an aspiring data scientist, you’re inquisitive – always exploring, learning, and asking questions. Online tutorials and videos can help you prepare you for your first role, but the best way to ensure that you’re ready to be a data scientist is by making sure you’re fluent in the tools people use in the industry.
I asked our data science faculty to put together seven python tools that they think all data scientists should know how to use. The Galvanize Data Science and GalvanizeU programs both focus on making sure students spend ample time immersed in these technologies, investing the time to gain a deep understanding of these tools will give you a major advantage when you apply for your first job. Check them out below:
IPython
IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history. IPython provides the following features:
- Powerful interactive shells (terminal and Qt-based)
- A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media
- Support for interactive data visualization and use of GUI toolkits
- Flexible, embeddable interpreters to load into one’s own projects
- Easy to use, high performance tools for parallel computing
Contributed by Nir Kaldero, Director of Science, Head of Galvanize Experts
GraphLab Create
GraphLab Create is a Python library, backed by a C++ engine, for quickly building large-scale, high-performance data products.
Here are a few of the features of GraphLab Create:
- Ability to analyze terabyte scale data at interactive speeds, on your desktop
- A Single platform for tabular data, graphs, text, and images
- State of the art machine learning algorithms including deep learning, boosted trees, and factorization machines
- Run the same code on your laptop or in a distributed system, using a Hadoop Yarn or EC2 cluster
- Focus on tasks or machine learning with the flexible API
- Easily deploy data products in the cloud using Predictive Services
- Visualize data for exploration and production monitoring
Contributed by Benjamin Skrainka, Lead Data Science Instructor at Galvanize
Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate. pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.
Contributed by Nir Kaldero, Director of Science, Head of Galvanize Experts
PuLP
Linear Programming is a type of optimisation where an objective function should be maximised given some constraints. PuLP is an Linear Programming modeler written in python. PuLP can generate LP files and call on use highly optimized solvers, GLPK, COIN CLP/CBC, CPLEX, and GUROBI, to solve these linear problems.
Contributed by Isaac Laughlin, Data Science Instructor at Galvanize
Matplotlib
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB® or Mathematica®), web application servers, and six graphical user interface toolkits.
matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code.
For simple plotting the pyplot interface provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.
Contributed by Mike Tamir, Chief Science Officer at Galvanize
Scikit-Learn
Scikit-Learn is a simple and efficient tool for data mining and data analysis. What is so great about it is that it’s accessible to everybody, and reusable in various contexts. It is built on NumPy,SciPy, and mathplotlib. Scikit is also an open source that is commercially usable – BSD licence. Scikit-Learn has the following features:
- Classification – Identifying to which category an object belongs to
- Regression – Predicting a continuous-valued attribute associated with an object
- Clustering – Automatic grouping of similar objects into sets
- Dimensionality Reduction – Reducing the number of random variables to consider
- Model Selection – Comparing, validating and choosing parameters and models
- Preprocessing – Feature extraction and normalization
Contributed by Isaac Laughlin, Data Science Instructor at Galvanize
Spark
Spark consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
Contributed by Benjamin Skrainka, Lead Data Science Instructor at Galvanize
Still hungry for more data science? Enter our data science giveaway for a chance to win tickets awesome conferences like PyData Seattle and the Data Science Summit, or get discounts on Python resources like Effective Python and Data Science from Scratch.
Seven Python Tools All Data Scientists Should Know How to Use的更多相关文章
- 7 Tools for Data Visualization in R, Python, and Julia
7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...
- Why Apache Spark is a Crossover Hit for Data Scientists [FWD]
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...
- Software development skills for data scientists
Software development skills for data scientists Data scientists often come from diverse backgrounds ...
- Python Tools for Machine Learning
Python Tools for Machine Learning Python is one of the best programming languages out there, with an ...
- Python tools for Visual Studio插件介绍
Python tools for Visual Studio是一个免费开源的VisualStudio的插件,支持 VisualStudio 2010,2012与2013.我们想要实现的是: ...
- visual studio 2015使用python tools远程调试maya 2016
步骤: 1. 去https://apps.exchange.autodesk.com/MAYA/en/Home/Index搜索Developer Kit并下载,maya 2016可以直接点击这里下载. ...
- arcgis python arcpy add data script添加数据脚本
arcgis python arcpy add data script添加数据脚本mxd = arcpy.mapping.MapDocument("CURRENT")... df ...
- 8 Productivity hacks for Data Scientists & Business Analysts
8 Productivity hacks for Data Scientists & Business Analysts Introduction I was catching up with ...
- The 10 Statistical Techniques Data Scientists Need to Master
原文 就我个人所知有太多的软件工程师尝试转行到数据科学家而盲目地使用机器学习框架来处理数据,例如,TensorFlow或者Apache Spark,但是对于这些框架背后的统计理论没有完全的理解.所以提 ...
随机推荐
- mongodb添加权限
1.连接mongodb数据库(如果mongo命令没有做环境变量配置,需要定位到有mongo命令的目录) root@AY140709212620347s22Z:~# mongo MongoDB shel ...
- 如何在Xilinx ISE中使用TCL提高工作效率
http://wenku.baidu.com/link?url=jxtsPLGUlWwYuD8TtfWYYU_NhY5Qty3rx8ZDLCkINLe39JRGb90V5HoJhnkn9r_PQ6vZ ...
- yii2 ./yii command : No such file or directory
git clone下来的yii2后台项目,由于需要执行 ./yii migrate命令.执行之后,提示 No such file or directory 我从同样为yii2 basic的./yii ...
- css-实现元素垂直居中对齐
css-实现元素/元素内容,垂直居中对齐 一.单行内容的垂直居中(line-height:行高方法) 只考虑单行是最简单的,无论是否给容器固定高度,只要给容器设置 line-height 和 heig ...
- JAXB - The JAXB Context
As we have seen, an object of the class JAXBContext must be constructed as a starting point for othe ...
- hdoj1584 蜘蛛牌 (区间型动态规划)
hdoj1584 分析: f[i][j] 表示 把一串牌 牌 i 到 j 摞为一摞时 所花费最少的步数. d[i][j] 表示把牌 i 挪到牌 j 上时需要走的步数(最初给的状态). 以一串牌 3~8 ...
- redhat6.5 配置使用centos的yum源
新安装了redhat6.5安装后,登录系统,使用yum update 更新系统.提示: This system is not registered to Red Hat Subscription Ma ...
- ajax跨域请求的解决方案
一直打算改造一下自己传统做网站的形式. 我是.Net程序员,含辛茹苦数年也没混出个什么名堂. 最近微信比较火, 由于现在大环境的影响和以前工作的总结和经验,我打算自己写一个数据,UI松耦合的比较新潮的 ...
- props验证
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- nodejs js模块加载
本文地址:http://www.cnblogs.com/jasonxuli/p/4381747.html nodejs的非核心模块(core module)加载主要使用的就是module.js. 项目 ...