Python Tools for Machine Learning
Python Tools for Machine Learning
Python is one of the best programming languages out there, with an extensive coverage in scientific computing: computer vision, artificial intelligence, mathematics, astronomy to name a few. Unsurprisingly, this holds true for machine learning as well.
Of course, it has some disadvantages too; one of which is that the tools and libraries for Python are scattered. If you are a unix-minded person, this works quite conveniently as every tool does one thing and does it well. However, this also requires you to know different libraries and tools, including their advantages and disadvantages, to be able to make a sound decision for the systems that you are building. Tools by themselves do not make a system or product better, but with the right tools we can work much more efficiently and be more productive. Therefore, knowing the right tools for your work domain is crucially important.
This post aims to list and describe the most useful machine learning tools and libraries that are available for Python. To make this list, we did not require the library to be written in Python; it was sufficient for it to have a Python interface. We also have a small section on Deep Learning at the end as it has received a fair amount of attention recently.
We do not aim to list all the machine learning libraries available in Python (the Python package index returns 139 results for “machine learning”) but rather the ones that we found useful and well-maintained to the best of our knowledge. Moreover, although some of modules could be used for various machine learning tasks, we included libraries whose main focus is machine learning. For example, although Scipy has some clustering algorithms, the main focus of this module is not machine learning but rather in being a comprehensive set of tools for scientific computing. Therefore, we excluded libraries like Scipy from our list (though we use it too!).
Another thing worth mentioning is that we also evaluated the library based on how it integrates with other scientific computing libraries because machine learning (either supervised or unsupervised) is part of a data processing system. If the library that you are using does not fit with your rest of data processing system, then you may find yourself spending a tremendous amount of time to creating intermediate layers between different libraries. It is important to have a great library in your toolset but it is also important for that library to integrate well with other libraries.
If you are great in another language but want to use Python packages, we also briefly go into how you could integrate with Python to use the libraries listed in the post.
Scikit-Learn
Scikit Learn is our machine learning tool of choice at CB Insights. We use it for classification, feature selection, feature extraction and clustering. What we like most about it is that it has a consistent API which is easy to use while also providing a lot of evaluation, diagnostic and cross-validation methods out of the box (sound familiar? Python has batteries-included approach as well). The icing on the cake is that it uses Scipy data structures under the hood and fits quite well with the rest of scientific computing in Python with Scipy, Numpy, Pandas and Matplotlib packages. Therefore, if you want to visualize the performance of your classifiers (say, using a precision-recall graph or Receiver Operating Characteristics (ROC) curve) those could be quickly visualized with help of Matplotlib. Considering how much time is spent on cleaning and structuring the data, this makes it very convenient to use the library as it tightly integrates to other scientific computing packages.
Moreover, it has also limited Natural Language Processing feature extraction capabilities as well such as bag of words, tfidf, preprocessing (stop-words, custom preprocessing, analyzer). Moreover, if you want to quickly perform different benchmarks on toy datasets, it has a datasets module which provides common and useful datasets. You could also build toy datasets from these datasets for your own purposes to see if your model performs well before applying the model to the real-world dataset. For parameter optimization and tuning, it also provides grid search and random search. These features could not be accomplished if it did not have great community support or if it was not well-maintained. We look forward to its first stable release.
Statsmodels
Statsmodels is another great library which focuses on statistical models and is used mainly for predictive and exploratory analysis. If you want to fit linear models, do statistical analysis, maybe a bit of predictive modeling, then Statsmodels is a great fit. The statistical tests it provides are quite comprehensive and cover validation tasks for most of the cases. If you are R or S user, it also accepts R syntax for some of its statistical models. It also accepts Numpy arrays as well as Pandas data-frames for its models making creating intermediate data structures a thing of the past!
PyMC
PyMC is the tool of choice for Bayesians. It includes Bayesian models, statistical distributions and diagnostic tools for the convergence of models. It includes some hierarchical models as well. If you want to do Bayesian Analysis, you should check it out.
Shogun
Shogun is a machine learning toolbox with a focus on Support Vector Machines (SVM) that is written in C++. It is actively developed and maintained, provides a Python interface and the Python interface is mostly documented well. However, we’ve found its API hard to use compared to Scikit-learn. Also, it does not provide many diagnostics or evaluation algorithms out of the box. However, its speed is a great advantage.
Gensim
Gensim is defined as “topic modeling for humans”. As its homepage describes, its main focus is Latent Dirichlet Allocation (LDA) and its variants. Different from other packages, it has support for Natural Language Processing which makes it easier to combine NLP pipeline with other machine learning algorithms. If your domain is in NLP and you want to do clustering and basic classification, you may want to check it out. Recently, they introduced Recurrent Neural Network based text representation called word2vec from Google to their API as well. This library is written purely in Python.
Orange
Orange is the only library that has a Graphical User Interface (GUI) among the libraries listed in this post. It is also quite comprehensive in terms of classification, clustering and feature selection methods and has some cross-validation methods. It is better than Scikit-learn in some aspects (classification methods, some preprocessing capabilities) as well, but it does not fit well with the rest of the scientific computing ecosystem (Numpy, Scipy, Matplotlib, Pandas) as nicely as Scikit-learn.
Having a GUI is an important advantage over other libraries however. You could visualize cross-validation results, models and feature selection methods (you need to install Graphviz for some of the capabilities separately). Orange has its own data structures for most of the algorithms so you need to wrap the data into Orange-compatible data structures which makes the learning curve steeper.
PyMVPA
PyMVPA is another statistical learning library which is similar to Scikit-learn in terms of its API. It has cross-validation and diagnostic tools as well, but it is not as comprehensive as Scikit-learn.
Deep Learning
Even though deep learning is a subsection Machine Learning, we created a separate section for this field as it has received tremendous attention recently with various acqui-hires by Google and Facebook.
Theano
Theano is the most mature of deep learning library. It provides nice data structures (tensors) to represent layers of neural networks and they are efficient in terms of linear algebra similar to Numpy arrays. One caution is that, its API may not be very intuitive, which increases learning curve for users. There are a lot of libraries which build on top of Theano exploiting its data structures. It has support for GPU programming out of the box as well.
PyLearn2
There is another library built on top of Theano, called PyLearn2 which brings modularity and configurability to Theano where you could create your neural network through different configuration files so that it would be easier to experiment different parameters. Arguably, it provides more modularity by separating the parameters and properties of neural network to the configuration file.
Decaf
Decaf is a recently released deep learning library from UC Berkeley which has state of art neural network implementations which are tested on the Imagenet classification competition.
Nolearn
If you want to use excellent Scikit-learn library api in deep learning as well, Nolearn wraps Decaf to make the life easier for you. It is a wrapper on top of Decaf and it is compatible(mostly) with Scikit-learn, which makes Decaf even more awesome.
OverFeat
OverFeat is a recent winner of Dogs vs Cats (kaggle competition) which is written in C++ but it comes with a Python wrapper as well(along with Matlab and Lua). It uses GPU through Torch library so it is quite fast. It also won the detection and localization competition in ImageNet classification. If your main domain is in computer vision, you may want to check it out.
Hebel
Hebel is another neural network library comes along with GPU support out of the box. You could determine the properties of your neural networks through YAML files(similar to Pylearn2) which provides a nice way to separate your neural network from the code and quickly run your models. Since it has been recently developed, documentation is lacking in terms of depth and breadth. It is also limited in terms of neural network models as it only has one type of neural network model(feed-forward). However, it is written in pure Python and it will be nice library as it has a lot of utility functions such as schedulers and monitors which we did not see any library provides such functionalities.
Neurolab
NeuroLab is another neural network library which has nice api(similar to Matlab’s api if you are familiar) It has different variants of Recurrent Neural Network(RNN) implementation unlike other libraries. If you want to use RNN, this library might be one of the best choice with its simple API.
Integration with other languages
You do not know any Python but great in another language? Do not despair! One of the strengths of Python (among many other) is that it is a perfect glue language that you could use your tool of choice programming language with these libraries through access from Python. Following packages for respective programming languages could be used to combine Python with other programming languages:
- R -> RPython
- Matlab -> matpython
- Java -> Jython
- Lua -> Lunatic Python
- Julia -> PyCall.jl
Inactive Libraries
These are the libraries that did not release any updates for more than one year, we are listing them because some may find it useful, but it is unlikely that these libraries will be maintained for bug fixes and especially enhancements in the future:
If we are missing one of your favorite packages in Python for machine learning, feel free to let us know in the comments. We will gladly add that library to our blog post as well.
Python Tools for Machine Learning的更多相关文章
- 五款实用免费的Python机器学习集成开发环境(5 free Python IDE for Machine Learning)(图文详解)
前言 集成开发环境(IDE)是提供给程序员和开发者的一种基本应用,用来编写和测试软件.一般而言,IDE 由一个编辑器,一个编译器(或称之为解释器),和一个调试器组成,通常能够通过 GUI(图形界面)来 ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)
##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- 100 Most Popular Machine Learning Video Talks
100 Most Popular Machine Learning Video Talks 26971 views, 1:00:45, Gaussian Process Basics, David ...
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- Java Machine Learning Tools & Libraries--转载
原文地址:http://www.demnag.com/b/java-machine-learning-tools-libraries-cm570/?ref=dzone This is a list o ...
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- 【Machine Learning】决策树案例:基于python的商品购买能力预测系统
决策树在商品购买能力预测案例中的算法实现 作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...
- [Python & Machine Learning] 学习笔记之scikit-learn机器学习库
1. scikit-learn介绍 scikit-learn是Python的一个开源机器学习模块,它建立在NumPy,SciPy和matplotlib模块之上.值得一提的是,scikit-learn最 ...
随机推荐
- 自学Zabbix12.1 Zabbix命令-zabbix_server
点击返回:自学Zabbix之路 点击返回:自学Zabbix4.0之路 点击返回:自学zabbix集锦 自学Zabbix12.1 Zabbix命令-zabbix_server 1. zabbix核心:z ...
- 配置AD RMS及SharePoint 2013 IRM问题解决及相关资源
最近配置AD RMS及SharePoint 2013 IRM遇到几个问题: 1. RMS配置好后,client端连不上, 一直要求输入用户名和密码. 后来换了台不是SP的机器,并用内部DB,搞定. ...
- mysql数据库几种引擎
· InnoDB:用于事务处理应用程序,具有众多特性,包括ACID事务支持.(提供行级锁) · BDB:可替代InnoDB的事务引擎,支持COMMIT.ROLLBACK和其他事务特性. · Memor ...
- Mongodb中经常出现的错误(汇总)child process failed, exited with error number
异常处理汇总-服 务 器 http://www.cnblogs.com/dunitian/p/4522983.html 异常处理汇总-数据库系列 http://www.cnblogs.com/dun ...
- P3747 相逢是问候 欧拉定理+线段树
巨难!!! 去年六省联考唯一的一道黑牌题,我今天一天从早到晚,把它从暴力15分怼到了90分,极端接近正解了. bzoj上A了,但是洛谷和loj上面就不行.伪正解会T,奇奇怪怪的类正解会WA.. 那么, ...
- 【洛谷P1248】加工生产调度
题目大意:某工厂收到了n个产品的订单,这n个产品分别在A.B两个车间加工,并且必须先在A车间加工后才可以到B车间加工.某个产品i在A.B两车间加工的时间分别为Ai.Bi.怎样安排这n个产品的加工顺序, ...
- python3之rabbitMQ
1.RabbitMQ介绍 RabbitMQ是一个由erlang开发的AMQP(Advanced Message Queue )的开源实现.AMQP 的出现其实也是应了广大人民群众的需求,虽然在同步消息 ...
- 编译Uboot——错误记录
我使用的是ZLG的EasyARM i.MX280A的开发板.官方提供的编译器时arm-fsl-linux-gnueabihf(gcc 4.4.4).自己尝试使用arm-linaro-linux-gnu ...
- (原)2018牛课多校第4场--G
传送门 /* 按值从大到小排序,记录下相应出现的次数并去重 枚举:从大到小枚举,如果能够通过删除其他数让当前这个数成为众数,则循环结束,输出此数,如果循环结束也没答案,输出-1 优先级:值优先 举例: ...
- 基于pycaffe的网络训练和结果分析(mnist数据集)
该工作的主要目的是为了练习运用pycaffe来进行神经网络一站式训练,并从多个角度来分析对应的结果. 目标: python的运用训练 pycaffe的接口熟悉 卷积网络(CNN)和全连接网络(DNN) ...