Python & 机器学习入门指导
Getting started with Python & Machine Learning
(阅者注:这是一篇关于机器学习的指导入门,作者大致描述了用Python来开始机器学习的优劣,以及如果用哪些Python 的package 来开始机器学习。)
Machine learning is eating the world right now. Everyone and their mother are learning about machine learning models, classification, neural networks, and Andrew Ng. You’ve decided you want to be a part of it, but where to start?
In this article we’ll cover some important characteristics of Python and why it’s great for machine learning. We’ll also cover some of the most important libraries it has for ML, and if it piques your interest, some places where you can learn more.
Why is Python used for machine learning?
Python is a great choice for machine learning for several reasons. First and foremost, it’s a simple language on the surface; even if you’re not familiar with Python, getting up to speed is very quick if you’ve ever used any other language with C-like syntax (i.e. every language out there). Second, Python has a great community, which results in good documentation and friendly, comprehensive answers in StackOverflow (fundamental!). Third, also stemming from the great community, there are plenty of useful libraries for Python (both as “batteries included” and third party), which solve basically any problem that you can have (including machine learning).
But I heard Python is slow!
Yeah and it’s true. Python isn’t the fastest language out there: all those handy abstractions come at a cost.
But here’s the trick: libraries can and do offload the expensive calculations to the much more performant (but harder to use) C and C++. For instance, there’s NumPy, which is a library for numerical computation. It’s written in C, and it’s fast. Practically every library out there that involves intensive calculations uses it — almost all the libraries listed next use it in some form. So if you read NumPy, think fast.
Therefore, you can make your scripts run basically as fast as straight up writing them in a lower level language. So there’s really nothing to worry about when it comes to speed.
Python libraries to check out
Scikit-learn
Are you starting out in machine learning? Want something that covers everything from feature engineering to training and testing a model? Look no further than scikit-learn! This fantastic piece of free software provides every tool necessary for machine learning and data mining. It’s the de facto standard library for machine learning in Python, recommended for most of the ‘old’ ML algorithms.
This library does both classification and regression, supporting basically every algorithm out there (support vector machines, random forest, naive bayes, and so on). It’s built in such a way that allows easy switching of algorithms, so experimentation is easy. These ‘older’ algorithms are surprisingly resilient and work very well in a lot of cases.
But that’s not all! Scikit-learn also does dimensionality reduction, clustering, you name it. It’s also blazingly fast since it runs on NumPy and SciPy (meaning that all the heavy number crunching is run on C instead of Python).
Check out some examples to see everything this library is capable of, and the tutorials if you want to learn how it works.
NLTK
While not a machine learning library per se, NLTK is a must when working with natural language processing (NLP). It comes with a bundle of datasets and other lexical resources (useful for training models) in addition to libraries for working with text — for functions such as classification, tokenization, stemming, tagging, parsing and more.
The usefulness of having all of this stuff neatly packaged can’t be overstated. So if you are interested in NLP, check out some tutorials!
Theano
Used widely in research and academia, Theano is the grandfather of all deep learning frameworks. Written in Python, it’s tightly integrated with NumPy. Theano allows you to create neural networks, which are represented as mathematical expressions with multi-dimensional arrays. Theano handles this for you so you don’t have to worry about the actual implementation of the math involved.
It supports offloading calculations to the much faster GPU, which is a feature that everyone supports today, but back when they introduced it this wasn’t the case. The library is very mature at this point and supports a very wide range of operations, which is a great plus when it comes to comparing it with other similar libraries.
The biggest complaint out there is that the API may be unwieldy for some, making the library hard to use for beginners. However, there are wrappers that ease the pain and make working with Theano simple, such as Keras, Blocks and Lasagne.
Interested in learning about Theano? Check out this Jupyter Notebook tutorial.
TensorFlow
The Google Brain team created TensorFlow for internal use in machine learning applications, and open sourced it in late 2015. They wanted something that could replace their older, closed source machine learning framework, DistBelief, which they said wasn’t flexible enough and too tightly coupled to their infrastructure to be shared with other researchers around the world.
And so TensorFlow was created. Learning from the mistakes of the past, many consider this library to be an improvement over Theano, claiming more flexibility and a more intuitive API. Not only can it be used for research but also for production environments, supporting huge clusters of GPUs for training. While it doesn’t support as wide a range of operations as Theano, it has better computational graph visualizations.
TensorFlow is very popular nowadays. In fact, if you’ve heard about a single library on this list, it’s probably this one: there isn’t a day that goes by without a new blog post or paper mentioning TensorFlow gets published. This popularity translates into a lot of new users and a lot of tutorials, making it very welcoming to beginners.
Keras
Keras is a fantastic library that provides a high-level API for neural networks and is capable of running on top of either Theano or TensorFlow. It makes harnessing the full power of these complex pieces of software much easier than using them directly. It’s very user-friendly, putting user experience as a top priority. They manage this by using simple APIs and excellent feedback on errors.
It’s also modular, meaning that different models (neural layers, cost functions, and so on) can be plugged together with little restrictions. This also makes it very easy to extend, since it’s simple to add new modules and connect them with the existing ones.
Some people have called Keras so good that it is effectively cheating in machine learning. So if you’re starting out with deep learning, go through the examples and documentation to get a feel for what you can do with it. And if you want to learn, start out with this tutorial and see where you can go from there.
Two similar alternatives are Lasagne and Blocks, but they only run on Theano. So if you tried Keras and are unhappy with it, maybe try out one of these alternatives to see if they work out for you.
PyTorch
Another popular deep learning framework is Torch, which is written in Lua. Facebook open-sourced a Python implementation of Torch called PyTorch, which allows you to conveniently use the same low-level libraries that Torch uses, but from Python instead of Lua.
PyTorch is much better for debugging since one of the biggest differences between Theano/TensorFlow and PyTorch is that the former use symbolic computation while the latter doesn’t. Symbolic computation means that coding an operation (say, ‘x + y’), it’s not computed when that line is interpreted. Before getting executed it has to be compiled (translated to CUDA or C). This makes debugging harder in Theano/TensorFlow, since an error is much harder to associate with the line of code that caused it. Of course, doing things this way has its advantages, but debugging isn’t one of them.
If you want to start out with PyTorch the official tutorials are very friendly to beginners but get to advanced topics as well.
First steps in machine learning?
Alright, you’ve presented me with a lot of alternatives for machine learning libraries in Python. What should I choose? How do I compare these things? Where do I start?
Our Ape Advice™ for beginners is to try and not get bogged down by details. If you’ve never done anything machine learning related, try out scikit-learn. You’ll get an idea of how the cycle of tagging, training and testing work and how a model is developed.
Now, if you want to try out deep learning, start out with Keras — which is widely agreed to be the easiest framework — and see where that takes you. After you have more experience, you will start to see what it is that you actually want from the framework: greater speed, a different API, or maybe something else, and you’ll be able to make a more informed decision.
And even then, there is an endless supply of articles out there comparing Theano, Torch, and TensorFlow. There’s no real way to tell which one is the good one. It’s important to take into account that all of them have wide support and are improving constantly, making comparisons harder to make. A six month old benchmark may be outdated, and year old claims of framework X doesn’t support operation Y could no longer be valid.
Finally, if you’re interested in doing machine learning specifically applied to NLP, why not check out MonkeyLearn! Our platform provides a unique UX that makes it super easy to build, train and improve NLP models. You can either use pre-trained models for common use cases (like sentiment analysis, topic detection or keyword extraction) or train custom algorithms using your particular data. Also, you don’t have to worry about the underlying infrastructure or deploying your models, our scalable cloud does this for you. You can start for free and integrate right away with our beautiful API.
Want to learn more?
There are plenty of online resources out there to learn about machine learning ! Here are a few:
- A comprehensive guide of a machine learning project on a Jupyter Notebook, if you want to see what some code looks like.
- Our Gentle Guide to Machine Learning, if you want to read more about the concepts of machine learning.
- Andrew Ng’s Stanford CS229 on Coursera, if you’re ready to get serious about this machine learning thing. If you are looking for a course on practical deep learning, check out the one at fast.ai.
Final words
So that was a brief intro to machine learning in Python and some of its libraries. The important part is not getting bogged down by details and just trying stuff out. Follow your curiosity, and don’t be afraid to experiment.
Know about a python library that was left out? Share it in the comments below!
Python & 机器学习入门指导的更多相关文章
- python机器学习入门-(1)
机器学习入门项目 如果你和我一样是一个机器学习小白,这里我将会带你进行一个简单项目带你入门机器学习.开始吧! 1.项目介绍 这个项目是针对鸢尾花进行分类,数据集是含鸢尾花的三个亚属的分类信息,通过机器 ...
- Python机器学习入门(1)之导学+无监督学习
Python Scikit-learn *一组简单有效的工具集 *依赖Python的NumPy,SciPy和matplotlib库 *开源 可复用 sklearn库的安装 DOS窗口中输入 pip i ...
- Python机器学习入门
# NumPy Python科学计算基础包 import numpy as np # 导入numpy库并起别名为npnumpy_array = np.array([[1,3,5],[2,4,6]])p ...
- 零起点PYTHON机器学习快速入门 PDF |网盘链接下载|
点击此处进入下载地址 提取码:2wg3 资料简介: 本书采用独创的黑箱模式,MBA案例教学机制,结合一线实战案例,介绍Sklearn人工智能模块库和常用的机器学习算法.书中配备大量图表说明,没有枯 ...
- [Python]-numpy模块-机器学习Python入门《Python机器学习手册》-01-向量、矩阵和数组
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-pandas模块-机器学习Python入门《Python机器学习手册》-02-加载数据:加载文件
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- [Python]-sklearn模块-机器学习Python入门《Python机器学习手册》-02-加载数据:加载数据集
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
- 《Python机器学习及实践:从零开始通往Kaggle竞赛之路》
<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础 主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代 ...
随机推荐
- C++数组引用
C++数组引用 一.数组引用 C++数组的引用:引用即别名这样比指针传地址方便多了 形参中的(&a)[10]可以就看做a数组的别名,肯定要指定数组大小,如果没有后面的数组大小,天知道是变量还是 ...
- php 获取自己的公网IP
<?php $externalContent = file_get_contents('http://checkip.dyndns.com/'); preg_match('/Current IP ...
- WPF几种高级绑定
(1)Binding + RelativeSource + AncestorType 模式 , 根据关联源所指定的类型,可动态绑定指定类型的Path属性(Path可以省略)(PS:动态指父级在运行 ...
- java克隆机制
看了下面博客就很明白了 http://www.cnblogs.com/Qian123/p/5710533.html#_label0 java对象创建方式有三种: 1.通过new对象 2.通过java克 ...
- 第7章使用请求测试-测试API . Rspec: everyday-rspec实操。
测试应用与非人类用户的交互,涵盖外部 API 7.1request test vs feature test 对 RSpec 来说,这种专门针 对 API 的测试最好放在 spec/requests ...
- Java基础-封装(09)
通过对象直接访问成员变量,会存在数据安全问题(比如年龄不能为负).这个时候,我们就不能让外界的对象直接访问成员变量. private关键字 是一个权限修饰符.可以修饰成员(成员变量和成员方法)被pri ...
- C#下实现的基础K-MEANS多维聚类
资源下载 #本文PDF版下载 C#下实现的基础K-MEANS多维聚类PDF #本文代码下载 基于K-Means的成绩聚类程序 前言 最近由于上C # 课的时候,老师提到了-我们的课程成绩由几个部分组成 ...
- python-day6---流程控制
# if 条件:# 子代码1# 子代码2# 子代码3 # if True:# print('ok')# print('=====?>')# print('=====?>')# print( ...
- gleez开发环境搭建
一.虚拟主机目录配置 1.配置apache服务器 Apache是常用的web服务器,即常见的用来处理http协议,处理网页的. Apache的配置文件都存放在/etc/apache2/目录,这里有很多 ...
- HDOJ1008
#include "iostream" using namespace std; int main() { ) { int n; cin >> n; ) break; ...