scikit-learn：6. Strategies to scale computationally: bigger data

參考：http://scikit-learn.org/stable/modules/scaling_strategies.html

对于examples、features（或者两者）数量非常大的情况，挑战传统的方法要解决两个问题：内存和效率。办法是Out-of-core (or “external memory”) learning。

有三种方法能够实现out-of-core。各自是：

1、Streaming instances（流体化实例）：

简单说就是。instances是一个一个来的。详细实现不在scikit-learn文档范围。

2、Extracting features：

简单说就是利用different feature
extraction methods（翻译之后的文章：http://blog.csdn.net/mmc2015/article/details/46992105）实现大数据提取实用数据。简化内存、提高效率。不细讲。

3、Incremental
learning：

all
estimators implementing the partial_fit API
are candidates。

the
ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory。

全部实现 partial_fit API
的estimators都能够实现增量学习，包含：

Classification
Regression
- sklearn.linear_model.SGDRegressor
- sklearn.linear_model.PassiveAggressiveRegressor
Clustering
- sklearn.cluster.MiniBatchKMeans
Decomposition / feature Extraction

注意：对于分类问题，因为incremental
learner可能不知道全部的classes有哪些，所以第一次调用partial_fit时，最好人工设定參数 classes= ，指明全部类别。

4、Examples：

a
example of Out-of-core
classification of text documents. 通过样例能够更好理解上面的内容。

scikit-learn：6. Strategies to scale computationally: bigger data的更多相关文章

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
偏移：translate ，旋转：rotate，缩放 scale，不知道什么东东：lineCap 实例
<!DOCTYPE HTML> <head> <meta charset = "utf-8"> <title>canvas</ ...
Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
Bigtable：A Distributed Storage System for Strctured Data
2006 年10 月Google 发布三架马车之一的<Bigtable:A Distributed Storage System for Strctured Data>论文之后,Power ...
18.翻译系列：EF 6 Code-First 中的Seed Data（种子数据或原始测试数据）【EF 6 Code-First系列】
原文链接:https://www.entityframeworktutorial.net/code-first/seed-database-in-code-first.aspx EF 6 Code-F ...
Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）
所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的 ...

随机推荐

jenkins 搭建过程中遇到的问题
1.[ERROR] Unknown lifecycle phase "mvn". You must specify a valid lifecycle phase or a goa ...
JBoss和tomcat的区别
JBoss和tomcat的区别注意JBoss和tomcat是不一样,JBoss是一个可伸缩的服务器平台,当你的EJB程序编制完成后,如果访问量增加,只要通过增加服务器硬件就可以实现多台服务器同时运算 ...
小米路由Mini刷Breed, 潘多拉和LEDE
1. 下载breed,地址 http://breed.hackpascal.net/ 2. 下载小米Mini的开发板rom, 地址 http://www1.miwifi.com/miwifi_down ...
Windows开发之VC++仿QQ迷你首页(迷你资讯)
技术:VC++,MFC,WTL,,C++,Windows 概述之前由于需求和兴趣,需要实现类似QQ迷你资讯首页的东西,看起来很酷,于是就写了个实现方案,主要还是基于WIndows C++ 和MF ...
Mongodb系列：初识Mongodb
一.背景: 月初进行了期末考试非常荣幸可以參加到了考试系统维护中(详情请阅读:<那些年我们一起參加的活动:15年上半年考试系统维护总结>)!主要负责server维护,在维护期间对Mongo ...
Asp.Net通过ODBC连接Oracle数据库
本来有个项目是通过安装Oracle client然后让asp.net引用System.Data.OracleClient来访问Oracle数据库的,但是不知道为什么老是报:ORA-12170:连接超时 ...
git for windows配置SSH key
0. 前言之前用过一段时间的git,后来迁移系统导致电脑中的git bash消失了,由于在上家公司版本管理用的svn,所以一直没有重新配置,目前工作中版本管理用的gitLab,后期计划将工作之外的精 ...
Emacs显示光标在哪个函数
Emacs24中打开which-function-mode即可. 在.emacs中添加一行: (which-function-mode 1) 调整which-function在mode-line中的显 ...
MySQL导入SQL文件过大或连接超时的解决办法/在navcat执行sql卡在0%
set global max_allowed_packet=100 000 000;set global net_buffer_length=100000;SET GLOBAL interactiv ...
mysql远程访问，修改root密码
mysql -uroot -p #input password use mysql; update user set host='%' where user='root'; flush privile ...

scikit-learn：6. Strategies to scale computationally: bigger data

scikit-learn：6. Strategies to scale computationally: bigger data的更多相关文章

随机推荐

热门专题