Spark2.0 Pipelines

MLlib中众多机器学习算法API在单一管道或工作流中更容易相互结合起来使用。管道的思想主要是受到scikit-learn库的启发。
ML API使用Spark SQL中的DataFrame作为机器学习的数据集。DataFrame不同的列可以分别存储文本，特征向量，真实的Lables，和预测值。

Transformer:一个Transformer是一个算法，可以将一个DataFrame转换为另一个DataFrame。如将一个带特征值的DataFrame转换为带预测值的DataFrame。
Estimator：Estimator在一个DataFrame上完成Transformer转换过程。如一个学习算法就是一个Estimator，该Estimator应用在测试DataFrame上，完成模型的训练过程。
Pipelie：将多个Transformers和 Estimators 串在一起，以完成某个特定的机器学习工作流程。
参数：全部Transformers和 Estimators 共享通用的API，以完成各自特定参数的设置。

MLlib standardizes APIs for machine learning algorithms to make it
easier to combine multiple algorithms into a single pipeline, or
workflow. This section covers the key concepts introduced by the
Pipelines API, where the pipeline concept is mostly inspired by the
scikit-learn project.

DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

A DataFrame can be created either implicitly or explicitly from a
regular RDD. See the code examples below and the Spark SQL programming guide for examples.

Columns in a DataFrame are named. The code examples below use names such as “text,” “features,” and “label.”

Transformer: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame. E.g., an ML model is a Transformer
which transforms a DataFrame with features into a DataFrame with
predictions.

Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm is an
Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow. Parameter: All Transformers and
Estimators now share a common API for specifying parameters.

Spark2.0 Pipelines的更多相关文章

geotrellis使用（二十五）将Geotrellis移植到spark2.0
目录前言升级spark到2.0 将geotrellis最新版部署到spark2.0(CDH) 总结一.前言事情总是变化这么快,前面刚写了一篇博客介绍如何将geotrellis移植 ...
Ubuntu14.04或16.04下安装JDK1.8+Scala+Hadoop2.7.3+Spark2.0.2
为了将Hadoop和Spark的安装简单化,今日写下此帖. 首先,要看手头有多少机器,要安装伪分布式的Hadoop+Spark还是完全分布式的,这里分别记录. 1. 伪分布式安装伪分布式的Hadoo ...
maven+spark2.0.0最大连通分量
运用到了spark2.0.0的grarhx包,要手动的在pom.xml里面添加依赖包,要什么就在里面添加依赖,然后在run->maven install
Eclipse+maven+scala2.11.8+spark2.0.0的环境部署
主要在maven-for-scalaIDE纠结了,因为在eclipse版本是luna4.x 里面有自己带有的maven. 根据网上面无脑的下一步下一步,出现了错误,在此讲解各个插件的用途,以此新人看见 ...
spark2.0.1 安装配置
1. 官网下载 wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.1-bin-hadoop2.7.tgz 2. 解压 tar -zxvf spar ...
Spark2.0自定义累加器
Spark2.0 自定义累加器在2.0中使用自定义累加器需要继承AccumulatorV2这个抽象类,同时必须对以下6个方法进行实现: 1.reset 方法: 将累加器进行重置; abstract ...
Spark2.0编译
Spark2.0编译 1 前言 Spark2.0正式版于今天正式发布,本文基于CDH5.0.2的Spark编译. 2 编译步骤 #2.1 下载源码 wget https://github.com/ap ...
Apache Spark2.0正式发布
Apache Spark2.0正式发布 7月26日起Databricks开始提供Apache Spark 2.0的下载,这个版本是基于社区在过去两年的经验总结而成,不但加入了用户喜爱的功能,也修复了之 ...
【Spark2.0源码学习】-1.概述
Spark作为当前主流的分布式计算框架,其高效性.通用性.易用性使其得到广泛的关注,本系列博客不会介绍其原理.安装与使用相关知识,将会从源码角度进行深度分析,理解其背后的设计精髓,以便后续 ...

随机推荐

jquery.autocomplete 传参问题
今天遇到一个问题,我需要传一个文本库的内容给后台,然后用autocomplete显示内容,开始觉得很简单不就是 function SalePrice_Complete(text) { //分割拿到索 ...
JAVA会将所有的错误封装成为一个对象，其根本父类为Throwable
JAVA会将所有的错误封装成为一个对象,其根本父类为Throwable. Throwable有两个子类:Error和Exception. 一个Error对象表示一个程序错误,指的是底层的.低级的.不可 ...
php -- 魔术方法之删除属性：__unset()
属性重载:当访问一个不存在或者权限不够的属性的时候,能够触发一系列的魔术方法,就叫做属性重载 __unset():当删除一个不存在或者权限不够的属性的时候会自动触发 <?php //属性重载 c ...
MyBitis(iBitis)系列随笔之四：多表(多对一查询操作)
前面几篇博客介绍的都是单表映射的一些操作,然而在我们的实际项目中往往是用到多表映射.至于多表映射的关键要用到mybitis的association来加以实现. 这篇介绍的是多表中 ...
localStorage变更事件当前页响应新解-awen
html5的localStorage相信大家都是很熟悉了,但是在chrome等支持该对象的浏览器中(ie10除外),如果你监听storage变更事件你就会发现,当数据发生变化时本页是监听不到stora ...
Leetcode: Anagrams(颠倒字母而成的字)
题目 Given an array of strings, return all groups of strings that are anagrams. Note: All inputs will ...
关于recycler遇到的问题
1.//设置recyclerView不能点击myLayoutManager.setScrollEnabled(false);class MyLayoutManager extends LinearLa ...
log4j2设置日志文件读写权限（filePermissions）
spring-boot使用log4j2作为日志插件的时候需要设置日志文件的读写权限,可以File 上增加filePermissions,如: <File name="File" ...
Django学习笔记第八篇--实战练习四--为你的视图函数自定义装饰器
零.背景: 对于登录后面所有视图函数,都需要验证登录信息,一般而言就是验证cookie里面的一些信息.所以你可以这么写函数: def personinfo(request): ": retu ...
【黑金原创教程】【TimeQuest】【第三章】TimeQuest 扫盲文
声明:本文为黑金动力社区(http://www.heijin.org)原创教程,如需转载请注明出处,谢谢! 黑金动力社区2013年原创教程连载计划: http://www.cnblogs.com/al ...

Spark2.0 Pipelines

Spark2.0 Pipelines的更多相关文章

随机推荐

热门专题