spark pipeline 例子

"""

Pipeline Example.

"""

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import HashingTF, Tokenizer

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("PipelineExample")\

        .getOrCreate()

    # $example on$

    # Prepare training documents from a list of (id, text, label) tuples.

    training = spark.createDataFrame([

        (0, "a b c d e spark", 1.0),

        (1, "b d", 0.0),

        (2, "spark f g h", 1.0),

        (3, "hadoop mapreduce", 0.0)

    ], ["id", "text", "label"])

    # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

    tokenizer = Tokenizer(inputCol="text", outputCol="words")

    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

    lr = LogisticRegression(maxIter=10, regParam=0.001)

    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

    # Fit the pipeline to training documents.

    model = pipeline.fit(training)

    # Prepare test documents, which are unlabeled (id, text) tuples.

    test = spark.createDataFrame([

        (4, "spark i j k"),

        (5, "l m n"),

        (6, "spark hadoop spark"),

        (7, "apache hadoop")

    ], ["id", "text"])

    # Make predictions on test documents and print columns of interest.

    prediction = model.transform(test)

    selected = prediction.select("id", "text", "probability", "prediction")

    for row in selected.collect():

        rid, text, prob, prediction = row

        print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

    # $example off$

    spark.stop()

"""

Decision Tree Classification Example.

"""

from __future__ import print_function

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.feature import StringIndexer, VectorIndexer

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("DecisionTreeClassificationExample")\

        .getOrCreate()

    # $example on$

    # Load the data stored in LIBSVM format as a DataFrame.

    data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    # Index labels, adding metadata to the label column.

    # Fit on whole dataset to include all labels in index.

    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

    # Automatically identify categorical features, and index them.

    # We specify maxCategories so features with > 4 distinct values are treated as continuous.

    featureIndexer =\

        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a DecisionTree model.

    dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

    # Chain indexers and tree in a Pipeline

    pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

    # Train model.  This also runs the indexers.

    model = pipeline.fit(trainingData)

    # Make predictions.

    predictions = model.transform(testData)

    # Select example rows to display.

    predictions.select("prediction", "indexedLabel", "features").show(5)

    # Select (prediction, true label) and compute test error

    evaluator = MulticlassClassificationEvaluator(

        labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

    accuracy = evaluator.evaluate(predictions)

    print("Test Error = %g " % (1.0 - accuracy))

    treeModel = model.stages[2]

    # summary only

    print(treeModel)

    # $example off$

    spark.stop()

管道里的主要概念

MLlib提供标准的接口来使联合多个算法到单个的管道或者工作流，管道的概念源于scikit-learn项目。

1.数据框：机器学习接口使用来自Spark SQL的数据框形式数据作为数据集，它可以处理多种数据类型。比如，一个数据框可以有不同的列存储文本、特征向量、标签值和预测值。

2.转换器：转换器是将一个数据框变为另一个数据框的算法。比如，一个机器学习模型就是一个转换器，它将带有特征数据框转为预测值数据框。

3.估计器：估计器是拟合一个数据框来产生转换器的算法。比如，一个机器学习算法就是一个估计器，它训练一个数据框产生一个模型。

4.管道：一个管道串起多个转换器和估计器，明确一个机器学习工作流。

5.参数：管道中的所有转换器和估计器使用共同的接口来指定参数。

工作原理

管道由一系列有顺序的阶段指定，每个状态时转换器或估计器。每个状态的运行是有顺序的，输入的数据框通过每个阶段进行改变。在转换器阶段，transform()方法被调用于数据框上。对于估计器阶段，fit()方法被调用来产生一个转换器，然后该转换器的transform()方法被调用在数据框上。

下面的图说明简单的文档处理工作流的运行。

spark pipeline 例子的更多相关文章

spark JavaDirectKafkaWordCount 例子分析
spark JavaDirectKafkaWordCount 例子分析: 1. KafkaUtils.createDirectStream( jssc, String.class, String.c ...
Spark Pipeline官方文档
ML Pipelines(译文) 官方文档链接:https://spark.apache.org/docs/latest/ml-pipeline.html 概述在这一部分,我们将要介绍ML Pipe ...
Spark SQL例子
综合案例分析现有数据集 department.json与employee.json,以部门名称和员工性别为粒度,试计算每个部门分性别平均年龄与平均薪资. department.json如下: {&q ...
Spark Pipeline
一个简单的Pipeline,用作estimator.Pipeline由有序列的stages组成,每个stage是一个Estimator或者一个Transformer. 当Pipeline调用fit,s ...
Spark Streaming 例子
NetworkWordCount.scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * con ...
看到了一个pipeline例子，
pipeline { agent any options { timestamps() } parameters { string(name: 'GIT_BRANCH', defaultValue: ...
spark执行例子eclipse maven打包jar
首先在eclipse Java EE中新建一个Maven project具体选项如下点击Finish创建成功,接下来把默认的jdk1.5改成jdk1.8 然后编辑pom.xml加入spark-cor ...
spark scala 例子
object ScalaApp { def main(args: Array[String]): Unit = { var conf = new SparkConf() conf.setMaster( ...
Spark.ML之PipeLine学习笔记
地址: http://spark.apache.org/docs/2.0.0/ml-pipeline.html Spark PipeLine 是基于DataFrames的高层的API,可以方便用户 ...

随机推荐

server.htaccess 具体解释以及 .htaccess 參数说明
.htaccess文件(或者"分布式配置文件")提供了针对文件夹改变配置的方法. 即.在一个特定的文档文件夹中放置一个包括一个或多个指令的文件, 以作用于此文件夹及其所有子文件夹. ...
怎样获取ios设备的唯一标识
非常多地方都会须要用到唯一标志. 比方: 1. 我们相用一个设备的唯一标志当作用户id,特别是网络游戏,这样就能够省去注冊的麻烦. 2. 想把app相关的文件加密,密钥哪里来的?有些人可能会说hard ...
ubuntu 休眠之后蓝牙鼠标无效果。
ubuntu链接蓝牙鼠标之后.左上角蓝牙标志左下角应该有一个锁的标志. 可是休眠之后,蓝牙鼠标失效,锁没有了,点击按键,出来锁之后,立即消失. 运行两次例如以下命令能够解决: sudo hciconf ...
bzoj3436: 小K的农场（差分约束）
3436: 小K的农场题目:传送门题解: 查分基础: t==1 a>=b+c t==2 b>=a-c t==3 a>=b+0 b>=a+0 跑最长路一A 代码: #i ...
Oracle 10g RAC (linux) ASM 共享存储的管理详解
---------ASM 的管理(共享磁盘的管理)1.以 instance 的方式管理 ASM,启动 database 之前必须先启动 ASM instance,ASM instance 启动后,挂载 ...
Linux就该这么学 20181002(第二章基础命令)
参考链接https://www.linuxprobe.com/ 忘记密码操作启动页面默认按e 在linux16行后空格 rd.break ctrl + x mount -o remount,rw ...
Ubuntu16.04+OpenCV3.2.0+Opencv_Contrib3.2.0安装
为了学习slam,在ubuntu16.04系统上安装opencv3.2.0以及对应的opencv_contrib3.2.0 安装过程下载 Github上下载有的时候比较慢,我这里分享了OpenCV3 ...
iOS开发-Tools-CocoaPods
CocoaPods是什么? 当你开发iOS应用时,会经常使用到很多第三方开源类库,比如JSONKit,AFNetWorking等等.可能某个类库又用到其他类库,所以要使用它,必须得另外下载其他类库,而 ...
react-native之文件上传下载
目录文件上传 1.文件选择 2.文件上传 1.FormData对象包装 2.上传示例文件下载最近react-native项目上需要做文件上传下载的功能,由于才接触react-native不久,好 ...
【jQuery04】折叠树
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...

spark pipeline 例子

管道里的主要概念

spark pipeline 例子的更多相关文章

随机推荐

热门专题