spark pipeline 例子

"""

Pipeline Example.

"""

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import HashingTF, Tokenizer

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("PipelineExample")\

        .getOrCreate()

    # $example on$

    # Prepare training documents from a list of (id, text, label) tuples.

    training = spark.createDataFrame([

        (0, "a b c d e spark", 1.0),

        (1, "b d", 0.0),

        (2, "spark f g h", 1.0),

        (3, "hadoop mapreduce", 0.0)

    ], ["id", "text", "label"])

    # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

    tokenizer = Tokenizer(inputCol="text", outputCol="words")

    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

    lr = LogisticRegression(maxIter=10, regParam=0.001)

    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

    # Fit the pipeline to training documents.

    model = pipeline.fit(training)

    # Prepare test documents, which are unlabeled (id, text) tuples.

    test = spark.createDataFrame([

        (4, "spark i j k"),

        (5, "l m n"),

        (6, "spark hadoop spark"),

        (7, "apache hadoop")

    ], ["id", "text"])

    # Make predictions on test documents and print columns of interest.

    prediction = model.transform(test)

    selected = prediction.select("id", "text", "probability", "prediction")

    for row in selected.collect():

        rid, text, prob, prediction = row

        print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

    # $example off$

    spark.stop()

"""

Decision Tree Classification Example.

"""

from __future__ import print_function

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.feature import StringIndexer, VectorIndexer

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("DecisionTreeClassificationExample")\

        .getOrCreate()

    # $example on$

    # Load the data stored in LIBSVM format as a DataFrame.

    data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    # Index labels, adding metadata to the label column.

    # Fit on whole dataset to include all labels in index.

    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

    # Automatically identify categorical features, and index them.

    # We specify maxCategories so features with > 4 distinct values are treated as continuous.

    featureIndexer =\

        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a DecisionTree model.

    dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

    # Chain indexers and tree in a Pipeline

    pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

    # Train model.  This also runs the indexers.

    model = pipeline.fit(trainingData)

    # Make predictions.

    predictions = model.transform(testData)

    # Select example rows to display.

    predictions.select("prediction", "indexedLabel", "features").show(5)

    # Select (prediction, true label) and compute test error

    evaluator = MulticlassClassificationEvaluator(

        labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

    accuracy = evaluator.evaluate(predictions)

    print("Test Error = %g " % (1.0 - accuracy))

    treeModel = model.stages[2]

    # summary only

    print(treeModel)

    # $example off$

    spark.stop()

管道里的主要概念

MLlib提供标准的接口来使联合多个算法到单个的管道或者工作流，管道的概念源于scikit-learn项目。

1.数据框：机器学习接口使用来自Spark SQL的数据框形式数据作为数据集，它可以处理多种数据类型。比如，一个数据框可以有不同的列存储文本、特征向量、标签值和预测值。

2.转换器：转换器是将一个数据框变为另一个数据框的算法。比如，一个机器学习模型就是一个转换器，它将带有特征数据框转为预测值数据框。

3.估计器：估计器是拟合一个数据框来产生转换器的算法。比如，一个机器学习算法就是一个估计器，它训练一个数据框产生一个模型。

4.管道：一个管道串起多个转换器和估计器，明确一个机器学习工作流。

5.参数：管道中的所有转换器和估计器使用共同的接口来指定参数。

工作原理

管道由一系列有顺序的阶段指定，每个状态时转换器或估计器。每个状态的运行是有顺序的，输入的数据框通过每个阶段进行改变。在转换器阶段，transform()方法被调用于数据框上。对于估计器阶段，fit()方法被调用来产生一个转换器，然后该转换器的transform()方法被调用在数据框上。

下面的图说明简单的文档处理工作流的运行。

spark pipeline 例子的更多相关文章

spark JavaDirectKafkaWordCount 例子分析
spark JavaDirectKafkaWordCount 例子分析: 1. KafkaUtils.createDirectStream( jssc, String.class, String.c ...
Spark Pipeline官方文档
ML Pipelines(译文) 官方文档链接:https://spark.apache.org/docs/latest/ml-pipeline.html 概述在这一部分,我们将要介绍ML Pipe ...
Spark SQL例子
综合案例分析现有数据集 department.json与employee.json,以部门名称和员工性别为粒度,试计算每个部门分性别平均年龄与平均薪资. department.json如下: {&q ...
Spark Pipeline
一个简单的Pipeline,用作estimator.Pipeline由有序列的stages组成,每个stage是一个Estimator或者一个Transformer. 当Pipeline调用fit,s ...
Spark Streaming 例子
NetworkWordCount.scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * con ...
看到了一个pipeline例子，
pipeline { agent any options { timestamps() } parameters { string(name: 'GIT_BRANCH', defaultValue: ...
spark执行例子eclipse maven打包jar
首先在eclipse Java EE中新建一个Maven project具体选项如下点击Finish创建成功,接下来把默认的jdk1.5改成jdk1.8 然后编辑pom.xml加入spark-cor ...
spark scala 例子
object ScalaApp { def main(args: Array[String]): Unit = { var conf = new SparkConf() conf.setMaster( ...
Spark.ML之PipeLine学习笔记
地址: http://spark.apache.org/docs/2.0.0/ml-pipeline.html Spark PipeLine 是基于DataFrames的高层的API,可以方便用户 ...

随机推荐

jquery mobile常用的data-role类型介绍
转自原文 jquery mobile常用的data-role类型介绍 data-role参数表: page 页面容器,其内部的mobile元素将会继承这个容器上所设置的属性 header ...
JDK1.7中的ThreadPoolExecutor源代码剖析
JDK1. 7中的ThreadPoolExecutor 线程池,顾名思义一个线程的池子,池子里存放了非常多能够复用的线程,假设不用线程池相似的容器,每当我们须要创建新的线程时都须要去new Threa ...
一些.NET 项目中经常使用的类库
Web自己主动化測试 Watin Selenium Selenium git .net 集合类扩展实现C5 Subscriber/Publisher 模式 Rx Nats 防御式编程断言库流 ...
kentico中的urls
alias是默认的访问页面 page aliases中可以手动指定访问一个url,然后跳转到当前的页面
java高级——暴力反射
反射,java中一个比较高级的应用,主要和开发中的框架紧密相连.今天我们就介绍一下他的特性之一,暴力反射.(听名字很恐怖呦) package wo; public class A{ public St ...
C# AssemblyResolve事件可能不触发
C# AssemblyResolve事件需要引用的dll的“复制本地”属性设置为False,如果为True,可能不会触发这个事件的处理函数. 我想设计一个自动加载分架构的C++/CLI的dll,用到了 ...
SQL 中多个 and or 的组合运算
sql关系型运算符优先级高到低为:not >and> or AND.OR运算符的组合使用在WHERE子句中,通过AND.OR运算符可以同时连接多个条件,当然AND.OR运算符也可以同时使 ...
读取XML字符串到临时表
DECLARE @hdoc int DECLARE @doc xml SET @doc = '<CityValueSet> <CityItem> <CityId>2 ...
【AnjularJS系列1 】— 样式相关的指令
最近,开始学习AngularJS. 开始记录学习AngularJS的过程,从一些很简单的知识点开始. 习惯先从实际应用的指令开始,再从应用中去体会AngularJS的优缺点.使用的场景等. 之前一直希 ...
Mojo For Chromium Developers
Overview This document contains the minimum amount of information needed for a developer to start us ...

spark pipeline 例子

管道里的主要概念

spark pipeline 例子的更多相关文章

随机推荐

热门专题