Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API.

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils # Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32) # Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString()) # Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo.

class pyspark.mllib.tree.DecisionTree[source]

Learning algorithm for a decision tree model for classification or regression.

New in version 1.1.0.

classmethod trainClassifier(datanumClassescategoricalFeaturesInfoimpurity='gini'maxDepth=5maxBins=32minInstancesPerNode=1minInfoGain=0.0)[source]

Train a decision tree model for classification.

Parameters:
  • data – Training data: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
  • numClasses – Number of classes for classification.
  • categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
  • impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
  • maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 5)
  • maxBins – Number of bins used for finding splits at each node. (default: 32)
  • minInstancesPerNode – Minimum number of instances required at child nodes to create the parent split. (default: 1)
  • minInfoGain – Minimum info gain required to create a split. (default: 0.0)
Returns:

DecisionTreeModel.

Example usage:

>>> from numpy import array
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import DecisionTree
>>>
>>> data = [
... LabeledPoint(0.0, [0.0]),
... LabeledPoint(1.0, [1.0]),
... LabeledPoint(1.0, [2.0]),
... LabeledPoint(1.0, [3.0])
... ]
>>> model = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
>>> print(model)
DecisionTreeModel classifier of depth 1 with 3 nodes
>>> print(model.toDebugString())
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 0 <= 0.0)
Predict: 0.0
Else (feature 0 > 0.0)
Predict: 1.0 >>> model.predict(array([1.0]))
1.0
>>> model.predict(array([0.0]))
0.0
>>> rdd = sc.parallelize([[1.0], [0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]

摘自:https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

python spark 决策树 入门demo的更多相关文章

  1. Spark快速入门 - Spark 1.6.0

    Spark快速入门 - Spark 1.6.0 转载请注明出处:http://www.cnblogs.com/BYRans/ 快速入门(Quick Start) 本文简单介绍了Spark的使用方式.首 ...

  2. Spark快速入门

    Spark 快速入门   本教程快速介绍了Spark的使用. 首先我们介绍了通过Spark 交互式shell调用API( Python或者scala代码),然后演示如何使用Java, Scala或者P ...

  3. spark streaming 入门例子

    spark streaming 入门例子: spark shell import org.apache.spark._ import org.apache.spark.streaming._ sc.g ...

  4. 转-Python自然语言处理入门

      Python自然语言处理入门 原文链接:http://python.jobbole.com/85094/ 分享到:20 本文由 伯乐在线 - Ree Ray 翻译,renlytime 校稿.未经许 ...

  5. Spark高速入门指南(Quick Start Spark)

    版权声明:本博客已经不再更新.请移步到Hadoop技术博客:https://www.iteblog.com https://blog.csdn.net/w397090770/article/detai ...

  6. [转] Spark快速入门指南 – Spark安装与基础使用

    [From] https://blog.csdn.net/w405722907/article/details/77943331 Spark快速入门指南 – Spark安装与基础使用 2017年09月 ...

  7. storm入门demo

    一.storm入门demo的介绍 storm的入门helloworld有2种方式,一种是本地的,另一种是远程. 本地实现: 本地写好demo之后,不用搭建storm集群,下载storm的相关jar包即 ...

  8. Python学习--01入门

    Python学习--01入门 Python是一种解释型.面向对象.动态数据类型的高级程序设计语言.和PHP一样,它是后端开发语言. 如果有C语言.PHP语言.JAVA语言等其中一种语言的基础,学习Py ...

  9. Python简单爬虫入门三

    我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二 前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...

随机推荐

  1. elasticsearch模板 template

    https://elasticsearch.cn/article/335 elasticsearch模板 template 可以考虑的学习点: mapping的 _default_类型 动态模板:dy ...

  2. QS之force(3)

    Example in project - First force signals in certain time and then noforce signals after some time. # ...

  3. 【sqli-labs】 less32 GET- Bypass custom filter adding slashes to dangrous chars (GET型转义了'/"字符的宽字节注入)

    转义函数,针对以下字符,这样就无法闭合引号,导致无法注入 ' --> \' " --> \" \ --> \\ 但是,当MySQL的客户端字符集为gbk时,就可能 ...

  4. AndroidStudio 内存泄漏的分析过程

    前言部分这次泄漏是自己代码写的太随意引起的,讲道理,代码写的太为所欲为了,导致有些问题根本就很难发现. 泄漏产生的原因,由于activity未被回收导致.这里给我们提出的一个警示,在使用上下文的时候, ...

  5. Ubuntu下解压(unzip)乱码的解决方法

    在Windows上压缩的文件,是以Windows系统默认编码中文来压缩文件.由于zip文件中没有声明其编码,所以linux上的unzip一般以默认编码解压,中文文件名会出现乱码. 通过unzip -- ...

  6. PAT_A1144#The Missing Number

    Source: PAT A1144 The Missing Number (20 分) Description: Given N integers, you are supposed to find ...

  7. vue 导航菜单默认子路由

    export default new Router({ routes: [ { path: '/', name: 'index', component: index, children: [ { pa ...

  8. C语言开发框架、printf(day02)

    C语言里包含以.c作为扩展名的文件,这种 文件叫源文件.C语言程序的绝大部分内容 应该记录在源文件里. C语言里还包括以.h作为扩展名的文件,这种 文件叫头文件. C语言程序里可以直接使用数字和加减乘 ...

  9. 继续聊WPF——为ListView的行设置样式

    <Window x:Class="Wpf_GridHeaderStyle_sample.Window1" xmlns="http://schemas.microso ...

  10. PAT 1090. Highest Price in Supply Chain

    A supply chain is a network of retailers(零售商), distributors(经销商), and suppliers(供应商)-- everyone invo ...