python spark 决策树入门demo

Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API.

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel

from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

# Split the data into training and test sets (30% held out for testing)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.

#  Empty categoricalFeaturesInfo indicates all features are continuous.

model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},

                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error

predictions = model.predict(testData.map(lambda x: x.features))

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))

print('Learned classification tree model:')

print(model.toDebugString())

# Save and load model

model.save(sc, "target/tmp/myDecisionTreeClassificationModel")

sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo.

class pyspark.mllib.tree.DecisionTree[source]

Learning algorithm for a decision tree model for classification or regression.

New in version 1.1.0.

classmethod trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity='gini', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0)[source]

Train a decision tree model for classification.

Parameters:	data – Training data: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}. numClasses – Number of classes for classification. categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}. impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”) maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 5) maxBins – Number of bins used for finding splits at each node. (default: 32) minInstancesPerNode – Minimum number of instances required at child nodes to create the parent split. (default: 1) minInfoGain – Minimum info gain required to create a split. (default: 0.0)
Returns:	DecisionTreeModel.

Parameters:

data – Training data: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
numClasses – Number of classes for classification.
categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 5)
maxBins – Number of bins used for finding splits at each node. (default: 32)
minInstancesPerNode – Minimum number of instances required at child nodes to create the parent split. (default: 1)
minInfoGain – Minimum info gain required to create a split. (default: 0.0)

Returns:

DecisionTreeModel.

Example usage:

>>> from numpy import array

>>> from pyspark.mllib.regression import LabeledPoint

>>> from pyspark.mllib.tree import DecisionTree

>>>

>>> data = [

...     LabeledPoint(0.0, [0.0]),

...     LabeledPoint(1.0, [1.0]),

...     LabeledPoint(1.0, [2.0]),

...     LabeledPoint(1.0, [3.0])

... ]

>>> model = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})

>>> print(model)

DecisionTreeModel classifier of depth 1 with 3 nodes

>>> print(model.toDebugString())

DecisionTreeModel classifier of depth 1 with 3 nodes

  If (feature 0 <= 0.0)

   Predict: 0.0

  Else (feature 0 > 0.0)

   Predict: 1.0

>>> model.predict(array([1.0]))

1.0

>>> model.predict(array([0.0]))

0.0

>>> rdd = sc.parallelize([[1.0], [0.0]])

>>> model.predict(rdd).collect()

[1.0, 0.0]

摘自：https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree

python spark 决策树入门demo的更多相关文章

Spark快速入门 - Spark 1.6.0
Spark快速入门 - Spark 1.6.0 转载请注明出处:http://www.cnblogs.com/BYRans/ 快速入门(Quick Start) 本文简单介绍了Spark的使用方式.首 ...
Spark快速入门
Spark 快速入门本教程快速介绍了Spark的使用. 首先我们介绍了通过Spark 交互式shell调用API( Python或者scala代码),然后演示如何使用Java, Scala或者P ...
spark streaming 入门例子
spark streaming 入门例子: spark shell import org.apache.spark._ import org.apache.spark.streaming._ sc.g ...
转-Python自然语言处理入门
Python自然语言处理入门原文链接:http://python.jobbole.com/85094/ 分享到:20 本文由伯乐在线 - Ree Ray 翻译,renlytime 校稿.未经许 ...
Spark高速入门指南(Quick Start Spark)
版权声明:本博客已经不再更新.请移步到Hadoop技术博客:https://www.iteblog.com https://blog.csdn.net/w397090770/article/detai ...
[转] Spark快速入门指南 – Spark安装与基础使用
[From] https://blog.csdn.net/w405722907/article/details/77943331 Spark快速入门指南 – Spark安装与基础使用 2017年09月 ...
storm入门demo
一.storm入门demo的介绍 storm的入门helloworld有2种方式,一种是本地的,另一种是远程. 本地实现: 本地写好demo之后,不用搭建storm集群,下载storm的相关jar包即 ...
Python学习--01入门
Python学习--01入门 Python是一种解释型.面向对象.动态数据类型的高级程序设计语言.和PHP一样,它是后端开发语言. 如果有C语言.PHP语言.JAVA语言等其中一种语言的基础,学习Py ...
Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...

随机推荐

CDC之fast->slow (2)
1 Open-loop solution One potential solution is to assert CDC signals for a period of time that excee ...
Embedded之Stack之一
1 Intro When a program starts executing, a certain contiguous section of memory is set aside for the ...
微软CRM4.0 页面表单和腾讯QQ在线整合
现在通过QQ和客户联系.洽谈业务及沟通感情的场合越来越多,在微软CRM表单上整合QQ可以方便的显示客户QQ在线状态,点击图标即可和客户进行QQ聊天. 客户在线状态: 客户离线状态: 输入QQ号码后即时 ...
07--c++类的构造函数详解
c++类的构造函数详解 c++构造函数的知识在各种c++教材上已有介绍,不过初学者往往不太注意观察和总结其中各种构造函数的特点和用法,故在此我根据自己的c++编程经验总结了一下c++中各种构造函数的特 ...
c# cookie帮助类
using System; using System.Collections.Generic; using System.Text; using System.Web; namespace Matic ...
jdk?jre?
很多人都搞不懂什么是jdk,什么是jre,只知道电脑安装了这两个就能开发和运行java程序,这里我简单讲讲什么是jdk,什么是jre. jdk,即Java Development Kit,故名思意就是 ...
Lua的五种变量类型、局部变量、全局变量、lua运算符、流程控制if语句_学习笔记02
Lua的五种变量类型.局部变量.全局变量 .lua运算符 .流程控制if语句 Lua代码的注释方式: --当行注释 --[[ 多行注释 ]]-- Lua的5种变量类型: 1.null 表示 ...
swift-正则验证手机号码
// 手机号验证正则表达式 func validateMobile(phoneNum:String)-> Bool { // 手机号以 13 14 15 18 开头八个 \d 数字字符 let ...
非传统题初探——AtCoder Practice Contest #B - インタラクティブ練習 (Interactive Sorting)
原题: Time limit : 2sec / Memory limit : 256MB Score : 300 points Problem Statement This is an interac ...
[kuangbin带你飞]专题1-23题目清单总结
[kuangbin带你飞]专题1-23 专题一简单搜索 POJ 1321 棋盘问题POJ 2251 Dungeon MasterPOJ 3278 Catch That CowPOJ 3279 Fli ...

python spark 决策树 入门demo

python spark 决策树 入门demo的更多相关文章

随机推荐

热门专题

python spark 决策树入门demo

python spark 决策树入门demo的更多相关文章