python spark 决策树 入门demo
Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API.
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils # Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32) # Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString()) # Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
class pyspark.mllib.tree.DecisionTree[source]
Learning algorithm for a decision tree model for classification or regression.
New in version 1.1.0.
- classmethod trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity='gini', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0)[source]
-
Train a decision tree model for classification.
Parameters: - data – Training data: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
- numClasses – Number of classes for classification.
- categoricalFeaturesInfo – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
- impurity – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
- maxDepth – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 5)
- maxBins – Number of bins used for finding splits at each node. (default: 32)
- minInstancesPerNode – Minimum number of instances required at child nodes to create the parent split. (default: 1)
- minInfoGain – Minimum info gain required to create a split. (default: 0.0)
Returns: DecisionTreeModel.
Example usage:
>>> from numpy import array
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import DecisionTree
>>>
>>> data = [
... LabeledPoint(0.0, [0.0]),
... LabeledPoint(1.0, [1.0]),
... LabeledPoint(1.0, [2.0]),
... LabeledPoint(1.0, [3.0])
... ]
>>> model = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
>>> print(model)
DecisionTreeModel classifier of depth 1 with 3 nodes>>> print(model.toDebugString())
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 0 <= 0.0)
Predict: 0.0
Else (feature 0 > 0.0)
Predict: 1.0 >>> model.predict(array([1.0]))
1.0
>>> model.predict(array([0.0]))
0.0
>>> rdd = sc.parallelize([[1.0], [0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]-
摘自:https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree
python spark 决策树 入门demo的更多相关文章
- Spark快速入门 - Spark 1.6.0
Spark快速入门 - Spark 1.6.0 转载请注明出处:http://www.cnblogs.com/BYRans/ 快速入门(Quick Start) 本文简单介绍了Spark的使用方式.首 ...
- Spark快速入门
Spark 快速入门 本教程快速介绍了Spark的使用. 首先我们介绍了通过Spark 交互式shell调用API( Python或者scala代码),然后演示如何使用Java, Scala或者P ...
- spark streaming 入门例子
spark streaming 入门例子: spark shell import org.apache.spark._ import org.apache.spark.streaming._ sc.g ...
- 转-Python自然语言处理入门
Python自然语言处理入门 原文链接:http://python.jobbole.com/85094/ 分享到:20 本文由 伯乐在线 - Ree Ray 翻译,renlytime 校稿.未经许 ...
- Spark高速入门指南(Quick Start Spark)
版权声明:本博客已经不再更新.请移步到Hadoop技术博客:https://www.iteblog.com https://blog.csdn.net/w397090770/article/detai ...
- [转] Spark快速入门指南 – Spark安装与基础使用
[From] https://blog.csdn.net/w405722907/article/details/77943331 Spark快速入门指南 – Spark安装与基础使用 2017年09月 ...
- storm入门demo
一.storm入门demo的介绍 storm的入门helloworld有2种方式,一种是本地的,另一种是远程. 本地实现: 本地写好demo之后,不用搭建storm集群,下载storm的相关jar包即 ...
- Python学习--01入门
Python学习--01入门 Python是一种解释型.面向对象.动态数据类型的高级程序设计语言.和PHP一样,它是后端开发语言. 如果有C语言.PHP语言.JAVA语言等其中一种语言的基础,学习Py ...
- Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二 前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...
随机推荐
- 利用POPAnimatableProperty属性来实现动画倒计时
POPAnimatableProperty *prop = [POPAnimatableProperty propertyWithName:@"countdown" initial ...
- 基于mybatis向oracle中插入数据的性能对比
数据库表结构: 逐条插入sql语句: <insert id="insert" parameterType="com.Structure"> INSE ...
- Maven项目pom.xml配置详解
maven项目pom.xml文件配置详解,需要时可以用作参考: <project xmlns="http://maven.apache.org/POM/4.0.0" xmln ...
- Leetcode0100--Same Tree 相同树
[转载请注明]http://www.cnblogs.com/igoslly/p/8707664.html 来看一下题目: Given two binary trees, write a functio ...
- 【转载】linux环境下大数据网站搬家
这里说的大数据是指你的网站数据库大小至少超过了500M,当然只有50M的网站也同样可以用这样的方法来轻松安全的实现网站搬家,前提是你使用的是linux环境下的VPS或者独立服务器. 我们假设你的网站域 ...
- 【sqli-labs】 less41 GET -Blind based -Intiger -Stacked(GET型基于盲注的堆叠查询整型注入)
整型的不用闭合引号 http://192.168.136.128/sqli-labs-master/Less-41/?id=1;insert into users(id,username,passwo ...
- 初识cocos creator的一些问题
本文的cocos creator版本为v1.9.01.color赋值cc.Label组件并没有颜色相关的属性,但是Node有color的属性. //如果4个参数,在ios下有问题let rgb = [ ...
- Markdown 常用语法总结
注意:Markdown使用#.+.*等符号来标记,符号后面必须跟上至少跟上 1个空格才有效! Markdown的常用语法 标题 Markdown标题支持两种形式. 1.用#标记 在标题开头加上1~6个 ...
- SP1825 FTOUR2 - Free tour II 点分治+启发式合并+未调完
题意翻译 给定一棵n个点的树,树上有m个黑点,求出一条路径,使得这条路径经过的黑点数小于等于k,且路径长度最大 Code: #include <bits/stdc++.h> using n ...
- jsp 多条件组合查询
web层: public String query(HttpServletRequest request, HttpServletResponse response) throws ServletEx ...