随机森林算法demo python spark

关键参数

最重要的，常常需要调试以提高算法效果的有两个参数：numTrees，maxDepth。

numTrees（决策树的个数）：增加决策树的个数会降低预测结果的方差，这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
maxDepth：是指森林中每一棵决策树最大可能depth，在决策树中提到了这个参数。更深的一棵树意味模型预测更有力，但同时训练时间更长，也更倾向于过拟合。但是值得注意的是，随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差，因此相对于单一决策树而言，不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
甚至有的文献说，随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样，还是建议对maxDepth参数进行一定的实验，看看是否可以提高预测的效果。
另外还有两个参数，subsamplingRate，featureSubsetStrategy一般不需要调试，但是这两个参数也可以重新设置以加快训练，但是值得注意的是可能会影响模型的预测效果（如果需要调试的仔细读下面英文吧）。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
（1）numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
（2）maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
（3）subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
（4）featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""

Random Forest Classification Example.

"""

from __future__ import print_function

from pyspark import SparkContext

# $example on$

from pyspark.mllib.tree import RandomForest, RandomForestModel

from pyspark.mllib.util import MLUtils

# $example off$

if __name__ == "__main__":

    sc = SparkContext(appName="PythonRandomForestClassificationExample")

    # $example on$

    # Load and parse the data file into an RDD of LabeledPoint.

    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a RandomForest model.

    #  Empty categoricalFeaturesInfo indicates all features are continuous.

    #  Note: Use larger numTrees in practice.

    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.

    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},

                                         numTrees=3, featureSubsetStrategy="auto",

                                         impurity='gini', maxDepth=4, maxBins=32)

    # Evaluate model on test instances and compute test error

    predictions = model.predict(testData.map(lambda x: x.features))

    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

    print('Test Error = ' + str(testErr))

    print('Learned classification forest model:')

    print(model.toDebugString())

    # Save and load model

    model.save(sc, "target/tmp/myRandomForestClassificationModel")

    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

    # $example off$

模型样子：

TreeEnsembleModel classifier with 3 trees

  Tree 0:

    If (feature 511 <= 0.0)

     If (feature 434 <= 0.0)

      Predict: 0.0

     Else (feature 434 > 0.0)

      Predict: 1.0

    Else (feature 511 > 0.0)

     Predict: 0.0

  Tree 1:

    If (feature 490 <= 31.0)

     Predict: 0.0

    Else (feature 490 > 31.0)

     Predict: 1.0

  Tree 2:

    If (feature 302 <= 0.0)

     If (feature 461 <= 0.0)

      If (feature 208 <= 107.0)

       Predict: 1.0

      Else (feature 208 > 107.0)

       Predict: 0.0

     Else (feature 461 > 0.0)

      Predict: 1.0

    Else (feature 302 > 0.0)

     Predict: 0.0

随机森林算法demo python spark的更多相关文章

Spark mllib 随机森林算法的简单应用（附代码）
此前用自己实现的随机森林算法,应用在titanic生还者预测的数据集上.事实上,有很多开源的算法包供我们使用.无论是本地的机器学习算法包sklearn 还是分布式的spark mllib,都是非常不错 ...
H2O中的随机森林算法介绍及其项目实战（python实现）
H2O中的随机森林算法介绍及其项目实战(python实现) 包的引入:from h2o.estimators.random_forest import H2ORandomForestEstimator ...
用Python实现随机森林算法，深度学习
用Python实现随机森林算法,深度学习拥有高方差使得决策树(secision tress)在处理特定训练数据集时其结果显得相对脆弱.bagging(bootstrap aggregating 的缩 ...
spark 随机森林算法案例实战
随机森林算法由多个决策树构成的森林,算法分类结果由这些决策树投票得到,决策树在生成的过程当中分别在行方向和列方向上添加随机过程,行方向上构建决策树时采用放回抽样(bootstraping)得到训练数 ...
Python机器学习笔记——随机森林算法
随机森林算法的理论知识随机森林是一种有监督学习算法,是以决策树为基学习器的集成学习算法.随机森林非常简单,易于实现,计算开销也很小,但是它在分类和回归上表现出非常惊人的性能,因此,随机森林被誉为“代 ...
随机森林算法OOB_SCORE最佳特征选择
RandomForest算法(有监督学习),可以根据输入数据,选择最佳特征组合,减少特征冗余:原理:由于随机决策树生成过程采用的Boostrap,所以在一棵树的生成过程并不会使用所有的样本,未使用的样 ...
Bagging与随机森林算法原理小结
在集成学习原理小结中,我们讲到了集成学习有两个流派,一个是boosting派系,它的特点是各个弱学习器之间有依赖关系.另一种是bagging流派,它的特点是各个弱学习器之间没有依赖关系,可以并行拟合. ...
R语言︱决策树族——随机森林算法
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 笔者寄语:有一篇<有监督学习选择深度学习 ...
R语言︱机器学习模型评估方案（以随机森林算法为例）
笔者寄语:本文中大多内容来自<数据挖掘之道>,本文为读书笔记.在刚刚接触机器学习的时候,觉得在监督学习之后,做一个混淆矩阵就已经足够,但是完整的机器学习解决方案并不会如此草率.需要完整的评 ...

随机推荐

guice基本学习，guice的学习资料(十)
这个是我前面几篇的参考. guice的学习资料下载:http://pan.baidu.com/s/1bDEPem 路途遥远,但是人确在走.不忘初心,方得始终.
linux install PyMsql
# 安装pip curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python get-pip.py # 安装PyMysql pip in ...
使用纯 CSS 实现 Google Photos 照片列表布局
文章太长,因为介绍了如何一步一步进化到最后接近完美的效果的,不想读的同学可以直接跳到最后一个大标题之后看代码.demo及原理就好,或者也可以直接看下面这个链接的源代码. 不过还是建议顺序读下去,因为后 ...
TOF相机基本知识
TOF是Time of flight的简写,直译为飞行时间的意思.所谓飞行时间法3D成像,是通过给目标连续发送光脉冲,然后利用传感器接收从物体返回的光,通过探测光脉冲的飞行时间来得到目标物的距离.TO ...
使用meta实现页面的定时刷新或跳转
<meta http-equiv="refresh" content="5"> 这个表示当前页面每5秒钟刷一下,刷一下~ <meta http ...
APICloud开发小技巧（一）
apicloud开发文档中,前端开发框架指的就是,类似jq\js的语法: https://docs.apicloud.com/Front-end-Framework/framework-dev-gui ...
java 常用API 时间
package com.orcal.demc01; import java.text.ParseException; import java.text.SimpleDateFormat; import ...
漫谈 Google 的 Native Client(NaCl) 技术（二）---- 技术篇（兼谈 LLVM）
转自:http://hzx5.blog.163.com/blog/static/40744388201172531637729/ 漫谈 Google 的 Native Client(NaCl) 技术( ...
Vue.js 渲染简写样式存在的问题
引出问题首先我们来这么一个问题, 这里是完整的 jsfiddle demo or codepen demo 给一个元素绑定两个边框样式, 右侧和底部都为1px的红色边框 styleA: { bord ...
Ubuntu双系统后时间不对解决方案
先在ubuntu下更新一下时间,确保时间无误 sudo apt install ntpdate sudo ntpdate time.windows.com 然后将时间更新到硬件上 sudo hwclo ...

随机森林算法demo python spark

关键参数

随机森林算法demo python spark的更多相关文章

随机推荐

热门专题