pyspark RandomForestRegressor 随机森林回归

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Fri Jun  8 09:27:08 2018

@author: luogan

"""

from pyspark.ml import Pipeline

from pyspark.ml.regression import RandomForestRegressor

from pyspark.ml.feature import VectorIndexer

from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.sql import SparkSession

spark= SparkSession\

                .builder \

                .appName("dataFrame") \

                .getOrCreate()

# Load and parse the data file, converting it to a DataFrame.

data = spark.read.format("libsvm").load("/home/luogan/lg/softinstall/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.

# Set maxCategories so features with > 4 distinct values are treated as continuous.

featureIndexer =\

    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.

rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline

pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.

model = pipeline.fit(trainingData)

# Make predictions.

predictions = model.transform(testData)

# Select example rows to display.

predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error

evaluator = RegressionEvaluator(

    labelCol="label", predictionCol="prediction", metricName="rmse")

rmse = evaluator.evaluate(predictions)

print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]

print(rfModel)  # summary only

　结果：

+----------+-----+--------------------+

|prediction|label|            features|

+----------+-----+--------------------+

|       0.0|  0.0|(692,[95,96,97,12...|

|       0.3|  0.0|(692,[100,101,102...|

|       0.0|  0.0|(692,[123,124,125...|

|      0.05|  0.0|(692,[124,125,126...|

|       0.0|  0.0|(692,[124,125,126...|

+----------+-----+--------------------+

only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.127949

RandomForestRegressionModel (uid=RandomForestRegressor_4acc9ab165e4f84f7169) with 20 trees

原文：https://blog.csdn.net/luoganttcc/article/details/80618336

PySpark 分类模型训练参考：

https://blog.csdn.net/u013719780/article/details/51792097

pyspark RandomForestRegressor 随机森林回归的更多相关文章

机器学习之路：python 集成回归模型随机森林回归RandomForestRegressor 极端随机森林回归ExtraTreesRegressor GradientBoostingRegressor回归预测波士顿房价
python3 学习机器学习api 使用了三种集成回归模型 git: https://github.com/linyi0604/MachineLearning 代码: from sklearn.dat ...
机器学习实战基础（三十八）：随机森林（五）RandomForestRegressor 之用随机森林回归填补缺失值
简介我们从现实中收集的数据,几乎不可能是完美无缺的,往往都会有一些缺失值.面对缺失值,很多人选择的方式是直接将含有缺失值的样本删除,这是一种有效的方法,但是有时候填补缺失值会比直接丢弃样本效果更好, ...
MATLAB随机森林回归模型
MATLAB随机森林回归模型: 调用matlab自带的TreeBagger.m T=textread('E:\datasets-orreview\discretized-regression\10bi ...
机器学习实战基础（三十七）：随机森林（四）之 RandomForestRegressor 重要参数，属性与接口
RandomForestRegressor class sklearn.ensemble.RandomForestRegressor (n_estimators=’warn’, criterion=’ ...
Python机器学习笔记——随机森林算法
随机森林算法的理论知识随机森林是一种有监督学习算法,是以决策树为基学习器的集成学习算法.随机森林非常简单,易于实现,计算开销也很小,但是它在分类和回归上表现出非常惊人的性能,因此,随机森林被誉为“代 ...
随机森林random forest及python实现
引言想通过随机森林来获取数据的主要特征 1.理论根据个体学习器的生成方式,目前的集成学习方法大致可分为两大类,即个体学习器之间存在强依赖关系,必须串行生成的序列化方法,以及个体学习器间不存在强依赖关系 ...
100天搞定机器学习|Day56 随机森林工作原理及调参实战（信用卡欺诈预测）
本文是对100天搞定机器学习|Day33-34 随机森林的补充前文对随机森林的概念.工作原理.使用方法做了简单介绍,并提供了分类和回归的实例. 本期我们重点讲一下: 1.集成学习.Bagging和随 ...
RandomForest 随机森林算法与模型参数的调优
公号:码农充电站pro 主页:https://codeshellme.github.io 本篇文章来介绍随机森林(RandomForest)算法. 1,集成算法之 bagging 算法在前边的文章& ...
[Python] 波士顿房价的7种模型(线性拟合、二次多项式、Ridge、Lasso、SVM、决策树、随机森林)的训练效果对比
目录 1. 载入数据列解释Columns: 2. 数据分析 2.1 预处理 2.2 可视化 3. 训练模型 3.1 线性拟合 3.2 多项式回归(二次) 3.3 脊回归(Ridge Regressi ...

随机推荐

Windows 下 Git 安装与初始配置
官方下载地址:https://git-scm.com/download/win,我下载的最新版是 Git-2.15.1.2-64-bit.exe . Windows 下安装步骤 1.相关信息,直接“ ...
PLT redirection through shared object injection into a running process
PLT redirection through shared object injection into a running process
mybatis+spring配置
可参考:http://www.javacodegeeks.com/2014/02/building-java-web-application-using-mybatis-with-spring.htm ...
Spring Boot 之 RESTfull API简单项目的快速搭建（二）
1.打包 -- Maven build 2.问题 [WARNING] The requested profile "pom.xml" could not be activated ...
logrotate: 管理日志文件
Erik Troan提供了一种优秀的工具logrotate,它实现了多种多样的日志管理策略,而且在我们举例的所有发行版本上都是标准应用. logrotate的配置文件由一系列规范组成,它们说明了要管理 ...
JSFL 获取当前脚本路径，执行其他脚本
Application.jsfl为程序入口,导入其他jsfl [Common.jsfl] function trace() { fl.trace(Array.prototype.join.call(a ...
Linux上实现Windows的SQLPlus保存SQL历史记录功能
在Windows操作系统上,当在DOS命令窗口中运行SQL*Plus的时候,可以使用向上,向下键来跳回之前已经执行过的SQL语句.你可以根据需要修改他们,然后按Enter键重新提交执行. 然而,当在L ...
Android中怎样做到自己定义的广播仅仅能有指定的app接收
今天没吊事.又去面试了,详细哪家公司就不说了,由于我在之前的blog中注明了那些家公司的名字,结果人家给我私信说我泄露他们的题目.好吧,我错了... 事实上当我们已经在工作的时候.我们能够在空暇的时间 ...
【转】Jenkins怎么启动和停止服务
笔者没有把Jenkins配置到tomcat中,每次都是用命令行来启动Jenkins.但是遇到一个问题:Jenkins一直是开着的,想关闭也关闭不了.百度了一些资料,均不靠谱(必须吐槽一下百度).于是进 ...
Android控件进阶-自定义流式布局和热门标签控件
技术:Android+java 概述在日常的app使用中,我们会在android 的app中看见热门标签等自动换行的流式布局,今天,我们就来看看如何自定义一个类似热门标签那样的流式布局吧,类 ...

pyspark RandomForestRegressor 随机森林回归

pyspark RandomForestRegressor 随机森林回归的更多相关文章

随机推荐

热门专题