Spark2 oneHot编码--标准化--主成分--聚类
1.导入包
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrameReader
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrameStatFunctions
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.clustering.KMeans
2.导入数据
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
val data: DataFrame = spark.read.format("csv").option("header", true).load("hdfs://ns1/datafile/wangxiao/Affairs.csv")
data: org.apache.spark.sql.DataFrame = [affairs: string, gender: string ... 7 more fields]
data.cache
res0: data.type = [affairs: string, gender: string ... 7 more fields]
data.limit(10).show()
+-------+------+---+------------+--------+-------------+---------+----------+------+
|affairs|gender|age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+---+------------+--------+-------------+---------+----------+------+
| 0| male| 37| 10| no| 3| 18| 7| 4|
| 0|female| 27| 4| no| 4| 14| 6| 4|
| 0|female| 32| 15| yes| 1| 12| 1| 4|
| 0| male| 57| 15| yes| 5| 18| 6| 5|
| 0| male| 22| 0.75| no| 2| 17| 6| 3|
| 0|female| 32| 1.5| no| 2| 17| 5| 5|
| 0|female| 22| 0.75| no| 2| 12| 1| 3|
| 0| male| 57| 15| yes| 2| 14| 4| 4|
| 0|female| 32| 15| yes| 4| 16| 1| 2|
| 0| male| 22| 1.5| no| 4| 14| 4| 5|
+-------+------+---+------------+--------+-------------+---------+----------+------+
// 转换字符类型,将Double和String的字段分开放
val data1 = data.select(
| data("affairs").cast("Double"),
| data("age").cast("Double"),
| data("yearsmarried").cast("Double"),
| data("religiousness").cast("Double"),
| data("education").cast("Double"),
| data("occupation").cast("Double"),
| data("rating").cast("Double"),
| data("gender").cast("String"),
| data("children").cast("String"))
data1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 7 more fields]
data1.printSchema()
root
|-- affairs: double (nullable = true)
|-- age: double (nullable = true)
|-- yearsmarried: double (nullable = true)
|-- religiousness: double (nullable = true)
|-- education: double (nullable = true)
|-- occupation: double (nullable = true)
|-- rating: double (nullable = true)
|-- gender: string (nullable = true)
|-- children: string (nullable = true)
data1.limit(10).show
+-------+----+------------+-------------+---------+----------+------+------+--------+
|affairs| age|yearsmarried|religiousness|education|occupation|rating|gender|children|
+-------+----+------------+-------------+---------+----------+------+------+--------+
| 0.0|37.0| 10.0| 3.0| 18.0| 7.0| 4.0| male| no|
| 0.0|27.0| 4.0| 4.0| 14.0| 6.0| 4.0|female| no|
| 0.0|32.0| 15.0| 1.0| 12.0| 1.0| 4.0|female| yes|
| 0.0|57.0| 15.0| 5.0| 18.0| 6.0| 5.0| male| yes|
| 0.0|22.0| 0.75| 2.0| 17.0| 6.0| 3.0| male| no|
| 0.0|32.0| 1.5| 2.0| 17.0| 5.0| 5.0|female| no|
| 0.0|22.0| 0.75| 2.0| 12.0| 1.0| 3.0|female| no|
| 0.0|57.0| 15.0| 2.0| 14.0| 4.0| 4.0| male| yes|
| 0.0|32.0| 15.0| 4.0| 16.0| 1.0| 2.0|female| yes|
| 0.0|22.0| 1.5| 4.0| 14.0| 4.0| 5.0| male| no|
+-------+----+------------+-------------+---------+----------+------+------+--------+
val dataDF = data1
dataDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 7 more fields]
dataDF.cache()
res4: dataDF.type = [affairs: double, age: double ... 7 more fields]
3.字符转换成数字索引,OneHot编码,注意setDropLast设置为false
字符转换成数字索引
val indexer = new StringIndexer().setInputCol("gender").setOutputCol("genderIndex").fit(dataDF)
indexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_27dba613193a val indexed = indexer.transform(dataDF)
indexed: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 8 more fields] // OneHot编码,注意setDropLast设置为false
val encoder = new OneHotEncoder().setInputCol("genderIndex").setOutputCol("genderVec").setDropLast(false)
encoder: org.apache.spark.ml.feature.OneHotEncoder = oneHot_155a53de3aef val encoded = encoder.transform(indexed)
encoded: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 9 more fields] encoded.show()
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+
|affairs| age|yearsmarried|religiousness|education|occupation|rating|gender|children|genderIndex| genderVec|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+
| 0.0|37.0| 10.0| 3.0| 18.0| 7.0| 4.0| male| no| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 6.0| 4.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|32.0| 15.0| 1.0| 12.0| 1.0| 4.0|female| yes| 0.0|(2,[0],[1.0])|
| 0.0|57.0| 15.0| 5.0| 18.0| 6.0| 5.0| male| yes| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 0.75| 2.0| 17.0| 6.0| 3.0| male| no| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 1.5| 2.0| 17.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 2.0| 12.0| 1.0| 3.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|57.0| 15.0| 2.0| 14.0| 4.0| 4.0| male| yes| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 15.0| 4.0| 16.0| 1.0| 2.0|female| yes| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 4.0| 14.0| 4.0| 5.0| male| no| 1.0|(2,[1],[1.0])|
| 0.0|37.0| 15.0| 2.0| 20.0| 7.0| 2.0| male| yes| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 18.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])|
| 0.0|47.0| 15.0| 5.0| 17.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 1.5| 2.0| 17.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|37.0| 15.0| 1.0| 17.0| 5.0| 5.0|female| yes| 0.0|(2,[0],[1.0])|
| 0.0|37.0| 15.0| 2.0| 18.0| 4.0| 3.0|female| yes| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 3.0| 16.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 2.0| 16.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])|
| 0.0|27.0| 10.0| 2.0| 14.0| 1.0| 5.0|female| yes| 0.0|(2,[0],[1.0])|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+
only showing top 20 rows val indexer1 = new StringIndexer().setInputCol("children").setOutputCol("childrenIndex").fit(encoded)
indexer1: org.apache.spark.ml.feature.StringIndexerModel = strIdx_55db099c07b7 val indexed1 = indexer1.transform(encoded)
indexed1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 10 more fields] val encoder1 = new OneHotEncoder().setInputCol("childrenIndex").setOutputCol("childrenVec").setDropLast(false) val encoded1 = encoder1.transform(indexed1)
encoded1: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 11 more fields] encoded1.show()
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
|affairs| age|yearsmarried|religiousness|education|occupation|rating|gender|children|genderIndex| genderVec|childrenIndex| childrenVec|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
| 0.0|37.0| 10.0| 3.0| 18.0| 7.0| 4.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 6.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 15.0| 1.0| 12.0| 1.0| 4.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|57.0| 15.0| 5.0| 18.0| 6.0| 5.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 2.0| 17.0| 6.0| 3.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 1.5| 2.0| 17.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 0.75| 2.0| 12.0| 1.0| 3.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|57.0| 15.0| 2.0| 14.0| 4.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|32.0| 15.0| 4.0| 16.0| 1.0| 2.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 4.0| 14.0| 4.0| 5.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|37.0| 15.0| 2.0| 20.0| 7.0| 2.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|27.0| 4.0| 4.0| 18.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|47.0| 15.0| 5.0| 17.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 2.0| 17.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|37.0| 15.0| 1.0| 17.0| 5.0| 5.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|37.0| 15.0| 2.0| 18.0| 4.0| 3.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 3.0| 16.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 1.5| 2.0| 16.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 10.0| 2.0| 14.0| 1.0| 5.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
only showing top 20 rows val encodeDF: DataFrame = encoded1
encodeDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 11 more fields] encodeDF.show()
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
|affairs| age|yearsmarried|religiousness|education|occupation|rating|gender|children|genderIndex| genderVec|childrenIndex| childrenVec|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
| 0.0|37.0| 10.0| 3.0| 18.0| 7.0| 4.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 6.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 15.0| 1.0| 12.0| 1.0| 4.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|57.0| 15.0| 5.0| 18.0| 6.0| 5.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 2.0| 17.0| 6.0| 3.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|32.0| 1.5| 2.0| 17.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 0.75| 2.0| 12.0| 1.0| 3.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|57.0| 15.0| 2.0| 14.0| 4.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|32.0| 15.0| 4.0| 16.0| 1.0| 2.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 4.0| 14.0| 4.0| 5.0| male| no| 1.0|(2,[1],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|37.0| 15.0| 2.0| 20.0| 7.0| 2.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|27.0| 4.0| 4.0| 18.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|47.0| 15.0| 5.0| 17.0| 6.0| 4.0| male| yes| 1.0|(2,[1],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 1.5| 2.0| 17.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 4.0| 4.0| 14.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|37.0| 15.0| 1.0| 17.0| 5.0| 5.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|37.0| 15.0| 2.0| 18.0| 4.0| 3.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
| 0.0|22.0| 0.75| 3.0| 16.0| 5.0| 4.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|22.0| 1.5| 2.0| 16.0| 5.0| 5.0|female| no| 0.0|(2,[0],[1.0])| 1.0|(2,[1],[1.0])|
| 0.0|27.0| 10.0| 2.0| 14.0| 1.0| 5.0|female| yes| 0.0|(2,[0],[1.0])| 0.0|(2,[0],[1.0])|
+-------+----+------------+-------------+---------+----------+------+------+--------+-----------+-------------+-------------+-------------+
only showing top 20 rows encodeDF.printSchema()
root
|-- affairs: double (nullable = true)
|-- age: double (nullable = true)
|-- yearsmarried: double (nullable = true)
|-- religiousness: double (nullable = true)
|-- education: double (nullable = true)
|-- occupation: double (nullable = true)
|-- rating: double (nullable = true)
|-- gender: string (nullable = true)
|-- children: string (nullable = true)
|-- genderIndex: double (nullable = true)
|-- genderVec: vector (nullable = true)
|-- childrenIndex: double (nullable = true)
|-- childrenVec: vector (nullable = true)
4.将字段组合成向量feature
//将字段组合成向量feature
val assembler = new VectorAssembler().setInputCols(Array("affairs", "age", "yearsmarried", "religiousness", "education", "occupation", "rating", "genderVec", "childrenVec")).setOutputCol("features")
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_df76d5d1e3f4 val vecDF: DataFrame = assembler.transform(encodeDF)
vecDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 12 more fields] vecDF.select("features").show
+--------------------+
| features|
+--------------------+
|[0.0,37.0,10.0,3....|
|[0.0,27.0,4.0,4.0...|
|[0.0,32.0,15.0,1....|
|[0.0,57.0,15.0,5....|
|[0.0,22.0,0.75,2....|
|[0.0,32.0,1.5,2.0...|
|[0.0,22.0,0.75,2....|
|[0.0,57.0,15.0,2....|
|[0.0,32.0,15.0,4....|
|[0.0,22.0,1.5,4.0...|
|[0.0,37.0,15.0,2....|
|[0.0,27.0,4.0,4.0...|
|[0.0,47.0,15.0,5....|
|[0.0,22.0,1.5,2.0...|
|[0.0,27.0,4.0,4.0...|
|[0.0,37.0,15.0,1....|
|[0.0,37.0,15.0,2....|
|[0.0,22.0,0.75,3....|
|[0.0,22.0,1.5,2.0...|
|[0.0,27.0,10.0,2....|
+--------------------+
only showing top 20 rows
5.标准化--均值标准差
// 标准化--均值标准差
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(true)
scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_43d3da1cd3bf // Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(vecDF)
scalerModel: org.apache.spark.ml.feature.StandardScalerModel = stdScal_43d3da1cd3bf // Normalize each feature to have unit standard deviation.
val scaledData: DataFrame = scalerModel.transform(vecDF)
scaledData: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 13 more fields] scaledData.select("features", "scaledFeatures").show
+--------------------+--------------------+
| features| scaledFeatures|
+--------------------+--------------------+
|[0.0,37.0,10.0,3....|[-0.4413500298573...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
|[0.0,32.0,15.0,1....|[-0.4413500298573...|
|[0.0,57.0,15.0,5....|[-0.4413500298573...|
|[0.0,22.0,0.75,2....|[-0.4413500298573...|
|[0.0,32.0,1.5,2.0...|[-0.4413500298573...|
|[0.0,22.0,0.75,2....|[-0.4413500298573...|
|[0.0,57.0,15.0,2....|[-0.4413500298573...|
|[0.0,32.0,15.0,4....|[-0.4413500298573...|
|[0.0,22.0,1.5,4.0...|[-0.4413500298573...|
|[0.0,37.0,15.0,2....|[-0.4413500298573...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
|[0.0,47.0,15.0,5....|[-0.4413500298573...|
|[0.0,22.0,1.5,2.0...|[-0.4413500298573...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|
|[0.0,37.0,15.0,1....|[-0.4413500298573...|
|[0.0,37.0,15.0,2....|[-0.4413500298573...|
|[0.0,22.0,0.75,3....|[-0.4413500298573...|
|[0.0,22.0,1.5,2.0...|[-0.4413500298573...|
|[0.0,27.0,10.0,2....|[-0.4413500298573...|
+--------------------+--------------------+
only showing top 20 rows
6.主成分PCA
// 主成分
val pca = new PCA().setInputCol("scaledFeatures").setOutputCol("pcaFeatures").setK(3).fit(scaledData) pca.explainedVariance.values //解释变量方差
res11: Array[Double] = Array(0.28779526464781313, 0.23798543640278289, 0.11742828783633019) pca.pc //载荷(观测变量与主成分的相关系数)
res12: org.apache.spark.ml.linalg.DenseMatrix =
-0.12034310848156521 0.05153952289637974 0.6678769450480689
-0.42860623714516627 0.05417889891307473 -0.05592377098140197
-0.44404074412877986 0.1926596811059294 -0.017025575192258197
-0.12233707317255231 0.08053139375662526 -0.5093149296300096
-0.14664751606128462 -0.3872166556211308 -0.03406819489501708
-0.145543746024348 -0.43054860653839705 0.07841454709046872
0.17703994181974803 -0.12792784984216296 -0.5173229755329072
0.2459668445061567 0.4915809641798787 0.010477548320795945
-0.2459668445061567 -0.4915809641798787 -0.010477548320795945
-0.44420980045271047 0.240652448514566 -0.089356723885704
0.4442098004527103 -0.24065244851456588 0.08935672388570405 pca.extractParamMap()
res13: org.apache.spark.ml.param.ParamMap =
{
pca_40a453a54776-inputCol: scaledFeatures,
pca_40a453a54776-k: 3,
pca_40a453a54776-outputCol: pcaFeatures
} pca.params
res14: Array[org.apache.spark.ml.param.Param[_]] = Array(pca_40a453a54776__inputCol, pca_40a453a54776__k, pca_40a453a54776__outputCol) val pcaDF: DataFrame = pca.transform(scaledData)
pcaDF: org.apache.spark.sql.DataFrame = [affairs: double, age: double ... 14 more fields] pcaDF.cache()
res15: pcaDF.type = [affairs: double, age: double ... 14 more fields] pcaDF.printSchema()
root
|-- affairs: double (nullable = true)
|-- age: double (nullable = true)
|-- yearsmarried: double (nullable = true)
|-- religiousness: double (nullable = true)
|-- education: double (nullable = true)
|-- occupation: double (nullable = true)
|-- rating: double (nullable = true)
|-- gender: string (nullable = true)
|-- children: string (nullable = true)
|-- genderIndex: double (nullable = true)
|-- genderVec: vector (nullable = true)
|-- childrenIndex: double (nullable = true)
|-- childrenVec: vector (nullable = true)
|-- features: vector (nullable = true)
|-- scaledFeatures: vector (nullable = true)
|-- pcaFeatures: vector (nullable = true) pcaDF.select("features", "scaledFeatures", "pcaFeatures").show
+--------------------+--------------------+--------------------+
| features| scaledFeatures| pcaFeatures|
+--------------------+--------------------+--------------------+
|[0.0,37.0,10.0,3....|[-0.4413500298573...|[0.27828160409293...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[2.42147114101165...|
|[0.0,32.0,15.0,1....|[-0.4413500298573...|[0.18301418047489...|
|[0.0,57.0,15.0,5....|[-0.4413500298573...|[-2.9795960667914...|
|[0.0,22.0,0.75,2....|[-0.4413500298573...|[1.79299133565688...|
|[0.0,32.0,1.5,2.0...|[-0.4413500298573...|[2.65694237441759...|
|[0.0,22.0,0.75,2....|[-0.4413500298573...|[3.48234503794570...|
|[0.0,57.0,15.0,2....|[-0.4413500298573...|[-2.4215838062079...|
|[0.0,32.0,15.0,4....|[-0.4413500298573...|[-0.6964555195741...|
|[0.0,22.0,1.5,4.0...|[-0.4413500298573...|[2.18771069800414...|
|[0.0,37.0,15.0,2....|[-0.4413500298573...|[-2.4259075891377...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[-0.7743038356008...|
|[0.0,47.0,15.0,5....|[-0.4413500298573...|[-2.6176149267534...|
|[0.0,22.0,1.5,2.0...|[-0.4413500298573...|[2.95788535193022...|
|[0.0,27.0,4.0,4.0...|[-0.4413500298573...|[2.50146472861263...|
|[0.0,37.0,15.0,1....|[-0.4413500298573...|[-0.5123817022008...|
|[0.0,37.0,15.0,2....|[-0.4413500298573...|[-0.9191740114044...|
|[0.0,22.0,0.75,3....|[-0.4413500298573...|[2.97391491782863...|
|[0.0,22.0,1.5,2.0...|[-0.4413500298573...|[3.17940505267806...|
|[0.0,27.0,10.0,2....|[-0.4413500298573...|[0.74585406839527...|
+--------------------+--------------------+--------------------+
only showing top 20 rows
7.聚类
// 注意最大迭代次數和轮廓系数
val KSSE = (2 to 20 by 1).toList.map { k =>
// 聚类
// Trains a k-means model.
val kmeans = new KMeans().setK(k).setSeed(1L).setFeaturesCol("scaledFeatures")
val model = kmeans.fit(scaledData)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(scaledData)
// K,实际迭代次数,SSE,聚类类别编号,每类的记录数,类中心点
(k, model.getMaxIter, WSSSE, model.summary.cluster, model.summary.clusterSizes, model.clusterCenters)
}
// 根据SSE确定K值
val KSSEdf:DataFrame=KSSE.map{x=>(x._1,x._2,x._3,x._5)}.toDF("K", "MaxIter", "SSE", "clusterSizes")
KSSE.foreach(println)
Spark2 oneHot编码--标准化--主成分--聚类的更多相关文章
- 文本离散表示(二):新闻语料的one-hot编码
上一篇博客介绍了文本离散表示的one-hot.TF-IDF和n-gram方法,在这篇文章里,我做了一个对新闻文本进行one-hot编码的小实践. 文本的one-hot相对而言比较简单,我用了两种方法, ...
- OneHot编码
One-Hot编码 What.Why And When? 一句话概括:one hot编码是将类别变量转换为机器学习算法易于利用的一种形式的过程. 目录: 前言: 通过例子可能更容易理解这个概念. 假设 ...
- 在Keras模型中one-hot编码,Embedding层,使用预训练的词向量/处理图片
最近看了吴恩达老师的深度学习课程,又看了python深度学习这本书,对深度学习有了大概的了解,但是在实战的时候, 还是会有一些细枝末节没有完全弄懂,这篇文章就用来总结一下用keras实现深度学习算法的 ...
- HAWQ + MADlib 玩转数据挖掘之(六)——主成分分析与主成分投影
一.主成分分析(Principal Component Analysis,PCA)简介 在数据挖掘中经常会遇到多个变量的问题,而且在多数情况下,多个变量之间常常存在一定的相关性.例如,网站的" ...
- 主成分_CPA
基本原理:方差最大原理 通过正交变换将原相关性变量转化为不相关的变量 第一主成分:线性组合 方差最大 第二主成分:线性组合,COV(F1,F2)=0 步骤: 原始数据标准化:DataAdjust(m ...
- R语言实战(九)主成分和因子分析
本文对应<R语言实战>第14章:主成分和因子分析 主成分分析(PCA)是一种数据降维技巧,它能将大量相关变量转化为一组很少的不相关变量,这些无关变量成为主成分. 探索性因子分析(EFA)是 ...
- 机器学习:PCA(使用梯度上升法求解数据主成分 Ⅰ )
一.目标函数的梯度求解公式 PCA 降维的具体实现,转变为: 方案:梯度上升法优化效用函数,找到其最大值时对应的主成分 w : 效用函数中,向量 w 是变量: 在最终要求取降维后的数据集时,w 是参数 ...
- R in action读书笔记(19)第十四章 主成分和因子分析
第十四章:主成分和因子分析 本章内容 主成分分析 探索性因子分析 其他潜变量模型 主成分分析(PCA)是一种数据降维技巧,它能将大量相关变量转化为一组很少的不相关变量,这些无关变量称为主成分.探索性因 ...
- 数据预处理之独热编码(One-Hot):为什么要使用one-hot编码?
一.问题由来 最近在做ctr预估的实验时,还没思考过为何数据处理的时候要先进行one-hot编码,于是整理学习如下: 在很多机器学习任务如ctr预估任务中,特征不全是连续值,而有可能是分类值.如下: ...
随机推荐
- 使用intellij idea打包并部署到外部的tomcat
1.使用intellij idea创建项目demotest File -> New -> Project-> Spring Initializr,根据提示一步步操作 会生成一个带有 ...
- Windows 安装 adt-bundle的方法
Refer:http://my.eoe.cn/shuhai/archive/19381.html Windows 安装 adt-bundle的方法 很多大神说Windows下Eclipse启动不起来, ...
- sql2008修改管理员与普通用户密码
方法一: sp_password Null,'123,'sa'把sa的密码设为“123” 执行成功后有“Command(s) completed successfully.” OK! 方法二: 第一步 ...
- 完美解决ListView中事件ItemCreated中使用ClientID导致插入数据失败
于昨天晚上看到视频做到这个例子,但是发现始终有错误,在ListView的ItemCreated事件中使用了ClientID则会导致数据插入数据库失败.当点击插入按钮时,网页就像点击F5刷新一样,无任何 ...
- SpringMVC由浅入深day01_8springmvc和mybatis整合
8 springmvc和mybatis整合 为了更好的学习 springmvc和mybatis整合开发的方法,需要将springmvc和mybatis进行整合. 整合目标:控制层采用springmvc ...
- mysql强制使用索引
在公司后台某模块功能记录日志中有一个搜索功能,通过前段时间的产品使用时间区间进行搜索反馈有些卡顿,我发现这个搜索功能比较慢,要3秒左右才能出来,就决定对这里做一下优化. 通过分析代码和SQL发现最核心 ...
- ActiveX 控件导入程序
ActiveX 控件导入程序将 ActiveX 控件的 COM 类型库中的类型定义转换为 Windows 窗体控件. http://msdn.microsoft.com/zh-cn/library/8 ...
- Splash 对象方法
go() wait() jsfunc() evaljs() runjs() autoload() call_later() http_get() http_post() set_content() h ...
- Ajax 分析方法
我们如何查看到 Ajax 请求: 以 https://m.weibo.cn/u/2830678474 这个网页为例,按 F12,加载网页,然后选择资源类型为 XHR 的就可以看到 Ajax 请求了 我 ...
- Ora2Pg的安装和使用
1. 安装DBI,DBD::Oracle DBI只是个抽象层,要实现支持不同的数据库,则需要在DBI之下,编写针对不同数据库的驱动.对MySql来说,有DBD::Mysql, 而对ORACLE来说,则 ...