特征工程

对连续值处理

0.binarizer/二值化

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession\
.builder\
.appName("BinarizerExample")\
.getOrCreate() # 创建DataFrame
continuousDataFrame = spark.createDataFrame([
(0, 1.1),
(1, 8.5),
(2, 5.2)
], ["id", "feature"]) # 给定阈值,对连续特征列进行二值化。
binarizer = Binarizer(threshold=5.1, inputCol="feature", outputCol="binarized_feature") binarizedDataFrame = binarizer.transform(continuousDataFrame) print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show() spark.stop()
Binarizer output with Threshold = 5.100000
+---+-------+-----------------+
| id|feature|binarized_feature|
+---+-------+-----------------+
| 0| 1.1| 0.0|
| 1| 8.5| 1.0|
| 2| 5.2| 1.0|
+---+-------+-----------------+

1.按照给定边界离散化

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer spark = SparkSession\
.builder\
.appName("BucketizerExample")\
.getOrCreate() splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")] data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
dataFrame = spark.createDataFrame(data, ["features"]) # 将连续特性列映射到特征桶。
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures") # 按照给定的边界进行分桶
bucketedData = bucketizer.transform(dataFrame) print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
bucketedData.show() spark.stop()
Bucketizer output with 4 buckets
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
| -999.9| 0.0|
| -0.5| 1.0|
| -0.3| 1.0|
| 0.0| 2.0|
| 0.2| 2.0|
| 999.9| 3.0|
+--------+----------------+

2.quantile_discretizer/按分位数离散化

from __future__ import print_function
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("QuantileDiscretizerExample")\
.getOrCreate() data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 9.2), (6, 14.4)]
df = spark.createDataFrame(data, ["id", "hour"])
df = df.repartition(1) # 分成3个桶进行离散化
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") result = discretizer.fit(df).transform(df)
result.show() spark.stop()
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 0.0|
| 4| 2.2| 0.0|
| 5| 9.2| 1.0|
| 6|14.4| 2.0|
+---+----+------+

3.最大最小值幅度缩放

from __future__ import print_function
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("MaxAbsScalerExample")\
.getOrCreate() dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"]) # 通过除以每个特征的最大绝对值,将每个特征单独缩放到范围[- 1,1]。
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures") # 计算最大最小值用于缩放
scalerModel = scaler.fit(dataFrame) # 缩放幅度到[-1, 1]之间
scaledData = scalerModel.transform(dataFrame)
scaledData.select("features", "scaledFeatures").show() spark.stop()
+--------------+----------------+
| features| scaledFeatures|
+--------------+----------------+
|[1.0,0.1,-8.0]|[0.25,0.01,-1.0]|
|[2.0,1.0,-4.0]| [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]| [1.0,1.0,1.0]|
+--------------+----------------+

4.标准化

from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StandardScalerExample")\
.getOrCreate() dataFrame = spark.read.format("libsvm").load("sample_libsvm_data.txt") # 使用训练集中样本的列汇总统计信息,通过删除均值并缩放到单位方差来标准化特征。
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False) # 计算均值方差等参数
scalerModel = scaler.fit(dataFrame) # 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show() spark.stop()
+-----+--------------------+--------------------+
|label| features| scaledFeatures|
+-----+--------------------+--------------------+
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
| 1.0|(692,[152,153,154...|(692,[152,153,154...|
| 1.0|(692,[151,152,153...|(692,[151,152,153...|
| 0.0|(692,[129,130,131...|(692,[129,130,131...|
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
| 1.0|(692,[99,100,101,...|(692,[99,100,101,...|
| 0.0|(692,[154,155,156...|(692,[154,155,156...|
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
| 1.0|(692,[154,155,156...|(692,[154,155,156...|
| 0.0|(692,[153,154,155...|(692,[153,154,155...|
| 0.0|(692,[151,152,153...|(692,[151,152,153...|
| 1.0|(692,[129,130,131...|(692,[129,130,131...|
| 0.0|(692,[154,155,156...|(692,[154,155,156...|
| 1.0|(692,[150,151,152...|(692,[150,151,152...|
| 0.0|(692,[124,125,126...|(692,[124,125,126...|
| 0.0|(692,[152,153,154...|(692,[152,153,154...|
| 1.0|(692,[97,98,99,12...|(692,[97,98,99,12...|
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
+-----+--------------------+--------------------+
only showing top 20 rows
from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StandardScalerExample")\
.getOrCreate() dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"]) # 计算均值方差等参数
scalerModel = scaler.fit(dataFrame) # 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show() spark.stop()
+---+--------------+--------------------+
| id| features| scaledFeatures|
+---+--------------+--------------------+
| 0|[1.0,0.1,-8.0]|[0.65465367070797...|
| 1|[2.0,1.0,-4.0]|[1.30930734141595...|
| 2|[4.0,10.0,8.0]|[2.61861468283190...|
+---+--------------+--------------------+

5.添加多项式特征

from __future__ import print_function
from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("PolynomialExpansionExample")\
.getOrCreate() df = spark.createDataFrame([
(Vectors.dense([2.0, 1.0]),),
(Vectors.dense([0.0, 0.0]),),
(Vectors.dense([3.0, -1.0]),)
], ["features"]) # 在多项式空间中进行特征展开。
polyExpansion = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")
polyDF = polyExpansion.transform(df) polyDF.show(truncate=False) spark.stop()
+----------+------------------------------------------+
|features |polyFeatures |
+----------+------------------------------------------+
|[2.0,1.0] |[2.0,4.0,8.0,1.0,2.0,4.0,1.0,2.0,1.0] |
|[0.0,0.0] |[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0] |
|[3.0,-1.0]|[3.0,9.0,27.0,-1.0,-3.0,-9.0,1.0,3.0,-1.0]|
+----------+------------------------------------------+

对离散型处理

0.独热向量编码

from __future__ import print_function
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("OneHotEncoderExample")\
.getOrCreate() df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"]) # 将分类值转换为类别索引
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df) # 独热向量编码
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show() spark.stop()
+---+--------+-------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+-------------+-------------+
| 0| a| 0.0|(2,[0],[1.0])|
| 1| b| 2.0| (2,[],[])|
| 2| c| 1.0|(2,[1],[1.0])|
| 3| a| 0.0|(2,[0],[1.0])|
| 4| a| 0.0|(2,[0],[1.0])|
| 5| c| 1.0|(2,[1],[1.0])|
+---+--------+-------------+-------------+

对文本型处理

0.去停用词

from __future__ import print_function
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StopWordsRemoverExample")\
.getOrCreate() sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"]) # 去停用词
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False) spark.stop()
+---+----------------------------+--------------------+
|id |raw |filtered |
+---+----------------------------+--------------------+
|0 |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+

1.Tokenizer 分词器

from __future__ import print_function
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("TokenizerExample")\
.getOrCreate() sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"]) # 分词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") # 通过使用提供的正则表达式模式分割文本
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W") countTokens = udf(lambda words: len(words), IntegerType()) tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show(truncate=False) regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show(truncate=False) spark.stop()
+-----------------------------------+------------------------------------------+------+
|sentence |words |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1 |
+-----------------------------------+------------------------------------------+------+ +-----------------------------------+------------------------------------------+------+
|sentence |words |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
+-----------------------------------+------------------------------------------+------+

2.count_vectorizer

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import CountVectorizer spark = SparkSession\
.builder\
.appName("CountVectorizerExample")\
.getOrCreate() df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"]) # 从文档集合中提取词汇表并生成CountVectorizerModel
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0) model = cv.fit(df) result = model.transform(df)
result.show(truncate=False) spark.stop()
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

3.TF-IDF权重

from __future__ import print_function
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("TfIdfExample")\
.getOrCreate() sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"]) # 分词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData) # 使用哈希技巧将一系列项映射到它们的项频率。
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData) # 计算给定文档集合的逆文档频率(IDF)。
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData) rescaledData.select("label", "features").show() spark.stop()
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(20,[0,5,9,17],[0...|
| 0.0|(20,[2,7,9,13,15]...|
| 1.0|(20,[4,6,13,15,18...|
+-----+--------------------+

4.n-gram语言模型

from __future__ import print_function
from pyspark.ml.feature import NGram
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("NGramExample")\
.getOrCreate() #Hanmeimei loves LiLei
#LiLei loves Hanmeimei wordDataFrame = spark.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"]) # 将字符串输入数组转换为n-grams数组的特征转换器。
ngram = NGram(n=2, inputCol="words", outputCol="ngrams") ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False) spark.stop()
+------------------------------------------------------------------+
|ngrams |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark] |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat] |
+------------------------------------------------------------------+

高级变换

0.SQL变换

from __future__ import print_function
from pyspark.ml.feature import SQLTransformer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("SQLTransformerExample")\
.getOrCreate() df = spark.createDataFrame([
(0, 1.0, 3.0),
(2, 2.0, 5.0)
], ["id", "v1", "v2"]) # 实现由SQL语句定义的转换。
sqlTrans = SQLTransformer(statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show() spark.stop()
+---+---+---+---+----+
| id| v1| v2| v3| v4|
+---+---+---+---+----+
| 0|1.0|3.0|4.0| 3.0|
| 2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+

1.R公式变换

from __future__ import print_function
from pyspark.ml.feature import RFormula
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("RFormulaExample")\
.getOrCreate() dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"]) # 实现根据R模型公式拟合数据集所需的转换。
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label") output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show() spark.stop()
+--------------+-----+
| features|label|
+--------------+-----+
|[0.0,0.0,18.0]| 1.0|
|[0.0,1.0,12.0]| 0.0|
|[1.0,0.0,15.0]| 0.0|
+--------------+-----+

Spark机器学习基础一的更多相关文章

  1. spark 机器学习基础 数据类型

    spark的机器学习库,包含常见的学习算法和工具如分类.回归.聚类.协同过滤.降维等使用算法时都需要指定相应的数据集,下面为大家介绍常用的spark ml 数据类型.1.本地向量(Local Vect ...

  2. Spark机器学习基础三

    监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...

  3. Spark机器学习基础二

    无监督学习 0.K-means from __future__ import print_function from pyspark.ml.clustering import KMeans #from ...

  4. Spark机器学习基础-监督学习

    监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...

  5. Spark机器学习基础-无监督学习

    0.K-means from __future__ import print_function from pyspark.ml.clustering import KMeans#硬聚类 #from p ...

  6. Spark机器学习基础-特征工程

    对连续值处理 0.binarizer/二值化 from __future__ import print_function from pyspark.sql import SparkSession fr ...

  7. Spark机器学习4·分类模型(spark-shell)

    线性模型 逻辑回归--逻辑损失(logistic loss) 线性支持向量机(Support Vector Machine, SVM)--合页损失(hinge loss) 朴素贝叶斯(Naive Ba ...

  8. 掌握Spark机器学习库(课程目录)

    第1章 初识机器学习 在本章中将带领大家概要了解什么是机器学习.机器学习在当前有哪些典型应用.机器学习的核心思想.常用的框架有哪些,该如何进行选型等相关问题. 1-1 导学 1-2 机器学习概述 1- ...

  9. Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API

    Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API 关键词:Local vector,Labeled point,Local matrix,Distrib ...

随机推荐

  1. vim模式下报错E37: No write since last change (add ! to override)

    故障现象: 使用vim修改文件报错,系统提示如下: E37: No write since last change (add ! to override) 故障原因: 文件为只读文件,无法修改. 解决 ...

  2. SoupUI 5.1.2(专业版)下载(含破解文件)

    包内含原安装包和破解文件,仅用于技术交流,切勿用于商业用途. 安装教程参考:https://www.cnblogs.com/miaojjblog/p/9778839.html 安装包及破解文件下载地址 ...

  3. C语言扫盲及深化学习

    c语言特点: (1)效率高 (2)控制性强 (3)硬件亲和性好 (4)可移植性高 一.关于注释 c语言中注释不能嵌套,因此注释代码时一定要注意源代码中是否已经存在注释.要从逻辑上删除一段代码,利用预编 ...

  4. 改造一下jeecg中的部门树

    假装有需求 关于 jeecg 提供的部门树,相信很多小伙伴都已经用过了,今天假装有那么一个需求 "部门树弹窗选择默认展开下级部门",带着这个需求再次去探索一下吧. 一.改造之前的部 ...

  5. MySQL 8.0 InnoDB新特性

    MySQL 8.0 InnoDB新特性 1.数据字典全部采用InnoDB引擎存储,支持DDL原子性.crash safe,metadata管理更完善 2.快速在线加新列(腾讯互娱DBA团队贡献) 3. ...

  6. 2019年新软件发布分享HanGi.IT.AStrutTie.v2017 1CD

    Steelray Project Viewer 2019.1.69 1CDIAR Embedded Workbench for Renesas M16C-R8C v3.71.1 1CD Mentor ...

  7. Java多线程处理List数据

    实例1: 解决问题:如何让n个线程顺序遍历含有n个元素的List集合 import java.util.ArrayList; import java.util.List; import org.apa ...

  8. 命名空间"xx"已经包含了"xx"的定义

    例: namespace A.B {    public class C    {             } } 注:重名的不仅仅是类,还可以结构,枚举,命名空间本身也有可能重复. 这个类C若与命名 ...

  9. Bioinfo online workshop

    http://icb.med.cornell.edu/index.html%3Fpage_id=2068.html 1. Datacamp https://www.datacamp.com/cours ...

  10. Mysql安装、设置密码、编码

    一.MySQL介绍 MySQL是一个关系型数据库管理系统,由瑞典MySQL AB 公司开发,目前属于 Oracle 旗下公司.MySQL 最流行的关系型数据库管理系统,在 WEB 应用方面MySQL是 ...