特征工程

对连续值处理

0.binarizer/二值化

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession\
.builder\
.appName("BinarizerExample")\
.getOrCreate() # 创建DataFrame
continuousDataFrame = spark.createDataFrame([
(0, 1.1),
(1, 8.5),
(2, 5.2)
], ["id", "feature"]) # 给定阈值,对连续特征列进行二值化。
binarizer = Binarizer(threshold=5.1, inputCol="feature", outputCol="binarized_feature") binarizedDataFrame = binarizer.transform(continuousDataFrame) print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show() spark.stop()
Binarizer output with Threshold = 5.100000
+---+-------+-----------------+
| id|feature|binarized_feature|
+---+-------+-----------------+
| 0| 1.1| 0.0|
| 1| 8.5| 1.0|
| 2| 5.2| 1.0|
+---+-------+-----------------+

1.按照给定边界离散化

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer spark = SparkSession\
.builder\
.appName("BucketizerExample")\
.getOrCreate() splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")] data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
dataFrame = spark.createDataFrame(data, ["features"]) # 将连续特性列映射到特征桶。
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures") # 按照给定的边界进行分桶
bucketedData = bucketizer.transform(dataFrame) print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
bucketedData.show() spark.stop()
Bucketizer output with 4 buckets
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
| -999.9| 0.0|
| -0.5| 1.0|
| -0.3| 1.0|
| 0.0| 2.0|
| 0.2| 2.0|
| 999.9| 3.0|
+--------+----------------+

2.quantile_discretizer/按分位数离散化

from __future__ import print_function
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("QuantileDiscretizerExample")\
.getOrCreate() data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 9.2), (6, 14.4)]
df = spark.createDataFrame(data, ["id", "hour"])
df = df.repartition(1) # 分成3个桶进行离散化
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") result = discretizer.fit(df).transform(df)
result.show() spark.stop()
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 0.0|
| 4| 2.2| 0.0|
| 5| 9.2| 1.0|
| 6|14.4| 2.0|
+---+----+------+

3.最大最小值幅度缩放

from __future__ import print_function
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("MaxAbsScalerExample")\
.getOrCreate() dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"]) # 通过除以每个特征的最大绝对值,将每个特征单独缩放到范围[- 1,1]。
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures") # 计算最大最小值用于缩放
scalerModel = scaler.fit(dataFrame) # 缩放幅度到[-1, 1]之间
scaledData = scalerModel.transform(dataFrame)
scaledData.select("features", "scaledFeatures").show() spark.stop()
+--------------+----------------+
| features| scaledFeatures|
+--------------+----------------+
|[1.0,0.1,-8.0]|[0.25,0.01,-1.0]|
|[2.0,1.0,-4.0]| [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]| [1.0,1.0,1.0]|
+--------------+----------------+

4.标准化

from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StandardScalerExample")\
.getOrCreate() dataFrame = spark.read.format("libsvm").load("sample_libsvm_data.txt") # 使用训练集中样本的列汇总统计信息,通过删除均值并缩放到单位方差来标准化特征。
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False) # 计算均值方差等参数
scalerModel = scaler.fit(dataFrame) # 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show() spark.stop()
+-----+--------------------+--------------------+
|label| features| scaledFeatures|
+-----+--------------------+--------------------+
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
| 1.0|(692,[152,153,154...|(692,[152,153,154...|
| 1.0|(692,[151,152,153...|(692,[151,152,153...|
| 0.0|(692,[129,130,131...|(692,[129,130,131...|
| 1.0|(692,[158,159,160...|(692,[158,159,160...|
| 1.0|(692,[99,100,101,...|(692,[99,100,101,...|
| 0.0|(692,[154,155,156...|(692,[154,155,156...|
| 0.0|(692,[127,128,129...|(692,[127,128,129...|
| 1.0|(692,[154,155,156...|(692,[154,155,156...|
| 0.0|(692,[153,154,155...|(692,[153,154,155...|
| 0.0|(692,[151,152,153...|(692,[151,152,153...|
| 1.0|(692,[129,130,131...|(692,[129,130,131...|
| 0.0|(692,[154,155,156...|(692,[154,155,156...|
| 1.0|(692,[150,151,152...|(692,[150,151,152...|
| 0.0|(692,[124,125,126...|(692,[124,125,126...|
| 0.0|(692,[152,153,154...|(692,[152,153,154...|
| 1.0|(692,[97,98,99,12...|(692,[97,98,99,12...|
| 1.0|(692,[124,125,126...|(692,[124,125,126...|
+-----+--------------------+--------------------+
only showing top 20 rows
from __future__ import print_function
from pyspark.ml.feature import StandardScaler
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StandardScalerExample")\
.getOrCreate() dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"]) # 计算均值方差等参数
scalerModel = scaler.fit(dataFrame) # 标准化
scaledData = scalerModel.transform(dataFrame)
scaledData.show() spark.stop()
+---+--------------+--------------------+
| id| features| scaledFeatures|
+---+--------------+--------------------+
| 0|[1.0,0.1,-8.0]|[0.65465367070797...|
| 1|[2.0,1.0,-4.0]|[1.30930734141595...|
| 2|[4.0,10.0,8.0]|[2.61861468283190...|
+---+--------------+--------------------+

5.添加多项式特征

from __future__ import print_function
from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("PolynomialExpansionExample")\
.getOrCreate() df = spark.createDataFrame([
(Vectors.dense([2.0, 1.0]),),
(Vectors.dense([0.0, 0.0]),),
(Vectors.dense([3.0, -1.0]),)
], ["features"]) # 在多项式空间中进行特征展开。
polyExpansion = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")
polyDF = polyExpansion.transform(df) polyDF.show(truncate=False) spark.stop()
+----------+------------------------------------------+
|features |polyFeatures |
+----------+------------------------------------------+
|[2.0,1.0] |[2.0,4.0,8.0,1.0,2.0,4.0,1.0,2.0,1.0] |
|[0.0,0.0] |[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0] |
|[3.0,-1.0]|[3.0,9.0,27.0,-1.0,-3.0,-9.0,1.0,3.0,-1.0]|
+----------+------------------------------------------+

对离散型处理

0.独热向量编码

from __future__ import print_function
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("OneHotEncoderExample")\
.getOrCreate() df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"]) # 将分类值转换为类别索引
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df) # 独热向量编码
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show() spark.stop()
+---+--------+-------------+-------------+
| id|category|categoryIndex| categoryVec|
+---+--------+-------------+-------------+
| 0| a| 0.0|(2,[0],[1.0])|
| 1| b| 2.0| (2,[],[])|
| 2| c| 1.0|(2,[1],[1.0])|
| 3| a| 0.0|(2,[0],[1.0])|
| 4| a| 0.0|(2,[0],[1.0])|
| 5| c| 1.0|(2,[1],[1.0])|
+---+--------+-------------+-------------+

对文本型处理

0.去停用词

from __future__ import print_function
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("StopWordsRemoverExample")\
.getOrCreate() sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"]) # 去停用词
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False) spark.stop()
+---+----------------------------+--------------------+
|id |raw |filtered |
+---+----------------------------+--------------------+
|0 |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+

1.Tokenizer 分词器

from __future__ import print_function
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("TokenizerExample")\
.getOrCreate() sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"]) # 分词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") # 通过使用提供的正则表达式模式分割文本
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W") countTokens = udf(lambda words: len(words), IntegerType()) tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show(truncate=False) regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show(truncate=False) spark.stop()
+-----------------------------------+------------------------------------------+------+
|sentence |words |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1 |
+-----------------------------------+------------------------------------------+------+ +-----------------------------------+------------------------------------------+------+
|sentence |words |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
+-----------------------------------+------------------------------------------+------+

2.count_vectorizer

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import CountVectorizer spark = SparkSession\
.builder\
.appName("CountVectorizerExample")\
.getOrCreate() df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"]) # 从文档集合中提取词汇表并生成CountVectorizerModel
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0) model = cv.fit(df) result = model.transform(df)
result.show(truncate=False) spark.stop()
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

3.TF-IDF权重

from __future__ import print_function
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("TfIdfExample")\
.getOrCreate() sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"]) # 分词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData) # 使用哈希技巧将一系列项映射到它们的项频率。
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData) # 计算给定文档集合的逆文档频率(IDF)。
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData) rescaledData.select("label", "features").show() spark.stop()
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(20,[0,5,9,17],[0...|
| 0.0|(20,[2,7,9,13,15]...|
| 1.0|(20,[4,6,13,15,18...|
+-----+--------------------+

4.n-gram语言模型

from __future__ import print_function
from pyspark.ml.feature import NGram
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("NGramExample")\
.getOrCreate() #Hanmeimei loves LiLei
#LiLei loves Hanmeimei wordDataFrame = spark.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"]) # 将字符串输入数组转换为n-grams数组的特征转换器。
ngram = NGram(n=2, inputCol="words", outputCol="ngrams") ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False) spark.stop()
+------------------------------------------------------------------+
|ngrams |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark] |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat] |
+------------------------------------------------------------------+

高级变换

0.SQL变换

from __future__ import print_function
from pyspark.ml.feature import SQLTransformer
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("SQLTransformerExample")\
.getOrCreate() df = spark.createDataFrame([
(0, 1.0, 3.0),
(2, 2.0, 5.0)
], ["id", "v1", "v2"]) # 实现由SQL语句定义的转换。
sqlTrans = SQLTransformer(statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show() spark.stop()
+---+---+---+---+----+
| id| v1| v2| v3| v4|
+---+---+---+---+----+
| 0|1.0|3.0|4.0| 3.0|
| 2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+

1.R公式变换

from __future__ import print_function
from pyspark.ml.feature import RFormula
from pyspark.sql import SparkSession spark = SparkSession\
.builder\
.appName("RFormulaExample")\
.getOrCreate() dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"]) # 实现根据R模型公式拟合数据集所需的转换。
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label") output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show() spark.stop()
+--------------+-----+
| features|label|
+--------------+-----+
|[0.0,0.0,18.0]| 1.0|
|[0.0,1.0,12.0]| 0.0|
|[1.0,0.0,15.0]| 0.0|
+--------------+-----+

Spark机器学习基础一的更多相关文章

  1. spark 机器学习基础 数据类型

    spark的机器学习库,包含常见的学习算法和工具如分类.回归.聚类.协同过滤.降维等使用算法时都需要指定相应的数据集,下面为大家介绍常用的spark ml 数据类型.1.本地向量(Local Vect ...

  2. Spark机器学习基础三

    监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...

  3. Spark机器学习基础二

    无监督学习 0.K-means from __future__ import print_function from pyspark.ml.clustering import KMeans #from ...

  4. Spark机器学习基础-监督学习

    监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...

  5. Spark机器学习基础-无监督学习

    0.K-means from __future__ import print_function from pyspark.ml.clustering import KMeans#硬聚类 #from p ...

  6. Spark机器学习基础-特征工程

    对连续值处理 0.binarizer/二值化 from __future__ import print_function from pyspark.sql import SparkSession fr ...

  7. Spark机器学习4·分类模型(spark-shell)

    线性模型 逻辑回归--逻辑损失(logistic loss) 线性支持向量机(Support Vector Machine, SVM)--合页损失(hinge loss) 朴素贝叶斯(Naive Ba ...

  8. 掌握Spark机器学习库(课程目录)

    第1章 初识机器学习 在本章中将带领大家概要了解什么是机器学习.机器学习在当前有哪些典型应用.机器学习的核心思想.常用的框架有哪些,该如何进行选型等相关问题. 1-1 导学 1-2 机器学习概述 1- ...

  9. Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API

    Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API 关键词:Local vector,Labeled point,Local matrix,Distrib ...

随机推荐

  1. 2019王小的Java学习之路

    文章背景身边有个非常要好的朋友王某某,因为是发小的关系,之后文章统称为王小. 大专毕业后 顺利 的被安排进了某某工厂工作,工作一段时间后,尽管工作比较轻松,却无法忍受终日的流水线生活,经过我的介绍,决 ...

  2. opencart3调用三级菜单level 3 sub categories

    Opencart 3的menu菜单默认只调用一级和二级菜单,但很多电商网站类目复杂,三级菜单一般都是需要的,甚至更深,那么如何调用三级菜单level 3 sub categories呢?ytkah有一 ...

  3. 使用charles模拟慢速网络

    1.设置慢速网络 点击导航栏的proxy---throttle setting来设置想要的网络情况, 其中有两种方法: (1)勾选Enable Throttling,在Throttle presett ...

  4. input type = file 在部分安卓手机上无法调起摄像头和相册

    移动端H5web 用input type = file 在部分安卓手机上无法调起摄像头拍照,有的也无法访问相册而是直接访问了文档,解决办法是: 加上 accept = "image/*&qu ...

  5. C++实验三

    part2 graph.h #ifndef GRAPH_H#define GRAPH_H// 类Graph的声明 class Graph { public: Graph(char ch, int n) ...

  6. python实现可以被with上下文管理的类或函数

      # .开始之前先明确一下with机制 # 1.类包函数__enter__()和__exit__()函数,即是可以被上下文管理的类 # __enter__用来执行with时的方法,__exit__返 ...

  7. 004-CSS怎样让背景充满整个屏幕

    <!doctype html><html><body> ...Your content goes here...</body></html> ...

  8. 学习DButils笔记

    DBUtills: *********************** 1:创建对象:QueryRunner的对象,其中创建的方式有两种: ①QueryRunner qr = new QueryRunne ...

  9. 【数据结构】算法 LinkList (Reverse LinkedList) Java

    反转链表,该链表为单链表. head 节点指向的是头节点. 最简单的方法,就是建一个新链表,将原来链表的节点一个个找到,并且使用头插法插入新链表.时间复杂度也就是O(n),空间复杂度就需要定义2个节点 ...

  10. liunx驱动----系统滴答时钟的使用

    2019-3-12系统滴答定时器中断使用 定义一个timer ​​ 其实就是使用系统的滴答定时器产生一个中断. 初始化timer init_timer函数 实现如下 void fastcall ini ...