Spark机器学习7·降维模型(scala&python)
- PCA(主成分分析法,Principal Components Analysis)
- SVD(奇异值分解法,Singular Value Decomposition)
http://vis-www.cs.umass.edu/lfw/lfw-a.tgz
0 运行环境
export SPARK_HOME=/Users/erichan/Garden/spark-1.5.1-bin-hadoop2.6
cd $SPARK_HOME
bin/spark-shell --name my_mlib --packages org.jblas:jblas:1.2.4-SNAPSHOT --driver-memory 4G --executor-memory 4G --driver-cores 2
1 抽取特征
1.1 载入脸部数据
val PATH = "/Users/erichan/sourcecode/book/Spark机器学习"
val path = PATH+"/lfw/*"
val rdd = sc.wholeTextFiles(path)
val files = rdd.map { case (fileName, content) => fileName.replace("file:", "") }
println(files.count)
1054
1.2 可视化脸部数据(python)
ipython -pylab
PATH = "/Users/erichan/sourcecode/book/Spark机器学习"
path = PATH+"/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg"
ae = imread(path)
imshow(ae)
tmpPath = "/tmp/aeGray.jpg"
aeGary = imread(tmpPath)
imshow(aeGary, cmap=plt.cm.gray)
1.3 提取脸部图片作为向量
1.3.1 载入图片
import java.awt.image.BufferedImage
def loadImageFromFile(path: String): BufferedImage = {
import javax.imageio.ImageIO
import java.io.File
ImageIO.read(new File(path))
}
val aePath = PATH+"/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg"
val aeImage = loadImageFromFile(aePath)
1.3.2 转换灰度、改变尺寸
def processImage(image: BufferedImage, width: Int, height: Int): BufferedImage = {
val bwImage = new BufferedImage(width, height, BufferedImage.TYPE_BYTE_GRAY)
val g = bwImage.getGraphics()
g.drawImage(image, 0, 0, width, height, null)
g.dispose()
bwImage
}
val grayImage = processImage(aeImage, 100, 100)
import javax.imageio.ImageIO
import java.io.File
ImageIO.write(grayImage, "jpg", new File("/tmp/aeGray.jpg"))
1.3.3 提取特征向量
def getPixelsFromImage(image: BufferedImage): Array[Double] = {
val width = image.getWidth
val height = image.getHeight
val pixels = Array.ofDim[Double](width * height)
image.getData.getPixels(0, 0, width, height, pixels)
// pixels.map(p => p / 255.0) // optionally scale to [0, 1] domain
}
// put all the functions together
def extractPixels(path: String, width: Int, height: Int): Array[Double] = {
val raw = loadImageFromFile(path)
val processed = processImage(raw, width, height)
getPixelsFromImage(processed)
}
val pixels = files.map(f => extractPixels(f, 50, 50))
println(pixels.take(10).map(_.take(10).mkString("", ",", ", ...")).mkString("\n"))
1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0, ...
247.0,173.0,159.0,144.0,139.0,155.0,32.0,7.0,4.0,5.0, ...
253.0,254.0,253.0,253.0,253.0,253.0,253.0,253.0,253.0,253.0, ...
242.0,242.0,246.0,239.0,238.0,239.0,225.0,165.0,140.0,167.0, ...
47.0,221.0,205.0,46.0,41.0,154.0,127.0,214.0,232.0,232.0, ...
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...
75.0,76.0,72.0,72.0,72.0,74.0,71.0,78.0,54.0,26.0, ...
25.0,27.0,24.0,22.0,26.0,27.0,19.0,16.0,22.0,25.0, ...
240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0,240.0, ...
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...
import org.apache.spark.mllib.linalg.Vectors
val vectors = pixels.map(p => Vectors.dense(p))
vectors.setName("image-vectors")
vectors.cache
1.4 正则化
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd = false).fit(vectors)
val scaledVectors = vectors.map(v => scaler.transform(v))
2 训练降维模型
2.1 前k个主成分
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val matrix = new RowMatrix(scaledVectors)
val K = 10
val pc = matrix.computePrincipalComponents(K)
val rows = pc.numRows
val cols = pc.numCols
println(rows, cols)
(2500,10)
2.2 可视化特征脸
import breeze.linalg.DenseMatrix
val pcBreeze = new DenseMatrix(rows, cols, pc.toArray)
import breeze.linalg.csvwrite
import java.io.File
csvwrite(new File("/tmp/pc.csv"), pcBreeze)
pc = np.loadtxt("/tmp/pc.csv", delimiter=",")
print(pc.shape)
def plot_gallery(images, h, w, n_row=2, n_col=5):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[:, i].reshape((h, w)), cmap=plt.cm.gray)
plt.title("Eigenface %d" % (i + 1), size=12)
plt.xticks(())
plt.yticks(())
plot_gallery(pc, 50, 50)
3 使用降维模型
3.1 PCA投影(图像矩阵x主成分矩阵)
val projected = matrix.multiply(pc)
println(projected.numRows, projected.numCols)
println(projected.rows.take(5).mkString("\n"))
3.2 PCA与SVD
val svd = matrix.computeSVD(10, computeU = true)
println(s"U dimension: (${svd.U.numRows}, ${svd.U.numCols})")
println(s"S dimension: (${svd.s.size}, )")
println(s"V dimension: (${svd.V.numRows}, ${svd.V.numCols})")
U dimension: (1054, 10)
S dimension: (10, )
V dimension: (2500, 10)
def approxEqual(array1: Array[Double], array2: Array[Double], tolerance: Double = 1e-6): Boolean = {
// note we ignore sign of the principal component / singular vector elements
val bools = array1.zip(array2).map { case (v1, v2) => if (math.abs(math.abs(v1) - math.abs(v2)) > 1e-6) false else true }
bools.fold(true)(_ & _)
}
println(approxEqual(Array(1.0, 2.0, 3.0), Array(1.0, 2.0, 3.0)))
println(approxEqual(Array(1.0, 2.0, 3.0), Array(3.0, 2.0, 1.0)))
println(approxEqual(svd.V.toArray, pc.toArray))
true
false
true
// compare projections
val breezeS = breeze.linalg.DenseVector(svd.s.toArray)
val projectedSVD = svd.U.rows.map { v =>
val breezeV = breeze.linalg.DenseVector(v.toArray)
val multV = breezeV :* breezeS
Vectors.dense(multV.data)
}
projected.rows.zip(projectedSVD).map { case (v1, v2) => approxEqual(v1.toArray, v2.toArray) }.filter(b => true).count
4 评价降维模型
4.1 评估SVD的k值
val sValues = (1 to 5).map { i => matrix.computeSVD(i, computeU = false).s }
val svd300 = matrix.computeSVD(300, computeU = false)
val sMatrix = new DenseMatrix(1, 300, svd300.s.toArray)
csvwrite(new File("/tmp/s.csv"), sMatrix)
s = np.loadtxt("/tmp/s.csv", delimiter=",")
print(s.shape)
plot(s)
plot(cumsum(s))
plt.yscale('log')
Spark机器学习7·降维模型(scala&python)的更多相关文章
- Spark机器学习5·回归模型(pyspark)
分类模型的预测目标是:类别编号 回归模型的预测目标是:实数变量 回归模型种类 线性模型 最小二乘回归模型 应用L2正则化时--岭回归(ridge regression) 应用L1正则化时--LASSO ...
- Spark机器学习6·聚类模型(spark-shell)
K-均值(K-mean)聚类 目的:最小化所有类簇中的方差之和 类簇内方差和(WCSS,within cluster sum of squared errors) fuzzy K-means 层次聚类 ...
- Spark机器学习4·分类模型(spark-shell)
线性模型 逻辑回归--逻辑损失(logistic loss) 线性支持向量机(Support Vector Machine, SVM)--合页损失(hinge loss) 朴素贝叶斯(Naive Ba ...
- Spark机器学习1·编程入门(scala/java/python)
Spark安装目录 /Users/erichan/Garden/spark-1.4.0-bin-hadoop2.6 基本测试 ./bin/run-example org.apache.spark.ex ...
- 吴裕雄 python 机器学习——等度量映射Isomap降维模型
# -*- coding: utf-8 -*- import numpy as np import matplotlib.pyplot as plt from sklearn import datas ...
- 吴裕雄 python 机器学习——局部线性嵌入LLE降维模型
# -*- coding: utf-8 -*- import numpy as np import matplotlib.pyplot as plt from sklearn import datas ...
- Mac 配置Spark环境scala+python版本(Spark1.6.0)
1. 从官网下载Spark安装包,解压到自己的安装目录下(默认已经安装好JDK,JDK安装可自行查找): spark官网:http://spark.apache.org/downloads.html ...
- 梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)
梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python) http://blog.csdn.net/liulingyuan6/article/details ...
- Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API
Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API 关键词:Local vector,Labeled point,Local matrix,Distrib ...
随机推荐
- 利用jsonrpc技术包装uiautomator
昨天一天在网上搜索解决上一篇文章中的exception: monkeyrunner内置uiautomator出错的原因 尽管没找到解决办法.可是让我无意中发现了一个好工具,比sl4a更好用的工具.直接 ...
- poj 2386:Lake Counting(简单DFS深搜)
Lake Counting Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 18201 Accepted: 9192 De ...
- 关于spotlight for Windows和spotlight for oracle的使用
http://blog.csdn.net/luowangjun/article/details/4866084 http://konglx.iteye.com/blog/1873805
- java . -- IDEA运行最简单的Java程序Hello World
以下内容基于IDEA, JAVA EE已经安装好的情况. 如果没有安装好,网上的安装教程非常多的. 下面开始,18个图解 点击Finish 运行的代码如下: public class HelloWor ...
- js封装日历控件
最终效果 代码实现 <script> $(function () { $(".j-calendar").calendar({ date: '2017-08-03', c ...
- git登陆迁移 SourceTree 不能自动识别
公司切换了迁移了git登陆,具体是什么 我也不是很清楚,结果就是,周一上班 好多小伙伴的git 用不了了,办公室里自然是哀嚎一片, 运维小伙伴给出的解决方案是:改个密码就好啦: 于是照做 结果Sour ...
- 第十篇:顺序容器vector,deque,list的选用规则
前言 常见的顺序容器主要有三种 - vector,deque,list.它们实现的功能相差不大,那么实际开发中该如何进行选择呢?本文将为你解答这个问题. 分析 由于这三种容器实现的数据结构原型不同(v ...
- Thrift快速入门
Thrift 简单示例 2017-01-19 16:47:57 首先通过先面两个示例简单感受一下Thrift(RPC)服务端与客户端之间的通信...... RPC学习----Thrift快速入门和Ja ...
- <td></td>之间的
今天开发中遇到了一个很丢脸的事:我把下图中别人写的 给去掉了,我觉得这个很多余,结果被在大神们痛斥了一顿.因为去掉这个 会导致td之间没有数据时td的边框不显示的问题,所以几下这个惨痛的教训,以自警. ...
- REDO 的内容:改变向量
REDO 的内容 ---改变向量 redo的内容并不是sql语句,他是放的一些改变,叫改变向量. 数据库恢复的时候并不是执行sql语句,而是一个物理的过程,是一个数据块的覆盖.是改变数据块的大小. 可 ...