Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

关键词：Local vector，Labeled point，Local matrix，Distributed matrix，RowMatrix，IndexedRowMatrix，CoordinateMatrix，BlockMatrix。
前言：MLlib支持本地向量和存储在单机上的矩阵，当然也支持被存储为RDD的分布式矩阵。一个有监督的机器学习的例子在MLlib里面叫做标签点。

1. 本地向量

一个本地向量是由整数类型和从０开始的索引存储在单机上
。MLlib支持两种本地向量，稠密向量和稀疏向量。稠密向量由一个浮点数组表示它的的所有值，而一个稀疏矩阵由两个平行的数组组成，索引和值。举个例子，一个向量，(1.0,0.0,3.0)能个用稠密表现为[1.0,0.0,3.0] 或者以稀疏的形式表现为（３，[0,2],[1.0,3.0]）,3是这个向量的大小。（本人注解：３为长度，即是元素个数，[0,2]为索引，[1.0,3.0],为值）

１．１MLlib认为如下数据类型是稠密向量：
~NumPys array
~Python list

1.2MLlib认为如下数据类型是稀疏向量：
~MLlib’s SparseVector.
~SciPy’s csc_matrix with a single colum

为了效率，我们推荐使用Numpy arrays ,并使用工厂方法继承Vectors 来创建稀疏矩阵。

import numpy as np

import scipy.sparse as sps

from pyspark.mllib.linalg import Vectors

# Use a NumPy array as a dense vector.

dv1 = np.array([1.0, 0.0, 3.0])

# Use a Python list as a dense vector.

dv2 = [1.0, 0.0, 3.0]

# Create a SparseVector.

sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])

# Use a single-column SciPy csc_matrix as a sparse vector.

sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))

２．标签点

标签点可以是一个本地的向量，可以是稀疏的也可以是稠密的，总之他们是带有标签的。在MLlib中，标签点用来进行有监督的学习算法。我们使用双精度数来存储一个标签，这样我们既可以用标签点做分类，也可以用来做回归了。对于二分法，一个标签应该不是０就是１。对于多种分类，标签应该是索引从0,1,2,3….

一个标签点用LabelPoint来表示。

from pyspark.mllib.linalg import SparseVector

from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.

pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.

neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

稀疏数据
在训练中，有一个稀疏训练数据是有一件很平常的事情。MLlib支持读取一个以LIBSVM格式存储训练例子。LIBSVM是LIBSVM和LIBLINEAR默认的格式。这是一种每一行带有一个标签的的稀疏向量格式如下：

label index1:value1 index2:value2 ...

索引从１开始的升序排列的。当读取完成之后，这些特征索引转化为从０开始。

MLUtils.loadLibSVMFile　读取存储LIBSVM格式的训练模型

from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

３．本地矩阵
一个本地矩阵有整数型的行，和双精度的列索引，并且存储在单机上。MLlib支持将所有数据存储在一个单独的数组上并且以列为顺序的稠密矩阵，也支持稀疏矩阵。举个例子，比如像下面的稠密矩阵：

这个矩阵是一个存储在一维数组 [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]上的大小为(3,2)的矩阵。

本地矩阵的基础类是Matrix,我们提供两个实现函数：DenseMatrix和SparseMatrix。我们推荐Matrices里面的工厂实现方法来创造本地矩阵。记住，MLlib的本地矩阵是列为序存储。

from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))

sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

４．分布式矩阵
一个分布式矩阵有long类型的行和列，还有double类型的值，并且被分布式存储在一个或者更多的RDDs中。选择正确的格式来存储巨大的分布式矩阵是很重要的事情。将一个分布式矩阵转换可能需要一个全局的清洗，这是代价非常昂贵的。直到现在，四种类型的分布式矩阵已经实现了。
这四种中基础类型叫做 RowMatrix。这种 RowMatrix一种面向行的分布式矩阵，没有所谓的行索引。比如：一个特征向量的集合。它依赖与RDD自己的行，并且RDD的每个行是一个本地向量。对于一个RowMatrix我们假设列的数量不是非常巨大，以便一个单独的本地向量能够合理正确的与驱动交流并且能够存储操作在一个正在使用它的节点上。
IndexedRowMatrix与 RowMatrix除了能被用来区分行和执行合并的行索引不同之外，其他都非常相似。CoordinateMatrix是一种以coordinate list (COO) 格式存储在RDDs条目中的分布式矩阵。
BlockMatrix 是一种被RDDMatrixBlock支持的分布式矩阵，MatrixBlock是元祖(Int, Int, Matrix).

NOTE
潜在的分布式矩阵RDD必须是确定的，因为我们缓存了矩阵的大小，一般来说使用一个非确定性的RDD会导致错误。

RowMatrix
RowMatrix是一个面向行的分布式矩阵，没有所谓的行索引,可以使用RDD的行，这些行是本地向量。既然每个行都被本地向量表示，列的数目被整数范围限制，但是列数目在实际情况应该是比行小的多的。

from pyspark.mllib.linalg.distributed import RowMatrix

# Create an RDD of vectors.

rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Create a RowMatrix from an RDD of vectors.

mat = RowMatrix(rows)

# Get its size.

m = mat.numRows()  # 4

n = mat.numCols()  # 3

# Get the rows as an RDD of vectors again.

rowsRDD = mat.rows

IndexedRowMatrix

IndexedRowMatrix与 RowMatrix除了有意义的行索引外，其他都非常相似。它使用RDD索引行，以便每一行代表它的索引和本地向量。
一个 indexedRowMatrix可以被 indexedRowMatrix 创造出来，一个 indexedRowMatrix以能够被转化为RowMatrix通过去掉行索引。

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

# Create an RDD of indexed rows.

#   - This can be done explicitly with the IndexedRow class:

indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]),

                              IndexedRow(1, [4, 5, 6]),

                              IndexedRow(2, [7, 8, 9]),

                              IndexedRow(3, [10, 11, 12])])

#   - or by using (long, vector) tuples:

indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]),

                              (2, [7, 8, 9]), (3, [10, 11, 12])])

# Create an IndexedRowMatrix from an RDD of IndexedRows.

mat = IndexedRowMatrix(indexedRows)

# Get its size.

m = mat.numRows()  # 4

n = mat.numCols()  # 3

# Get the rows as an RDD of IndexedRows.

rowsRDD = mat.rows

# Convert to a RowMatrix by dropping the row indices.

rowMat = mat.toRowMatrix()

CoordinateMatrix

CoordinateMatrix是一个分布式矩阵，
并且由RDD的条目支持的。每一个条目就是一个元祖(i: Long, j: Long, value: Double），i是行索引，j 是列索引，value是条目值。CoordinateMatrix应该仅仅使用当矩阵规模特别大并且矩阵很稀疏的时候。

CoordinateMatrix 能够被MatrixEntry条目创建， CoordinateMatrix能被转化为 RowMatrix通过使用toRowMatrix，或者一个稀疏行IndexedRowMatrix通过使用 toIndexedRowMatrix.

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry

# Create an RDD of coordinate entries.

#   - This can be done explicitly with the MatrixEntry class:

entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)])

#   - or using (long, long, float) tuples:

entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)])

# Create an CoordinateMatrix from an RDD of MatrixEntries.

mat = CoordinateMatrix(entries)

# Get its size.

m = mat.numRows()  # 3

n = mat.numCols()  # 2

# Get the entries as an RDD of MatrixEntries.

entriesRDD = mat.entries

# Convert to a RowMatrix.

rowMat = mat.toRowMatrix()

# Convert to an IndexedRowMatrix.

indexedRowMat = mat.toIndexedRowMatrix()

# Convert to a BlockMatrix.

blockMat = mat.toBlockMatrix()

BlockMatrix

BlockMatrix是一个分布式矩阵，并且被 MatrixBlocks支持， MatrixBlocks是一个元祖， ((Int, Int), Matrix)，(Int, Int)是块索引，Matrix是rowsPerBlock x colsPerBlock的形状。

from pyspark.mllib.linalg import Matrices

from pyspark.mllib.linalg.distributed import BlockMatrix

# Create an RDD of sub-matrix blocks.

blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])),

                         ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))])

# Create a BlockMatrix from an RDD of sub-matrix blocks.

mat = BlockMatrix(blocks, 3, 2)

# Get its size.

m = mat.numRows() # 6

n = mat.numCols() # 2

# Get the blocks as an RDD of sub-matrix blocks.

blocksRDD = mat.blocks

# Convert to a LocalMatrix.

localMat = mat.toLocalMatrix()

# Convert to an IndexedRowMatrix.

indexedRowMat = mat.toIndexedRowMatrix()

# Convert to a CoordinateMatrix.

coordinateMat = mat.toCoordinateMatrix()

原文转自 http://blog.csdn.net/qq_30115765/article/details/52594421

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API的更多相关文章

Spark机器学习7·降维模型(scala&python)
PCA(主成分分析法,Principal Components Analysis) SVD(奇异值分解法,Singular Value Decomposition) http://vis-www.cs ...
spark机器学习从0到1基本数据类型之(二）
MLlib支持存储在单个机器上的局部向量和矩阵,以及由一个或多个RDD支持的分布式矩阵. 局部向量和局部矩阵是用作公共接口的简单数据模型. 底层线性代数操作由Breeze提供. 在监督学习中使 ...
spark Using MLLib in Scala/Java/Python
Using MLLib in ScalaFollowing code snippets can be executed in spark-shell. Binary ClassificationThe ...
Spark Mllib里的分布式矩阵（行矩阵、带有行索引的行矩阵、坐标矩阵和块矩阵概念、构成）（图文详解）
不多说,直接上干货! Distributed matrix : 分布式矩阵一般能采用分布式矩阵,说明这数据存储下来,量还是有一定的.在Spark Mllib里,提供了四种分布式矩阵存储形式,均由支 ...
Spark入门实战系列--8.Spark MLlib（下）--机器学习库SparkMLlib实战
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .MLlib实例 1.1 聚类实例 1.1.1 算法说明聚类(Cluster analys ...
Spark入门实战系列--8.Spark MLlib（上）--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .机器学习概念 1.1 机器学习的定义在维基百科上对机器学习提出以下几种定义: l“机器学 ...
spark 机器学习基础数据类型
spark的机器学习库,包含常见的学习算法和工具如分类.回归.聚类.协同过滤.降维等使用算法时都需要指定相应的数据集,下面为大家介绍常用的spark ml 数据类型.1.本地向量(Local Vect ...
Spark机器学习库（MLlib）官方指南手册中文版
中文https://blog.csdn.net/liulingyuan6/article/details/53582300 https://yq.aliyun.com/articles/608083 ...
【Spark机器学习速成宝典】基础篇02RDD常见的操作（Python版）
目录引例入门:textFile.collect.filter.first.persist.count 创建RDD的方式:parallelize.textFile 转化操作:map.filter.fl ...

随机推荐

[转帖]TPC-C解析系列01_TPC-C benchmark测试介绍
TPC-C解析系列01_TPC-C benchmark测试介绍 http://www.itpub.net/2019/10/08/3334/ 学习一下. 自从蚂蚁金服自研数据库OceanBase获得TP ...
Session中清除对象方法比较
转载. https://blog.csdn.net/u014401141/article/details/51816308 Session中清除对象方法比较 http://blog.csdn.ne ...
在CentOS下搭建GoLang开发环境
下载安装包 [vagrant@dk ~]$ wget http://mirrors.ustc.edu.cn/golang/go1.13.linux-amd64.tar.gz 使用USTC 中科大开源镜 ...
Linux基础-02-目录文件管理
1. Linux文件系统的层次结构 1) Linux文件系统的树状结构: 在Linux或UNIX操作系统中,所有的文件和目录都被组织成一个以根节点开始的倒置的树状结构. 2) 目录的定义: 目录相当于 ...
Python爬虫刷回复
最近闲的无聊,就想着去看看爬虫,顺着爬虫顺利的做到了模拟登录.刷帖子等等,这里简要说一下. 使用Python2.7写的爬虫,对某论坛做模拟登陆和刷帖子.回复等等,由于之前是没有接触过爬虫,这次之后感觉 ...
WUSTOJ 1324: Base64 Coding（Java）未解决，求题解
题目链接:1324: Base64 Coding 资料:ASCII码表原文是英文,而且篇幅较长.因此下面不粘贴原文,只写中文大意. Description Base64是一种编码算法.它的工作原理是 ...
Kubernetes(k8s) docker 修改 /dev/shm大小
一.问题 /dev/shm在/etc/fstab中挂载,对应tmpfs,实际使用的是内存的空间.默认情况下,/dev/shm为物理内存大小的一半. 在Kubernetes上跑docker,发现/dev ...
Codeforces Round #557 (Div. 1)
A.直接做. #include<vector> #include<cstdio> #include<cstring> #include<iostream> ...
使用 pykafka 进行消费
kafka连接脚本环境:python3,用到的模块有 pykafka,kazoo # coding=utf-8 import pykafka class KafkaReaderThread(obje ...
Java构建器（多个构造器参数）
今天看netty权威指南,第一次听说构建器,百度了几个博客,但是并没有通俗易懂一点儿的,综合别人的博客,总结如下: 1. 构建器是什么? 当创建对象需要传入多个参数的时候我们通常会根据参数的数量写不同 ...

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API的更多相关文章

随机推荐

热门专题