Spark 学习笔记:(四)MLlib基础
MLlib:Machine Learning Library。主要内容包括:
- 数据类型
- 统计工具
- summary statistics
- correlations
- stratified sampling
- hypothesis testing
- random data generation
- 分类和回归
- 线性模型(SVM,逻辑回归,线性回归)
- 朴素贝叶斯
- 决策树
- ensembles of trees(随机森林和Gradient-Boosted Trees)
- isotonic regression
- 协同过滤
- ALS(alternating least squares)
- 聚类
- k-means
- 高斯混合模型
- power iteration clustering(PIC)
- LDA(latent Dirichlet allocation)
- 流式k-means
- 降维
- SVD
- PCA
- 特征提取和转换
- Frequent pattern mining
- FP-growth
- 优化
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)
I.数据类型
MLlib的数据类型主要是local vectors和local matrices,潜在的代数操作由Breeze和jblas提供。
1.local vector 有int型和double型,下标从0开始,分为dense和sparse 两种。
Local vector的基本类型是Vector,包括:DenseVector和SparseVector。
import org.apache.spark.mllib.linalg.{Vector, Vectors}
// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
Scala imports scala.collection.immutable.Vector by default, so you have to import org.apache.spark.mllib.linalg.Vector explicitly to use MLlib’s Vector.
- MLlib中一个监督学习的训练样本被称为“labeled point”。
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint // Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) // Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
- MLlib supports reading training examples stored in
LIBSVMformat, which is the default format used byLIBSVMandLIBLINEAR.
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
2.local matrix
A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column major.
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
- A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs.It is very important to choose the right format to store large and distributed matrices. A
RowMatrixis a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge. AnIndexedRowMatrixis similar to aRowMatrixbut with row indices, which can be used for identifying rows and executing joins. ACoordinateMatrixis a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows) // Get its size.
val m = mat.numRows()
val n = mat.numCols() import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix() import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
- A
BlockMatrixis a distributed matrix backed by an RDD ofMatrixBlocks, where aMatrixBlockis a tuple of((Int, Int), Matrix), where the(Int, Int)is the index of the block, andMatrixis the sub-matrix at the given index.BlockMatrixsupports methods such asaddandmultiplywith anotherBlockMatrix.ABlockMatrixcan be most easily created from anIndexedRowMatrixorCoordinateMatrixby callingtoBlockMatrix.
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
// Transform the CoordinateMatrix to a BlockMatrix
val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
// Nothing happens if it is valid.
matA.validate()
// Calculate A^T A.
val ata = matA.transpose.multiply(matA)
用到什么model先看介绍,再查API doc: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
Spark 学习笔记:(四)MLlib基础的更多相关文章
- Java基础学习笔记四 Java基础语法
数组 数组的需求 现在需要统计某公司员工的工资情况,例如计算平均工资.最高工资等.假设该公司有50名员工,用前面所学的知识完成,那么程序首先需要声明50个变量来分别记住每位员工的工资,这样做会显得很麻 ...
- Spark学习笔记——基于MLlib的机器学习
使用MLlib库中的机器学习算法对垃圾邮件进行分类 分类的垃圾邮件的如图中分成4个文件夹,两个文件夹是训练集合,两个文件夹是测试集合 build.sbt文件 name := "spark-f ...
- Java IO学习笔记四:Socket基础
作者:Grey 原文地址:Java IO学习笔记四:Socket基础 准备两个Linux实例(安装好jdk1.8),我准备的两个实例的ip地址分别为: io1实例:192.168.205.138 io ...
- spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
- js学习笔记:webpack基础入门(一)
之前听说过webpack,今天想正式的接触一下,先跟着webpack的官方用户指南走: 在这里有: 如何安装webpack 如何使用webpack 如何使用loader 如何使用webpack的开发者 ...
- Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
- C#可扩展编程之MEF学习笔记(四):见证奇迹的时刻
前面三篇讲了MEF的基础和基本到导入导出方法,下面就是见证MEF真正魅力所在的时刻.如果没有看过前面的文章,请到我的博客首页查看. 前面我们都是在一个项目中写了一个类来测试的,但实际开发中,我们往往要 ...
- Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构 什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器 受 ...
- Java学习笔记:语言基础
Java学习笔记:语言基础 2014-1-31 最近开始学习Java,目的倒不在于想深入的掌握Java开发,而是想了解Java的基本语法,可以阅读Java源代码,从而拓展一些知识面.同时为学习An ...
- IOS学习笔记(四)之UITextField和UITextView控件学习
IOS学习笔记(四)之UITextField和UITextView控件学习(博客地址:http://blog.csdn.net/developer_jiangqq) Author:hmjiangqq ...
随机推荐
- HDU 5483 Nux Walpurgis
Nux Walpurgis Time Limit: 8000ms Memory Limit: 131072KB This problem will be judged on HDU. Original ...
- windows下在指定目录下打开命令行
直接用cd的话比较麻烦,可以先进入制定目录后,按住shift键,鼠标右键可以选择“在此处打开命令窗口”
- 九度oj 题目1209:最小邮票数
题目描述: 有若干张邮票,要求从中选取最少的邮票张数凑成一个给定的总值. 如,有1分,3分,3分,3分,4分五张邮票,要求凑成10分,则使用3张邮票:3分.3分.4分即可. 输入: 有多组数据, ...
- php文档php-fpm.conf配置
cat php-fpm.conf|grep -v "^;"|grep -v "^$" [global] pid = run/php-fpm.pid //pi ...
- BZOJ 3757 苹果树 ——莫队算法
挺好的一道题目,怎么就没有版权了呢?大数据拍过了,精神AC.... 发现几种颜色这性质比较垃圾,不可加,莫队硬上. %了一发popoqqq大神的博客, 看了一波VFK关于糖果公园的博客, 又找了wjm ...
- cf493E Vasya and Polynomial
Vasya is studying in the last class of school and soon he will take exams. He decided to study polyn ...
- bigdata related
hive: http://lxw1234.com/archives/2015/07/413.htm 搜狗实验室数据集: https://www.sogou.com/labs/resource/list ...
- R语言入门视频笔记--2--一些简单的命令
一.对象 1.列举当前内存中的对象 ls() 2.删除不需要的对象 rm(某对象名称) 3.查看向量长度 length(某向量名称) 4.查看向量类型 mode(某向量名称) 二.函数 1.seq函数 ...
- MarsEdit 快速插入代码
<div class="cnblogs_Highlighter"> <pre class="brush:objc;gutter:true;"& ...
- Vue1.x 迁移 Vue2.x(项目进行不断修改)
一.$dispatch 和 $broadcast 已经被弃用. 请使用更多简明清晰的组件间通信和更好的状态管理方案,如:Vuex 这些方法的最常见用途之一是父子组件的相互通信.在这些情况下,你可以使用 ...