Using MLLib in Scala
Following code snippets can be executed in spark-shell.

Binary Classification
The following code snippet illustrates how to load a sample dataset, execute a training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error.

 import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data file
val data = sc.textFile("mllib/data/sample_svm_data.txt")
val parsedData = data.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
} // Run training algorithm to build the model
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
println("Training Error = " + trainErr)

The SVMWithSGD.train() method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we can customize SVMWithSGD further by creating a new object directly and calling setter methods. All other MLlib algorithms support customization in this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations.

 import org.apache.spark.mllib.optimization.L1Updater

 val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
.setRegParam(0.1)
.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)

Linear Regression
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit

 import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data
val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)
} // Building the model
val numIterations = 20
val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
println("training Mean Squared Error = " + MSE)

Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.

Clustering
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.

 import org.apache.spark.mllib.clustering.KMeans

 // Load and parse the data
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

Collaborative Filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.

 import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating // Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
}) // Build the recommendation model using ALS
val numIterations = 20
val model = ALS.train(ratings, 1, 20, 0.01) // Evaluate the model on rating data
val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}
val predictions = model.predict(usersProducts).map{
case Rating(user, product, rate) => ((user, product), rate)
}
val ratesAndPreds = ratings.map{
case Rating(user, product, rate) => ((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
}.reduce(_ + _)/ratesAndPreds.count
println("Mean Squared Error = " + MSE)

If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.

 val model = ALS.trainImplicit(ratings, 1, 20, 0.01)

Using MLLib in Java
All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.

Using MLLib in Python
Following examples can be tested in the PySpark shell.

Binary Classification
The following example shows how to load a sample dataset, build Logistic Regression model, and make predictions with the resulting model to compute the training error.

 from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array # Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData) # Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size))))) # Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

Linear Regression
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit

 from pyspark.mllib.regression import LinearRegressionWithSGD
from numpy import array # Load and parse the data
data = sc.textFile("mllib/data/ridge-data/lpsa.data")
parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')])) # Build the model
model = LinearRegressionWithSGD.train(parsedData) # Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda point: (point.item(0),
model.predict(point.take(range(1, point.size)))))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y)/valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))

Clustering
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.

 from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt # Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=30, initialization_mode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.

Collaborative Filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation by measuring the Mean Squared Error of rating prediction.

 from pyspark.mllib.recommendation import ALS
from numpy import array # Load and parse the data
data = sc.textFile("mllib/data/als/test.data")
ratings = data.map(lambda line: array([float(x) for x in line.split(',')])) # Build the recommendation model using Alternating Least Squares
model = ALS.train(ratings, 1, 20) # Evaluate the model on training data
testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y)/ratesAndPreds.count()
print("Mean Squared Error = " + str(MSE))

If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.

 # Build the recommendation model using Alternating Least Squares based on implicit ratings
model = ALS.trainImplicit(ratings, 1, 20)

spark Using MLLib in Scala/Java/Python的更多相关文章

  1. 朴素贝叶斯算法原理及Spark MLlib实例(Scala/Java/Python)

    朴素贝叶斯 算法介绍: 朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法. 朴素贝叶斯的思想基础是这样的:对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,在没有其它可用信息下,我 ...

  2. 梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)

    梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python) http://blog.csdn.net/liulingyuan6/article/details ...

  3. 三种文本特征提取(TF-IDF/Word2Vec/CountVectorizer)及Spark MLlib调用实例(Scala/Java/python)

    https://blog.csdn.net/liulingyuan6/article/details/53390949

  4. Spark机器学习1·编程入门(scala/java/python)

    Spark安装目录 /Users/erichan/Garden/spark-1.4.0-bin-hadoop2.6 基本测试 ./bin/run-example org.apache.spark.ex ...

  5. (一)Spark简介-Java&Python版Spark

    Spark简介 视频教程: 1.优酷 2.YouTube 简介: Spark是加州大学伯克利分校AMP实验室,开发的通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月 ...

  6. Spark: 单词计数(Word Count)的MapReduce实现(Java/Python)

    1 导引 我们在博客<Hadoop: 单词计数(Word Count)的MapReduce实现 >中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来 ...

  7. (八)map,filter,flatMap算子-Java&Python版Spark

    map,filter,flatMap算子 视频教程: 1.优酷 2.YouTube 1.map map是将源JavaRDD的一个一个元素的传入call方法,并经过算法后一个一个的返回从而生成一个新的J ...

  8. Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

    问题: 今天用Maven搭建了一个Spark的Scala项目,运行后遇到下面异常: Apache Spark Exception in thread “main” java.lang.NoClassD ...

  9. 如何在本地使用scala或python运行Spark程序

    如何在本地使用scala或python运行Spark程序   包含两个部分: 本地scala语言编写程序,并编译打包成jar,在本地运行. 本地使用python语言编写程序,直接调用spark的接口, ...

随机推荐

  1. 特殊表达式的意义[c++ special expressions]

    [本文链接] http://www.cnblogs.com/hellogiser/p/special-expressions.html x&(x-1)表达式的意义: 统计二进制中1的个数.   ...

  2. 用几条shell命令快速去重10G数据

    试想一下,如果有10G数据,或者更多:怎么才能够快速地去重呢?你会说将数据导入到数据库(mysql等)进行去重,或者用java写个程序进行去重,或者用Hadoop进行处理.如果是大量的数据要写入数据库 ...

  3. tomcat启动是报Multiple Contexts have a path of "/XXX"

    Eclipse集成了tomcat,启动时报如下异常: Could not publish server configuration for Tomcat v7.0 Server at localhos ...

  4. 集群ssh服务和免密码登录的配置

    安装Hadoop之前,由于集群中大量主机进行分布式计算需要相互进行数据通信,服务器之间的连接需要通过ssh来进行,所以要安装ssh服务,默认情况下通过ssh登录服务器需要输入用户名和密码进行连接,如果 ...

  5. 【pymongo】连接认证 auth failed解决方法

    故事背景: 我在虚拟机(ip:192.168.xx.xx)上建立了一个mongo的数据库,里面已经存好了内容.里面的一个database叫做 "adb", 里面有个collecti ...

  6. 江哥的dp题a(codevs 4815)

    题目描述 Description 给出一个长度为N的序列A(A1,A2,A3,...,AN).现选择K个互不相同的元素,要求: 1.两两元素互不相邻 2.元素值之和最大 输入描述 Input Desc ...

  7. 查看Linus中自带的jdk ,设置JAVA_HOME

    在配置hadoop是,进行格式化hadoop的时候,出现找不到jdk 我用Red hat是32位的,没有现成的32位的,敲java , 发现本机有java ,就找了一下其位置 找到了jdk-1.6.0 ...

  8. SQL中行列转换Pivot

    --建表 ),课程 ),分数 int) --插入数据 ) ) ) ) ) ) 1.静态行转列(确定有哪些列) select 姓名, end)语文, end)数学, end)物理 from tb gro ...

  9. UML中的依赖关系

    UML中的五种关系和设计模式中的代码实现. 又重新听了一遍UML中的关系.感觉又是收获很大. UML中的关系有依赖,关联(聚合,组合),泛化(也叫继承),实现 现在一个一个的来实现: 一:依赖 依赖关 ...

  10. 对于JavaScript的函数.NET开发人员应该知道的11件事

    (此文章同时发表在本人微信公众号"dotNET每日精华文章",欢迎右边二维码来关注.) 昨天小感冒今天重感冒,也不能长篇大论.如果你是.NET开发人员,在进入前端开发领域的时候,对 ...