Mllib SVM实例

1、数据

数据格式为:标签, 特征1 特征2 特征3……

0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37

1 159:124 160:253 161:255 162:63 186:96 187:244 188:251 189:253 190:62 214:127 215:251 216:251 217:253 218:62 241:68 242:236 243:251 244:211 245:31 246:8 268:60 269:228 270:251 271:251 272:94 296:155 297:253 298:253 299:189 323:20 324:253 325:251 326:235 327:66 350:32 351:205 352:253 353:251 354:126 378:104 379:251 380:253 381:184 382:15 405:80 406:240 407:251 408:193 409:23 432:32 433:253 434:253 435:253 436:159 460:151 461:251 462:251 463:251 464:39 487:48 488:221 489:251 490:251 491:172 515:234 516:251 517:251 518:196 519:12 543:253 544:251 545:251 546:89 570:159 571:255 572:253 573:253 574:31 597:48 598:228 599:253 600:247 601:140 602:8 625:64 626:251 627:253 628:220 653:64 654:251 655:253 656:220 681:24 682:193 683:253 684:220

……

2、代码

 //1 读取样本数据

 val data_path = "/user/tmp/sample_libsvm_data.txt"

 val examples = MLUtils.loadLibSVMFile(sc, data_path).cache()

 //2 样本数据划分训练样本与测试样本

 val splits = examples.randomSplit(Array(0.6, 0.4), seed = 11L)

 val training = splits(0).cache()

 val test = splits(1)

 val numTraining = training.count()

 val numTest = test.count()

 println(s"Training: $numTraining, test: $numTest.")

 //3 新建SVM模型,并设置训练参数

 val numIterations = 1000

 val stepSize = 1

 val miniBatchFraction = 1.0

 val model = SVMWithSGD.train(training, numIterations, stepSize, miniBatchFraction)
//4 对测试样本进行测试 val prediction = model.predict(test.map(_.features)) val predictionAndLabel = prediction.zip(test.map(_.label)) //5 计算测试误差 val metrics = new MulticlassMetrics(predictionAndLabel) val precision = metrics.precision println("Precision = " + precision)

------------------------------------我叫分隔线------------------------------------

The following code snippet illustrates how to load a sample dataset, execute a training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error.

 import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1) // Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations) // Clear the default threshold.
model.clearThreshold() // Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
} // Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) // Save and load model
model.save(sc, "myModelPath")
val sameModel = SVMModel.load(sc, "myModelPath")

The SVMWithSGD.train() method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we can customize SVMWithSGD further by creating a new object directly and calling setter methods. All other MLlib algorithms support customization in this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations.

 import org.apache.spark.mllib.optimization.L1Updater

 val svmAlg = new SVMWithSGD()
svmAlg.optimizer.
setNumIterations(200).
setRegParam(0.1).
setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)

------------------------------------我叫分隔线------------------------------------

1.数据

0 2.857738033247042 0 2.061393766919624 2.619965104088255 0 2.004684436494304 2.000347299268466 2.122974378789621 2.228387042742021 2.228387042742023 0 0 0 0 12.72816758217773 0

1 2.857738033247042 0 2.061393766919624 2.619965104088255 0 2.004684436494304 2.000347299268466 2.122974378789621 2.228387042742021 2.228387042742023 0 0 12.72816758217773 0 0 0

1 2.857738033247042 2.52078447201548 0 2.619965104088255 0 2.004684436494304 0 2.122974378789621 0 0 0 0 0 0 0 0

2.代码

 import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Load and parse the data file
val data = sc.textFile("mllib/data/sample_svm_data.txt")
val parsedData = data.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x => x.toDouble).toArray))
}
// Run training algorithm to build the model
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction) }
val trainErr = labelAndPreds.filter( r =>
r._1 != r._2).count.toDouble / parsedData.count
println("Training Error = " + trainErr) file:///home/hadoop/suanec/suae/workspace/data/mllib/sample_svm_data.txt

spark Mllib SVM实例的更多相关文章

  1. 梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)

    梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python) http://blog.csdn.net/liulingyuan6/article/details ...

  2. 三种文本特征提取(TF-IDF/Word2Vec/CountVectorizer)及Spark MLlib调用实例(Scala/Java/python)

    https://blog.csdn.net/liulingyuan6/article/details/53390949

  3. Spark MLlib回归算法------线性回归、逻辑回归、SVM和ALS

    Spark MLlib回归算法------线性回归.逻辑回归.SVM和ALS 1.线性回归: (1)模型的建立: 回归正则化方法(Lasso,Ridge和ElasticNet)在高维和数据集变量之间多 ...

  4. spark Mllib基本功系列编程入门之 SVM实现分类

    话不多说.直接上代码咯.欢迎交流. /** * Created by whuscalaman on 1/7/16. */import org.apache.spark.{SparkConf, Spar ...

  5. Spark入门实战系列--8.Spark MLlib(下)--机器学习库SparkMLlib实战

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .MLlib实例 1.1 聚类实例 1.1.1 算法说明 聚类(Cluster analys ...

  6. Spark MLlib 机器学习

    本章导读 机器学习(machine learning, ML)是一门涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多领域的交叉学科.ML专注于研究计算机模拟或实现人类的学习行为,以获取新知识.新 ...

  7. Spark MLlib(下)--机器学习库SparkMLlib实战

    1.MLlib实例 1.1 聚类实例 1.1.1 算法说明 聚类(Cluster analysis)有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可 ...

  8. Spark MLlib聚类KMeans

    算法说明 聚类(Cluster analysis)有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可能相似,簇与簇之间的object尽可能相异.聚类算 ...

  9. 《Spark MLlib机器学习实践》内容简介、目录

      http://product.dangdang.com/23829918.html Spark作为新兴的.应用范围最为广泛的大数据处理开源框架引起了广泛的关注,它吸引了大量程序设计和开发人员进行相 ...

随机推荐

  1. 25.在从1到n的正数中1出现的次数[NumberOf1Between1_N]

    [题目] 输入一个整数n,求从1到n这n个整数的十进制表示中1出现的次数.例如输入12,从1到12这些整数中包含1 的数字有1,10,11和12,1一共出现了5次. [分析] 这是一道广为流传的goo ...

  2. DP:Bridging Signals(POJ 1631)

    不能交叉的引脚 (这一题的难度在于读题)题目大意:有一堆引脚(signals),左边一排,右边一排,左边从上到下,对应着连接右边的引脚(所有的引脚都被接上),现在引脚之间的连线有交叉,我们要桥接这些交 ...

  3. Oracle错误代码大全

    Oracle错误代码大全——最新.最全的Oracle错误代码 对快速查找oracle数据库错误原因很有帮助 ORA-00001: 违反唯一约束条件 (.) ORA-00017: 请求会话以设置跟踪事件 ...

  4. Android之UI控件

    本文主要包括以下内容 Spinner的使用 Gallery的使用 Spinner的使用 Spinner的实现过程是 1. 在xml文件中定义Spinner的控件 2. 在activity中获取Spin ...

  5. p188习题2

  6. SVN服务器搭建和使用(一)(转载)

    转载地址:http://www.cnblogs.com/xiaobaihome/archive/2012/03/20/2407610.html Subversion是优秀的版本控制工具,其具体的的优点 ...

  7. callsession新功能版

    可以getopt解析参数. 也实现了将参数用空格分隔,来传给进程. 注意string和LPSTR数据类型的转换方法: LPSTR(lpCmdLine.c_str()) #include <win ...

  8. Groovy安装配置

    一.介绍 Groovy是可以运行在 Java 平台上进行动态语言,使用方式基本与使用 Java 的方式相同,Groovy和java基本是可以实现无缝整合,它有以下一些特性: 是一个基于Java虚拟机的 ...

  9. SAE Java开发问题汇总

    转自:http://binary.duapp.com/2012/10/275.html 1.sae上传了war后不报错,却出现一片空白: 原因:上传war包不能包含servlet-api和xmlsec ...

  10. 小甲鱼PE详解之区块描述、对齐值以及RVA详解(PE详解06)

    各种区块的描述: 很多朋友喜欢听小甲鱼的PE详解,因为他们觉得课堂上老师讲解的都是略略带过,绕得大家云里雾里~刚好小甲鱼文采也没课堂上的教授讲的那么好,只能以比较通俗的话语来给大家描述~ 通常,区块中 ...